Metadata-Version: 2.4
Name: nl-code
Version: 0.6.0
Summary: Primitives for research into LLMs and code
Author-email: Danielle Rothermel <danielle.rothermel@gmail.com>
Requires-Python: >=3.12
Requires-Dist: datasets>=3.0.0
Requires-Dist: dr-docker==0.4.5
Requires-Dist: fastapi>=0.135.3
Requires-Dist: pydantic>=2.12.0
Requires-Dist: python-dotenv>=1.2.2
Requires-Dist: typer>=0.24.1
Requires-Dist: uvicorn[standard]>=0.44.0
Provides-Extra: bigcodebench
Requires-Dist: beautifulsoup4>=4.12; extra == 'bigcodebench'
Requires-Dist: gensim>=4.3; extra == 'bigcodebench'
Requires-Dist: holidays>=0.60; extra == 'bigcodebench'
Requires-Dist: matplotlib>=3.9; extra == 'bigcodebench'
Requires-Dist: nltk>=3.9; extra == 'bigcodebench'
Requires-Dist: numpy>=1.26; extra == 'bigcodebench'
Requires-Dist: openpyxl>=3.1; extra == 'bigcodebench'
Requires-Dist: pandas>=2.2; extra == 'bigcodebench'
Requires-Dist: pypdf2>=3.0; extra == 'bigcodebench'
Requires-Dist: python-dateutil>=2.9; extra == 'bigcodebench'
Requires-Dist: python-docx>=1.1; extra == 'bigcodebench'
Requires-Dist: pytz>=2024.1; extra == 'bigcodebench'
Requires-Dist: regex>=2024.4; extra == 'bigcodebench'
Requires-Dist: reportlab>=4.2; extra == 'bigcodebench'
Requires-Dist: scikit-learn>=1.5; extra == 'bigcodebench'
Requires-Dist: scipy>=1.14; extra == 'bigcodebench'
Requires-Dist: seaborn>=0.13; extra == 'bigcodebench'
Requires-Dist: statsmodels>=0.14; extra == 'bigcodebench'
Provides-Extra: docker
Requires-Dist: dr-docker>=0.4.5; extra == 'docker'
Description-Content-Type: text/markdown

# nl-code

Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.

## Install

```bash
uv add nl-code                # core
uv add nl-code[docker]        # + Docker execution via dr-docker
uv add nl-code[bigcodebench]  # + scientific libs for BigCodeBench/ClassEval
```

## Code Execution

Execute generated code in isolated Docker containers.

Three execution modes covering all supported dataset test formats:

- **function_call** — call a named function with inputs, compare return values (HumanEval)
- **assertion** — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
- **unittest** — exec code + unittest.TestCase classes (ClassEval)

Batch variants (`batch_run_test_cases`, `batch_run_assertion_tests`, `batch_run_unittest_tests`) process many code samples in a single container with auto-chunking.

### Build The Docker Image

Build the execution image from the repo root:

```bash
docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .
```

This is the default runtime image used by the execution pipeline. The Dockerfile
installs both the `bigcodebench` dependency set and the pinned `dr-docker`
runtime dependency directly from `pyproject.toml`, so the image stays aligned
with the repo's declared execution requirements.

### Run The Docker Test Tier

Docker-dependent tests are marked with `@pytest.mark.docker` and are excluded
from the default `pytest` run.

Run them explicitly with:

```bash
uv run nl-code-test docker
```

You can pass extra pytest arguments through after `docker`, for example:

```bash
uv run nl-code-test docker -q tests/test_execution_runner.py
```

## Datasets

Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into `Task` objects, and cached locally.

The corresponding raw task models preserve the original dataset inputs as `source__...` fields and expose richer derived artifacts such as:
- official prompt fields
- stripped and comment-preserving code stubs
- stripped and comment-preserving ground-truth code

Across task families, `new_official_prompt`, `new_code_stub`, and `new_code_stub_with_comments` provide a consistent interface for prompt/stub access even when the underlying dataset-specific field names differ.

`DatasetSlice` supports filtering, seeded shuffling, limits, and parallel accessors for common raw-task artifacts:
- `get_source_code(task_id)`
- `get_official_prompt(task_id)`
- `get_code_stub(task_id)`
- `get_code_stub_with_comments(task_id)`

## Dataset Explorer

A FastAPI + React app for browsing and comparing datasets. Run from `ui/dataset-explorer/`.

## Headless validation runs

General dataset validation/debugging commands that import `matplotlib` should run headlessly with:

```bash
MPLBACKEND=Agg uv run python ...
```

## Rebuild Dataset Caches

Run the Docker-backed cache rebuilds with:

```bash
uv run python -m nl_code.datasets.cache_cli rebuild all
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
uv run python -m nl_code.datasets.cache_cli rebuild class-eval
uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro
```

`cache_cli rebuild` sets `MPLBACKEND=Agg` automatically.

Current observed results with the default execution image and env limits:

```text
humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
humaneval-pro: cached 163 tasks (163 raw, 1 flawed)
mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
class-eval: cached 98 tasks (98 raw, 2 flawed)
bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)
```

The remaining flawed samples above are dataset-level failures, not Docker
runtime failures.

The current known flawed HumanEval-Pro sample is `HumanEvalPro/24`, where the
new function docstring is not present in `new_solution`.
