Metadata-Version: 2.4
Name: anonlm-pii
Version: 0.1.3
Summary: Open-source PII anonymization agent with reproducible benchmarking for OpenAI-compatible models
Project-URL: Homepage, https://github.com/aritzjl/anonlm
Project-URL: Repository, https://github.com/aritzjl/anonlm
Project-URL: Issues, https://github.com/aritzjl/anonlm/issues
Author: AnonLM Contributors
License: Apache-2.0
License-File: LICENSE
Keywords: anonymization,benchmarking,langgraph,llm,pii,privacy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: langchain-core>=1.2.16
Requires-Dist: langchain-openai>=1.1.10
Requires-Dist: langgraph>=1.0.10
Requires-Dist: pydantic>=2.12.5
Requires-Dist: python-dotenv>=1.2.1
Provides-Extra: dev
Requires-Dist: build>=1.2.2; extra == 'dev'
Requires-Dist: mypy>=1.15.0; extra == 'dev'
Requires-Dist: pre-commit>=4.2.0; extra == 'dev'
Requires-Dist: ruff>=0.11.0; extra == 'dev'
Requires-Dist: twine>=6.1.0; extra == 'dev'
Provides-Extra: docs
Provides-Extra: test
Requires-Dist: pytest>=8.0.0; extra == 'test'
Description-Content-Type: text/markdown

# AnonLM

AnonLM is an open-source Python library for LLM-based PII anonymization with reproducible benchmarking.

It provides:
- A configurable anonymization engine for OpenAI-compatible providers.
- A stable Python API for anonymize/deanonymize workflows.
- A unified CLI for anonymization and benchmark execution.
- Benchmark history artifacts for auditability and experiment tracking.

## Installation

```bash
pip install anonlm-pii
```

For development:

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev,test]
```

## Quickstart (Python API)

```python
from anonlm import anonymize

result = anonymize("Contact Jane Doe at jane.doe@example.com or +34 600 123 456.")
print(result.anonymized_text)
print(result.mapping_forward)
print(result.chunking.chunk_count)
print(result.chunking.chunks)
```

## Quickstart (CLI)

```bash
# Text input
anonlm anonymize --text "Contact Jane Doe at jane.doe@example.com"

# File input -> JSON output
anonlm anonymize --file input.txt --output output.json

# Benchmark run
anonlm benchmark run --dataset datasets/pii_mvp_dataset.csv --split dev
```

## Configuration

Configuration precedence is:
1. Explicit CLI flags
2. Environment variables (`ANONLM_*`)
3. Provider defaults

Core environment variables:

| Variable | Description |
| --- | --- |
| `ANONLM_PROVIDER` | `openai`, `openrouter`, `groq`, or `custom` |
| `ANONLM_MODEL_NAME` | Model identifier |
| `ANONLM_BASE_URL` | OpenAI-compatible base URL |
| `ANONLM_API_KEY_ENV` | Env var name containing API key |
| `ANONLM_API_KEY` | API key value |
| `ANONLM_TEMPERATURE` | LLM temperature |
| `ANONLM_MAX_CHUNK_CHARS` | Chunk size |
| `ANONLM_CHUNK_OVERLAP_CHARS` | Chunk overlap |

Provider examples:

```bash
# OpenAI
export ANONLM_PROVIDER=openai
export ANONLM_API_KEY=sk-...

# OpenRouter
export ANONLM_PROVIDER=openrouter
export ANONLM_API_KEY=...
export ANONLM_MODEL_NAME=openai/gpt-4o-mini

# Groq
export ANONLM_PROVIDER=groq
export ANONLM_API_KEY=...
export ANONLM_MODEL_NAME=llama-3.3-70b-versatile

# Custom OpenAI-compatible endpoint
export ANONLM_PROVIDER=custom
export ANONLM_BASE_URL=https://your.endpoint/v1
export ANONLM_API_KEY=...
```

## Benchmarking

Run benchmark with deterministic document-based splits (`dev`, `val`, `final`):

```bash
anonlm benchmark run --dataset datasets/pii_mvp_dataset.csv --split dev --verbose
```

Optional benchmark controls:

```bash
anonlm benchmark run \
  --dataset datasets/pii_mvp_dataset.csv \
  --split val \
  --history-dir runs/benchmarks \
  --threshold-f1 0.80
```

Artifacts:
- JSON run detail: `runs/benchmarks/<timestamp>__<split>.json`
- CSV summary index: `runs/benchmarks/index.csv`

See `docs/benchmarking.md` for protocol and interpretation guidelines.

## Public API

- `anonlm.anonymize(text: str, config: AnonLMConfig | None = None) -> AnonymizationResult`
- `anonlm.deanonymize(text: str, mapping_reverse: dict[str, str]) -> str`
- `anonlm.create_engine(config: AnonLMConfig | None = None) -> AnonymizationEngine`

`AnonymizationResult` includes chunking metadata in `result.chunking` (and in `result.to_dict()["chunking"]`):
- `chunk_count`: total chunks processed
- `chunks`: chunk content list in processing order
- `max_chunk_chars`: chunk size setting used
- `chunk_overlap_chars`: overlap setting used

## Project status

Current status: `0.x` (early API hardening). Expect minor breaking changes until `1.0.0`.

## Next objectives

1. Reach `>90%` reliability with `gpt-oss-20b` on the current baseline dataset (`datasets/pii_mvp_dataset.csv`).
2. Build a stronger benchmark dataset, likely by adapting a PII dataset from Hugging Face and normalizing it to AnonLM's benchmark format.
3. Reach `>=90%` reliability with `gpt-oss-20b` on the new dataset.

## License

Apache-2.0
