Metadata-Version: 2.4
Name: finetunecheck
Version: 0.1.1
Summary: Automated base vs fine-tuned LLM comparison with forgetting detection, capability retention scoring, and visual diff reports.
Project-URL: Homepage, https://github.com/shuhulx/finetunecheck
Project-URL: Repository, https://github.com/shuhulx/finetunecheck
Project-URL: Issues, https://github.com/shuhulx/finetunecheck/issues
Author-email: Shuhul Razdan <shuhul.aiml@gmail.com>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: diskcache>=5.6
Requires-Dist: jinja2>=3.1
Requires-Dist: numpy>=1.24
Requires-Dist: peft>=0.10
Requires-Dist: plotly>=5.18
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: rouge-score>=0.1
Requires-Dist: scipy>=1.11
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.40
Requires-Dist: typer[all]>=0.12
Provides-Extra: all
Requires-Dist: anthropic>=0.25; extra == 'all'
Requires-Dist: llama-cpp-python>=0.2; extra == 'all'
Requires-Dist: mcp>=1.0; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: pytest-cov; extra == 'all'
Requires-Dist: pytest>=8.0; extra == 'all'
Requires-Dist: ruff; extra == 'all'
Requires-Dist: vllm>=0.4; extra == 'all'
Provides-Extra: api-judge
Requires-Dist: anthropic>=0.25; extra == 'api-judge'
Requires-Dist: openai>=1.0; extra == 'api-judge'
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: gguf
Requires-Dist: llama-cpp-python>=0.2; extra == 'gguf'
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == 'mcp'
Provides-Extra: vllm
Requires-Dist: vllm>=0.4; extra == 'vllm'
Description-Content-Type: text/markdown

# FineTuneCheck

**Diagnostic tool for LLM fine-tuning outcomes.**

Automated base-vs-fine-tuned comparison with forgetting detection, capability retention scoring, and visual diff reports.

[![PyPI](https://img.shields.io/pypi/v/finetunecheck)](https://pypi.org/project/finetunecheck/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://python.org)
[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-142%20passed-brightgreen.svg)]()

---

## The Problem

You fine-tuned a model. It's better at your task. But **what did it forget?**

Fine-tuning improves target capabilities at the cost of general ones. Without measurement, you're shipping blind:

- Did safety alignment degrade?
- Is reasoning still intact?
- Are code capabilities broken?
- Was the trade-off worth it?

**FineTuneCheck answers these questions in one command.**

## Features

- **12 built-in probe categories** — reasoning, code, math, safety, chat quality, creative writing, summarization, extraction, classification, instruction following, multilingual, world knowledge
- **4 forgetting metrics** — Backward Transfer (BWT), Capability Retention Rate (CRR), Selective Forgetting Index (SFI), Safety Alignment Retention (SAR)
- **Multi-judge system** — exact match, F1, rule-based, ROUGE, LLM-as-judge
- **Deep analysis** — CKA similarity, spectral analysis, perplexity distribution shift, calibration (ECE), activation drift
- **Multi-run comparison** — Pareto frontier analysis across fine-tuning runs
- **5 verdict levels** — EXCELLENT → GOOD → GOOD_WITH_CONCERNS → POOR → HARMFUL
- **Composite ROI score** — 0-100 score balancing improvement vs forgetting cost
- **HTML/JSON/CSV/Markdown reports** — interactive Plotly charts, exportable results
- **MCP server** — 9 tools for AI assistant integration
- **LoRA + GGUF support** — works with PEFT adapters and quantized models

## Install

```bash
pip install finetunecheck
```

With optional backends:

```bash
pip install finetunecheck[api-judge]   # LLM-as-judge (Anthropic + OpenAI)
pip install finetunecheck[vllm]        # vLLM inference backend
pip install finetunecheck[gguf]        # GGUF model support
pip install finetunecheck[mcp]         # MCP server for AI assistants
pip install finetunecheck[all]         # Everything
```

## Quick Start

### CLI

```bash
# Full evaluation
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model \
  --profile code --report report.html

# Quick 5-minute check (20 samples, 4 categories)
ftcheck quick meta-llama/Llama-3-8B ./my-finetuned-model

# Compare multiple fine-tuning runs
ftcheck compare meta-llama/Llama-3-8B ./run1 ./run2 ./run3 \
  --report comparison.html

# Deep analysis (CKA, spectral, perplexity, calibration)
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model --deep

# List available probes and profiles
ftcheck list-probes
ftcheck list-profiles
```

### Python API

```python
from finetunecheck import EvalRunner
from finetunecheck.config import EvalConfig

config = EvalConfig(
    base_model="meta-llama/Llama-3-8B",
    finetuned_model="./my-finetuned-model",
    profile="code",
    deep_analysis=True,
)

runner = EvalRunner(config)
results = runner.run()

print(f"Verdict: {results.verdict.value}")        # GOOD_WITH_CONCERNS
print(f"ROI Score: {results.roi_score}")           # 72.5
print(f"BWT: {results.forgetting.backward_transfer:+.3f}")  # -0.082
print(f"Safety: {results.forgetting.safety_alignment_retention}")  # 0.97
```

## Probe Categories

| Category | Samples | Judge | What It Tests |
|----------|---------|-------|---------------|
| reasoning | 100+ | LLM | Logical deduction, chain-of-thought |
| code | 100+ | rule-based | Code generation, debugging |
| math | 100+ | exact match | Arithmetic, algebra, word problems |
| safety | 100+ | rule-based | Refusal of harmful prompts, alignment |
| chat_quality | 100+ | LLM | Helpfulness, coherence, tone |
| creative_writing | 100+ | LLM | Storytelling, style, creativity |
| summarization | 100+ | ROUGE | Compression, faithfulness |
| extraction | 100+ | F1 | Named entities, structured data |
| classification | 100+ | exact match | Sentiment, topic, intent |
| instruction_following | 100+ | rule-based | Format compliance, constraints |
| multilingual | 100+ | LLM | Translation, cross-lingual transfer |
| world_knowledge | 100+ | exact match | Facts, trivia, common sense |

## Forgetting Metrics

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **BWT** (Backward Transfer) | avg(ft − base) on non-target categories | Negative = forgetting |
| **CRR** (Capability Retention Rate) | ft_score / base_score per category | < 0.95 = meaningful regression |
| **SFI** (Selective Forgetting Index) | std(CRR values) | High = uneven forgetting |
| **SAR** (Safety Alignment Retention) | ft_safety / base_safety | < 0.90 → HARMFUL verdict |

## Verdict System

| Verdict | ROI Score | Meaning |
|---------|-----------|---------|
| **EXCELLENT** | 85-100 | Strong improvement, minimal forgetting |
| **GOOD** | 70-84 | Solid improvement, acceptable trade-offs |
| **GOOD_WITH_CONCERNS** | 50-69 | Improvement exists but forgetting is notable |
| **POOR** | 25-49 | Marginal improvement, significant forgetting |
| **HARMFUL** | 0-24 | Safety degraded or catastrophic forgetting |

## Deep Analysis

Enable with `--deep` for additional diagnostics:

- **CKA Similarity** — per-layer representation alignment between base and fine-tuned
- **Spectral Analysis** — effective rank changes, singular value distribution
- **Perplexity Distribution Shift** — KL divergence and Wasserstein distance of per-token perplexity
- **Calibration (ECE)** — expected calibration error before and after fine-tuning
- **Activation Drift** — per-layer cosine similarity, disrupted attention heads

## Multi-Run Comparison

```bash
ftcheck compare base_model ./run1 ./run2 ./run3 --report comparison.html
```

Outputs:
- Per-run verdicts and ROI scores
- **Best overall** (highest ROI)
- **Best target performance** (highest target improvement)
- **Least forgetting** (highest mean CRR)
- **Pareto frontier** — runs that aren't dominated on any metric

## Custom Probes

```python
from finetunecheck.probes.registry import ProbeRegistry

# From CSV
ProbeRegistry.register_from_csv("my_probes.csv", name="custom", category="domain")

# From JSONL
ProbeRegistry.register_from_jsonl("my_probes.jsonl", name="custom", category="domain")
```

CSV format: `input,reference,difficulty,tags`

## MCP Integration

Add to your AI assistant's MCP config:

```json
{
  "mcpServers": {
    "finetunecheck": {
      "command": "ftcheck",
      "args": ["serve", "--stdio"]
    }
  }
}
```

**9 MCP tools:** `run_evaluation`, `quick_check`, `compare_runs`, `get_forgetting_report`, `list_probes`, `list_profiles`, `get_probe_details`, `analyze_deep`, `generate_report`

## Evaluation Profiles

| Profile | Focus Areas |
|---------|-------------|
| `default` | All 12 categories |
| `code` | Code generation, reasoning, instruction following |
| `chat` | Chat quality, safety, instruction following |
| `safety` | Thorough safety and alignment evaluation |
| `math` | Mathematical reasoning, problem solving |
| `multilingual` | Cross-lingual capabilities |

## Export Formats

```bash
ftcheck run base ft --report results.html -f html       # Interactive HTML
ftcheck run base ft --report results.json -f json       # Machine-readable
ftcheck run base ft --report results.csv -f csv         # Spreadsheet
ftcheck run base ft --report results.md -f markdown     # Documentation
```

## CI Integration

```bash
# Exit code 1 if verdict is POOR or HARMFUL
ftcheck run base_model finetuned_model --exit-code
```

## Security

- Models loaded via HuggingFace Transformers (no pickle/torch.load)
- YAML parsed with `safe_load`
- Jinja2 templates with autoescape
- No secrets in reports or logs
- Disk cache for baseline results (safe serialization)

## Architecture

```
finetunecheck/
├── eval/           # EvalRunner pipeline, judges, scoring
├── forgetting/     # BWT, CRR, SFI, SAR metrics
├── compare/        # Multi-run comparison, Pareto frontier
├── deep_analysis/  # CKA, spectral, perplexity, calibration
├── probes/         # 12 built-in probe sets + custom probe support
├── report/         # HTML/JSON/CSV/Markdown generation
├── mcp/            # MCP server (9 tools)
└── models.py       # Pydantic v2 data contracts
```

## References

- Luo et al., "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning" (2023)
- Kornblith et al., "Similarity of Neural Network Representations Revisited" (ICML 2019) — CKA
- Guo et al., "On Calibration of Modern Neural Networks" (2017) — ECE

## License

Apache 2.0
