Metadata-Version: 2.4
Name: model-auditor
Version: 0.1.9
Summary: A library for evaluating ML model performance across subgroups with stratified metrics and bootstrap confidence intervals
Author: Beatrice Brown-Mulry
License-Expression: MIT
Project-URL: Homepage, https://github.com/beatrice-b-m/model-auditor
Project-URL: Repository, https://github.com/beatrice-b-m/model-auditor
Project-URL: Issues, https://github.com/beatrice-b-m/model-auditor/issues
Keywords: machine-learning,model-evaluation,metrics,fairness,audit,classification,confidence-intervals,bootstrap
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.2
Requires-Dist: numpy>=2.1
Requires-Dist: scikit-learn>=1.5
Requires-Dist: tqdm>=4.0

# Model Auditor

A Python library for evaluating machine learning model performance across subgroups with support for stratified metrics, bootstrap confidence intervals, and hierarchical visualizations.

## Installation

```bash
pip install model-auditor
```

## Features

- **Stratified Evaluation**: Evaluate model metrics across different subgroups (e.g., by age, gender, region)
- **Bootstrap Confidence Intervals**: Calculate 95% confidence intervals for all supported metrics
- **Comprehensive Metrics**: Built-in support for classification metrics including:
  - Sensitivity, Specificity, Precision, Recall, F1 Score
  - AUROC, AUPRC
  - Matthews Correlation Coefficient (MCC)
  - F-beta Score (configurable beta)
  - TPR, TNR, FPR, FNR
  - Count metrics (N, TP, TN, FP, FN, Positive, Negative)
- **Threshold Optimization**: Automatic threshold selection using the Youden index
- **Hierarchical Visualization**: Generate data structures for sunburst/treemap plots
- **Extensible Design**: Protocol-based architecture for custom metrics

## Quick Start

```python
from model_auditor import Auditor
from model_auditor.metrics import Sensitivity, Specificity, AUROC, F1Score

# Initialize the auditor
auditor = Auditor()

# Add your data
auditor.add_data(df)

# Define stratification features
auditor.add_feature(name="age_group", label="Age Group")
auditor.add_feature(name="gender", label="Gender")

# Define the score column and threshold
auditor.add_score(name="risk_score", label="Risk Score", threshold=0.5)

# Define the outcome column
auditor.add_outcome(name="diagnosis", mapping={"positive": 1, "negative": 0})

# Set metrics to evaluate
auditor.set_metrics([
    Sensitivity(),
    Specificity(),
    AUROC(),
    F1Score()
])

# Run evaluation with bootstrap confidence intervals
results = auditor.evaluate_metrics(score_name="risk_score", n_bootstraps=1000)

# Convert results to a DataFrame
results_df = results.to_dataframe()
print(results_df)
```

## Threshold Optimization

Find the optimal decision threshold using the Youden index:

```python
auditor = Auditor()
auditor.add_data(df)
auditor.add_score(name="risk_score")
auditor.add_outcome(name="label")

# Find optimal threshold
optimal_threshold = auditor.optimize_score_threshold(score_name="risk_score")
# Output: Optimal threshold for 'risk_score' found at: 0.423
```

## Available Metrics

### Classification Metrics

| Metric | Class | Description |
|--------|-------|-------------|
| Sensitivity | `Sensitivity()` | TP / (TP + FN) |
| Specificity | `Specificity()` | TN / (TN + FP) |
| Precision | `Precision()` | TP / (TP + FP) |
| Recall | `Recall()` | TP / (TP + FN) |
| F1 Score | `F1Score()` | Harmonic mean of precision and recall |
| F-beta | `FBetaScore(beta=2.0)` | Weighted harmonic mean |
| MCC | `MatthewsCorrelationCoefficient()` | Matthews Correlation Coefficient |

### Ranking Metrics

| Metric | Class | Description |
|--------|-------|-------------|
| AUROC | `AUROC()` | Area Under ROC Curve |
| AUPRC | `AUPRC()` | Area Under Precision-Recall Curve |

### Rate Metrics

| Metric | Class | Description |
|--------|-------|-------------|
| TPR | `TPR()` | True Positive Rate |
| TNR | `TNR()` | True Negative Rate |
| FPR | `FPR()` | False Positive Rate |
| FNR | `FNR()` | False Negative Rate |

### Count Metrics

| Metric | Class | Description |
|--------|-------|-------------|
| N | `nData()` | Sample size |
| TP | `nTP()` | True positive count |
| TN | `nTN()` | True negative count |
| FP | `nFP()` | False positive count |
| FN | `nFN()` | False negative count |
| Positive | `nPositive()` | Positive class count |
| Negative | `nNegative()` | Negative class count |

## Custom Metrics

Create custom metrics by implementing the `AuditorMetric` protocol:

```python
from model_auditor.metrics import AuditorMetric
import pandas as pd

class AccuracyMetric(AuditorMetric):
    name = "accuracy"
    label = "Accuracy"
    inputs = ["tp", "tn", "fp", "fn"]
    ci_eligible = True

    def data_call(self, data: pd.DataFrame) -> float:
        tp = data["tp"].sum()
        tn = data["tn"].sum()
        fp = data["fp"].sum()
        fn = data["fn"].sum()
        return (tp + tn) / (tp + tn + fp + fn)

# Use with the auditor
auditor.set_metrics([AccuracyMetric(), Sensitivity()])
```

## Hierarchical Visualization

Generate data for hierarchical plots (sunburst, treemap):

```python
from model_auditor.plotting import HierarchyPlotter

plotter = HierarchyPlotter()
plotter.set_data(df)
plotter.set_features(["region", "age_group", "gender"])
plotter.set_score(name="risk_score")
plotter.set_aggregator("median")  # or "mean", or a custom function

# Compile plot data
plot_data = plotter.compile(container="All Patients")

# Use with Plotly
import plotly.graph_objects as go

fig = go.Figure(go.Sunburst(
    labels=plot_data.labels,
    ids=plot_data.ids,
    parents=plot_data.parents,
    values=plot_data.values,
    marker=dict(colors=plot_data.colors)
))
fig.show()
```

### Custom Hierarchies

Define complex hierarchies with conditional features:

```python
from model_auditor.plotting.schemas import Hierarchy, HLevel, HItem

hierarchy = Hierarchy(levels=[
    HLevel([HItem(name="region")]),
    HLevel([
        HItem(name="urban_category", query="region == 'Urban'"),
        HItem(name="rural_category", query="region == 'Rural'")
    ]),
    HLevel([HItem(name="age_group")])
])

plotter.set_features(hierarchy)
```

## Disabling Confidence Intervals

For faster evaluation without confidence intervals:

```python
results = auditor.evaluate_metrics(score_name="risk_score", n_bootstraps=None)
```

## Output Format

Results are returned as nested dataclass objects that can be converted to DataFrames:

```python
# Get results as DataFrame
df = results.to_dataframe(n_decimals=3, metric_labels=True)

# Access specific feature results
gender_results = results.features["gender"].to_dataframe()

# Access specific level results
male_results = results.features["gender"].levels["Male"].to_dataframe()
```

## Controlling Feature Level Order

By default, feature levels appear in the order they were encountered in the
data.  To control the row order in exported DataFrames, assign the feature
column a `pd.Categorical` dtype with an explicit `categories` list before
passing the data to the auditor:

```python
import pandas as pd
from model_auditor import Auditor
from model_auditor.metrics import Sensitivity, Specificity

# Declare the desired display order for the 'age_group' column.
# Categories not present in the data still appear as rows (with NaN values).
df["age_group"] = pd.Categorical(
    df["age_group"],
    categories=["<30", "30-50", "50-70", ">70"],
    ordered=True,
)

auditor = Auditor()
auditor.add_data(df)
auditor.add_feature(name="age_group")
auditor.add_score(name="risk_score", threshold=0.5)
auditor.add_outcome(name="outcome")
auditor.set_metrics([Sensitivity(), Specificity()])

results = auditor.evaluate_metrics(score_name="risk_score", n_bootstraps=None)

# Rows appear in the declared order: <30, 30-50, 50-70, >70.
# If no rows belong to a declared category (e.g. '>70' is absent from the
# data), that category still appears as a row with NaN metric values.
df_out = results.features["age_group"].to_dataframe()
```

The same order is preserved in `style_dataframe()` and in the score-level
`ScoreEvaluation.to_dataframe()` / `ScoreEvaluation.style_dataframe()` exports.
Non-categorical feature columns are unaffected.



## Error Analysis

Use `evaluate_errors()` to understand which subgroups are over- or
under-represented within each confusion-matrix group (TP, TN, FP, FN).
For every feature level the *representation ratio* is computed:

    ratio = P(level | group) / P(level | full dataset)

A ratio of 1.0 means proportional representation.  Values above 1.0 indicate
over-representation in that confusion group; below 1.0 under-representation.

```python
# No additional metric setup required — evaluate_errors() uses RepresentationRatio
# by default.
error_results = auditor.evaluate_errors(score_name="risk_score", n_bootstraps=1000)

# Convert to a wide analysis-ready DataFrame.
# Rows: MultiIndex(feature, level)
# Columns: MultiIndex(section, metric)
#   Sections: "Class Balance", "Overall", "TP", "TN", "FP", "FN"
#   Sub-columns per group: N, % overall, % group, representation_ratio
df = error_results.to_dataframe()
print(df)

# Use metric_labels=True for human-readable column names:
df_labels = error_results.to_dataframe(metric_labels=True)

# Per-group deep inspection is still available:
tp_age = error_results.groups["tp"].features["age_group"]
print(tp_age.to_dataframe())
```

## License
## Notebook Styling

For Jupyter notebooks, `style_dataframe(...)` returns a pandas `Styler` that colours cells by relative performance tier within each metric column.

```python
# Colour all levels in a feature by relative tier (default: performance metrics only)
display(results.features['age_group'].style_dataframe(n_decimals=3, metric_labels=True))

# Also colour count columns (N, TP, TN, …)
display(results.features['gender'].style_dataframe(include_count_metrics=True))

# Opt into custom colours
display(results.style_dataframe(
    low_color="#ffd6d6",
    medium_color="#fff9c4",
    high_color="#d0f0d0",
))
```

### Tier assignment

| Tier | Default colour | Meaning |
|------|---------------|----------|
| High | `#d4edda` (green) | Top third of values in the column |
| Medium | `#fff3cd` (yellow) | Middle third |
| Low | `#f8d7da` (red) | Bottom third |

Tiers are computed **per metric column** across all rows in the table. Lower-is-better metrics (`fpr`, `fnr`) are inverted: a lower value receives the high (green) tier.

### Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `n_decimals` | `3` | Decimal places for numeric display |
| `metric_labels` | `False` | Use metric labels as column headers instead of names |
| `include_count_metrics` | `False` | Also style count columns (N, TP, TN, FP, FN, Pos., Neg.) |
| `low_color` | `"#f8d7da"` | Background colour for low-tier cells |
| `medium_color` | `"#fff3cd"` | Background colour for medium-tier cells |
| `high_color` | `"#d4edda"` | Background colour for high-tier cells |


MIT License

## Author

Beatrice BM
