Metadata-Version: 2.4
Name: deepvariance-sdk
Version: 1.0.1
Summary: DeepVariance Python AutoML SDK — LLM-driven pipelines for tabular ML and image classification
Author: Deep Variance Dev Team
License-Expression: LicenseRef-proprietary
Project-URL: Homepage, https://deepvariance.com
Keywords: automl,llm,machine-learning,deep-learning,autogluon,pytorch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: scikit-learn>=1.3
Requires-Dist: psutil>=5.9
Requires-Dist: openai>=1.0
Requires-Dist: groq>=0.9
Requires-Dist: autogluon.tabular>=1.0
Requires-Dist: torch>=2.0
Requires-Dist: torchvision>=0.15
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: cython>=3.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.2; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.23; extra == "docs"
Dynamic: license-file

# DeepVariance SDK

**DeepVariance** is a Python AutoML SDK that combines LLM-driven code generation with [AutoGluon](https://auto.gluon.ai/) to automatically cast, clean, sample, preprocess, and train ML models on any tabular dataset — with a single `pipeline.run()` call.

---

## Table of Contents

- [How it works](#how-it-works)
- [Requirements](#requirements)
- [Installation](#installation)
- [Configuration](#configuration)
- [Quickstart](#quickstart)
- [Pipeline output](#pipeline-output)
- [PipelineConfig reference](#pipelineconfig-reference)
- [Progress callbacks](#progress-callbacks)
- [Build](#build)
- [Development](#development)
- [Documentation](#documentation)

---

## How it works

The `MLPipeline` executes 7 sequential layers against your DataFrame:

| #   | Layer                      | Type                 | What it does                                                     |
| --- | -------------------------- | -------------------- | ---------------------------------------------------------------- |
| 1   | `AutoCastLayer`            | LLM → code           | Infers and applies column types, encodes categoricals            |
| 2   | `DataProfilingLayer`       | Deterministic        | Computes feature + target statistics                             |
| 3   | `CorrelationLayer`         | Deterministic        | Pearson correlation matrix + mutual information scores           |
| 4   | `SamplingLayer`            | LLM → code           | Produces a stratified, representative sample                     |
| 5   | `PreprocessingLayer`       | LLM → code           | Generates and applies pandas transforms (imputation, scaling, …) |
| 6   | `ModelRecommendationLayer` | LLM → recommendation | Selects the best AutoGluon model codes for your task             |
| 7   | `ModelTrainingLayer`       | Deterministic        | Trains and evaluates a `TabularPredictor`, returns metrics       |

LLM-driven layers use a **retry loop** — if the generated code raises an exception, the error is fed back to the LLM for self-correction.

---

## Requirements

- Python ≥ 3.12
- A **DeepVariance API key** — email [founders@deepvariance.com](mailto:founders@deepvariance.com) or fill the contact form at [deepvariance.com](https://deepvariance.com)
- An **OpenAI** or **Groq** API key

---

## Installation

```bash
pip install deepvariance-sdk
```

Dependencies installed automatically: `pandas`, `numpy`, `scipy`, `scikit-learn`, `psutil`, `openai`, `groq`, `autogluon.tabular`, `torch`, `torchvision`

### Dev install (from source)

```bash
git clone https://github.com/deepvariance/deepvariance-sdk
cd deepvariance-sdk
uv venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"             # installs all deps + pytest, ruff, cython
```

---

## Configuration

The SDK reads credentials from environment variables.  Set them in your shell
before running:

```bash
export DV_API_KEY=your-deepvariance-api-key
export OPENAI_API_KEY=sk-...
export GROQ_API_KEY=gsk_...   # fallback if OpenAI key is absent
```

The SDK resolves LLM providers in order: **OpenAI → Groq**. You only need one.

### Optional: load from a `.env` file (local dev)

`python-dotenv` is not required by the SDK, but it is a convenient way to
manage keys during local development.

```bash
pip install python-dotenv
```

Create a `.env` file at the project root (see `.env.example`):

```dotenv
# .env
DV_API_KEY=dv_...
OPENAI_API_KEY=sk-...
GROQ_API_KEY=gsk_...
```

Then load it at the top of your script, **before** constructing `PipelineConfig`:

```python
from dotenv import load_dotenv
load_dotenv()          # reads .env into os.environ

import os
from deepvariance.pipelines.ml import MLPipeline
from deepvariance.typings import PipelineConfig

config = PipelineConfig(
    dv_api_key=os.getenv("DV_API_KEY"),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)
```

> **Never commit your `.env` file.** Add it to `.gitignore`:
> ```
> .env
> ```
> The `.env.example` file in the repo root shows all available environment variables.

---

## Quickstart

```python
import os
import pandas as pd

from deepvariance.pipelines.ml import MLPipeline
from deepvariance.typings import PipelineConfig

# 1. Load your data
data = pd.read_csv("your_dataset.csv")

# 2. Configure
config = PipelineConfig(
    dv_api_key=os.getenv("DV_API_KEY"),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    groq_api_key=os.getenv("GROQ_API_KEY"),
    sample_percentage=0.1,   # train on a 10% stratified sample
)

# 3. Run
pipeline = MLPipeline(config=config)
result = pipeline.run(data, target="your_target_column")

# 4. Inspect results
print(result["metrics"])
print(result["leaderboard"])
```

Run the bundled examples directly:

```bash
# Binary classification — Australia weather dataset
.venv/bin/python examples/ml_quickstart.py

# Regression — medical insurance dataset
.venv/bin/python examples/insurance_regression.py
```

---

## Pipeline output

`pipeline.run()` returns a dict:

| Key                  | Type                   | Description                                         |
| -------------------- | ---------------------- | --------------------------------------------------- |
| `metrics`            | `dict[str, float]`     | Accuracy, F1, ROC-AUC, RMSE, R², … (task-dependent) |
| `model`              | `TabularPredictor`     | Trained AutoGluon predictor                         |
| `leaderboard`        | `pd.DataFrame`         | All candidate models ranked by validation score     |
| `feature_importance` | `pd.DataFrame \| None` | Feature importance scores from the best model       |
| `run_stats`          | `dict`                 | Wall-clock duration and peak memory per layer       |

### Classification metrics

`accuracy`, `f1_macro`, `f1_weighted`, `precision_macro`, `precision_weighted`, `recall_macro`, `recall_weighted`, `cohen_kappa`, `mcc`, `roc_auc` (binary) / `roc_auc_ovr` (multiclass), `log_loss`

### Regression metrics

`rmse`, `mae`, `r2`, `median_ae`, `max_error`, `explained_var`, `mape`

---

## PipelineConfig reference

```python
@dataclass
class PipelineConfig:
    dv_api_key:    str | None = None   # DeepVariance API key (or set DV_API_KEY env var)
    openai_api_key: str | None = None  # OpenAI API key
    groq_api_key:   str | None = None  # Groq API key (fallback)
    sample_percentage: float | None = None  # e.g. 0.1 → 10% sample fed to AutoGluon
    extra: dict[str, Any] = field(default_factory=dict)  # pipeline-specific overrides
```

`sample_percentage` controls the fraction of rows passed to AutoGluon after the LLM sampling stage. For large datasets (> 100k rows) a value of `0.1`–`0.2` keeps training fast while preserving distribution.

---

## Progress callbacks

Pass an `on_progress` callable to get real-time stage updates:

```python
def on_progress(stage: str, status: str) -> None:
    # stage  — e.g. "AutoCastLayer", "ModelTrainingLayer"
    # status — "start" | "complete" | "error"
    icon = {"start": "▶", "complete": "✓", "error": "✗"}.get(status, "·")
    print(f"  {icon}  {stage}: {status}")

result = pipeline.run(data, target="label", on_progress=on_progress)
```

---

## Build

The release wheel compiles all source to native C extensions via Cython —
no Python source is included in the distributed package.

```bash
# Install build dependencies (one-time)
uv pip install -e ".[dev]"

# Compile extensions in-place (for local dev / running tests against .so)
just build-ext

# Build a release wheel (compiled .so only, no .py source)
just build-wheel
# → dist/deepvariance_sdk-1.0.0-cp312-cp312-macosx_10_9_universal2.whl
```

For CI, build on each target platform (macOS arm64, Linux x86_64) and upload
all wheels to PyPI so users get the right binary for their machine.

---

## Documentation

The project now includes **Sphinx-based documentation** under the `docs/` directory.  To build the HTML locally:

```bash
# install docs dependencies (optional group)
uv pip install -e ".[docs]"    # or use pip/poetry/uv manually
cd docs
make html             # requires make; or run `sphinx-build -b html . _build/html`
```

The generated site will appear in `docs/_build/html/index.html`.

See `docs/quickstart.rst` for a getting‑started guide and `docs/api.rst` for
an auto‑generated API reference.

## Development

```bash
# Run tests
.venv/bin/python -m pytest tests/ -q

# Lint
.venv/bin/ruff check src/ tests/

# Format
.venv/bin/ruff format src/ tests/
```

All lint rules are configured in `pyproject.toml` under `[tool.ruff]`.
