Metadata-Version: 2.4
Name: data-job-healer
Version: 0.1.0
Summary: Self-healing agent for Databricks jobs using MLflow and Databricks Mosaic AI
Author: Data Engineering Team
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: databricks-sdk>=0.20.0
Requires-Dist: httpx>=0.26.0
Requires-Dist: jira>=3.5.0
Requires-Dist: mlflow>=2.10.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pygithub>=2.1.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == 'dev'
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.2.0; extra == 'dev'
Provides-Extra: mcp
Requires-Dist: fastmcp>=2.0.0; extra == 'mcp'
Description-Content-Type: text/markdown

# Data Job Healer

An autonomous self-healing agent for Databricks notebook jobs. When a job fails, it analyzes the error, generates a fix using Databricks Mosaic AI, validates it in an isolated test environment, and opens a GitHub PR or creates a JIRA ticket — automatically.

---

## Features

- **Automatic failure detection** — polls the Databricks Jobs API for failed runs
- **LLM-powered fix generation** — uses Databricks Foundation Model APIs (Llama 4 Maverick) to analyze errors and generate code fixes
- **Retry loop with feedback** — retries fix generation up to 3 times, passing validation errors back to the LLM as context
- **Isolated test validation** — runs fixes against sampled production data (1,000 rows) on serverless compute before applying anything
- **Smart routing** — creates a GitHub PR for fixable errors; opens a JIRA ticket for anything requiring human intervention; notifies via Slack in both cases
- **Experience store** — records past fix attempts in Unity Catalog and injects successful patterns into future LLM prompts
- **Full audit trail** — logs every healing attempt to Unity Catalog and MLflow

---

## Prerequisites

- Databricks workspace with **Jobs Serverless** and **Unity Catalog** enabled
- [Databricks CLI v0.200+](https://docs.databricks.com/dev-tools/cli/install.html) installed and configured
- A Databricks Foundation Model endpoint (e.g., `databricks-llama-4-maverick`)
- Python 3.10+
- `uv` (recommended) or `pip`
- GitHub Personal Access Token with `repo` scope _(optional, for PR creation)_
- JIRA API token _(optional, for ticket creation)_
- Slack Incoming Webhook URL _(optional, for notifications)_

---

## Step-by-Step Setup

### 1. Install the Databricks CLI

```bash
# macOS / Linux
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Verify installation
databricks --version
```

### 2. Authenticate with your workspace

```bash
databricks configure
# Prompts for:
#   Databricks Host: https://your-workspace.cloud.databricks.com
#   Token: dapi...
```

Or use environment variables:

```bash
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
```

### 3. Clone the repository

```bash
git clone https://github.com/your-org/data-job-healer.git
cd data-job-healer
```

### 4. Deploy the bundle

```bash
# Install build tool if needed
pip install build

# Deploy to your workspace (creates jobs, uploads artifacts)
databricks bundle deploy -t dev
```

This will:
- Build the Python wheel (`data_job_healer-0.1.0-py3-none-any.whl`)
- Upload notebooks and the wheel to your workspace
- Create two scheduled jobs: **Data Job Healer - Monitor** (every 5 min) and **Feedback Ingestion** (every 6 hours)

### 5. Set up the test schema

Run the setup notebook once to create the required Unity Catalog schema and tables:

```bash
databricks workspace import notebooks/setup_test_environment.py \
  /Workspace/Users/$(databricks current-user me | jq -r .userName)/setup_test_environment.py

databricks workspace run /Workspace/Users/<your-user>/setup_test_environment.py
```

This creates `<your_catalog>._test_healing` with the following tables:

| Table | Purpose |
|---|---|
| `_sampled_<table>` | Sampled production data (1,000 rows per table) |
| `_test_metadata` | Tracks sampled tables and cache status |
| `_test_results` | Test execution results per fix attempt |
| `_healing_log` | Full audit log of all healing attempts |
| `_healing_experiences` | Experience store for LLM few-shot learning |

### 6. Configure job parameters

Update the job parameters either in the **Databricks UI** (`Workflows → Jobs → Data Job Healer - Monitor → Edit → Tasks → Parameters`) or directly in `databricks.yml` before deploying.

**Required:**

| Parameter | Description |
|---|---|
| `test_catalog` | Your Unity Catalog name (e.g., `databricks_healer`) |
| `databricks_model_endpoint` | LLM endpoint (e.g., `databricks-llama-4-maverick`) |

**Integrations (all optional):**

| Parameter | Description |
|---|---|
| `github_token` | GitHub PAT — enables PR creation |
| `github_org` | GitHub org for auto-detecting repo from notebook path |
| `SLACK_WEBHOOK_URL` | Slack Incoming Webhook URL |
| `JIRA_URL` | JIRA instance URL (e.g., `https://your-org.atlassian.net`) |
| `JIRA_USER` | JIRA user email |
| `JIRA_API_TOKEN` | JIRA API token |
| `JIRA_PROJECT_KEY` | JIRA project key (e.g., `OPS`) |

**Behavior (optional):**

| Parameter | Default | Description |
|---|---|---|
| `auto_create_pr` | `false` | Automatically open GitHub PRs |
| `auto_create_ticket` | `false` | Automatically create JIRA tickets |
| `heal_lookback_minutes` | `60` | How far back to scan for failures |
| `heal_limit` | `10` | Max failures to process per run |

After editing `databricks.yml`, redeploy:

```bash
databricks bundle deploy -t dev
```

---

## Usage

### Automatic (scheduled)

After deploying, the monitor job runs every 5 minutes automatically. No action needed.

### Manual trigger via CLI

```bash
# Get the job ID
databricks jobs list | grep "Data Job Healer"

# Trigger a run
databricks jobs run-now --job-id <job_id>
```

### Heal a specific failed run

```bash
# Via CLI (if the wheel is installed locally)
heal heal <run_id>

# List recent failures
heal list --hours 1 --limit 10
```

### Programmatic usage (in a notebook)

```python
from data_job_healer.agents import HealerAgent

agent = HealerAgent()

result = agent.heal_job_run(run_id=12345, auto_create_pr=True, auto_create_ticket=False)
print(f"Success: {result.success}")
print(f"PR URL: {result.pr_url}")
```

### View results

```sql
-- Unity Catalog audit log
SELECT timestamp, run_id, job_name, error_category, action_taken, success, pr_url
FROM <your_catalog>._test_healing._healing_log
ORDER BY timestamp DESC
LIMIT 50;
```

---

## Architecture

```
Failed Databricks Job Run
        │
        ▼
┌───────────────┐
│  Job Monitor  │  Polls Jobs API for failures in the last N minutes
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Error Analyzer│  Parses logs → classifies error → fixable or not?
└──────┬────────┘
       │
   ┌───┴──────────────────────────┐
   │ Fixable                      │ Unfixable (OOM, timeout, permissions)
   ▼                              ▼
┌──────────────┐           ┌─────────────┐
│ Data Sampler │           │ JIRA Ticket │ + Slack notification
│ (1k rows/tbl)│           └─────────────┘
└──────┬───────┘
       │
       ▼
┌──────────────┐   fail    ┌──────────────────────────────────┐
│ Fix Generator│──────────▶│  Retry (up to 3x with LLM feedback)│
│  (LLM)       │◀──────────└──────────────────────────────────┘
└──────┬───────┘
       │ pass
       ▼
┌──────────────┐
│ Fix Validator│  Runs fix in isolated test notebook on serverless compute
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  GitHub PR   │ + Slack notification + Unity Catalog log + MLflow trace
└──────────────┘
```

### Key components

| Component | File | Role |
|---|---|---|
| `HealerAgent` | `agents/healer_agent.py` | Main orchestrator |
| `FixGenerator` | `agents/fix_generator.py` | LLM-based fix generation with retry |
| `FixValidator` | `agents/fix_validator.py` | Validates fixes in test env + security scan |
| `ErrorParser` | `databricks/error_parser.py` | Classifies 18 error categories |
| `DataSampler` | `testing/data_sampler.py` | Samples production tables |
| `OutcomeHandler` | `agents/outcome_handler.py` | Creates PRs, tickets, Slack messages |
| `ExperienceStore` | `databricks/experience_store.py` | UC-based few-shot learning |

### Security

The validator scans every LLM-generated fix before running it. These patterns are rejected outright:

`%sh`, `%fs`, `dbutils.secrets`, `dbutils.notebook.run`, `os.system`, `subprocess`, `__import__`, `eval`, `exec`

---

## Supported Error Types

| Category | Auto-fixable |
|---|---|
| `SYNTAX_ERROR`, `SQL_ERROR` | Yes |
| `IMPORT_ERROR`, `NAME_ERROR`, `TYPE_ERROR` | Yes |
| `COLUMN_NOT_FOUND`, `SCHEMA_MISMATCH` | Yes (with data sampling) |
| `OUT_OF_MEMORY`, `TIMEOUT`, `PERMISSION_ERROR` | No → JIRA ticket |
| `CLUSTER_ERROR`, `SPARK_ERROR` | No → JIRA ticket |

---

## Troubleshooting

**`ModuleNotFoundError: No module named 'data_job_healer'`**
The monitor notebook installs the wheel at runtime. Re-run `databricks bundle deploy -t dev` and check that the artifact uploaded successfully.

**Validation always fails**
Query the healing log for the specific `validation_error`:
```sql
SELECT validation_error FROM <catalog>._test_healing._healing_log WHERE run_id = <run_id>;
```

**GitHub PR creation fails**
- Confirm `github_token` has `repo` scope
- Confirm the notebook is stored under `/Repos/{user}/{repo}/...`
- Confirm `github_org` matches exactly

**Data sampling fails**
- Check Unity Catalog permissions on the source tables
- Verify `test_catalog` exists and is accessible

---

## Local Development

```bash
# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Lint / format
ruff check src/
black src/

# Type check
mypy src/

# Build wheel manually
python -m build --wheel
```

---

## License

MIT