Metadata-Version: 2.4
Name: marco-dvcs
Version: 0.1.52
Summary: A minimal dataset versioning system for text data with a focus on reproducibility.
Home-page: https://github.com/Team-Marco-ACM/marco-package
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: Flask
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Marco Dataset Versioning System

A minimal dataset versioning system for text data with a strong focus on reproducibility and transparency. Treat your text datasets like code — immutable, versioned, reproducible, and explainable.

Marco acts as a lightweight Python library, meaning you can initialize it in *any* machine learning project folder to safely version and preprocess your datasets without altering your original files.

---

## 🚀 Installation (Linux / macOS)

On modern Linux environments (Arch Linux, Ubuntu 23.04+), Python packages must be installed inside a Virtual Environment (PEP 668) to prevent conflicts with system packages.

Follow these steps to safely install Marco into your ML project:

1. **Navigate to your ML project folder** (e.g. your bag-of-words project):
   ```bash
   cd ~/projects/my-bag-of-words-model
   ```

2. **Create and activate a Python Virtual Environment**:
   ```bash
   # Create a virtual environment named 'venv'
   python3 -m venv venv

   # Activate it (required every time you open a new terminal in this folder)
   source venv/bin/activate
   ```
   > You should now see `(venv)` at the start of your terminal prompt.

3. **Install Marco**:
   ```bash
   pip install marco-dvcs
   ```

---

## 🛠️ Usage Guide

Once `marco` is installed in your virtual environment, you have access to the full CLI.

### 1. Initialize a Repository

Initialize Marco tracking in your current directory. This creates a `.marco/` data versioning environment specific to that project.
```bash
marco init
```
This generates a single **Root Node** linking all future lineage chains, creates the registry files, and provisions a `raw/` directory to store your raw data.


### 2. Create an Immutable Version

Upload a text/CSV/TSV dataset to create an immutable version. Marco computes a cryptographically secure SHA-256 hash from the raw data combined with the preprocessing configuration. It groups dataset versions into **Lineage Chains** based on the hash of the raw source data, auto-linking parents and children up to the Root Node.

**Interactive Mode** — if you don't supply a config file, Marco guides you through building the preprocessing pipeline (lowercasing, tokenization, stopword removal, deduplication):
```bash
marco upload my_dataset.csv -t v1-raw
```

**Config Mode** — supply a JSON config directly:
```bash
marco upload my_dataset.csv -c my_config.json -t v1-processed
```

**With Training Results Tracking** — attach any training results file (e.g., `.txt`, `.csv`, `.json`) to track model performance outcomes directly inside the version. The original file extension will be preserved natively:
```bash
marco upload my_dataset.csv -t v1-trained -r ./results/training_results_report.txt
```

> **Note**: During upload, Marco automatically generates a `raw_stats.json` artifact inside the version directory. This file captures pristine, pre-pipeline dataset statistics — including vocabulary size, document length distributions (avg, median, std, min, max), and label distributions.

### 3. Attach Training Results (Decoupled Workflow)

If you need to train your model on Marco's preprocessed data *before* you have the results, you can decouple the upload from the result tracking:

1. **Extract** the preprocessed data to your workspace:
   ```bash
   marco restore v1-raw -o ./my_training_data.tsv
   ```
2. **Train** your model on `my_training_data.tsv` to generate a result file (e.g., `results.txt`).
3. **Attach** the results back to your existing dataset version:
   ```bash
   marco attach-results v1-raw results.txt
   ```
*(Note: You can also use the shorter `marco attach` alias)*

### 4. List Versions

View all versions you've created, along with their tags, chains, and timestamps.
```bash
marco list
```

### 5. View Lineage Tree

View a visual ASCII-art representation of your project's history, tracing all versions back up through their lineage chains directly to the project's central **Root Node**.
```bash
marco lineage
```

### 6. Restore / Checkout Data

Extract the processed dataset from Marco's storage back into your active workspace for model training.
```bash
marco restore v1-processed -o ./training_data.tsv
```

### 7. Export / Import Versions

Easily share dataset versions with teammates by packing them into `.tar.gz` archives.
```bash
# Export version 'v1-raw' to the 'exports' folder
marco export v1-raw ./exports/

# Import an archive received from a teammate
marco import ./exports/marco_version_e5e0b767.tar.gz
```

### 8. Delete Versions

Delete a dataset version to recover disk space. Marco intelligently updates the lineage (`parents` history) of any descendant versions so that your Git-style history tree remains intact.
```bash
marco delete v1-raw
# or by hash
marco rm e5e0b767
```

### 9. Diff Versions — Unified `---/+++` Format

`marco diff` instantly prints a human-readable summary of how key metrics changed (e.g. "15% increase in tokens") followed by a line-by-line unified diff directly to the terminal — no files written, no clutter.

**Diff the raw source data (default):**
```bash
marco diff v1.0 v2.0
```

**Diff the post-pipeline preprocessed output:**
```bash
marco diff v1.0 v2.0 --target preprocessed
```

**Diff the DAG preprocessing configuration:**
```bash
marco diff v1.0 v2.0 --target config
```

**Control how many context lines surround each hunk (default: 3):**
```bash
marco diff v1.0 v2.0 --context 10
```

**Save a full Markdown metrics report to a folder (opt-in):**
```bash
marco diff v1.0 v2.0 --save ./reports
```

**Example terminal output:**
```diff
📊 Summary: v1.0 → v2.0
  ► N Documents: 2.00% increase (50 → 51)
  ► N Tokens: 15.30% increase (1000 → 1153)
  ► Vocab Size: 5.10% increase (300 → 315)
─────────────────────────────────────────────────────────────

--- v1.0 (abc12345)/raw.txt
+++ v2.0 (def67890)/raw.txt
@@ -1,5 +1,6 @@
  positive  Great product, love it!
-negative  Terrible experience.
+negative  Poor experience, not recommended.
+negative  Worst purchase I ever made.
  positive  Highly recommend to everyone.
  neutral   It was okay, nothing special.
```

### 11. Track Model Drift

Detect performance degradation across dataset versions instantly using **dynamic text mining**. Marco automatically scans your attached unstructured `results.txt` or `training_results.json` files for baseline performance metrics (using Regex to extract Accuracy, Precision, Recall, and F1 from arbitrary terminal logs). It then dynamically spins up an internal, lightweight Scikit-Learn `MultinomialNB` pipeline to evaluate the new data in-memory without needing a serialized `.pkl` file.

```bash
marco drift v1-trained v2-processed
```

**Example terminal output** *(Metrics not found in the baseline text file are gracefully omitted from comparison!)*:
```text
MODEL DRIFT ANALYSIS
════════════════════════════════════════
Trained on:   V_OLD  (fe4403)
Evaluated on: V_NEW  (a3f9c1)
════════════════════════════════════════
Metric       V_OLD (baseline) V_NEW (new) Drop    Severity
────────────────────────────────────────────────────────
Accuracy     0.700          0.612      -12.6%  CRITICAL
F1           0.547          0.485      -11.3%  CRITICAL
════════════════════════════════════════
Overall: RETRAIN
"Significant dataset shift detected. Model cannot explain
 new data variance. Retraining recommended."
```
---

### 10. KL Divergence & Token Analytics

Marco goes beyond simple vocabulary size tracking. It uses KL divergence (Kullback-Leibler) to measure exactly how token probability distributions shift between two dataset versions — telling you whether common words disappeared or rare domain terms suddenly dominated.
```bash
marco token-analytics v1-raw v2-processed
```

### 11. Evaluate Dataset Health

Get a comprehensive evaluation report covering dataset statistics, distributions, and potential issues for a specific dataset version.
```bash
marco evaluate v1-processed
```

---

## 🔬 Reproducibility Proof

Marco guarantees that a dataset version is not just stored — it is **provably reproducible**. The `marco/reproducibility_proof/` module lets you mathematically verify that re-running the exact same pipeline on the same raw data produces a byte-identical result.

### 1. Verify a Single Version

Re-runs the full preprocessing pipeline from scratch and compares the freshly computed output hash against the stored hash in the manifest. Before computing the pipeline, it performs a **zero-trust file integrity check** to ensure the stored dataset itself has not been tampered with.

```bash
marco verify v1-raw
# or with a full hash
marco verify fe44032c
```

**Example output:**
```
🔬 Verifying version: fe44032c70164a71...

  Pipeline Step Verification:
    ✅  step_1 normalize_newlines
    ✅  step_2 lowercase

  Stored  output_hash: ed610672ea28e065...
  Recomputed   hash:   ed610672ea28e065...

  ✅ VERIFIED — output hash matches. Version is reproducible.
```

- **On PASS**: the manifest is updated with `"verified": true`, a timestamp, the current environment snapshot, and per-step hashes.
- **On FAIL**: a `verification_report.json` is written inside the version folder with a row-level delta showing exactly which rows differ. Exit code is `1`.

---

### 2. Verify All Versions at Once

Audit every version in the repository and get a clean summary table.

```bash
marco verify-all
```

**Example output:**
```
──────────────────────────────────────────────────────────────────────
  Marco Reproducibility Audit — 3 version(s) found
──────────────────────────────────────────────────────────────────────
🔬 Verifying version: a3f9c1d2...  ✅ VERIFIED
🔬 Verifying version: fe44032c...  ✅ VERIFIED
🔬 Verifying version: 83f77823...  ❌ FAIL
──────────────────────────────────────────────────────────────────────
  Summary: 2/3 versions passed.
──────────────────────────────────────────────────────────────────────
```

Returns exit code `1` if any version fails — making it CI/CD friendly.

---

### How the 7-Layer Reproducibility Engine Works

| # | Guarantee | How It Works |
|---|-----------|--------------|
| 1 | **Step-by-step hash verification** | Hashes the output after every individual DAG step. Pinpoints exactly which step introduced non-determinism (`✅ step_1`, `❌ step_2`). |
| 2 | **Byte-exact canonical enforcer** | Before hashing, data is serialized into a strictly defined canonical TSV format: Unix line endings only, fixed column order (`label`, `text`, `n_tokens`), UTF-8 encoding. Prevents false `FAIL` results from OS differences (Windows CRLF vs Linux LF). |
| 3 | **Row-level delta on FAIL** | When a hash mismatch occurs, computes exactly which rows were added or removed between the stored output and the fresh recompute, saved in `verification_report.json`. |
| 4 | **Permanent audit trail** | Every verification run (PASS or FAIL) is appended to `.marco/verification_log.jsonl` with a timestamp, result, and full environment snapshot — a tamper-evident history. |
| 5 | **Environment fingerprinting** | Stores Python version, NumPy/Pandas versions, and platform info in the manifest at verification time. Warns on re-verification if the environment changed: `"Python 3.10 → 3.14 (may explain hash mismatch)"`. |
| 6 | **Stochastic operation detector** | Scans the pipeline DAG config for non-deterministic functions (e.g. `shuffle`, `random_sample`) that lack a `seed` parameter. Warns before verifying so expectations are set correctly. |
| 7 | **Manifest integrity update** | On PASS: sets `"verified": true` in `manifest.json` and persists per-step hashes. On FAIL: sets `"verified": false` and keeps `verification_report.json` alongside the version data. |

---

### Verification Report (`verification_report.json`)

When a version fails, Marco writes a full diagnostic file at `.marco/versions/<version_id>/verification_report.json`:

```json
{
  "version_id": "fe44032c...",
  "verified_at": "2026-03-18T14:01:00Z",
  "result": "FAIL",
  "stored_hash": "ed610672...",
  "computed_hash": "a3f9c1d2...",
  "step_hashes": {
    "step_1": "aabbcc...",
    "step_2": "ddeeff..."
  },
  "stochastic_warnings": [],
  "environment_warnings": ["python_version changed: '3.10' → '3.14'"],
  "row_delta": {
    "count_stored": 1000,
    "count_computed": 998,
    "removed": [["pos", "hello world"], ["neg", "bad product"]],
    "added": []
  }
}
```

---

## 🧠 Architecture Overview

Marco decouples logic from the file system. All core engine operations live inside `marco/core/`:

- **`locker.py`** — File-based concurrency control using `.lock` files.
- **`repository.py`** — CRUD operations for dataset versions and `refs.json` tagging. Enforces file immutability by setting read-only permissions (`0o444`) on all version files immediately after creation.
- **`preprocessor.py`** — A robust Directed Acyclic Graph (DAG) preprocessing engine with deterministic topological execution.
- **`comparator.py`** — Version comparison via Markdown metric reports (`compare_versions`) and classic `---`/`+++` unified diffs (`diff_versions_text`) using Python's stdlib `difflib`.

### Reproducibility Proof Engine (`marco/reproducibility_proof/`)

- **`canonicalizer.py`** — Enforces byte-exact canonical serialization (Unix line endings, fixed column order). Also detects stochastic/non-deterministic pipeline operations.
- **`audit_log.py`** — Appends every verification run (PASS or FAIL) to `.marco/verification_log.jsonl` with a full environment snapshot.
- **`verify.py`** — End-to-end reproducibility verification with step-by-step hashing, row-level delta reports, environment fingerprinting, and a `verify_all()` batch audit command.

---

Have fun building safer machine learning pipelines!
