Metadata-Version: 2.4
Name: mx8
Version: 1.0.8rc1
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Operating System :: OS Independent
Summary: MX8: bounded data runtime (Rust) exposed to Python.
License: UNLICENSED
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# mx8 (Python)

MX8 is a bounded-memory data runtime exposed to Python (built with PyO3 + maturin).

The v0 focus is “don’t OOM”: MX8 enforces backpressure with hard caps (so prefetch can’t runaway).

Further docs:

- Python API: `../../docs/python_api.md`
- Vision labels/layout: `../../docs/vision_labels.md`
- S3/runtime tuning: `../../docs/s3_runtime_tuning.md`
- Memory contract: `../../docs/memory_contract.md`
- Video GA checklist: `../../docs/video_ga_checklist.md`
- Troubleshooting: `../../docs/troubleshooting.md`

## Install (from wheel)

Once you have a wheel (from CI or local build):

- `python -m venv .venv && . .venv/bin/activate`
- `pip install mx8-*.whl`

## Install (from PyPI)

- `python -m venv .venv && . .venv/bin/activate`
- `pip install mx8`
- Optional vision/training deps: `pip install pillow numpy torch`

## Quickstart (local, no S3)

```python
import mx8

mx8.pack_dir(
    "/path/to/imagefolder",
    out="/path/to/mx8-dataset",
    shard_mb=512,
    label_mode="imagefolder",
    require_labels=True,
)

loader = mx8.image(
    "/path/to/mx8-dataset@refresh",
    batch=64,
    inflight=256 * 1024 * 1024,
    resize=(224, 224),  # (H,W); optional
)

print(loader.classes)  # ["cat", "dog", ...] if labels.tsv exists

for images, labels in loader:
    pass
```

## Zero-manifest load (raw prefix)

```python
import mx8

loader = mx8.load(
    "s3://bucket/raw-prefix/",
    recursive=True,  # default
    profile="balanced",
)

for batch in loader:
    pass
```

`mx8.run(...)` is the convenience wrapper that chooses local vs distributed mode from environment (`WORLD_SIZE`).
`mx8.resolve(...)` is a short alias for `mx8.resolve_manifest_hash(...)`.

## Mix multiple loaders

`mx8.mix(...)` composes existing loaders into one deterministic stream.
`weights` are sampling proportions (not model-loss weights).

```python
import mx8

loader_a = mx8.load("s3://bucket/dataset_a/@refresh", profile="balanced", tune=True)
loader_b = mx8.load("s3://bucket/dataset_b/@refresh", profile="balanced", tune=True)

mixed = mx8.mix(
    [loader_a, loader_b],
    weights=[1, 1],   # fairness baseline (50:50)
    seed=0,
    epoch=0,
)

for batch in mixed:
    pass

print(mixed.stats())
```

Skewed example:

```python
mixed = mx8.mix([loader_a, loader_b], weights=[7, 3], seed=0, epoch=0)
```

`seed` and `epoch` define deterministic schedule behavior:

- same `seed` + `epoch` => same source-pick sequence
- same `seed`, different `epoch` => controlled schedule variation

`starvation` is an optional watchdog threshold in scheduler ticks used for starvation counters in `mixed.stats()`.
Set `MX8_MIX_SNAPSHOT=1` (and optional `MX8_MIX_SNAPSHOT_PERIOD_TICKS=64`) to emit periodic `mix_snapshot` proof events.

Minimal API naming note:
- Top-level APIs use short kwargs (`batch`, `ram_gb`, `coord`, `resume`, ...).
- Advanced objects keep explicit names:
  - `mx8.Constraints(max_inflight_bytes=..., max_ram_bytes=...)`
  - `mx8.RuntimeConfig(prefetch_batches=..., max_queue_batches=..., want=...)`
  - `mx8.DistributedDataLoader(..., autotune=..., resume_from=...)`

## Bounded memory (v0)

Set a hard cap and periodically print high-water marks:

```python
import mx8

loader = mx8.image(
    "/path/to/mx8-dataset@refresh",
    batch=64,
    inflight=256 * 1024 * 1024,
    queue=8,
    prefetch=4,
)

for step, (images, labels) in enumerate(loader):
    if step % 100 == 0:
        print(loader.stats())  # includes ram_high_water_bytes
```

Avoid patterns that intentionally accumulate batches:

```python
# ❌ Don't do this (will grow RSS regardless of any loader)
all_batches = list(loader)
```

## Labels (optional)

`label_mode="imagefolder"` is designed to scale:

- Per-sample records reference a numeric `label_id` (u64), not a repeated string.
- The human-readable mapping is stored once at `out/_mx8/labels.tsv`.

If your input layout is mixed (files directly under the prefix *and* subfolders), `label_mode="auto"` may disable ImageFolder labeling. To enforce ImageFolder semantics, use:

```python
mx8.pack_dir(..., label_mode="imagefolder", require_labels=True)
```

