Metadata-Version: 2.4
Name: quarterbit
Version: 16.1.1
Summary: AXIOM - High-performance optimizer for deep learning with extreme memory efficiency
Home-page: https://quarterbit.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <info@quarterbit.dev>
License: Commercial
Project-URL: Homepage, https://quarterbit.dev
Project-URL: Documentation, https://quarterbit.dev/docs
Keywords: optimizer,adam,deep-learning,pytorch,gpu,memory-efficient,compression,axiom
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# QuarterBit - AXIOM Optimizer

**Memory-efficient optimizer for LLM training**

Drop-in Adam replacement with 1333x memory compression. Train larger language models on the same hardware.

## Features

- **1333x Memory Compression** - Train GPT/LLaMA/Gemma on smaller GPUs
- **16% Better Convergence** - Outperforms AdamW on GPT-2 WikiText benchmark
- **Production Ready** - Gradient clipping, NaN detection, checkpointing
- **Full-Stack Trainer** - One-line training with all extensions auto-enabled
- **Two Tiers** - AXIOM (default) and AXIOM_2 (for 3B+ models on 8GB GPU)

## Requirements

- **Python 3.12+** (Windows or Linux)
- **PyTorch 2.0+** with CUDA
- **NVIDIA GPU** - Pascal or newer (GTX 10xx, RTX 20/30/40, T4, A100, H100)

## Installation

```bash
# PyTorch required (install first)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install QuarterBit
pip install quarterbit
```

## Quick Start (Recommended)

**One-line full-stack training with all extensions enabled:**

```python
from quarterbit import AXIOM_Trainer

# Create trainer - automatically enables everything
trainer = AXIOM_Trainer(model, train_loader, val_loader)

# Train with full monitoring
results = trainer.fit(steps=2000)

# Results include: train_losses, val_ppls, peak_vram_gb, compression_stats
# Automatically saves: axiom_training_results.json, axiom_training_chart.png
```

**What AXIOM_Trainer does automatically:**
- Creates AXIOM_2 optimizer (1333x compression)
- Calls `register_hooks()` for gradient compression
- Tracks validation loss and perplexity
- Monitors peak VRAM usage
- Exports results to JSON and PNG charts

## Manual Quick Start

```python
from quarterbit import AXIOM

# Create optimizer (drop-in Adam replacement)
optimizer = AXIOM(model.parameters(), lr=1e-4)

# Training loop
for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch).loss
    loss.backward()
    optimizer.step(loss=loss.item())  # Pass loss for adaptive learning
```

## Two Optimizer Tiers

### AXIOM (Default)
Standard mode with 1333x optimizer compression. Use this for most training.

```python
from quarterbit import AXIOM

opt = AXIOM(model.parameters(), lr=1e-4)
```

### AXIOM_2 (Large Models)
Compresses both optimizer AND gradients. Train 3B+ models on 8GB GPU.

```python
from quarterbit import AXIOM_2

opt = AXIOM_2(model.parameters(), lr=5e-3)
opt.register_hooks()  # IMPORTANT: Call before training loop

for batch in dataloader:
    loss = model(batch).loss
    loss.backward()  # Gradients compressed automatically
    opt.step(loss.item())
    opt.zero_grad()
```

**Memory comparison for 2.8B model:**
| Optimizer | Gradients | Opt State | Total |
|-----------|-----------|-----------|-------|
| Adam | 11.2 GB | 22 GB | 33 GB |
| AXIOM | 11.2 GB | 16 MB | 11.2 GB |
| AXIOM_2 | 16 MB | 16 MB | 32 MB |

## API Reference

### AXIOM / AXIOM_2

```python
AXIOM(
    params,                    # Model parameters
    lr=0.001,                  # Learning rate
    weight_decay=0.01,         # Decoupled weight decay
    max_grad_norm=None,        # Gradient clipping (None = disabled)
    detect_anomaly=True,       # Raise error on NaN/Inf gradients
)

# Methods
optimizer.step(loss)           # Pass loss.item() for adaptive learning
optimizer.zero_grad()          # Clear gradients
optimizer.get_lr()             # Get current learning rate
optimizer.set_lr(lr)           # Set learning rate
optimizer.state_dict()         # For checkpointing
optimizer.load_state_dict(d)   # Restore checkpoint
optimizer.memory_usage()       # Print memory comparison
```

### AXIOM_2 Additional Methods

```python
optimizer.register_hooks()     # Enable gradient compression (call once)
optimizer.remove_hooks()       # Disable gradient compression
```

### AXIOM_Trainer (Full-Stack)

```python
from quarterbit import AXIOM_Trainer, TrainerConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader

# Load model and data
model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Create dataloaders
ds = load_dataset("wikitext", "wikitext-2-raw-v1")
# ... tokenize and create train_loader, val_loader ...

# Simple usage - just works
trainer = AXIOM_Trainer(model, train_loader, val_loader)
results = trainer.fit(steps=2000)

# With options
trainer = AXIOM_Trainer(
    model,
    train_loader,
    val_loader,
    lr=5e-4,                   # Learning rate
    eval_interval=200,         # Validate every 200 steps
    log_interval=100,          # Log every 100 steps
    checkpoint_interval=500,   # Save checkpoint every 500 steps (0=disabled)
    checkpoint_dir="checkpoints",
    save_results=True,         # Save JSON + PNG
    results_prefix="my_run"
)
results = trainer.fit(steps=5000)

# Results dict contains:
# - train_losses, val_losses, val_ppls (lists)
# - initial_train_loss, final_train_loss, train_improvement_pct
# - initial_val_ppl, final_val_ppl, val_improvement_pct
# - peak_vram_gb, tokens_per_sec, total_time_min
# - compression_total, compression_optimizer, compression_gradients

# Manual checkpoint save/load
trainer.save_checkpoint(step=1000, path="my_checkpoint.pt")
trainer.load_checkpoint("my_checkpoint.pt")

# Manual evaluation
val_loss, val_ppl = trainer.evaluate()

# Cleanup when done
trainer.cleanup()
```

## Checkpointing

```python
# Save
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': epoch,
}
torch.save(checkpoint, 'checkpoint.pt')

# Load
checkpoint = torch.load('checkpoint.pt')
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
```

## What's Automatic vs Manual

| Component | Mode | What it does |
|-----------|------|--------------|
| AXIOM_Trainer | **Automatic** | Full training with all optimizations |
| AXIOM / AXIOM_2 | **Automatic** | Optimizer compression (1333x) |
| register_hooks() | **Automatic** | Gradient compression (683x) |
| AXIOM_CHECKPOINT | Manual | Activation compression (85%) |
| AXIOM_DDP | Manual | Distributed gradient compression (128x) |

**Automatic** = Just use it, works out of the box
**Manual** = Requires integration into your model/training code

## Manual Extensions

### AXIOM_CHECKPOINT (Activation Compression)

Reduces activation memory by 85%. Requires manual integration into your model's forward pass.

**When to use:** Large models where activation memory is the bottleneck (not optimizer memory).

```python
import torch
import torch.nn as nn
from quarterbit import AXIOM_CHECKPOINT

# Create checkpoint manager
# max_slots = number of layers to checkpoint
# max_n = max elements per activation (batch * seq * hidden)
actcp = AXIOM_CHECKPOINT(max_slots=32, max_n=4*512*4096)

class CheckpointedTransformer(nn.Module):
    def __init__(self, base_model, actcp):
        super().__init__()
        self.model = base_model
        self.actcp = actcp

    def forward(self, input_ids, attention_mask=None):
        hidden = self.model.embed_tokens(input_ids)

        for i, layer in enumerate(self.model.layers):
            # Store activation before each layer (compressed, 85% savings)
            self.actcp.store(hidden, slot=i)

            # Forward through layer
            hidden = layer(hidden, attention_mask=attention_mask)[0]

        return self.model.lm_head(self.model.norm(hidden))

    def restore_activation(self, slot):
        """Call during backward if needed."""
        return self.actcp.restore(slot)

# Usage
model = CheckpointedTransformer(base_model, actcp)

# Check memory savings
stats = actcp.memory_stats()
print(f"Compression: {stats['compression_ratio']:.1f}x")
print(f"Savings: {stats['savings_percent']:.0f}%")
```

### AXIOM_DDP (Distributed Gradient Compression)

Reduces all-reduce bandwidth by 128x. Requires manual integration into your distributed training loop.

```python
from quarterbit import AXIOM_DDP

gc = AXIOM_DDP(n=total_params, top_k_percent=6.25)

# In your distributed training loop:
# 1. Compress before all-reduce
vals, idx, count = gc.compress(all_gradients)

# 2. All-reduce only compressed data (128x less bandwidth)
dist.all_reduce(vals)
dist.all_reduce(idx)

# 3. Decompress after all-reduce
full_grads = gc.decompress(vals, idx, count)

# Stats
print(gc.stats())
```

### DDP Helper Functions

```python
from quarterbit import compress_gradients_for_ddp, decompress_gradients_for_ddp

# Compress all model gradients
vals, idx, count, compressor = compress_gradients_for_ddp(model)

# ... distributed all-reduce on vals, idx ...

# Decompress back to model
decompress_gradients_for_ddp(model, vals, idx, count, compressor)
```

## Supported Models

AXIOM is optimized for **language models**:
- GPT-2, GPT-Neo, GPT-J
- LLaMA, LLaMA 2, LLaMA 3
- Gemma, Gemma 2
- Mistral, Mixtral
- Phi, Phi-2, Phi-3
- BERT, RoBERTa (fine-tuning)

## License

Commercial license required for production use.
Free for research and evaluation.

**https://quarterbit.dev**

---

Copyright 2026 Clouthier Simulation Labs. All rights reserved.
