Metadata-Version: 2.4
Name: quarterbit
Version: 17.0.1
Summary: AXIOM - High-performance optimizer for deep learning with extreme memory efficiency
Home-page: https://quarterbit.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <info@quarterbit.dev>
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://quarterbit.dev
Project-URL: Documentation, https://quarterbit.dev/docs
Keywords: optimizer,adam,deep-learning,pytorch,gpu,memory-efficient,compression,axiom
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# QuarterBit - AXIOM Optimizer

**Memory-efficient optimizer for LLM training**

Drop-in Adam replacement with 220x memory compression. Train larger language models on the same hardware.

## Features

- **220x Memory Compression** - Train GPT/LLaMA/Gemma on smaller GPUs
- **Better Convergence** - Outperforms AdamW on language model benchmarks
- **Production Ready** - Gradient clipping, NaN detection, checkpointing
- **Full-Stack Trainer** - One-line training with automatic monitoring

## Requirements

- **Python 3.12+** (Windows or Linux)
- **PyTorch 2.0+** with CUDA
- **NVIDIA GPU** - Pascal or newer (GTX 10xx, RTX 20/30/40, T4, A100, H100)

## Installation

```bash
# PyTorch required (install first)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install QuarterBit
pip install quarterbit
```

## Quick Start

### Option 1: Full-Stack Trainer (Recommended)

```python
from quarterbit import AXIOM_Trainer

trainer = AXIOM_Trainer(model, train_loader, val_loader, lr=5e-4)
results = trainer.fit(steps=2000)

print(f"Final PPL: {results['final_val_ppl']:.2f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
```

### Option 2: Manual Training Loop

```python
from quarterbit import AXIOM

opt = AXIOM(model.parameters(), lr=1e-3)
opt.register_hooks()  # Enable gradient compression

for batch in dataloader:
    opt.zero_grad()
    loss = model(batch).loss
    loss.backward()
    opt.step(loss=loss.item())
```

## Memory Comparison (2.8B Model)

| Optimizer | Gradients | Opt State | Total |
|-----------|-----------|-----------|-------|
| Adam | 11.2 GB | 22 GB | 33 GB |
| AXIOM | 11.2 GB | 0.13 GB | 11.3 GB |
| AXIOM + hooks | 16 MB | 0.13 GB | 0.15 GB |

## Benchmark: 34B Model on Single GPU

**The "impossible" made possible.** We trained Yi-1.5-34B on a single H100 (80GB) - a model that would normally require 8 GPUs with AdamW.

### Memory Analysis

| Component | AdamW | AXIOM | Compression |
|-----------|-------|-------|-------------|
| Model (FP16) | 68.8 GB | 68.8 GB | 1x |
| Gradients | 68.8 GB | 201 MB | **341x** |
| Optimizer (m+v) | 275.1 GB | 201 MB | **1365x** |
| **Total Needed** | **413 GB** | **69 GB** | **853x** |
| Fits H100 80GB? | NO | YES | - |

AdamW needs 413GB. H100 has 80GB. That's 5.2x more memory than physically exists.

### Training Results

```
Model:           Yi-1.5-34B (34.4B parameters)
Hardware:        1x H100 80GB
Training:        2000 steps, 98 minutes
Peak VRAM:       78.9 GB / 80 GB

Validation PPL:  20.20 → 4.92 (75.6% improvement)
```

The model converged properly and generates coherent text:

```
Prompt: "The future of artificial intelligence is"
Output: "The future of artificial intelligence is very bright. AI has already
        made a significant impact on our lives, and it is only going to become
        more prevalent in the years to come..."
```

### Cost & Environmental Impact

| | AdamW (8x H100) | AXIOM (1x H100) | Savings |
|--|-----------------|-----------------|---------|
| Cloud Cost | $45.80 | $5.73 | **88%** |
| Carbon (CO2) | 3.85 kg | 0.48 kg | **87%** |

**AXIOM makes frontier-scale training accessible on single GPUs.**

## AXIOM Optimizer

```python
from quarterbit import AXIOM

opt = AXIOM(
    params,                    # Model parameters
    lr=0.001,                  # Learning rate
    weight_decay=0.01,         # Decoupled weight decay
    max_grad_norm=None,        # Gradient clipping (None = disabled)
    detect_anomaly=True,       # Error on NaN/Inf gradients
    streak_boost=1.5,          # Convergence tuning
    snr_floor=0.5,             # Noise filtering
)
```

### Methods

```python
opt.register_hooks()           # Enable gradient compression (call once)
opt.remove_hooks()             # Disable gradient compression
opt.step(loss=loss.item())     # Update weights (pass loss value)
opt.zero_grad()                # Clear gradients
opt.get_lr()                   # Get current learning rate
opt.set_lr(0.0005)             # Change learning rate
opt.state_dict()               # Save optimizer state
opt.load_state_dict(state)     # Load optimizer state
opt.memory_usage()             # Print memory stats
```

### Example: GPT-2 Training

```python
import torch
from transformers import GPT2LMHeadModel
from quarterbit import AXIOM

model = GPT2LMHeadModel.from_pretrained("gpt2").cuda()
opt = AXIOM(model.parameters(), lr=5e-4)
opt.register_hooks()

for epoch in range(3):
    for batch in train_loader:
        input_ids = batch["input_ids"].cuda()

        opt.zero_grad()
        outputs = model(input_ids, labels=input_ids)
        outputs.loss.backward()
        opt.step(loss=outputs.loss.item())

    print(f"Epoch {epoch}: Loss = {outputs.loss.item():.4f}")
```

## AXIOM_Trainer

Full-stack training with automatic monitoring, validation, and result export.

```python
from quarterbit import AXIOM_Trainer

trainer = AXIOM_Trainer(
    model,                     # PyTorch model
    train_loader,              # Training DataLoader
    val_loader,                # Validation DataLoader (optional)
    lr=5e-4,                   # Learning rate
    weight_decay=0.01,         # Weight decay
    max_grad_norm=None,        # Gradient clipping
    eval_interval=200,         # Validate every N steps
    log_interval=100,          # Log every N steps
    checkpoint_interval=500,   # Save every N steps (0 = disabled)
    checkpoint_dir="checkpoints",
    save_results=True,         # Export JSON + PNG
    results_prefix="my_run",
    device="cuda",
)

results = trainer.fit(steps=5000)
```

### Results Dictionary

```python
results = trainer.fit(steps=2000)

# Training metrics
results['train_losses']           # List of all training losses
results['initial_train_loss']     # First loss value
results['final_train_loss']       # Last loss value
results['train_improvement_pct']  # Percent improvement

# Validation metrics
results['val_losses']             # List of validation losses
results['val_ppls']               # List of perplexities
results['initial_val_ppl']        # Starting perplexity
results['final_val_ppl']          # Final perplexity
results['val_improvement_pct']    # Percent improvement

# Performance
results['peak_vram_gb']           # Peak GPU memory used
results['tokens_per_sec']         # Training speed
results['total_time_min']         # Total training time

# Compression stats
results['compression_total']      # Overall compression ratio
```

### Trainer Methods

```python
# Manual checkpoint
trainer.save_checkpoint(step=1000, path="checkpoint.pt")
trainer.load_checkpoint("checkpoint.pt")

# Manual evaluation
val_loss, val_ppl = trainer.evaluate(max_batches=50)

# Cleanup hooks
trainer.cleanup()
```

### Example: Complete Training Script

```python
import torch
from torch.utils.data import DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datasets import load_dataset
from quarterbit import AXIOM_Trainer

# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2").cuda()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load data
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

train_data = dataset["train"].map(tokenize, batched=True)
val_data = dataset["validation"].map(tokenize, batched=True)

train_loader = DataLoader(train_data, batch_size=4, shuffle=True)
val_loader = DataLoader(val_data, batch_size=4)

# Train
trainer = AXIOM_Trainer(
    model,
    train_loader,
    val_loader,
    lr=5e-4,
    eval_interval=200,
    checkpoint_interval=1000,
)

results = trainer.fit(steps=2000)

print(f"Training complete!")
print(f"Val PPL: {results['initial_val_ppl']:.1f} → {results['final_val_ppl']:.1f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
```

## Extensions

### AXIOM_CHECKPOINT - Activation Compression

Reduces activation memory by 85%. For large models where activations are the bottleneck.

```python
from quarterbit import AXIOM_CHECKPOINT

actcp = AXIOM_CHECKPOINT(max_slots=32, max_n=4*512*4096)

# In your model's forward pass
actcp.store(hidden_states, slot=layer_idx)

# During backward
restored = actcp.restore(slot=layer_idx)

# Check savings
stats = actcp.memory_stats()
print(f"Compression: {stats['compression_ratio']:.1f}x")
```

### AXIOM_DDP - Distributed Gradient Compression

128x bandwidth reduction for multi-GPU training.

```python
from quarterbit import AXIOM_DDP
import torch.distributed as dist

gc = AXIOM_DDP(n=total_params, top_k_percent=6.25)

# Compress before all-reduce
vals, idx, count = gc.compress(gradients)

# All-reduce compressed data (128x smaller)
dist.all_reduce(vals)
dist.all_reduce(idx)

# Decompress
full_grads = gc.decompress(vals, idx, count)
```

### AXIOM_TENSOR - Weight Compression

2.5x weight storage compression.

```python
from quarterbit import AXIOM_TENSOR

vla = AXIOM_TENSOR(model.parameters())
vla.sync_from_fp32()  # Compress weights
vla.sync_to_fp32()    # Restore weights

stats = vla.memory_stats()
```

### AXIOM_INT8 - Combined Compression

4.7x total compression combining optimizer and weight compression.

```python
from quarterbit import AXIOM_INT8

opt = AXIOM_INT8(model.parameters(), lr=1e-4)
```

## Checkpointing

```python
# Save
torch.save({
    'model': model.state_dict(),
    'optimizer': opt.state_dict(),
    'step': step,
}, 'checkpoint.pt')

# Load
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
opt.load_state_dict(ckpt['optimizer'])
```

## Supported Models

- GPT-2, GPT-Neo, GPT-J
- LLaMA, LLaMA 2, LLaMA 3
- Gemma, Gemma 2
- Mistral, Mixtral
- Phi, Phi-2, Phi-3
- BERT, RoBERTa (fine-tuning)

## License

Commercial license required for production use.
Free for research and evaluation.

**https://quarterbit.dev**

---

Copyright 2026 Clouthier Simulation Labs. All rights reserved.
