Metadata-Version: 2.4
Name: quarterbit
Version: 17.0.14
Summary: AXIOM - High-performance optimizer for deep learning with extreme memory efficiency
Home-page: https://quarterbit.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <info@quarterbit.dev>
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://quarterbit.dev
Project-URL: Documentation, https://quarterbit.dev/docs
Keywords: optimizer,adam,deep-learning,pytorch,gpu,memory-efficient,compression,axiom
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# QuarterBit - AXIOM Optimizer

**Memory-efficient optimizer for LLM training**

Drop-in Adam replacement with 220x memory compression. Train larger language models on the same hardware.

## Features

- **Train 3B Models on 8GB GPU** - Full FP16 training with all parameters
- **220x Memory Compression** - Proprietary optimizer + gradient compression
- **Proven Convergence** - 62-92% loss improvement on 774M+ models
- **Production Ready** - Gradient clipping, NaN detection, checkpointing
- **Full-Stack Trainer** - One-line training with automatic monitoring

## When to Use AXIOM

AXIOM trades convergence speed for memory efficiency. Use it when memory is your bottleneck:

| Model Size | GPU VRAM | Recommendation |
|------------|----------|----------------|
| < 500M | Any | Use **AdamW** - faster convergence |
| 500M - 3B | 8-16GB | Use **AXIOM FP16** - full training, all params |
| 3B - 9B | 8-16GB | Use **AXIOM FP16** - full training possible |
| 7B+ | 8GB | **Not recommended** - 4-bit freezes most weights |

**Rule of thumb:** If your model fits in FP16 with AXIOM, you get full training. 4-bit quantization freezes the Linear layers (only ~5% trainable), which is fine-tuning, not full training.

## Requirements

- **Python 3.11 or 3.12** (Windows) or **Python 3.12** (Linux)
- **PyTorch 2.0+** with CUDA
- **NVIDIA GPU** - Pascal or newer (GTX 10xx, RTX 20/30/40, T4, A100, H100)
- **bitsandbytes** (optional) - For 4-bit quantization with large models

## Installation

```bash
# PyTorch required (install first)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install QuarterBit
pip install quarterbit

# Optional: For 7B+ model training with 4-bit quantization
pip install bitsandbytes transformers accelerate
```

## Quick Start

### Loading Large Models (Recommended)

Use `load_model()` to automatically load models in INT8 - avoids OOM on limited VRAM:

```python
from quarterbit import load_model, AXIOM

# Loads GPT-J 6B in ~6GB instead of 12GB
model = load_model("EleutherAI/gpt-j-6b")

opt = AXIOM(model.parameters(), lr=1e-4)
opt.register_hooks()
```

**Why?** Loading a model in FP16 first, then quantizing, requires 2x the final memory. `load_model()` uses bitsandbytes to load directly in INT8.

```python
# Estimate memory before loading
from quarterbit import estimate_memory

info = estimate_memory("EleutherAI/gpt-j-6b")
print(f"FP16: {info['fp16_gb']:.1f} GB")
print(f"INT8: {info['int8_gb']:.1f} GB")
print(f"Recommendation: {info['recommendation']}")
```

### Option 1: Full-Stack Trainer (Recommended)

```python
from quarterbit import AXIOM_Trainer

trainer = AXIOM_Trainer(model, train_loader, val_loader, lr=5e-4)
results = trainer.fit(steps=2000)

print(f"Final PPL: {results['final_val_ppl']:.2f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
```

### Option 2: Manual Training Loop

```python
from quarterbit import AXIOM

opt = AXIOM(model.parameters(), lr=1e-3)

# Warmup phase (no compression)
for i, batch in enumerate(dataloader):
    if i == 100:  # Enable after warmup
        opt.register_hooks(lr_scale=25.0)

    opt.zero_grad()
    loss = model(batch).loss
    loss.backward()
    opt.step(loss=loss.item())
```

## Memory Comparison (2.8B Model)

| Optimizer | Gradients | Opt State | Total |
|-----------|-----------|-----------|-------|
| Adam | 11.2 GB | 22 GB | 33 GB |
| AXIOM | 11.2 GB | 0.13 GB | 11.3 GB |
| AXIOM + hooks | 16 MB | 0.13 GB | 0.15 GB |

## How AXIOM Compares to Other Methods

### vs State-of-the-Art (2024-2025)

| Method | Optimizer | Gradients | Total | Notes |
|--------|-----------|-----------|-------|-------|
| [GaLore](https://arxiv.org/abs/2403.03507) | 8x | 1x | 8x | Low-rank projection (SVD) |
| [APOLLO](https://arxiv.org/abs/2412.05270) | 8x | 1x | 8x | Random projection |
| [LOMO](https://github.com/OpenLMLab/LOMO) | ∞ | ∞ | ∞ | Fused backward (SGD-like) |
| [8-bit Adam](https://github.com/TimDettmers/bitsandbytes) | 4x | 1x | 4x | INT8 quantization |
| [Adafactor](https://arxiv.org/abs/1804.04235) | 2x | 1x | 2x | Factored states |
| **AXIOM** | **280x** | **200x** | **220x** | Proprietary compression |

**AXIOM provides 27x more compression** than GaLore/APOLLO. The tradeoff: AXIOM's lossy compression means slower convergence per step, but enables training models that wouldn't fit otherwise.

### Optimizer State Compression (7B Model)

| Method | Storage | Memory | Compression |
|--------|---------|--------|-------------|
| AdamW | Full FP32 m + v | 56 GB | 1x |
| 8-bit Adam | INT8 m + v | 14 GB | 4x |
| GaLore | Low-rank states | 7 GB | 8x |
| APOLLO | Random projection | 7 GB | 8x |
| Adafactor | Factored states | 28 GB | 2x |
| **AXIOM** | Proprietary | **0.2 GB** | **280x** |

### Gradient Compression (7B Model)

| Method | Approach | Memory | Compression |
|--------|----------|--------|-------------|
| FP32 | Full precision | 28 GB | 1x |
| FP16 | Half precision | 14 GB | 2x |
| GaLore | Low-rank projection | 3.5 GB | 8x |
| **AXIOM** | Proprietary | **0.14 GB** | **200x** |

### Total Memory (Optimizer + Gradients)

| Setup | Total Memory | Compression | Single GPU? |
|-------|--------------|-------------|-------------|
| AdamW FP32 | 84 GB | 1x | Needs A100 |
| 8-bit Adam | 28 GB | 3x | Needs A100 |
| GaLore | 10.5 GB | 8x | RTX 3090/4090 |
| APOLLO | 10.5 GB | 8x | RTX 3090/4090 |
| ZeRO-3 / FSDP | 10 GB per GPU | 8x | No (cluster) |
| **AXIOM + hooks** | **0.34 GB** | **220x** | **Yes (laptop)** |

### Compression Scales with Model Size

| Model Size | AXIOM Compression | Best Competitor |
|------------|-------------------|-----------------|
| 774M (GPT-2) | 170x | 8x (GaLore) |
| 7B | 220x | 8x (GaLore) |
| 34B | 853x | 8x (GaLore) |

### What Makes AXIOM Different

- **27x better compression** than state-of-the-art (GaLore, APOLLO)
- **Single GPU** - No distributed training required
- **Drop-in replacement** - Just change your optimizer
- **Both optimizer AND gradients** - Other methods only compress one
- **Scales with model size** - Larger models = higher compression
- **Production quality** - 60-90% of Adam convergence, enough for fine-tuning

## 7B-9B Models on Laptop GPU (4-bit Fine-tuning)

**Fine-tune billion-parameter models on a $1,500 laptop.** With 4-bit quantization, only embeddings and layer norms are trainable (~5% of parameters). This is partial fine-tuning, not full training.

### Verified Results

| Model | Parameters | Loss Improvement | Peak VRAM | Hardware |
|-------|------------|------------------|-----------|----------|
| Mistral-7B | 7B | 92.6% | 13.9 GB | RTX 4070 Laptop |
| Yi-1.5-9B | 9B | 62.7% | 16.9 GB | RTX 4070 Laptop |

### Configuration

```
Hardware:        RTX 4070 Laptop GPU (8GB VRAM)
Quantization:    4-bit (bitsandbytes)
Optimizer:       AXIOM with warmup + lr_scale=25
```

### Memory Breakdown

| Component | Normal Training | AXIOM + 4-bit |
|-----------|----------------|---------------|
| Model weights | 28 GB (FP32) | 4.5 GB (4-bit) |
| Optimizer states | 56 GB (Adam) | ~50 MB (AXIOM) |
| Gradients | 28 GB | Compressed |
| **Total** | **112 GB** | **~8 GB** |

**Cost comparison:** $15,000+ A100 → $1,500 gaming laptop

### Code

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from quarterbit import AXIOM
import torch

# Load 7B model in 4-bit
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb,
    device_map="auto"
)

# AXIOM optimizer with warmup strategy
trainable = [p for p in model.parameters() if p.requires_grad]
opt = AXIOM(trainable, lr=1e-4)

# Warmup phase (10 steps without hooks)
for step in range(10):
    loss = train_step(model, batch)
    opt.zero_grad()
    loss.backward()
    opt.step(loss=loss.item())

# Enable compression
opt.register_hooks(lr_scale=25.0)

# Training phase
for step in range(50):
    loss = train_step(model, batch)
    opt.zero_grad()
    loss.backward()
    opt.step(loss=loss.item())
```

**AXIOM democratizes LLM training - anyone with a gaming laptop can now fine-tune 7B models.**

---

## Benchmark: 34B Model on Single GPU

**The "impossible" made possible.** We trained Yi-1.5-34B on a single H100 (80GB) - a model that would normally require 8 GPUs with AdamW.

### Memory Analysis

| Component | AdamW | AXIOM | Compression |
|-----------|-------|-------|-------------|
| Model (FP16) | 68.8 GB | 68.8 GB | 1x |
| Gradients | 68.8 GB | 201 MB | **341x** |
| Optimizer (m+v) | 275.1 GB | 201 MB | **1365x** |
| **Total Needed** | **413 GB** | **69 GB** | **853x** |
| Fits H100 80GB? | NO | YES | - |

AdamW needs 413GB. H100 has 80GB. That's 5.2x more memory than physically exists.

### Training Results

```
Model:           Yi-1.5-34B (34.4B parameters)
Hardware:        1x H100 80GB
Training:        2000 steps, 98 minutes
Peak VRAM:       78.9 GB / 80 GB

Validation PPL:  20.20 → 4.92 (75.6% improvement)
```

The model converged properly and generates coherent text:

```
Prompt: "The future of artificial intelligence is"
Output: "The future of artificial intelligence is very bright. AI has already
        made a significant impact on our lives, and it is only going to become
        more prevalent in the years to come..."
```

### Cost & Environmental Impact

| | AdamW (8x H100) | AXIOM (1x H100) | Savings |
|--|-----------------|-----------------|---------|
| Cloud Cost | $45.80 | $5.73 | **88%** |
| Carbon (CO2) | 3.85 kg | 0.48 kg | **87%** |

**AXIOM makes frontier-scale training accessible on single GPUs.**

## AXIOM Optimizer

```python
from quarterbit import AXIOM

opt = AXIOM(
    params,                    # Model parameters
    lr=0.001,                  # Learning rate
    weight_decay=0.01,         # Decoupled weight decay
    max_grad_norm=None,        # Gradient clipping (None = disabled)
    detect_anomaly=True,       # Error on NaN/Inf gradients
    streak_boost=1.5,          # Convergence tuning
    snr_floor=0.5,             # Noise filtering
)
```

### Methods

```python
opt.register_hooks()           # Enable gradient compression (call once)
opt.remove_hooks()             # Disable gradient compression
opt.step(loss=loss.item())     # Update weights (pass loss value)
opt.zero_grad()                # Clear gradients
opt.get_lr()                   # Get current learning rate
opt.set_lr(0.0005)             # Change learning rate
opt.state_dict()               # Save optimizer state
opt.load_state_dict(state)     # Load optimizer state
opt.memory_usage()             # Print memory stats
```

### Warmup Strategy (Recommended)

For best results, use warmup before enabling gradient compression:

```python
opt = AXIOM(model.parameters(), lr=5e-4)

# Phase 1: Warmup (50-100 steps, no hooks)
for step in range(100):
    loss.backward()
    opt.step(loss=loss.item())

# Phase 2: Enable compression with lr_scale
opt.register_hooks(lr_scale=25.0)

# Phase 3: Continue training with compression
for step in range(100, 2000):
    loss.backward()
    opt.step(loss=loss.item())
```

**Why warmup?** The optimizer needs to build accurate statistics before compression. `lr_scale=25.0` compensates for gradient magnitude reduction from compression.

### Hooks: Memory vs Quality Tradeoff

| Mode | Memory | Convergence | Use When |
|------|--------|-------------|----------|
| `register_hooks()` ON | 220x compression | Use warmup | Large models (2B+) |
| `register_hooks()` OFF | 3x compression | Faster | Small models where quality matters most |

For most users: **use hooks ON with warmup** for large models, **hooks OFF** for small models.

### Example: GPT-2 Training

```python
import torch
from transformers import GPT2LMHeadModel
from quarterbit import AXIOM

model = GPT2LMHeadModel.from_pretrained("gpt2").cuda()
opt = AXIOM(model.parameters(), lr=5e-4)
opt.register_hooks()

for epoch in range(3):
    for batch in train_loader:
        input_ids = batch["input_ids"].cuda()

        opt.zero_grad()
        outputs = model(input_ids, labels=input_ids)
        outputs.loss.backward()
        opt.step(loss=outputs.loss.item())

    print(f"Epoch {epoch}: Loss = {outputs.loss.item():.4f}")
```

## AXIOM_Trainer

Full-stack training with automatic monitoring, validation, and result export.

**Works with any HuggingFace model** - GPT-2, LLaMA, Mistral, Phi, etc. The trainer expects HuggingFace-style API (`model(input_ids, labels=...) -> outputs.loss`).

```python
from quarterbit import AXIOM_Trainer

trainer = AXIOM_Trainer(
    model,                     # PyTorch model
    train_loader,              # Training DataLoader
    val_loader,                # Validation DataLoader (optional)
    lr=5e-4,                   # Learning rate
    weight_decay=0.01,         # Weight decay
    max_grad_norm=None,        # Gradient clipping
    eval_interval=200,         # Validate every N steps
    log_interval=100,          # Log every N steps
    checkpoint_interval=500,   # Save every N steps (0 = disabled)
    checkpoint_dir="checkpoints",
    save_results=True,         # Export JSON + PNG
    results_prefix="my_run",
    device="cuda",
)

results = trainer.fit(steps=5000)
```

### Results Dictionary

```python
results = trainer.fit(steps=2000)

# Training metrics
results['train_losses']           # List of all training losses
results['initial_train_loss']     # First loss value
results['final_train_loss']       # Last loss value
results['train_improvement_pct']  # Percent improvement

# Validation metrics
results['val_losses']             # List of validation losses
results['val_ppls']               # List of perplexities
results['initial_val_ppl']        # Starting perplexity
results['final_val_ppl']          # Final perplexity
results['val_improvement_pct']    # Percent improvement

# Performance
results['peak_vram_gb']           # Peak GPU memory used
results['tokens_per_sec']         # Training speed
results['total_time_min']         # Total training time

# Compression stats
results['compression_total']      # Overall compression ratio
```

### Trainer Methods

```python
# Manual checkpoint
trainer.save_checkpoint(step=1000, path="checkpoint.pt")
trainer.load_checkpoint("checkpoint.pt")

# Manual evaluation
val_loss, val_ppl = trainer.evaluate(max_batches=50)

# Cleanup hooks
trainer.cleanup()
```

### Example: Complete Training Script

```python
import torch
from torch.utils.data import DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datasets import load_dataset
from quarterbit import AXIOM_Trainer

# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2").cuda()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load data
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

train_data = dataset["train"].map(tokenize, batched=True)
val_data = dataset["validation"].map(tokenize, batched=True)

train_loader = DataLoader(train_data, batch_size=4, shuffle=True)
val_loader = DataLoader(val_data, batch_size=4)

# Train
trainer = AXIOM_Trainer(
    model,
    train_loader,
    val_loader,
    lr=5e-4,
    eval_interval=200,
    checkpoint_interval=1000,
)

results = trainer.fit(steps=2000)

print(f"Training complete!")
print(f"Val PPL: {results['initial_val_ppl']:.1f} → {results['final_val_ppl']:.1f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
```

---

## Verity: Cross-Hardware Deterministic Training

**Bit-exact reproducible training across different GPUs.**

Verity is a separate training system that guarantees identical weights whether you train on RTX 4070, Tesla P100, or T4. Same bytes, same model, any NVIDIA GPU.

### AXIOM vs Verity

| | AXIOM | Verity |
|--|-------|--------|
| **Goal** | Memory efficiency | Bit-exact reproducibility |
| **Compression** | 220x | None (full precision) |
| **Speed** | Fast | 3-10x slower |
| **Cross-hardware** | Same loss, different weights | **Identical weights** |
| **Use case** | Production training | Research, auditing, debugging |

**They cannot be combined.** AXIOM's compression introduces quantization noise. Verity requires full precision for determinism.

### When to Use Each

```
Need to train large models on limited VRAM?  → AXIOM
Need bit-exact reproducibility across GPUs?  → Verity
Need both?                                   → Train with Verity, deploy with AXIOM
```

### Verity Quick Start

```python
from quarterbit.verity import ops
from quarterbit.verity.ops import VerityAdam

# Build VLA layers (deterministic matmul, layernorm, softmax)
class MyModel(nn.Module):
    def __init__(self):
        self.linear = ops.VLALinear(768, 768)
        self.ln = ops.VLALayerNorm(768)

    def forward(self, x):
        return self.ln(self.linear(x))

model = MyModel().cuda()

# Use VerityAdam (deterministic sqrt)
optimizer = VerityAdam(model.parameters(), lr=1e-3)

# Training is now bit-exact across GPUs
for batch in dataloader:
    loss = model(batch).mean()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

### Verified Hardware

| GPU | Architecture | Checksum After Training |
|-----|--------------|------------------------|
| RTX 4070 | Ada (sm_89) | `43f9992d56edbf52` |
| Tesla P100 | Pascal (sm_60) | `43f9992d56edbf52` |
| Tesla T4 | Turing (sm_75) | `43f9992d56edbf52` |

Three generations. Same checksum. Bit-exact.

### How It Works

Standard GPU training is non-deterministic because:
- Reduction order varies (`sum([a,b,c])` computed differently)
- Floating-point is non-associative (`(a+b)+c ≠ a+(b+c)`)
- Hardware sqrt/rsqrt implementations differ

Verity solves this with:
- **VLA accumulator** - 4-limb FP64 with TwoSum error capture
- **Newton-Raphson sqrt** - Deterministic across all GPUs
- **Order-independent reduction** - Same result regardless of thread execution order

### Performance

| Model Size | Verity Overhead |
|------------|-----------------|
| 1.6M params | ~1x (negligible) |
| 10M params | ~3x slower |
| 120M params | ~5-10x slower |

The slowdown is the price of determinism. Use Verity when reproducibility matters more than speed.

### Documentation

See [docs/VERITY_CROSS_HARDWARE.md](docs/VERITY_CROSS_HARDWARE.md) for full technical details.

---

## Extensions

### AXIOM_CHECKPOINT - Activation Compression

Reduces activation memory by 85%. For large models where activations are the bottleneck.

```python
from quarterbit import AXIOM_CHECKPOINT

actcp = AXIOM_CHECKPOINT(max_slots=32, max_n=4*512*4096)

# In your model's forward pass
actcp.store(hidden_states, slot=layer_idx)

# During backward
restored = actcp.restore(slot=layer_idx)

# Check savings
stats = actcp.memory_stats()
print(f"Compression: {stats['compression_ratio']:.1f}x")
```

### AXIOM_DDP - Distributed Gradient Compression

128x bandwidth reduction for multi-GPU training.

```python
from quarterbit import AXIOM_DDP
import torch.distributed as dist

gc = AXIOM_DDP(n=total_params, top_k_percent=6.25)

# Compress before all-reduce
vals, idx, count = gc.compress(gradients)

# All-reduce compressed data (128x smaller)
dist.all_reduce(vals)
dist.all_reduce(idx)

# Decompress
full_grads = gc.decompress(vals, idx, count)
```

## Checkpointing

```python
# Save
torch.save({
    'model': model.state_dict(),
    'optimizer': opt.state_dict(),
    'step': step,
}, 'checkpoint.pt')

# Load
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
opt.load_state_dict(ckpt['optimizer'])
```

## Supported Models

- GPT-2, GPT-Neo, GPT-J
- LLaMA, LLaMA 2, LLaMA 3
- Gemma, Gemma 2
- Mistral, Mixtral
- Phi, Phi-2, Phi-3
- BERT, RoBERTa (fine-tuning)

## License

Commercial license required for production use.
Free for research and evaluation.

**https://quarterbit.dev**

---

Copyright 2026 Clouthier Simulation Labs. All rights reserved.
