Metadata-Version: 2.4
Name: quarterbit
Version: 18.2.1
Summary: AXIOM Full Stack - 4.8x memory reduction with TRAINABLE-QUANT + AXIOM optimizer
Home-page: https://quarterbit.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <info@quarterbit.dev>
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://quarterbit.dev
Project-URL: Documentation, https://quarterbit.dev/docs
Keywords: optimizer,adam,deep-learning,pytorch,gpu,memory-efficient,compression,axiom
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# QuarterBit - AXIOM Full Stack

**4.8x memory reduction for LLM training**

Train larger models on the same hardware with automatic weight compression + optimizer compression.

## Features

- **Train 774M+ Models on 8GB GPU** - Full training with all parameters
- **4.8x Memory Reduction** - Weight quantization + optimizer compression combined
- **Automatic Weight Compression** - INT4 base + trainable refinement (1.6x)
- **800x Optimizer Compression** - Proprietary state management
- **Same Training Quality** - Matches AdamW convergence
- **Production Ready** - Gradient clipping, NaN detection, checkpointing
- **Gradient Checkpointing** - ON by default, saves 50-85% activation memory
- **Full-Stack Trainer** - One-line training with automatic compression
- **Optimal LR by Default** - lr=1e-3 tuned for gradient compression (60% faster convergence)
- **Auto Early Stopping** - Prevents overfitting automatically

## When to Use AXIOM

AXIOM trades memory for efficiency. Use it when VRAM is your bottleneck:

| Model Size | GPU VRAM | Recommendation |
|------------|----------|----------------|
| < 500M | Any | Use **AdamW** - fits in memory |
| 500M - 1B | 8GB | Use **AXIOM** - enables full training |
| 774M - 3B | 8-16GB | Use **AXIOM Full Stack** - 4.8x reduction |
| 3B+ | 16GB+ | Use **AXIOM Full Stack** - required |

## Requirements

- **Python 3.11 or 3.12** (Windows) or **Python 3.12** (Linux)
- **PyTorch 2.0+** with CUDA
- **NVIDIA GPU** - Pascal or newer (GTX 10xx, RTX 20/30/40, T4, A100, H100)

## Installation

```bash
# PyTorch required (install first)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install QuarterBit
pip install quarterbit
```

## Quick Start

### Option 1: Full-Stack Trainer (Recommended)

Automatic weight compression + optimizer compression + early stopping:

```python
from quarterbit import AXIOM_Trainer

# Everything optimal by default:
# - lr=1e-3 (tuned for gradient compression)
# - early_stopping=True (prevents overfitting)
# - quantize_weights=True (1.6x weight compression)
trainer = AXIOM_Trainer(model, train_loader, val_loader)
results = trainer.fit(steps=2000)

print(f"Final PPL: {results['final_val_ppl']:.2f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
```

### Option 2: Manual Control

```python
from quarterbit import AXIOM, make_trainable_quantized

# Step 1: Compress weights (1.6x reduction)
model = make_trainable_quantized(model)

# Step 2: Use AXIOM optimizer (800x state compression)
opt = AXIOM(model.parameters(), lr=1e-3)
opt.register_hooks()  # Enable gradient compression

for batch in dataloader:
    opt.zero_grad()
    loss = model(batch).loss
    loss.backward()
    opt.step(loss=loss.item())
```

### Option 3: Optimizer Only

```python
from quarterbit import AXIOM

# Just optimizer compression (no weight quantization)
opt = AXIOM(model.parameters(), lr=1e-3)
opt.register_hooks()

for batch in dataloader:
    opt.zero_grad()
    loss = model(batch).loss
    loss.backward()
    opt.step(loss=loss.item())
```

## Memory Comparison (774M Model - GPT-2 Large)

| Component | Standard | AXIOM Full Stack | Reduction |
|-----------|----------|------------------|-----------|
| Weights | 3.1 GB | 1.9 GB | 1.6x |
| Optimizer | 6.2 GB | 8 MB | 800x |
| **Total** | **9.3 GB** | **~2 GB** | **4.8x** |

Standard training needs 9.3+ GB. With Full Stack, fits in 8GB GPU.

## Verified Results

| Model | Parameters | Before | After | Peak VRAM |
|-------|------------|--------|-------|-----------|
| GPT-2 Base | 124M | PPL 64.8 | PPL 2.3 | 1.8 GB |
| GPT-2 Large | 774M | PPL 39.2 | PPL 22.1 | 4.9 GB |

Training quality matches AdamW - loss curves are identical.

## How AXIOM Compares

### vs State-of-the-Art (2024-2025)

| Method | Optimizer | Gradients | Total | Notes |
|--------|-----------|-----------|-------|-------|
| [GaLore](https://arxiv.org/abs/2403.03507) | 8x | 1x | 8x | Low-rank projection |
| [APOLLO](https://arxiv.org/abs/2412.05270) | 8x | 1x | 8x | Random projection |
| [8-bit Adam](https://github.com/TimDettmers/bitsandbytes) | 4x | 1x | 4x | INT8 states |
| **AXIOM Full Stack** | **800x** | **200x** | **4.8x total** | Weight + optimizer |

## Weight Compression

The `make_trainable_quantized()` function compresses model weights while keeping them fully trainable:

```python
from quarterbit import make_trainable_quantized, verify_trainable

# Compress weights
model = make_trainable_quantized(model)

# Verify trainability
stats = verify_trainable(model)
print(f"Trainable: {stats['trainable_pct']:.1f}%")  # Should be ~100%
```

**How it works:**
- Compresses weights to compact representation
- Adds trainable refinement parameters
- Gradients flow through refinement → full training capability
- 1.6x memory reduction with zero quality loss

**Supported layers:**
- `nn.Linear` - Standard PyTorch linear layers
- `nn.Embedding` - Embedding tables
- `Conv1D` - HuggingFace GPT-2 style layers

## AXIOM Optimizer

```python
from quarterbit import AXIOM

opt = AXIOM(
    params,                    # Model parameters
    lr=0.001,                  # Learning rate
    weight_decay=0.01,         # Decoupled weight decay
    max_grad_norm=None,        # Gradient clipping (None = disabled)
    detect_anomaly=True,       # Error on NaN/Inf gradients
)
```

### Methods

```python
opt.register_hooks()           # Enable gradient compression (call once)
opt.remove_hooks()             # Disable gradient compression
opt.step(loss=loss.item())     # Update weights (pass loss value)
opt.zero_grad()                # Clear gradients
opt.get_lr()                   # Get current learning rate
opt.set_lr(0.0005)             # Change learning rate
opt.state_dict()               # Save optimizer state
opt.load_state_dict(state)     # Load optimizer state
opt.memory_usage()             # Print memory stats
```

### Warmup Strategy (Recommended)

For best results, use warmup before enabling gradient compression:

```python
opt = AXIOM(model.parameters(), lr=5e-4)

# Phase 1: Warmup (50-100 steps, no hooks)
for step in range(100):
    loss.backward()
    opt.step(loss=loss.item())

# Phase 2: Enable compression with lr_scale
opt.register_hooks(lr_scale=25.0)

# Phase 3: Continue training with compression
for step in range(100, 2000):
    loss.backward()
    opt.step(loss=loss.item())
```

## AXIOM_Trainer

Full-stack training with automatic compression, monitoring, and result export.

```python
from quarterbit import AXIOM_Trainer

trainer = AXIOM_Trainer(
    model,                     # PyTorch model
    train_loader,              # Training DataLoader
    val_loader,                # Validation DataLoader (optional, but recommended)
    lr=1e-3,                   # Learning rate (optimal for gradient compression)
    weight_decay=0.01,         # Weight decay
    quantize_weights=True,     # Enable weight compression (default: ON)
    gradient_checkpointing=True,  # Recompute activations to save memory (default: ON)
    early_stopping=True,       # Stop when val loss stops improving (default: ON)
    early_stopping_patience=3, # Stop after N evals without improvement
    warmup_steps=100,          # LR warmup for stable compression
    max_grad_norm=1.0,         # Gradient clipping
    eval_interval=200,         # Validate every N steps
    log_interval=100,          # Log every N steps
    checkpoint_interval=500,   # Save every N steps (0 = disabled)
    save_results=True,         # Export JSON + PNG
)

results = trainer.fit(steps=5000)
```

**Why lr=1e-3?** Gradient compression adds noise. Higher LR compensates, achieving 60% faster convergence than lr=5e-4. Early stopping prevents overfitting that can occur with high LR after ~15 epochs.

### Results Dictionary

```python
results = trainer.fit(steps=2000)

# Training metrics
results['train_losses']           # List of all training losses
results['final_train_loss']       # Last loss value
results['train_improvement_pct']  # Percent improvement

# Validation metrics
results['val_ppls']               # List of perplexities
results['final_val_ppl']          # Final perplexity
results['val_improvement_pct']    # Percent improvement
results['best_val_loss']          # Best validation loss achieved

# Early stopping
results['steps']                  # Actual steps completed
results['steps_requested']        # Steps requested (may differ if early stopped)
results['early_stopped']          # True if training stopped early

# Performance
results['peak_vram_gb']           # Peak GPU memory used
results['tokens_per_sec']         # Training speed
results['compression_total']      # Overall compression ratio
```

## Example: Complete Training Script

```python
import torch
from torch.utils.data import DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datasets import load_dataset
from quarterbit import AXIOM_Trainer

# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2").cuda().half()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load data
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

train_data = dataset["train"].map(tokenize, batched=True)
val_data = dataset["validation"].map(tokenize, batched=True)

train_loader = DataLoader(train_data, batch_size=4, shuffle=True)
val_loader = DataLoader(val_data, batch_size=4)

# Train with Full Stack - all optimizations ON by default
trainer = AXIOM_Trainer(model, train_loader, val_loader)
results = trainer.fit(steps=2000)

print(f"Training complete!")
print(f"Val PPL: {results['initial_val_ppl']:.1f} → {results['final_val_ppl']:.1f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
print(f"Early stopped: {results['early_stopped']}")
```

## Extensions

### AXIOM_CHECKPOINT - Activation Compression

Reduces activation memory by 85%. For large models where activations are the bottleneck.

```python
from quarterbit import AXIOM_CHECKPOINT

actcp = AXIOM_CHECKPOINT(max_slots=32, max_n=4*512*4096)

# In your model's forward pass
actcp.store(hidden_states, slot=layer_idx)

# During backward
restored = actcp.restore(slot=layer_idx)

# Check savings
stats = actcp.memory_stats()
print(f"Compression: {stats['compression_ratio']:.1f}x")
```

### AXIOM_DDP - Distributed Gradient Compression

128x bandwidth reduction for multi-GPU training.

```python
from quarterbit import AXIOM_DDP
import torch.distributed as dist

gc = AXIOM_DDP(n=total_params, top_k_percent=6.25)

# Compress before all-reduce
vals, idx, count = gc.compress(gradients)

# All-reduce compressed data (128x smaller)
dist.all_reduce(vals)
dist.all_reduce(idx)

# Decompress
full_grads = gc.decompress(vals, idx, count)
```

## Checkpointing

```python
# Save
torch.save({
    'model': model.state_dict(),
    'optimizer': opt.state_dict(),
    'step': step,
}, 'checkpoint.pt')

# Load
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
opt.load_state_dict(ckpt['optimizer'])
```

## Supported Models

- GPT-2, GPT-Neo, GPT-J
- LLaMA, LLaMA 2, LLaMA 3
- Gemma, Gemma 2
- Mistral, Mixtral
- Phi, Phi-2, Phi-3
- BERT, RoBERTa (fine-tuning)

## License

Commercial license required for production use.
Free for research and evaluation.

**https://quarterbit.dev**

---

Copyright 2026 Clouthier Simulation Labs. All rights reserved.
