Metadata-Version: 2.4
Name: quarterbit
Version: 19.6.16
Summary: Memory-efficient training for large language models
Home-page: https://quarterbit.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <info@quarterbit.dev>
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://quarterbit.dev
Project-URL: Documentation, https://quarterbit.dev/docs
Keywords: optimizer,adam,deep-learning,pytorch,gpu,memory-efficient,compression,axiom
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: ninja
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# QuarterBit

Memory-efficient training for large language models.

## Features

- **VLA Weight Compression** - Train 70B models on 80GB GPU, 7B on 8GB
- **Streaming Model Loader** - Load huge models without memory spikes
- **Long Context Training** - 2048 on 8GB, 8K+ on 24GB, scales with VRAM
- **Activation Compression** - 61% activation memory savings (PHI-ActCP)
- **Optimizer Compression** - 1000x+ compressed optimizer state
- **Multi-GPU Support** - Optional gradient compression for distributed training
- **Full-Stack Trainer** - One-line training with all optimizations
- **Gradient Checkpointing** - ON by default
- **Auto Early Stopping** - Prevents overfitting automatically
- **Production Ready** - Gradient clipping, NaN detection, checkpointing

## Requirements

- **Python 3.11 or 3.12** (Windows) or **Python 3.12** (Linux)
- **PyTorch 2.0+** with CUDA
- **NVIDIA GPU** - Pascal or newer (GTX 10xx, RTX 20/30/40, T4, A100, H100)

## Installation

```bash
# PyTorch required (install first)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install QuarterBit
pip install quarterbit
```

## Quick Start

### Option 1: Full-Stack Trainer (Recommended)

Automatic weight compression + optimizer compression + early stopping:

```python
from quarterbit import AXIOM_Trainer

trainer = AXIOM_Trainer(model, train_loader, val_loader)
results = trainer.fit(steps=2000)

print(f"Final PPL: {results['final_val_ppl']:.2f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
```

### Option 2: Manual Control

```python
from quarterbit import AXIOM, make_trainable_quantized

# Step 1: Compress weights
model = make_trainable_quantized(model)

# Step 2: Use AXIOM optimizer
opt = AXIOM(model.parameters(), lr=1e-3)
opt.register_hooks()  # Enable gradient compression

for batch in dataloader:
    opt.zero_grad()
    loss = model(batch).loss
    loss.backward()
    opt.step(loss=loss.item())
```

### Option 3: Optimizer Only

```python
from quarterbit import AXIOM

# Just optimizer compression (no weight quantization)
opt = AXIOM(model.parameters(), lr=1e-3)
opt.register_hooks()

for batch in dataloader:
    opt.zero_grad()
    loss = model(batch).loss
    loss.backward()
    opt.step(loss=loss.item())
```

## Weight Compression

The `make_trainable_quantized()` function compresses model weights while keeping them fully trainable:

```python
from quarterbit import make_trainable_quantized, verify_trainable

# Compress weights
model = make_trainable_quantized(model)

# Verify trainability
stats = verify_trainable(model)
print(f"Trainable: {stats['trainable_pct']:.1f}%")  # Should be ~100%
```

**Supported layers:**
- `nn.Linear` - Standard PyTorch linear layers
- `nn.Embedding` - Embedding tables
- `Conv1D` - HuggingFace GPT-2 style layers

## AXIOM Optimizer

```python
from quarterbit import AXIOM

opt = AXIOM(
    params,                    # Model parameters
    lr=0.001,                  # Learning rate
    weight_decay=0.01,         # Decoupled weight decay
    max_grad_norm=None,        # Gradient clipping (None = disabled)
    detect_anomaly=True,       # Error on NaN/Inf gradients
)
```

### Methods

```python
opt.register_hooks()           # Enable gradient compression (call once)
opt.remove_hooks()             # Disable gradient compression
opt.step(loss=loss.item())     # Update weights (pass loss value)
opt.zero_grad()                # Clear gradients
opt.get_lr()                   # Get current learning rate
opt.set_lr(0.0005)             # Change learning rate
opt.state_dict()               # Save optimizer state
opt.load_state_dict(state)     # Load optimizer state
opt.memory_usage()             # Print memory stats
```

### Warmup Strategy (Recommended)

For best results, use warmup before enabling gradient compression:

```python
opt = AXIOM(model.parameters(), lr=5e-4)

# Phase 1: Warmup (50-100 steps, no hooks)
for step in range(100):
    loss.backward()
    opt.step(loss=loss.item())

# Phase 2: Enable compression with lr_scale
opt.register_hooks(lr_scale=25.0)

# Phase 3: Continue training with compression
for step in range(100, 2000):
    loss.backward()
    opt.step(loss=loss.item())
```

## AXIOM_Trainer

Full-stack training with automatic compression, monitoring, and result export.

```python
from quarterbit import AXIOM_Trainer

trainer = AXIOM_Trainer(
    model,                     # PyTorch model
    train_loader,              # Training DataLoader
    val_loader,                # Validation DataLoader (optional, but recommended)
    lr=1e-3,                   # Learning rate
    weight_decay=0.01,         # Weight decay
    quantize_weights=True,     # Enable weight compression (default: ON)
    gradient_checkpointing=True,  # Recompute activations to save memory (default: ON)
    early_stopping=True,       # Stop when val loss stops improving (default: ON)
    early_stopping_patience=3, # Stop after N evals without improvement
    warmup_steps=100,          # LR warmup for stable compression
    max_grad_norm=1.0,         # Gradient clipping
    eval_interval=200,         # Validate every N steps
    log_interval=100,          # Log every N steps
    checkpoint_interval=500,   # Save every N steps (0 = disabled)
    save_results=True,         # Export JSON + PNG
)

results = trainer.fit(steps=5000)
```

### Results Dictionary

```python
results = trainer.fit(steps=2000)

# Training metrics
results['train_losses']           # List of all training losses
results['final_train_loss']       # Last loss value
results['train_improvement_pct']  # Percent improvement

# Validation metrics
results['val_ppls']               # List of perplexities
results['final_val_ppl']          # Final perplexity
results['val_improvement_pct']    # Percent improvement
results['best_val_loss']          # Best validation loss achieved

# Early stopping
results['steps']                  # Actual steps completed
results['steps_requested']        # Steps requested (may differ if early stopped)
results['early_stopped']          # True if training stopped early

# Performance
results['peak_vram_gb']           # Peak GPU memory used
results['tokens_per_sec']         # Training speed
results['compression_total']      # Overall compression ratio
```

## Example: Complete Training Script

```python
import torch
from torch.utils.data import DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datasets import load_dataset
from quarterbit import AXIOM_Trainer

# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2").cuda().half()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load data
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

train_data = dataset["train"].map(tokenize, batched=True)
val_data = dataset["validation"].map(tokenize, batched=True)

train_loader = DataLoader(train_data, batch_size=4, shuffle=True)
val_loader = DataLoader(val_data, batch_size=4)

# Train with Full Stack
trainer = AXIOM_Trainer(model, train_loader, val_loader)
results = trainer.fit(steps=2000)

print(f"Training complete!")
print(f"Val PPL: {results['initial_val_ppl']:.1f} → {results['final_val_ppl']:.1f}")
print(f"Peak VRAM: {results['peak_vram_gb']:.1f} GB")
print(f"Early stopped: {results['early_stopped']}")
```

## Extensions

### AXIOM_CHECKPOINT - Activation Compression

Reduces activation memory during training.

```python
from quarterbit import AXIOM_CHECKPOINT

actcp = AXIOM_CHECKPOINT(max_slots=32, max_n=4*512*4096)

# In your model's forward pass
actcp.store(hidden_states, slot=layer_idx)

# During backward
restored = actcp.restore(slot=layer_idx)

# Check savings
stats = actcp.memory_stats()
print(f"Compression: {stats['compression_ratio']:.1f}x")
```

### AXIOM_DDP - Distributed Gradient Compression

Bandwidth reduction for multi-GPU training.

```python
from quarterbit import AXIOM_DDP
import torch.distributed as dist

gc = AXIOM_DDP(n=total_params, top_k_percent=6.25)

# Compress before all-reduce
vals, idx, count = gc.compress(gradients)

# All-reduce compressed data
dist.all_reduce(vals)
dist.all_reduce(idx)

# Decompress
full_grads = gc.decompress(vals, idx, count)
```

## Checkpointing

```python
# Save
torch.save({
    'model': model.state_dict(),
    'optimizer': opt.state_dict(),
    'step': step,
}, 'checkpoint.pt')

# Load
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
opt.load_state_dict(ckpt['optimizer'])
```

## Supported Models

- GPT-2, GPT-Neo, GPT-J
- LLaMA, LLaMA 2, LLaMA 3
- Gemma, Gemma 2
- Mistral, Mixtral
- Phi, Phi-2, Phi-3
- BERT, RoBERTa (fine-tuning)

## License

Commercial license required for production use.
Free for research and evaluation.

**https://quarterbit.dev**

---

Copyright 2026 Clouthier Simulation Labs. All rights reserved.
