Metadata-Version: 2.4
Name: simgen-vla
Version: 4.1.0
Summary: TRUE ZERO Exact ODE Simulation. Zero accumulation error. Perfectly reproducible.
Home-page: https://simgen.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <kyle@simgen.dev>
License: Proprietary
Project-URL: Homepage, https://simgen.dev
Project-URL: Documentation, https://simgen.dev/docs
Project-URL: Repository, https://github.com/clouthier-simulation-labs/simgen
Keywords: exact-arithmetic,GPU,precision,lossless,scientific-computing,machine-learning,deep-learning,simulation,finance,HPC,cuda,pytorch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Physics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: cython; extra == "dev"
Requires-Dist: torch>=2.0; extra == "dev"
Requires-Dist: numpy>=1.20; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# SimGen VLA - Zero-Error GPU Arithmetic

**The first library to achieve TRUE zero accumulation error on GPU.**

[![PyPI version](https://badge.fury.io/py/simgen-vla.svg)](https://pypi.org/project/simgen-vla/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Proprietary-red.svg)](LICENSE)

```
THE PROBLEM: Same code + different GPU = DIFFERENT ANSWER

RTX 4090:    1.0000000001
Tesla T4:    0.9999999998
A100:        1.0000000003
             ^^^^^^^^^^^^^^ CHAOS

THE SOLUTION: VLA checksums are BIT-IDENTICAL across ALL GPUs

RTX 4090:    6ece6956f187064f
Tesla T4:    6ece6956f187064f
A100:        6ece6956f187064f
             ^^^^^^^^^^^^^^^^ IDENTICAL
```

## Installation

```bash
pip install simgen-vla
```

**Requirements:**
- Python 3.10+
- PyTorch 2.0+ with CUDA
- NVIDIA GPU (sm_60 to sm_90: Pascal, Volta, Turing, Ampere, Ada, Hopper)

---

## NEW in v3.5.3: GPU ModularTensor - TRUE ZERO on CUDA

**ModularTensor** brings TRUE ZERO exact arithmetic to GPU with full CUDA acceleration.

```python
from simgen import vla

# GPU exact arithmetic (TRUE ZERO error)
a = vla.ModularTensor.from_fraction(1, 3, shape=(10000,), device='cuda')
b = vla.ModularTensor.from_fraction(1, 6, shape=(10000,), device='cuda')
c = a + b  # Exactly 1/2 for all 10,000 elements

# Exact equality check
expected = vla.ModularTensor.from_fraction(1, 2, shape=(10000,), device='cuda')
print((c == expected).all())  # True - TRUE ZERO!

# 100K iterations with TRUE ZERO
acc = vla.ModularTensor.from_int(0, shape=(1,), device='cuda')
delta = vla.ModularTensor.from_fraction(1, 100000, shape=(1,), device='cuda')
for _ in range(100000):
    acc = acc + delta
print((acc == vla.ModularTensor.from_int(1, shape=(1,), device='cuda')).all())  # True!
```

### CPU ModularRational (also available)

```python
# CPU exact arithmetic
a = vla.ModularRational.from_fraction(1, 3)
b = vla.ModularRational.from_fraction(1, 6)
c = a + b  # Exactly 1/2
print(c == vla.ModularRational.from_fraction(1, 2))  # True
```

### Why Modular Arithmetic?

- **TRUE ZERO error** - Not ~1e-15, but mathematically ZERO
- **Constant memory** - Fixed memory per value regardless of operation count
- **GPU accelerated** - 444M ops/sec on RTX 4070
- **Chaotic systems** - Lorenz 50,000 steps with exact time reversibility
- **Vectors & Matrices** - Full tensor operations on GPU

### When to use each precision level

| Type | Device | Error | Use Case |
|------|--------|-------|----------|
| `VLADecimal` | GPU | ~1e-15 | Production simulations, GPU speed |
| `ModularTensor` | GPU | TRUE ZERO | Financial, cryptographic, verification |
| `ModularRational` | CPU | TRUE ZERO | Exact scalar arithmetic |

---

## VLADecimal - GPU-Native Extended Precision

**VLADecimal** is a GPU-native extended precision type (106+ bit mantissa) that keeps ALL operations on GPU. No CPU conversions until you explicitly request a Python Decimal.

```python
import torch
from simgen import vla

# Create GPU-native extended precision tensors
x = vla.Decimal(torch.randn(1000, device='cuda'))
y = vla.Decimal(torch.randn(1000, device='cuda'))

# All operations stay on GPU with full precision
result = (x + y * 2).sum()

# Display exact value (converts to Python Decimal only for display)
print(result)  # VLADecimal(-12.34567890123456789...)

# Explicit conversion when you need Python Decimal
exact_value = result.to_decimal()  # decimal.Decimal object

# Convert back to torch.Tensor when done
tensor = result.to_torch()  # float64 tensor
```

### VLADecimal Features

- **82 methods** - Full arithmetic, reductions, linear algebra, trig, and more
- **GPU-native** - All operations stay on GPU until you explicitly convert
- **Chainable** - `(x + y).sum().sqrt()` preserves precision throughout
- **Indexing** - `x[0]`, `x[1:10]`, `x[::2]` all return VLADecimal
- **Shape ops** - `reshape`, `view`, `squeeze`, `transpose`, `flatten`, etc.
- **Factory functions** - `vla.Decimal_zeros()`, `vla.Decimal_randn()`, etc.

### Factory Functions

```python
# Create from exact fractions (TRUE zero representation error)
x = vla.Decimal_frac(1, 3)  # Exact 1/3

# Create zeros/ones/random
zeros = vla.Decimal_zeros((100, 100))
ones = vla.Decimal_ones((100, 100))
randn = vla.Decimal_randn((100, 100))

# Concatenate/stack VLADecimal tensors
combined = vla.Decimal_cat([x, y], dim=0)
stacked = vla.Decimal_stack([x, y], dim=0)
```

---

## Quick Start (Standard API)

```python
import torch
from simgen import vla

# Create test data
x = torch.randn(10000, device='cuda')

# Exact sum with zero accumulation error
result = vla.sum(x)

# Get cross-GPU checksum (SAME on any GPU!)
checksum = vla.checksum(result)
print(f"Checksum: {checksum}")  # e.g., "6ece6956f187064f"
```

## The Killer Feature: Cross-GPU Reproducibility

```python
# This checksum is IDENTICAL on RTX 4070, Tesla T4, A100, H100...
result = vla.matmul(A, B)
checksum = vla.checksum(result)

# Verify reproducibility
vla.verify(result, "6ece6956f187064f")  # Raises if mismatch
```

---

## Global Enable Mode

Patch ALL PyTorch operations with one line:

```python
import torch
from simgen import vla

vla.enable()  # Now ALL torch ops use VLA!

# These now use exact arithmetic automatically:
torch.sum(x)       # Uses VLA internally
torch.matmul(A, B) # Uses VLA internally
model(input)       # Entire model uses VLA!

vla.disable()  # Restore standard PyTorch ops
```

### Context Manager

```python
with vla.mode():
    # All operations in this block use VLA
    result = torch.sum(x)
    output = model(input)
# Back to standard PyTorch outside the block
```

---

## Complete API Reference

### Core Reductions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.sum(x)` | Exact sum | `vla.sum(tensor)` |
| `vla.mean(x)` | Exact mean | `vla.mean(tensor)` |
| `vla.var(x)` | Exact variance | `vla.var(tensor)` |
| `vla.std(x)` | Exact std deviation | `vla.std(tensor)` |
| `vla.norm(x, p=2)` | Exact Lp norm | `vla.norm(tensor)` |
| `vla.prod(x)` | Exact product | `vla.prod(tensor)` |
| `vla.cumsum(x)` | Exact cumulative sum | `vla.cumsum(tensor)` |
| `vla.logsumexp(x)` | Numerically stable | `vla.logsumexp(tensor)` |
| `vla.min(x)` | Minimum value | `vla.min(tensor)` |
| `vla.max(x)` | Maximum value | `vla.max(tensor)` |

### Matrix Operations

| Function | Description | Example |
|----------|-------------|---------|
| `vla.dot(a, b)` | Exact dot product | `vla.dot(x, y)` |
| `vla.matmul(a, b)` | Exact matrix multiply | `vla.matmul(A, B)` |
| `vla.mm(a, b)` | Alias for matmul | `vla.mm(A, B)` |
| `vla.bmm(a, b)` | Batched matmul | `vla.bmm(batch_A, batch_B)` |
| `vla.linear(x, w, b)` | Linear layer | `vla.linear(x, weight, bias)` |
| `vla.einsum(eq, *ops)` | Einstein summation | `vla.einsum('ij,jk->ik', A, B)` |

### Element-wise Arithmetic

| Function | Description | Example |
|----------|-------------|---------|
| `vla.add(a, b)` | Addition | `vla.add(x, y)` |
| `vla.sub(a, b)` | Subtraction | `vla.sub(x, y)` |
| `vla.mul(a, b)` | Multiplication | `vla.mul(x, y)` |
| `vla.div(a, b)` | Division | `vla.div(x, y)` |
| `vla.neg(x)` | Negation | `vla.neg(x)` |
| `vla.abs(x)` | Absolute value | `vla.abs(x)` |
| `vla.pow(x, n)` | Power | `vla.pow(x, 2)` |
| `vla.clamp(x, min, max)` | Clamp values | `vla.clamp(x, 0, 1)` |
| `vla.fmod(x, y)` | Float modulo | `vla.fmod(x, y)` |

### Transcendental Functions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.exp(x)` | Exponential | `vla.exp(x)` |
| `vla.log(x)` | Natural log | `vla.log(x)` |
| `vla.sqrt(x)` | Square root | `vla.sqrt(x)` |
| `vla.rsqrt(x)` | Reciprocal sqrt | `vla.rsqrt(x)` |

### Trigonometric Functions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.sin(x)` | Sine | `vla.sin(x)` |
| `vla.cos(x)` | Cosine | `vla.cos(x)` |
| `vla.tan(x)` | Tangent | `vla.tan(x)` |
| `vla.asin(x)` | Inverse sine | `vla.asin(x)` |
| `vla.acos(x)` | Inverse cosine | `vla.acos(x)` |
| `vla.atan(x)` | Inverse tangent | `vla.atan(x)` |
| `vla.atan2(y, x)` | Two-arg atan | `vla.atan2(y, x)` |

### Hyperbolic Functions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.sinh(x)` | Hyperbolic sine | `vla.sinh(x)` |
| `vla.cosh(x)` | Hyperbolic cosine | `vla.cosh(x)` |
| `vla.tanh(x)` | Hyperbolic tangent | `vla.tanh(x)` |

### Rounding Functions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.floor(x)` | Floor | `vla.floor(x)` |
| `vla.ceil(x)` | Ceiling | `vla.ceil(x)` |
| `vla.round(x)` | Round | `vla.round(x)` |
| `vla.trunc(x)` | Truncate | `vla.trunc(x)` |

### Comparison Functions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.sign(x)` | Sign function | `vla.sign(x)` |
| `vla.eq(x, y)` | Equal | `vla.eq(x, y)` |
| `vla.ne(x, y)` | Not equal | `vla.ne(x, y)` |
| `vla.lt(x, y)` | Less than | `vla.lt(x, y)` |
| `vla.le(x, y)` | Less or equal | `vla.le(x, y)` |
| `vla.gt(x, y)` | Greater than | `vla.gt(x, y)` |
| `vla.ge(x, y)` | Greater or equal | `vla.ge(x, y)` |
| `vla.where(c, x, y)` | Conditional | `vla.where(cond, x, y)` |

### Activation Functions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.relu(x)` | ReLU | `vla.relu(x)` |
| `vla.sigmoid(x)` | Sigmoid | `vla.sigmoid(x)` |
| `vla.leaky_relu(x, slope)` | Leaky ReLU | `vla.leaky_relu(x, 0.01)` |

### Signal Processing

| Function | Description | Example |
|----------|-------------|---------|
| `vla.fft(x)` | 1D FFT | `vla.fft(signal)` |
| `vla.ifft(x)` | 1D Inverse FFT | `vla.ifft(spectrum)` |
| `vla.rfft(x)` | Real FFT | `vla.rfft(signal)` |
| `vla.irfft(x)` | Inverse Real FFT | `vla.irfft(spectrum)` |
| `vla.conv2d(x, w)` | 2D Convolution | `vla.conv2d(image, kernel)` |

### Linear Algebra

| Function | Description | Example |
|----------|-------------|---------|
| `vla.trace(A)` | Matrix trace | `vla.trace(matrix)` |
| `vla.det(A)` | Determinant | `vla.det(matrix)` |
| `vla.inv(A)` | Matrix inverse | `vla.inv(matrix)` |
| `vla.solve(A, B)` | Solve Ax=B | `vla.solve(A, b)` |

### Loss Functions

| Function | Description | Example |
|----------|-------------|---------|
| `vla.mse_loss(pred, target)` | MSE Loss | `vla.mse_loss(pred, y)` |

### Input Precision Utilities

| Function | Description | Example |
|----------|-------------|---------|
| `vla.is_exact(x)` | Check if float is binary-exact | `vla.is_exact(0.125)` → `True` |
| `vla.to_exact(x)` | Snap to nearest binary-exact | `vla.to_exact(0.001)` → `0.0009765625` |
| `vla.frac(n, d)` | Create exact fraction tensor | `vla.frac(1, 1024)` |
| `vla.dyadic(x)` | Find closest p/2^q rational | `vla.dyadic(0.001)` |
| `VLADecimal.to_decimal()` | Convert to exact Python Decimal | `x.to_decimal()` |

---

## Precision Chaining with `return_vla`

For maximum precision in chained operations, use `return_vla=True`:

```python
# Standard: precision lost at each step
r = vla.sqrt(vla.add(vla.mul(x, x), vla.mul(y, y)))

# Chained: full precision preserved through entire computation
x2 = vla.mul(x, x, return_vla=True)
y2 = vla.mul(y, y, return_vla=True)
r2 = vla.add(x2, y2, return_vla=True)
r = vla.sqrt(r2)  # Final collapse to tensor
```

This is critical for:
- Orbital mechanics simulations
- Long-running numerical integrations
- Financial calculations
- Any computation with many sequential operations

---

## Exact Decimal Values

**Preferred:** Use VLADecimal for GPU-native operations with exact conversion:

```python
import torch
from simgen import vla

x = vla.Decimal(torch.randn(1000, device='cuda'))
result = x.sum()

# Get exact Python Decimal
exact_value = result.to_decimal()
print(exact_value)  # Full precision Decimal
```

**Legacy `_exact` functions** (still supported):

| Function | Description |
|----------|-------------|
| `vla.sum_exact(x)` | Returns Decimal sum |
| `vla.dot_exact(a, b)` | Returns Decimal dot product |
| `vla.mean_exact(x)` | Returns Decimal mean |

---

## Reproducibility & Verification

### Checksums

```python
# Compute deterministic checksum
result = vla.matmul(A, B)
cs = vla.checksum(result)  # "6ece6956f187064f"

# Full 64-char SHA256
full_cs = vla.checksum_hex(result)
```

### Verification

```python
# Verify result matches expected checksum
vla.verify(result, "6ece6956f187064f")  # Raises ValueError if mismatch

# Non-raising version
is_valid = vla.verify(result, "6ece6956f187064f", raise_on_mismatch=False)
```

---

## Examples

### Example 1: The Kahan Sum Test

Standard floating-point fails this classic test:

```python
import torch
from simgen import vla

# 1e20 + 10000 ones - 1e20 = should be 10000
data = torch.tensor([1e20] + [1.0]*10000 + [-1e20], device='cuda')

print(f"FP32: {data.sum().item()}")           # 0.0 (WRONG!)
print(f"FP64: {data.double().sum().item()}")  # 0.0 (WRONG!)
print(f"VLA:  {vla.sum(data).item()}")        # 10000.0 (CORRECT!)
```

### Example 2: Cross-GPU Verification

```python
import torch
from simgen import vla

torch.manual_seed(42)
A = torch.randn(1024, 1024, device='cuda')
B = torch.randn(1024, 1024, device='cuda')

result = vla.matmul(A, B)
checksum = vla.checksum(result)

print(f"Checksum: {checksum}")
# This EXACT checksum will be produced on ANY NVIDIA GPU:
# RTX 4070, Tesla T4, A100, H100, etc.
```

### Example 3: Orbital Mechanics

```python
import torch
from simgen import vla

# Satellite orbital parameters
r0, v0 = 6779.0, 7.66  # km, km/s (ISS altitude)
GM = 398600.4418       # km^3/s^2

x = torch.tensor([r0], device='cuda')
y = torch.tensor([0.0], device='cuda')
vx = torch.tensor([0.0], device='cuda')
vy = torch.tensor([v0], device='cuda')
dt = torch.tensor([1.0], device='cuda')

# Propagate orbit with chained precision
for _ in range(55000):  # ~10 orbits
    # Compute radius with full precision chain
    x2 = vla.mul(x, x, return_vla=True)
    y2 = vla.mul(y, y, return_vla=True)
    r2 = vla.add(x2, y2, return_vla=True)
    r = vla.sqrt(r2)

    # Gravitational acceleration
    r3 = vla.mul(vla.mul(r, r, return_vla=True), r)
    ax = vla.div(vla.mul(torch.tensor([-GM], device='cuda'), x), r3)
    ay = vla.div(vla.mul(torch.tensor([-GM], device='cuda'), y), r3)

    # Update position and velocity (Velocity Verlet)
    x = vla.add(x, vla.mul(vx, dt))
    y = vla.add(y, vla.mul(vy, dt))
    vx = vla.add(vx, vla.mul(ax, dt))
    vy = vla.add(vy, vla.mul(ay, dt))

final_r = vla.sqrt(vla.add(vla.mul(x, x), vla.mul(y, y)))
print(f"Final orbital radius: {final_r.item():.4f} km")
# With VLA: minimal drift. With FP64: kilometers of error.
```

### Example 4: Financial Calculations

```python
import torch
from simgen import vla

# Portfolio values (mixed magnitudes)
positions = torch.tensor([
    1_000_000_000.00,  # $1B position
    0.01,              # 1 cent
    -999_999_999.99,   # Large short
    50_000.50,         # Medium position
], device='cuda')

# Standard sum loses the penny
print(f"FP32 sum: ${positions.sum().item():,.2f}")

# VLA preserves every cent
print(f"VLA sum:  ${vla.sum(positions).item():,.2f}")

# For audit trails, use exact
exact_total = vla.sum_exact(positions)
print(f"Exact:    ${exact_total}")  # Decimal with full precision
```

### Example 5: Neural Network with VLA

```python
import torch
import torch.nn as nn
from simgen import vla

# Enable VLA globally
vla.enable()

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
).cuda()

optimizer = torch.optim.Adam(model.parameters())

for epoch in range(10):
    for x, y in dataloader:
        pred = model(x.cuda())
        loss = vla.mse_loss(pred, y.cuda())  # Exact loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

vla.disable()
```

---

## System Information

```python
from simgen import vla
vla.info()
```

Output:
```
============================================================
VLA - VIGIL Lossless Arithmetic for Scientific Computing
============================================================
Version: 3.5.3
Backend: native_cubin
Device: NVIDIA GeForce RTX 4070
Architecture: sm_89

Features:
  - ModularRational: TRUE ZERO exact arithmetic
  - VLADecimal: GPU-native extended precision (82 methods)
  - Core: sum, mean, var, std, norm, dot, matmul
  - Trig: sin, cos, tan, asin, acos, atan, atan2
  - Hyper: sinh, cosh, tanh
  - Signal: fft, ifft, rfft, irfft, conv2d
  - LinAlg: trace, det, inv, solve
  - Compare: eq, ne, lt, le, gt, ge, where
  - Verify: checksum, verify (cross-GPU reproducibility)

Precision: Machine epsilon squared (10^-32 vs 10^-16)
============================================================
```

---

## Supported GPU Architectures

| Architecture | GPUs | Compute Capability |
|-------------|------|-------------------|
| Pascal | GTX 1080, P100 | sm_60, sm_61 |
| Volta | V100 | sm_70 |
| Turing | RTX 2080, T4 | sm_75 |
| Ampere | RTX 3090, A100 | sm_80, sm_86 |
| Ada Lovelace | RTX 4090, 4080, 4070 | sm_89 |
| Hopper | H100 | sm_90 |

---

## Performance

VLA achieves exact results while maintaining GPU performance:

| Operation | Matrix Size | CPU Decimal | VLA GPU | Speedup |
|-----------|-------------|-------------|---------|---------|
| matmul | 1024x1024 | 37 min | 0.2s | **12,922x** |
| matmul | 4096x4096 | 1.6 days | 10s | **13,934x** |
| matmul | 10240x10240 | 25.5 days | 2.7 min | **13,848x** |
| matmul | 20480x20480 | 204 days | ~22 min | **~13,000x** |

---

## How It Works

VLA uses proprietary precision-preserving arithmetic that:

1. **Captures all rounding errors** during computation
2. **Maintains full precision** through operation chains
3. **Produces deterministic results** regardless of thread ordering
4. **Runs on native CUDA kernels** - no Python overhead

The result is mathematically exact to the precision of the input, with cross-GPU reproducibility guaranteed by deterministic algorithms.

---

## Understanding VLA's Guarantee

### What VLA Guarantees: ZERO ACCUMULATION Error

VLA eliminates **accumulation error** - the errors that compound when performing arithmetic operations. Every `+`, `-`, `*`, `/` is mathematically exact. This means:

- **Order independence**: `(a + b) + c = a + (b + c)` always (impossible with IEEE 754)
- **Cross-GPU reproducibility**: Same computation = identical result on any GPU
- **No error growth**: Million-step simulations don't accumulate drift

### What VLA Cannot Fix: INPUT REPRESENTATION Error

VLA cannot fix errors that exist **before** it sees your data. When you write `0.001` in Python, it's already corrupted:

```python
# 0.001 is NOT exactly representable in binary
# Python stores it as 0.001000000000000000020816681711721685228...
x = torch.tensor([0.001], device='cuda')
# The error already exists BEFORE VLA sees this tensor
```

### Demonstration: Binary-Exact vs Non-Binary-Exact

```python
import torch
from simgen import vla

# TEST 1: Non-binary-exact input (0.001)
# 0.001 requires infinite bits in binary - stored as approximation
increment = 0.001
expected = 100.0  # 0.001 * 100,000 iterations

x_vla = torch.tensor([0.0], device='cuda')
x_fp64 = torch.tensor([0.0], device='cuda', dtype=torch.float64)

for _ in range(100_000):
    x_vla = vla.add(x_vla, torch.tensor([increment], device='cuda'))
    x_fp64 += increment

# Both show ~same error because INPUT was corrupted
# VLA: 100.00000000133288 (error from input representation)
# FP64: 100.00000000133288 (same error)

# TEST 2: Binary-exact input (0.125 = 1/8 = 2^-3)
# This is EXACTLY representable in binary!
increment = 0.125
expected = 12500.0  # 0.125 * 100,000 iterations

x_vla = torch.tensor([0.0], device='cuda')
x_fp32 = torch.tensor([0.0], device='cuda', dtype=torch.float32)

for _ in range(100_000):
    x_vla = vla.add(x_vla, torch.tensor([increment], device='cuda'))
    x_fp32 += increment

# VLA: 12500.0 EXACTLY (TRUE ZERO error)
# FP32: 12499.9990234375 (accumulated rounding errors)
```

### Binary-Exact Values (TRUE Zero Error)

Use these values for demonstrations or when you need guaranteed zero error:

| Value | Binary Representation | Exact? |
|-------|----------------------|--------|
| 0.5 | 2^-1 | Yes |
| 0.25 | 2^-2 | Yes |
| 0.125 | 2^-3 | Yes |
| 0.0625 | 2^-4 | Yes |
| 0.001 | Infinite binary expansion | NO |
| 0.1 | Infinite binary expansion | NO |
| 0.3 | Infinite binary expansion | NO |

### Input Precision Utilities

VLA provides utilities to help you work with exact inputs:

```python
import torch
from simgen import vla

# Check if a value is binary-exact
vla.is_exact(0.125)  # True
vla.is_exact(0.001)  # False

# Create exact fractions (power-of-2 denominators are exact)
dt = vla.frac(1, 1024)  # Exact 0.0009765625 instead of 0.001
step = vla.frac(1, 8)   # Exact 0.125

# Find closest dyadic rational (p/2^q)
num, denom, exact_val, error = vla.dyadic(0.001)
# (1, 1024, 0.0009765625, 0.0000234375)

# Snap to nearest binary-exact value
vla.to_exact(0.001)  # 0.0009765625

# See exact stored value with VLADecimal
x = vla.Decimal(torch.tensor([0.001], device='cuda'))
print(x.to_decimal())  # Shows exact stored representation
```

### The Key Insight

```
IEEE 754 with 0.001:  Input error + Accumulation error = Large error
VLA with 0.001:       Input error + ZERO              = Input error only
VLA with 0.125:       ZERO        + ZERO              = TRUE ZERO
```

**VLA guarantees your arithmetic is perfect.** If you want perfect results, also ensure your inputs are perfectly representable - or use `vla.frac()` to create exact fractions.

---

## Version History

- **v3.5.3** - ModularTensor: TRUE ZERO exact arithmetic on GPU (CUDA accelerated, 444M ops/sec)
- **v3.5.2** - ModularRational: TRUE ZERO error for all operations (vectors, matrices, chaotic systems)
- **v3.5.0** - VLADecimal: GPU-native extended precision type with 82 methods
- **v3.4.6** - Fixed VLAResult device handling for chained operations
- **v3.4.5** - Universal `return_vla` support for all 70+ functions
- **v3.4.0** - Native CUDA cubins for 8 GPU architectures
- **v3.0.0** - Complete rewrite with VLAResult precision container
- **v2.0.0** - Initial public release

---

## License

Proprietary. All rights reserved.
(c) 2025-2026 Clouthier Simulation Labs

**Website:** https://simgen.dev
**Contact:** kyle@simgen.dev
