Metadata-Version: 2.4
Name: simgen-vla
Version: 3.2.3
Summary: VLA: Zero-Error GPU Arithmetic for Scientific Computing. Exact results, every calculation.
Home-page: https://simgen.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <kyle@simgen.dev>
License: Proprietary
Project-URL: Homepage, https://simgen.dev
Project-URL: Documentation, https://simgen.dev/docs
Project-URL: Repository, https://github.com/clouthier-simulation-labs/simgen
Keywords: exact-arithmetic,GPU,precision,lossless,scientific-computing,machine-learning,deep-learning,simulation,finance,HPC,cuda,pytorch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Physics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Provides-Extra: cuda12
Requires-Dist: cupy-cuda12x>=12.0; extra == "cuda12"
Provides-Extra: cuda11
Requires-Dist: cupy-cuda11x>=11.0; extra == "cuda11"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: mpmath; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# SimGen VLA

**Exact GPU Arithmetic for PyTorch**

Drop-in replacement for `torch.sum`, `torch.matmul`, and 50+ operations. Zero floating-point error. Deterministic results across GPUs.

## Installation

```bash
pip install simgen-vla
```

**Requirements:** Python 3.10+, PyTorch 2.0+, CUDA GPU (Turing/Ampere/Ada/Hopper)

## The Problem

Floating-point arithmetic accumulates error:

```python
import torch

x = torch.tensor([1e16, 1.0, -1e16], device='cuda', dtype=torch.float64)
print(torch.sum(x))  # 0.0 (WRONG - lost the 1.0)
```

## The Solution

```python
from simgen import vla

x = torch.tensor([1e16, 1.0, -1e16], device='cuda', dtype=torch.float64)
print(vla.sum(x))  # 1.0 (EXACT)
```

## API

```python
from simgen import vla

# Core operations - exact results
result = vla.sum(x)           # Zero accumulation error
result = vla.matmul(A, B)     # Exact matrix multiply
result = vla.dot(a, b)        # Exact dot product

# Numerically stable
result = vla.softmax(logits)
result = vla.layer_norm(x, weight, bias)
result = vla.cross_entropy(logits, targets)

# Enable globally (patches torch ops)
vla.enable()
torch.sum(x)      # Now uses VLA
torch.matmul(A,B) # Now uses VLA
vla.disable()

# System info
vla.info()
```

## Use Cases

### Scientific Computing
ODE solvers, N-body simulations, physics engines - anywhere error compounds over millions of timesteps.

```python
# Energy-conserving integration
for step in range(1_000_000):
    energy = vla.sum(kinetic + potential)  # No drift
```

### Financial Calculations
Exact arithmetic for transactions, risk calculations, regulatory compliance.

```python
# $0.0001 * 1M transactions = exactly $100, not $99.99847
total = vla.sum(transactions)
```

### Reproducible Simulations
Same arithmetic result on any GPU. Order-independent sums. Deterministic reductions.

```python
# N-body simulation - same trajectory every run
for step in range(1_000_000):
    forces = vla.sum(pairwise_forces)  # Deterministic
    positions = vla.matmul(velocities, dt)  # Exact
```

### Debugging Numerical Issues
When your training diverges or simulation explodes, VLA eliminates floating-point as the cause.

```python
# Is it a bug or floating-point?
with vla.mode():
    result = suspicious_computation()  # Now you know
```

## Operations (55 Kernels)

| Category | Functions |
|----------|-----------|
| **Reductions** | `sum`, `mean`, `var`, `std`, `norm`, `dot`, `prod`, `cumsum`, `logsumexp`, `min`, `max` |
| **Matrix** | `matmul`, `mm`, `bmm`, `linear` |
| **Activations** | `softmax`, `log_softmax`, `relu`, `gelu`, `silu`, `sigmoid`, `tanh` |
| **Normalization** | `layer_norm`, `rms_norm`, `batch_norm`, `group_norm` |
| **Loss** | `cross_entropy`, `mse_loss` |
| **Math** | `exp`, `log`, `sqrt`, `rsqrt`, `pow`, `abs`, `clamp` |
| **Advanced** | `scaled_dot_product_attention`, `conv2d`, `embedding` |

## Precision Comparison

| Operation | Standard FP64 | VLA |
|-----------|--------------|-----|
| Sum (1M terms) | ~10^-10 error | **0** |
| MatMul (1024x1024) | ~10^-7 relative | **< 10^-15** |
| Order sensitivity | Results vary | **Deterministic** |

## Supported GPUs

| Architecture | GPUs |
|--------------|------|
| Turing (sm_75) | T4, RTX 2080 |
| Ampere (sm_80/86) | A100, RTX 3090 |
| Ada (sm_89) | RTX 4070/4080/4090 |
| Hopper (sm_90) | H100 |

## Performance

VLA adds ~1.2-1.5x overhead for exact results:

| Operation | Standard | VLA |
|-----------|----------|-----|
| Sum (1M) | 0.12ms | 0.15ms |
| MatMul (1024x1024) | 0.8ms | 1.1ms |
| Softmax | 0.05ms | 0.06ms |

## How It Works

VLA uses proprietary precision-preserving arithmetic to capture all rounding errors during computation. The result is mathematically exact to the precision of the input.

- Multi-level error tracking captures every lost bit
- Precompiled CUDA kernels for each GPU architecture
- No Python overhead - pure CUDA execution

## Version History

- **v2.9.0** - Focused API: exact arithmetic primitives only
- **v2.8.0** - 55 kernels, Windows support, stress-tested
- **v2.0.0** - Native CUDA backend, VLAResult container

## License

Proprietary. All rights reserved.
(c) 2025-2026 Clouthier Simulation Labs

**Website:** https://simgen.dev
**Contact:** kyle@simgen.dev
