Metadata-Version: 2.4
Name: slotllm
Version: 0.1.0
Summary: A rate-limit-aware concurrency layer for batch LLM workloads, with optional database-backed distributed coordination.
Project-URL: Repository, https://github.com/datamachineworks/slotllm
Author: Data Machine Works Ltd
License-Expression: MIT
License-File: LICENSE
Keywords: batch,concurrency,llm,rate-limit,slots
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: pydantic>=2.0
Requires-Dist: tokencost
Provides-Extra: all
Requires-Dist: aiosqlite; extra == 'all'
Requires-Dist: instructor; extra == 'all'
Requires-Dist: litellm; extra == 'all'
Requires-Dist: mypy; extra == 'all'
Requires-Dist: psycopg-pool; extra == 'all'
Requires-Dist: psycopg[binary]>=3.1; extra == 'all'
Requires-Dist: pytest; extra == 'all'
Requires-Dist: pytest-asyncio; extra == 'all'
Requires-Dist: pytest-cov; extra == 'all'
Requires-Dist: ruff; extra == 'all'
Requires-Dist: time-machine; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Requires-Dist: time-machine; extra == 'dev'
Provides-Extra: instructor
Requires-Dist: instructor; extra == 'instructor'
Provides-Extra: litellm
Requires-Dist: litellm; extra == 'litellm'
Provides-Extra: postgres
Requires-Dist: psycopg-pool; extra == 'postgres'
Requires-Dist: psycopg[binary]>=3.1; extra == 'postgres'
Provides-Extra: sqlite
Requires-Dist: aiosqlite; extra == 'sqlite'
Description-Content-Type: text/markdown

# slotllm

**Rate-limit-aware concurrency for batch LLM workloads.**

[![PyPI](https://img.shields.io/pypi/v/slotllm)](https://pypi.org/project/slotllm/)
[![Python](https://img.shields.io/pypi/pyversions/slotllm)](https://pypi.org/project/slotllm/)
[![CI](https://github.com/datamachineworks/slotllm/actions/workflows/ci.yml/badge.svg)](https://github.com/datamachineworks/slotllm/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

## The Problem

LLM APIs enforce rate limits (requests-per-minute, tokens-per-day, etc.) that are easy to hit when processing hundreds or thousands of prompts. Most batching code either ignores these limits (and eats 429 errors), or serializes everything (and wastes time). When you add multiple models, cost tracking, and multi-worker coordination, the boilerplate explodes.

**slotllm** handles all of this. You declare your rate limits, hand over a list of prompts, and it processes them as fast as your budgets allow — no retries, no 429s, no guesswork.

## Quick Start

```python
import asyncio
from slotllm import BatchRunner, RateLimitConfig
from slotllm.adapters.litellm import LiteLLMCaller
from slotllm.backends.memory import MemoryBackend

async def main():
    caller = LiteLLMCaller()
    backend = MemoryBackend()
    configs = [RateLimitConfig(model_id="gpt-4o-mini", rpm=50, rpd=5000)]

    async with BatchRunner(caller, backend, configs) as runner:
        results = await runner.run_simple(
            ["Summarize quantum computing in one sentence.",
             "What is the capital of France?",
             "Explain recursion to a 5-year-old."],
            model_id="gpt-4o-mini",
        )

    for r in results:
        print(r.response.content)

asyncio.run(main())
```

## Installation

```bash
# Core (in-memory backend, bring your own caller)
pip install slotllm

# With LiteLLM adapter (100+ LLM providers)
pip install "slotllm[litellm]"

# With SQLite backend (multi-process coordination)
pip install "slotllm[sqlite]"

# With PostgreSQL backend (distributed coordination)
pip install "slotllm[postgres]"

# Everything
pip install "slotllm[all]"
```

## Backends

slotllm uses a **slot backend** to track rate-limit budgets. Choose the one that fits your deployment:

| Backend | Coordination | Persistence | Best for |
|---------|-------------|-------------|----------|
| `MemoryBackend` | Single process | None | Scripts, notebooks, dev |
| `SQLiteBackend` | Multiple processes (same machine) | Disk | CLI tools, local workers |
| `PostgresBackend` | Multiple machines | Database | Production, distributed |

All backends implement the same `SlotBackend` interface, so switching is a one-line change.

```python
# Memory (default — zero config)
from slotllm.backends.memory import MemoryBackend
backend = MemoryBackend()

# SQLite
from slotllm.backends.sqlite import SQLiteBackend
backend = SQLiteBackend(db_path="slots.db")

# PostgreSQL
from slotllm.backends.postgres import PostgresBackend
backend = PostgresBackend(dsn="postgresql://user:pass@localhost/mydb")
```

## Bring Your Own Caller

slotllm doesn't lock you into a specific LLM client. Any object with a `call` method matching this signature works:

```python
from slotllm.caller import Response

class MyCaller:
    async def call(self, model_id: str, messages: list[dict], **kwargs) -> Response:
        # Call your LLM here
        result = await my_llm_client.chat(model=model_id, messages=messages)
        return Response(
            content=result.text,
            input_tokens=result.usage.input,
            output_tokens=result.usage.output,
            model_id=model_id,
        )
```

That's it — no base class, no registration. Pass your caller to `BatchRunner` and go.

## Structured Outputs

When using the [LiteLLM](https://github.com/BerriAI/litellm) adapter with [Instructor](https://github.com/jxnl/instructor), pass a `response_model` to get validated Pydantic objects back:

```python
from pydantic import BaseModel
from slotllm.adapters.litellm import LiteLLMCaller

class City(BaseModel):
    name: str
    country: str
    population: int

caller = LiteLLMCaller()

# Inside your BatchRunner loop, or via RequestItem kwargs:
results = await runner.run_simple(
    ["Tell me about Tokyo.", "Tell me about Paris."],
    model_id="gpt-4o-mini",
    response_model=City,  # Each result.response.content is a City instance
)
```

Requires `pip install "slotllm[instructor]"`.

## Cost Tracking

Track spend across models with `CostTracker`:

```python
from decimal import Decimal
from slotllm import BatchRunner, CostTracker, RateLimitConfig

tracker = CostTracker()
# Optional: register custom prices (overrides tokencost lookups)
tracker.register_price(
    "gpt-4o-mini",
    input_price_per_token=Decimal("0.00000015"),
    output_price_per_token=Decimal("0.0000006"),
)

runner = BatchRunner(caller, backend, configs, cost_tracker=tracker)
results = await runner.run(items)

print(tracker.summary())
# {
#     "total_cost_usd": Decimal("0.0042"),
#     "by_model": {"gpt-4o-mini": Decimal("0.0042")},
#     "total_requests": 100,
#     "total_input_tokens": 15000,
#     "total_output_tokens": 5000,
# }
```

Per-request costs are also available on each `BatchResult.cost_usd`.

## Multi-Model Routing

Define multiple models with different priorities. Items without an explicit `model_id` are routed to the lowest-priority (cheapest) model first. When that model's budget runs out, slotllm automatically falls back to the next available model:

```python
from slotllm import BatchRunner, RateLimitConfig, RequestItem

configs = [
    RateLimitConfig(model_id="gpt-4o-mini", rpm=100, rpd=10_000, priority=0),
    RateLimitConfig(model_id="gpt-4o", rpm=20, rpd=2_000, priority=1),
]

items = [
    RequestItem(messages=[{"role": "user", "content": prompt}])
    # No model_id → auto-routed to cheapest available
    for prompt in my_prompts
]

runner = BatchRunner(caller, backend, configs)
results = await runner.run(items)
```

## Architecture

```
slotllm/
├── runner.py              # BatchRunner — orchestrates everything
├── rate_limit.py          # RateLimitConfig, SlotBudget, compute_budget()
├── caller.py              # Caller protocol + Response model
├── cost.py                # CostTracker (Decimal precision, tokencost)
├── backends/
│   ├── base.py            # SlotBackend ABC + Usage model
│   ├── memory.py          # In-memory (single process)
│   ├── sqlite.py          # SQLite (multi-process, WAL mode)
│   └── postgres.py        # PostgreSQL (distributed)
└── adapters/
    └── litellm.py         # LiteLLMCaller + Instructor support
```

**Flow:** `BatchRunner.run()` → acquires slots from backend → calls LLM via caller → records usage → tracks cost → returns results in input order.

## Comparison with Alternatives

| | slotllm | LiteLLM Router | LiteLLM Proxy | Manual retries |
|---|---|---|---|---|
| **Rate-limit budgeting** | Proactive slot reservation | Retry after 429 | Retry after 429 | Retry after 429 |
| **Multi-process coordination** | SQLite / Postgres | No | Yes (server) | No |
| **Cost tracking** | Built-in (Decimal) | Limited | Dashboard | DIY |
| **Setup** | `pip install slotllm` | `pip install litellm` | Docker + config | None |
| **Bring your own caller** | Yes (Protocol) | No (litellm only) | No | Yes |
| **Batch-native** | Yes | No | No | No |

slotllm **complements** LiteLLM — the default adapter uses LiteLLM under the hood for provider support. slotllm adds the concurrency and budgeting layer on top.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, running tests, and contribution guidelines.

## License

[MIT](LICENSE)
