Metadata-Version: 2.4
Name: edgequake-litellm
Version: 0.1.4
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Dist: pytest>=8.0 ; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24 ; extra == 'dev'
Requires-Dist: maturin>=1.7 ; extra == 'dev'
Requires-Dist: mypy>=1.8 ; extra == 'dev'
Requires-Dist: ruff>=0.3 ; extra == 'dev'
Provides-Extra: dev
Summary: Drop-in LiteLLM replacement backed by Rust — same API, 10× lower latency
Keywords: llm,litellm,openai,anthropic,gemini,mistral,ai,rust
Home-Page: https://github.com/raphaelmansuy/edgequake-llm
Author: EdgeQuake Contributors
Maintainer: EdgeQuake Contributors
License: Apache-2.0
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Bug Tracker, https://github.com/raphaelmansuy/edgequake-llm/issues
Project-URL: Documentation, https://github.com/raphaelmansuy/edgequake-llm/blob/main/edgequake-litellm/README.md
Project-URL: Homepage, https://github.com/raphaelmansuy/edgequake-llm
Project-URL: Repository, https://github.com/raphaelmansuy/edgequake-llm

# edgequake-litellm

**Drop-in LiteLLM replacement backed by Rust — same API, lower overhead.**

[![PyPI](https://img.shields.io/pypi/v/edgequake-litellm)](https://pypi.org/project/edgequake-litellm/)
[![Python](https://img.shields.io/pypi/pyversions/edgequake-litellm)](https://pypi.org/project/edgequake-litellm/)
[![CI](https://github.com/raphaelmansuy/edgequake-llm/actions/workflows/python-ci.yml/badge.svg)](https://github.com/raphaelmansuy/edgequake-llm/actions/workflows/python-ci.yml)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue)](../LICENSE-APACHE)

`edgequake-litellm` wraps the [`edgequake-llm`](https://crates.io/crates/edgequake-llm) Rust core via [PyO3](https://pyo3.rs/), providing a high-performance drop-in for [LiteLLM](https://github.com/BerriAI/litellm). Swap the import — the rest of your code stays unchanged.

```python
# Before
import litellm

# After — same API, Rust-backed
import edgequake_litellm as litellm
```

## Features

- **LiteLLM-compatible API** — `completion()`, `acompletion()`, `stream()`, `embedding()`, same call signatures, same response shape (`resp.choices[0].message.content`).
- **Multi-provider routing** — OpenAI, Anthropic, Gemini, Mistral, OpenRouter, xAI, Ollama, LM Studio, HuggingFace, and more, via `provider/model` strings.
- **Async-native** — built on Tokio; sync and async Python both supported.
- **Single wheel per platform** — uses PyO3's `abi3-py39` stable ABI, one `.whl` covers Python 3.9–3.13+.
- **Zero Python runtime dependencies** — the Rust extension is self-contained.
- **Full type annotations** — ships with `py.typed` and `.pyi` stubs.
- **`max_completion_tokens` support** — works for all OpenAI model families including `o1`, `o3-mini`, `o4-mini`, `gpt-4.1`, `gpt-4.1-nano` that require this field.
- **Cache hit tokens** — `resp.cache_hit_tokens` exposes OpenAI prompt cache hits and Anthropic cache reads.
- **Reasoning tokens** — `resp.thinking_tokens` surfaces o-series reasoning and Claude extended thinking token counts.

## What's New in 0.1.1

- **`max_completion_tokens` fixed** for OpenAI o-series and gpt-4.1 model families (previously returned 400 Bad Request).
- **`resp.cache_hit_tokens`** — new property returning tokens served from provider cache (`None` if not applicable).
- **`resp.thinking_tokens`** — new property returning reasoning/thinking token count for o-series and Claude models.
- Both new properties are included in `resp.to_dict()`.

See [CHANGELOG.md](CHANGELOG.md) for the full history.

## Installation

```bash
pip install edgequake-litellm
```

## Quick Start

```python
import edgequake_litellm as litellm   # drop-in import alias

# ── Synchronous chat ────────────────────────────────────────────────────────
resp = litellm.completion(
    "openai/gpt-4o-mini",
    [{"role": "user", "content": "Hello, world!"}],
)
# litellm-compatible access
print(resp.choices[0].message.content)
# convenience shortcut
print(resp.content)

# ── Asynchronous chat ───────────────────────────────────────────────────────
import asyncio

async def main():
    resp = await litellm.acompletion(
        "anthropic/claude-3-5-haiku-20241022",
        [{"role": "user", "content": "Tell me a joke."}],
        max_tokens=128,
        temperature=0.8,
    )
    print(resp.choices[0].message.content)

asyncio.run(main())

# ── Streaming (async generator) ─────────────────────────────────────────────
async def stream_example():
    messages = [{"role": "user", "content": "Count to five."}]
    async for chunk in litellm.acompletion("openai/gpt-4o", messages, stream=True):
        print(chunk.choices[0].delta.content or "", end="", flush=True)

# ── Embeddings ──────────────────────────────────────────────────────────────
result = litellm.embedding(
    "openai/text-embedding-3-small",
    ["Hello world", "Rust is fast"],
)
# litellm-compatible access
print(result.data[0].embedding[:3])
# legacy list access still works
print(len(result), len(result[0]))  # 2 1536
```

## Provider Routing

Pass `provider/model` as the first argument — the prefix selects the provider:

| Provider     | Example model string                                |
|--------------|-----------------------------------------------------|
| OpenAI       | `openai/gpt-4o`                                    |
| Anthropic    | `anthropic/claude-3-5-sonnet-20241022`              |
| Google Gemini| `gemini/gemini-2.0-flash`                          |
| Mistral      | `mistral/mistral-large-latest`                      |
| OpenRouter   | `openrouter/meta-llama/llama-3.1-70b-instruct`      |
| xAI          | `xai/grok-3-beta`                                  |
| Ollama       | `ollama/llama3.2`                                  |
| LM Studio    | `lmstudio/local-model`                             |
| HuggingFace  | `huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1` |
| Mock (tests) | `mock/any-name`                                    |

## API Reference

### `completion(model, messages, **kwargs) → ModelResponseCompat`

Synchronous chat completion. Blocks but releases the GIL during Rust I/O so other Python threads keep running.

```python
resp = litellm.completion(
    "openai/gpt-4o",
    messages,
    max_tokens=256,
    temperature=0.7,
    system="You are a helpful assistant.",
    max_completion_tokens=256,  # alias for max_tokens; required for o1/o3/gpt-4.1 models
    seed=42,
    response_format={"type": "json_object"},  # or "text" / "json_object"
)

# All of these access the same content:
resp.choices[0].message.content   # litellm path
resp.content                       # shortcut
resp["choices"][0]["message"]["content"]  # dict-style

resp.usage.total_tokens
resp.model
resp.response_ms                  # latency in milliseconds
resp.to_dict()                    # plain dict

# New in 0.1.1 — cache and reasoning token metadata
resp.cache_hit_tokens             # int | None — tokens served from provider cache
resp.thinking_tokens              # int | None — reasoning tokens (o-series, Claude)
resp.thinking_content             # str | None — visible thinking text (Claude)

# The same data via usage object:
resp.usage.cache_read_input_tokens  # same as resp.cache_hit_tokens
resp.usage.reasoning_tokens         # same as resp.thinking_tokens
```

### `acompletion(model, messages, stream=False, **kwargs)`

Async chat completion. Returns `ModelResponseCompat` or (if `stream=True`) `AsyncGenerator[StreamChunkCompat, None]`.

```python
# Non-streaming
resp = await litellm.acompletion("openai/gpt-4o", messages)

# Streaming
async for chunk in await litellm.acompletion("openai/gpt-4o", messages, stream=True):
    print(chunk.choices[0].delta.content or "", end="")
```

### `stream(model, messages, **kwargs) → AsyncGenerator[StreamChunk, None]`

Low-level streaming. Raw `StreamChunk` objects:

```python
async for chunk in litellm.stream("openai/gpt-4o", messages):
    if chunk.content:
        print(chunk.content, end="")
    elif chunk.is_finished:
        print(f"\n[stop: {chunk.finish_reason}]")
```

### `embedding(model, input, **kwargs) → EmbeddingResponseCompat`

Synchronous embeddings. Returns an `EmbeddingResponseCompat` that supports both litellm-style and legacy list-style access:

```python
result = litellm.embedding("openai/text-embedding-3-small", ["foo", "bar"])

# litellm path
result.data[0].embedding

# backwards-compatible list access
for vec in result:          # iterates List[float]
    print(len(vec))
result[0]                   # List[float]
len(result)                 # number of vectors
```

### `aembedding(model, input, **kwargs) → EmbeddingResponseCompat`

Async embeddings — same return type as `embedding()`.

### `stream_chunk_builder(chunks, messages=None) → ModelResponseCompat`

Reconstruct a full `ModelResponseCompat` from a collected list of streaming chunks:

```python
from edgequake_litellm import stream_chunk_builder

chunks = []
async for chunk in litellm.stream("openai/gpt-4o", messages):
    chunks.append(chunk)

full = stream_chunk_builder(chunks, messages=messages)
print(full.content)
```

## Configuration

Module-level globals mirror `litellm`:

```python
import edgequake_litellm as litellm

litellm.set_verbose = True      # enable debug logging
litellm.drop_params = True      # drop unknown params (always True)

# Set default provider / model
litellm.set_default_provider("anthropic")
litellm.set_default_model("claude-3-5-haiku-20241022")

# Now the provider prefix can be omitted:
resp = litellm.completion("claude-3-5-haiku-20241022", messages)
```

## Exception Hierarchy

Exceptions mirror LiteLLM for painless migration:

```python
import edgequake_litellm as litellm

try:
    resp = litellm.completion("openai/gpt-4o", messages)
except litellm.AuthenticationError as e:
    print(f"Check your API key: {e}")
except litellm.RateLimitError:
    time.sleep(5)
except litellm.ContextWindowExceededError:
    # trim messages and retry
    pass
except litellm.NotFoundError:      # alias for ModelNotFoundError
    pass
except litellm.APIConnectionError:
    pass
```

All exceptions (`AuthenticationError`, `RateLimitError`, `ContextWindowExceededError`, `ModelNotFoundError`, `Timeout`, `APIConnectionError`, `APIError`) are also available from `edgequake_litellm.exceptions`.

## Environment Variables

Provider credentials follow the standard naming convention:

| Provider     | Environment variable                                      |
|--------------|-----------------------------------------------------------|
| OpenAI       | `OPENAI_API_KEY`                                         |
| Anthropic    | `ANTHROPIC_API_KEY`                                      |
| Gemini       | `GEMINI_API_KEY`                                         |
| Mistral      | `MISTRAL_API_KEY`                                        |
| OpenRouter   | `OPENROUTER_API_KEY`                                     |
| xAI          | `XAI_API_KEY`                                            |
| HuggingFace  | `HF_TOKEN`                                               |
| Ollama       | `OLLAMA_HOST` (default: `http://localhost:11434`)         |
| LM Studio    | `LMSTUDIO_HOST` (default: `http://localhost:1234`)        |

Defaults can also be set via `LITELLM_EDGE_PROVIDER` / `LITELLM_EDGE_MODEL`.

## Development

### Prerequisites

- Rust ≥ 1.83 (`rustup toolchain install stable`)
- Python ≥ 3.9
- `pip install maturin`

### Build from source

```bash
git clone https://github.com/raphaelmansuy/edgequake-llm.git
cd edgequake-llm/edgequake-litellm

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

pip install maturin pytest pytest-asyncio ruff mypy

# Build & install in dev mode (incremental Rust + Python)
maturin develop --release

# Run unit tests (mock provider — no API keys needed)
pytest tests/ -k "not e2e" -v
```

### Running E2E tests

```bash
export OPENAI_API_KEY=sk-...
pytest tests/test_e2e_openai.py -v
```

### Publishing

```bash
# Bump version in pyproject.toml AND Cargo.toml (must match), then:
git tag py-v0.2.0
git push --tags
# GitHub Actions builds and publishes to PyPI automatically.
```

## License

Apache-2.0 — see [LICENSE-APACHE](../LICENSE-APACHE).


