Metadata-Version: 2.4
Name: cyllama-vulkan
Version: 0.2.9
Summary: cyllama is a comprehensive zero-dependencies Python library for local AI inference using the state-of-the-art llama, whisper, and stable-diffusion .cpp ecosystem.
Keywords: llama,llm,inference,cython,llama.cpp,whisper.cpp,stable-diffusion.cpp,gguf,ai,machine-learning
Author-Email: Shakeeb Alireza <shakfu@users.noreply.github.com>
License-Expression: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Cython
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Project-URL: Homepage, https://github.com/shakfu/cyllama
Project-URL: Repository, https://github.com/shakfu/cyllama
Project-URL: Issues, https://github.com/shakfu/cyllama/issues
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# cyllama - Fast, Pythonic AI Inference

cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem:

- **[llama.cpp](https://github.com/ggml-org/llama.cpp)** - Text generation, chat, embeddings, and text-to-speech
- **[whisper.cpp](https://github.com/ggerganov/whisper.cpp)** - Speech-to-text transcription and translation
- **[stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)** - Image and video generation

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

**[Documentation](https://shakfu.github.io/cyllama/)** | **[PyPI](https://pypi.org/project/cyllama/)** | **[Changelog](CHANGELOG.md)**

## Features

- High-level API -- `complete()`, `chat()`, `LLM` class for quick prototyping / text generation.
- Streaming -- token-by-token output with callbacks
- Batch processing -- process multiple prompts 3-10x faster
- GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)
- Speculative decoding -- 2-3x speedup with draft models
- Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling
- RAG -- retrieval-augmented generation with local embeddings and [sqlite-vector](https://github.com/sqliteai/sqlite-vector)
- Speech recognition -- whisper.cpp transcription and translation
- Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.
- OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints
- Framework integrations -- OpenAI API client, LangChain LLM interface

## Installation

### From PyPI

```sh
pip install cyllama
```

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed by default to take advantage of Apple Silicon.

### GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only for now):

```sh
pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)
```

All variants install the same `cyllama` Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

```sh
cyllama info
```

You can also query the backend configuration at runtime:

```python
from cyllama import _backend
print(_backend.cuda)   # True if built with CUDA
print(_backend.metal)  # True if built with Metal
```

### Build from source with a specific backend

```sh
GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama
```

## Command-Line Interface

cyllama provides a unified CLI for all major functionality:

```bash
# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json

# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf                      # interactive mode
cyllama chat -m models/llama.gguf -n 1024              # interactive, up to 1024 tokens per response
cyllama chat -m models/llama.gguf --stats              # show session stats on exit

# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim                        # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5

# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md   # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..."  # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..."           # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info       # build and backend information
cyllama memory -m models/llama.gguf  # GPU memory estimation
```

Run `cyllama --help` or `cyllama <command> --help` for full usage. See [CLI Cheatsheet](docs/cli-cheatsheet.md) for the complete reference.

## Quick Start

```python
from cyllama import complete

# One line is all you need
response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)
print(response)
```

## Key Features

### Simple by Default, Powerful When Needed

**High-Level API** - Get started in seconds:

```python
from cyllama import complete, chat, LLM

# One-shot completion
response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2")  # Model stays loaded!
```

**Streaming Support** - Real-time token-by-token output:

```python
for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
    print(chunk, end="", flush=True)
```

### Performance Optimized

**Batch Processing** - Process multiple prompts 3-10x faster:

```python
from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")
```

**Speculative Decoding** - 2-3x speedup with draft models:

```python
from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)
```

**Memory Optimization** - Smart GPU layer allocation:

```python
from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="model.gguf",
    available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")
```

**N-gram Cache** - 2-10x speedup for repetitive text:

```python
from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)
```

**Response Caching** - Cache LLM responses for repeated prompts:

```python
from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response
response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics
info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed
llm.cache_clear()
```

Note: Caching requires a fixed seed (not the default random sentinel) since random seeds produce non-deterministic output. Streaming responses are not cached.

### Framework Integrations

**OpenAI-Compatible API** - Drop-in replacement:

```python
from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)
print(response.choices[0].message.content)
```

**LangChain Integration** - Seamless ecosystem access:

```python
from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")
```

### Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

**ReActAgent** - Reasoning + Acting agent with tool calling:

```python
from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    return str(simple_eval(expression))

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)
```

**ConstrainedAgent** - Grammar-enforced tool calling for 100% reliability:

```python
from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls
```

**ContractAgent** - Contract-based agent with C++26-inspired pre/post conditions:

```python
from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,
    task_precondition=lambda task: len(task) > 10,
    answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")
```

See [Agents Overview](docs/agents_overview.md) for detailed agent documentation.

### Speech Recognition

**Whisper Transcription** - Transcribe audio files with timestamps:

```python
from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    start = ctx.full_get_segment_t0(i) / 100.0
    end = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{start:.2f}s - {end:.2f}s] {text}")
```

See [Whisper docs](docs/whisper.md) for full documentation.

### Stable Diffusion

**Image Generation** - Generate images from text using stable-diffusion.cpp:

```python
from cyllama.sd import text_to_image

# Simple text-to-image
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)
image.save("output.png")
```

**Advanced Generation** - Full control with SDContext:

```python
from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)
images = ctx.generate(
    prompt="a beautiful mountain landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
    sample_method=SampleMethod.EULER,
    scheduler=Scheduler.DISCRETE
)
```

**CLI Tool** - Command-line interface:

```bash
# Text to image
cyllama sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png

# Image to image
cyllama sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7

# Show system info
cyllama sd info
```

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See [Stable Diffusion docs](docs/stable_diffusion.md) for full documentation.

### RAG (Retrieval-Augmented Generation)

**CLI** - Query your documents from the command line:

```bash
# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure X?" --stream

# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md --sources

# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ --db docs.sqlite -p "How do I configure X?"   # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db docs.sqlite -p "Another question?"                # later runs: reuse index, no re-embedding
```

**Simple RAG** - Query your documents with LLMs:

```python
from cyllama.rag import RAG

# Create RAG instance with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Add documents
rag.add_texts([
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons."
])

# Query
response = rag.query("What is Python?")
print(response.text)
```

**Load Documents** - Support for multiple file formats:

```python
from cyllama.rag import RAG, load_directory

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)

response = rag.query("How do I configure the system?")
```

**Hybrid Search** - Combine vector and keyword search:

```python
from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)
```

**Embedding Cache** - Speed up repeated queries with LRU caching:

```python
from cyllama.rag import Embedder

# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss
embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")
```

**Agent Integration** - Use RAG as an agent tool:

```python
from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")
```

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See [RAG Overview](docs/rag_overview.md) for full documentation.

### Common Utilities

**GGUF File Manipulation** - Inspect and modify model files:

```python
from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")
```

**Structured Output** - JSON schema to grammar conversion (pure Python, no C++ dependency):

```python
from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)
```

**Huggingface Model Downloads**:

```python
from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
    model_path="./models/my_model.gguf"
)

# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models
models = list_cached_models()
```

## What's Inside

### Text Generation (llama.cpp)

- [x] **Full llama.cpp API** - Complete Cython wrapper with strong typing
- [x] **High-Level API** - Simple, Pythonic interface (`LLM`, `complete`, `chat`)
- [x] **Streaming Support** - Token-by-token generation with callbacks
- [x] **Batch Processing** - Efficient parallel inference
- [x] **Multimodal** - LLAVA and vision-language models
- [x] **Speculative Decoding** - 2-3x inference speedup with draft models

### Speech Recognition (whisper.cpp)

- [x] **Full whisper.cpp API** - Complete Cython wrapper
- [x] **High-Level API** - Simple `transcribe()` function
- [x] **Multiple Formats** - WAV, MP3, FLAC, and more
- [x] **Language Detection** - Automatic or specified language
- [x] **Timestamps** - Word and segment-level timing

### Image & Video Generation (stable-diffusion.cpp)

- [x] **Full stable-diffusion.cpp API** - Complete Cython wrapper
- [x] **Text-to-Image** - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2
- [x] **Image-to-Image** - Transform existing images
- [x] **Inpainting** - Mask-based editing
- [x] **ControlNet** - Guided generation with edge/pose/depth
- [x] **Video Generation** - Wan, CogVideoX models
- [x] **Upscaling** - ESRGAN 4x upscaling

### Cross-Cutting Features

- [x] **GPU Acceleration** - Metal, CUDA, Vulkan backends
- [x] **Memory Optimization** - Smart GPU layer allocation
- [x] **Agent Framework** - ReActAgent, ConstrainedAgent, ContractAgent
- [x] **Framework Integration** - OpenAI API, LangChain, FastAPI

## Why Cyllama?

**Performance**: Compiled Cython wrappers with minimal overhead

- Strong type checking at compile time
- Zero-copy data passing where possible
- Efficient memory management
- Native integration with llama.cpp optimizations

**Simplicity**: From 50 lines to 1 line for basic generation

- Intuitive, Pythonic API design
- Automatic resource management
- Sensible defaults, full control when needed

**Production-Ready**: Battle-tested and comprehensive

- 1460+ passing tests with extensive coverage
- Comprehensive documentation and examples
- Proper error handling and logging
- Framework integration for real applications

**Up-to-Date**: Tracks bleeding-edge llama.cpp

- Regular updates with latest features
- All high-priority APIs wrapped
- Performance optimizations included

## Status

**Current Version**: 0.2.8 (Apr 2026)
**llama.cpp Version**: b8757
**Build System**: scikit-build-core + CMake
**Test Coverage**: 1460+ tests passing

### Platform & GPU Availability

Pre-built wheels on PyPI:

| Package | Backend | Platform | Arch | Linking |
|---|---|---|---|---|
| `cyllama` | CPU | Linux | x86_64 | static |
| `cyllama` | CPU | Windows | x86_64 | static |
| `cyllama` | Metal | macOS | arm64 (Apple Silicon) | static |
| `cyllama` | Metal | macOS | x86_64 (Intel) | static |
| `cyllama-cuda12` | CUDA 12.4 | Linux | x86_64 | dynamic |
| `cyllama-rocm` | ROCm 6.3 | Linux | x86_64 | dynamic |
| `cyllama-sycl` | Intel SYCL (oneAPI 2025.3) | Linux | x86_64 | dynamic |
| `cyllama-vulkan` | Vulkan | Linux | x86_64 | dynamic |

We will be adding additional wheel support for more platforms in the future, starting with vulkan and cuda12 support Windows.

Build from source (any platform with a C++ toolchain):

| Backend | macOS | Linux | Windows |
|---|---|---|---|
| CPU | `make build-cpu` | `make build-cpu` | `make build-cpu` |
| Metal | `make build-metal` (default) | -- | -- |
| CUDA | -- | `make build-cuda` | `make build-cuda` |
| ROCm (HIP) | -- | `make build-hip` | -- |
| Vulkan | `make build-vulkan` | `make build-vulkan` | `make build-vulkan` |
| SYCL | -- | `make build-sycl` | -- |
| OpenCL | `make build-opencl` | `make build-opencl` | `make build-opencl` |

All source builds support both static (`make build-<backend>`) and dynamic (`make build-<backend>-dynamic`) linking.

### Recent Releases

- **v0.2.9** (Apr 2026) - Fixed CUDA image generation crash (SD now statically links its own vendored ggml by default), `--stats` works in streaming mode, exposed `LlamaContext.get_perf_data()` / `LlamaSampler.get_perf_data()`, `MtmdContextParams.warmup` property, replaced deprecated `mtmd_image_tokens_get_nx/ny` with `mtmd_decoder_pos` API, llama.cpp b8802, stable-diffusion.cpp master-567-ee5bf95
- **v0.2.8** (Apr 2026) - Expanded Cython bindings for `LlamaContextParams` (`flash_attn_type`, `embeddings`, `op_offload`, `swa_full`, `kv_unified`), ~30 new `WhisperFullParams` properties, `SDSampleParams`/`SDImageGenParams` additions (skip-layer guidance, custom sigmas, LoRA, IP-Adapter, Photo Maker, step-cache surface), `whisper_cpp.disable_logging()`, `cyllama transcribe -v` flag, centralized defaults in `cyllama._defaults` aligned with llama.cpp C library, Gemma 4 interactive chat fix, Qwen3 reasoning-block truncation fix, CUDA wheel double-free fix
- **v0.2.7** (Apr 2026) - SD defaults aligned with C library: `wtype` auto-detect (fixes blank images on CUDA), `sample_method`/`scheduler` auto-resolve, `eta` changed from 0.0 to infinity sentinel
- **v0.2.6** (Apr 2026) - Removed accidental `pytest-review` runtime dependency from 0.2.5
- **v0.2.5** (Apr 2026) - Typed loader exceptions, concurrent-use guard on `LLM`/`Embedder`/`WhisperContext`/`SDContext`, persistent RAG vector store (`cyllama rag --db`), corpus deduplication, vendored jinja2 chat templates (fixes Gemma 4 and other non-substring-detectable templates), Qwen3 `<think>`-block stripping + n-gram repetition guard, readline history for REPLs, memory-leak regression tests, llama.cpp b8757
- **v0.2.4** (Apr 2026) - Unified CLI (`cyllama gen`, `chat`, `embed`, `rag`, ...), `cyllama rag` command-line RAG, Ctrl+C during inference, embeddings endpoint, Embedder logging fix, interactive chat token limit fix
- **v0.2.3** (Apr 2026) - SD flow_shift black-image fix, GPU OOM validation, dynamic Linux install fixes, wheel backend discovery after auditwheel/delvewheel rename, CLI entry point, wheel smoke tests, OpenCL targets, CUDA tuning flags
- **v0.2.2** (Apr 2026) - CUDA wheel size stability (PTX-only sm_75), portability flags moved from manage.py to CI workflows
- **v0.2.1** (Mar 2026) - Code quality hardening: GIL release for whisper/encode, async stream fixes, memory-aware embedding cache, CI robustness, 30+ bug fixes, 1150+ tests
- **v0.2.0** (Mar 2026) - Dynamic-linked GPU wheels (CUDA, ROCm, SYCL, Vulkan) on PyPI, unified ggml, sqlite-vector vendored
- **v0.1.21** (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled
- **v0.1.20** (Feb 2026) - Update llama.cpp + stable-diffusion.cpp
- **v0.1.19** (Dev 2025) - Metal fix for stable-diffusion.cpp
- **v0.1.18** (Dec 2025) - Remaining stable-diffusion.cpp wrapped
- **v0.1.16** (Dec 2025) - Response class, Async API, Chat templates
- **v0.1.12** (Nov 2025) - Initial wrapper of stable-diffusion.cpp
- **v0.1.11** (Nov 2025) - ACP support, build improvements
- **v0.1.10** (Nov 2025) - Agent Framework, bug fixes
- **v0.1.9** (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation
- **v0.1.8** (Nov 2025) - Speculative decoding API
- **v0.1.7** (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache
- **v0.1.6** (Nov 2025) - Multimodal test fixes
- **v0.1.5** (Oct 2025) - Mongoose server, embedded server
- **v0.1.4** (Oct 2025) - Memory estimation, performance optimizations

See [CHANGELOG.md](CHANGELOG.md) for complete release history.

## Building from Source

To build `cyllama` from source:

1. A recent version of `python3` (currently testing on python 3.13)

2. Git clone the latest version of `cyllama`:

    ```sh
    git clone https://github.com/shakfu/cyllama.git
    cd cyllama
    ```

3. We use [uv](https://github.com/astral-sh/uv) for package management:

    If you don't have it see the link above to install it, otherwise:

    ```sh
    uv sync
    ```

4. Type `make` in the terminal.

    This will:

    1. Download and build `llama.cpp`, `whisper.cpp` and `stable-diffusion.cpp`
    2. Install them into the `thirdparty` folder
    3. Build `cyllama` using scikit-build-core + CMake

### Build Commands

```sh
# Full build (default: static linking, builds llama.cpp from source)
make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution
make wheel        # Creates wheel in dist/
make dist         # Creates sdist + wheel in dist/

# Backend-specific builds (static)
make build-cpu    # CPU only
make build-metal  # macOS Metal (default on macOS)
make build-cuda   # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip    # AMD ROCm
make build-sycl   # Intel SYCL
make build-opencl # OpenCL

# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic

# Backend-specific wheels (static and dynamic)
make wheel-cuda           # Static wheel
make wheel-cuda-dynamic   # Dynamic wheel with shared libs

# Clean and rebuild
make clean        # Remove build artifacts + dynamic libs
make reset        # Full reset including thirdparty and .venv
make remake       # Clean rebuild with tests

# Code quality
make lint         # Lint with ruff (auto-fix)
make format       # Format with ruff
make typecheck    # Type check with mypy
make qa           # Run all: lint, typecheck, format

# Memory leak detection
make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing
make check        # Validate wheels with twine
make publish      # Upload to PyPI
make publish-test # Upload to TestPyPI
```

### GPU Acceleration

By default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):

```sh
# Static builds (all libs compiled in)
make build-cuda
make build-vulkan

# Dynamic builds (shared libs installed alongside extension)
make build-cuda-dynamic
make build-vulkan-dynamic

# Multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build
```

See [Build Backends](docs/build_backends.md) for comprehensive backend build instructions.

### Multi-GPU Configuration

For systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:

```python
from cyllama import LLM, GenerationConfig

# Use a specific GPU (GPU index 1)
llm = LLM("model.gguf", main_gpu=1)

# Multi-GPU with layer splitting (default mode)
llm = LLM("model.gguf", split_mode=1, n_gpu_layers=-1)

# Multi-GPU with tensor parallelism (row splitting)
llm = LLM("model.gguf", split_mode=2, n_gpu_layers=-1)

# Custom tensor split: 30% GPU 0, 70% GPU 1
llm = LLM("model.gguf", tensor_split=[0.3, 0.7])

# Full configuration via GenerationConfig
config = GenerationConfig(
    main_gpu=0,
    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW
    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1
    n_gpu_layers=-1
)
llm = LLM("model.gguf", config=config)
```

**Split Modes:**

- `0` (NONE): Single GPU only, uses `main_gpu`
- `1` (LAYER): Split layers and KV cache across GPUs (default)
- `2` (ROW): Tensor parallelism - split layers with row-wise distribution

## Testing

The `tests` directory in this repo provides extensive examples of using cyllama.

However, as a first step, you should download a smallish llm in the `.gguf` model from [huggingface](https://huggingface.co/models?search=gguf). A good small model to start and which is assumed by tests is [Llama-3.2-1B-Instruct-Q8_0.gguf](https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf). `cyllama` expects models to be stored in a `models` folder in the cloned `cyllama` directory. So to create the `models` directory if doesn't exist and download this model, you can just type:

```sh
make download
```

This basically just does:

```sh
cd cyllama
mkdir models && cd models
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
```

Now you can test it using `llama-cli` or `llama-simple`:

```sh
bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"
```

With 1460+ passing tests, the library is ready for both quick prototyping and production use:

```sh
make test  # Run full test suite
```

You can also explore interactively:

```python
python3 -i scripts/start.py

>>> from cyllama import complete
>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
>>> print(response)
```

## Documentation

Full documentation is available at [https://shakfu.github.io/cyllama/](https://shakfu.github.io/cyllama/) (built with MkDocs).

To serve docs locally: `make docs-serve`

- **[User Guide](docs/user_guide.md)** - Comprehensive guide covering all features
- **[CLI Cheatsheet](docs/cli-cheatsheet.md)** - Complete CLI reference for all commands
- **[API Reference](docs/api_reference.md)** - Complete API documentation
- **[RAG Overview](docs/rag_overview.md)** - Retrieval-augmented generation guide
- **[Cookbook](docs/cookbook.md)** - Practical recipes and patterns
- **[Changelog](CHANGELOG.md)** - Complete release history
- **Examples** - See `tests/examples/` for working code samples

## Roadmap

### Completed

- [x] Full llama.cpp API wrapper with Cython
- [x] High-level API (`LLM`, `complete`, `chat`)
- [x] Async API support (`AsyncLLM`, `complete_async`, `chat_async`)
- [x] Response class with stats and serialization
- [x] Built-in chat template system (llama.cpp templates)
- [x] Batch processing utilities
- [x] OpenAI-compatible API client
- [x] LangChain integration
- [x] Speculative decoding
- [x] GGUF file manipulation
- [x] JSON schema to grammar conversion
- [x] Model download helper
- [x] N-gram cache
- [x] OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer) with chat and embeddings
- [x] Whisper.cpp integration
- [x] Multimodal support (LLAVA)
- [x] Memory estimation utilities
- [x] Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)
- [x] Stable Diffusion (stable-diffusion.cpp) - image/video generation
- [x] RAG utilities (text chunking, document processing)

### Future

- [ ] Web UI for testing

## Contributing

Contributions are welcome! Please see the [User Guide](docs/user_guide.md) for development guidelines.

## License

This project wraps [llama.cpp](https://github.com/ggml-org/llama.cpp), [whisper.cpp](https://github.com/ggml-org/whisper.cpp), and [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) which all follow the MIT licensing terms, as does cyllama.
