Metadata-Version: 2.4
Name: trillim
Version: 0.5.0
Summary: The fastest inference framework to run BitNet models on CPUs
Project-URL: Repository, https://github.com/Vineet-Vinod/Trillim
Project-URL: Issues, https://github.com/Vineet-Vinod/Trillim/issues
Author-email: Vineet V <vineetv314@gmail.com>
License: MIT License
        
        Copyright (c) 2026 Trillim.
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
        ---
        
        Proprietary Components
        
        The following components are NOT covered by the MIT License above and are
        governed by the Trillim Proprietary EULA below:
        
          - Pre-compiled binaries          trillim/_bin/inference, trillim/_bin/trillim-quantize
          - Wheel build script             scripts/build_wheels.py
        
        ---
        
        Trillim Proprietary End-User License Agreement (EULA)
        
        Copyright (c) 2026 Trillim. All rights reserved.
        
        1. GRANT OF LICENSE.  Trillim ("Trillim") grants you a non-exclusive,
           non-transferable, revocable license to use the closed components listed
           above solely for the purpose of running Trillim-compatible models on your
           own hardware.  You may use the closed components as part of applications
           you build, provided those applications do not expose the closed components
           as a standalone service or library.
        
        2. RESTRICTIONS.  You may NOT:
           (a) reverse engineer, decompile, disassemble, or otherwise attempt to
               derive the source code of any closed component, whether distributed
               as source or as a compiled binary;
           (b) redistribute, sublicense, rent, lease, or lend the closed components
               outside of the official Trillim package (i.e., the package distributed
               via PyPI under the name "trillim" or via Trillim's official GitHub
               releases);
           (c) modify, create derivative works of, or remove any proprietary notices
               from the closed components;
           (d) use the closed components to build a competing product that replicates
               the core functionality of Trillim's kernel library or quantizer.
        
        3. OWNERSHIP.  Trillim retains all right, title, and interest in and to the
           closed components, including all intellectual property rights therein.
        
        4. NO WARRANTY.  THE CLOSED COMPONENTS ARE PROVIDED "AS IS" WITHOUT WARRANTY
           OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
           OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT.
        
        5. LIMITATION OF LIABILITY.  IN NO EVENT SHALL TRILLIM BE LIABLE FOR ANY
           INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL, OR PUNITIVE DAMAGES ARISING
           OUT OF OR RELATED TO YOUR USE OF THE CLOSED COMPONENTS, REGARDLESS OF THE
           THEORY OF LIABILITY.
        
        6. TERMINATION.  This license terminates automatically if you violate any of
           its terms.  Upon termination, you must destroy all copies of the closed
           components in your possession.
License-File: LICENSE
License-File: THIRD_PARTY_LICENSES
Keywords: 1-bit,bitnet,cpu,inference,llm,ternary
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: ddgs==9.10.0
Requires-Dist: fastapi==0.128.0
Requires-Dist: huggingface-hub==0.36.0
Requires-Dist: jinja2==3.1.0
Requires-Dist: prompt-toolkit==3.0.52
Requires-Dist: trafilatura==2.0.0
Requires-Dist: transformers==4.57.1
Requires-Dist: uvicorn[standard]==0.40.0
Provides-Extra: dev
Requires-Dist: ruff==0.15.0; extra == 'dev'
Provides-Extra: voice
Requires-Dist: faster-whisper==1.2.1; extra == 'voice'
Requires-Dist: pocket-tts==1.0.3; extra == 'voice'
Requires-Dist: python-multipart==0.0.22; extra == 'voice'
Description-Content-Type: text/markdown

# Trillim

[What is Trillim?](docs/about-trillim.md)

## Quick Start

### Installation

- Python 3.12+ required
- Install with [uv](https://docs.astral.sh/uv/) (recommended) or pip

Pick your platform for full instructions:

- [macOS](docs/install-mac.md)
- [Linux](docs/install-linux.md)
- [Windows](docs/install-windows.md)

> **Note:** The rest of this README shows bare `trillim` commands. If you're using uv, prefix each command with `uv run` (e.g. `uv run trillim chat ...`).

### Quantize your own model

If you have a HuggingFace BitNet model with safetensors weights:

```bash
# Quantize model weights → qmodel.tensors + rope.cache
trillim quantize <path-to-model> --model

# Optionally extract a PEFT LoRA adapter → qmodel.lora
trillim quantize <path-to-model> --adapter <path-to-adapter>
```

## Chat

Start an interactive conversation in your terminal:

```bash
trillim chat Trillim/BitNet-TRNQ
```

Multi-turn conversations are supported with automatic prompt caching for fast follow-ups. Use `/new` to start a fresh conversation, or `q` to quit.

See the [Chat guide](docs/chat.md) for details on LoRA adapters, sampling parameters, and performance tips.

## Search-Augmented Chat

Trillim supports pluggable inference harnesses. For web-search-enabled models, use:

```bash
trillim chat Trillim/BitNet-Search-TRNQ --harness search
```

By default, search uses DuckDuckGo (`ddgs`). To use Brave:

```bash
export SEARCH_API_KEY=<your_api_key>
trillim chat Trillim/BitNet-Search-TRNQ --harness search --search-provider brave
```

The search harness emits status markers while it runs search and synthesis steps.
See [Chat](docs/chat.md#search-mode) for full behavior and troubleshooting.

## API Server

Trillim includes an OpenAI-compatible API server:

```bash
# Start the server
trillim serve Trillim/BitNet-TRNQ

# With voice pipeline (speech-to-text + text-to-speech)
# Requires optional `voice` dependencies:
# docs/server.md -> "Voice Optional Dependencies"
trillim serve Trillim/BitNet-TRNQ --voice
```

Endpoints:
- `POST /v1/chat/completions` — chat completions (streaming supported)
- `POST /v1/completions` — text completions
- `GET /v1/models` — list loaded models
- `POST /v1/models/load` — hot-swap models, LoRA adapters, and harness/search settings at runtime
- `POST /v1/audio/transcriptions` — speech-to-text (with `--voice`)
- `POST /v1/audio/speech` — text-to-speech (with `--voice`)
- `GET /v1/voices` — list available TTS voices
- `POST /v1/voices` — register a custom voice from audio (see [Voice Cloning Setup](#voice-cloning-setup))

For server-side search harness, start normally and then set `"harness": "search"` (plus optional `"search_provider"`) through `POST /v1/models/load`.

Works with the OpenAI Python client out of the box:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="BitNet-TRNQ",
    messages=[{"role": "user", "content": "Hello!"}],
)
```

See the [Server guide](docs/server.md) for full endpoint documentation, request/response schemas, the Python SDK, and voice pipeline usage.

## LoRA Adapters

Trillim supports PEFT LoRA adapters as bf16 corrections on top of the ternary base model. The adapter lives in its own directory (separate from the base model) and must be quantized first:

```bash
# Quantize a PEFT adapter into Trillim's format
trillim quantize <path-to-base-model> --adapter <path-to-adapter>

# Chat with the base model + adapter
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir>

# Or pull a pre-quantized adapter and use it by ID
trillim pull Trillim/BitNet-GenZ-LoRA-TRNQ
trillim chat Trillim/BitNet-TRNQ --lora Trillim/BitNet-GenZ-LoRA-TRNQ
```

Adapters can also be hot-swapped at runtime via the API server's `POST /v1/models/load` endpoint. See the [Server guide](docs/server.md) for details.

## Runtime Quantization

Separately from the offline `trillim quantize` step (which converts model weights to ternary), Trillim can quantize specific layers at inference time to reduce memory usage. This is controlled with two flags available on both `chat` and `serve`:

- **`--lora-quant <type>`** — quantize LoRA adapter layers. Options: `none`, `int8`, `q4_0`, `q5_0`, `q6_k`, `q8_0`. Only applies when using `--lora`.
- **`--unembed-quant <type>`** — quantize the unembedding (output projection) layer. Options: `int8`, `q4_0`, `q5_0`, `q6_k`, `q8_0`.

```bash
# Quantize LoRA layers to int8 for lower memory
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir> --lora-quant int8

# Quantize the unembed layer to q4_0
trillim chat Trillim/BitNet-TRNQ --unembed-quant q4_0

# Both at once
trillim serve Trillim/BitNet-TRNQ --lora-quant q8_0 --unembed-quant q4_0
```

Lower quantization levels (e.g. `q4_0`) use less memory at a small quality cost. These options can also be set per-request when hot-swapping models via `POST /v1/models/load`. See the [CLI reference](docs/cli.md) for the full flag list.

## Voice Cloning Setup

The voice pipeline (`--voice`) includes 8 predefined voices that work out of the box: `alba`, `marius`, `javert`, `jean`, `fantine`, `cosette`, `eponine`, `azelma`.

To register **custom voices** (voice cloning via `POST /v1/voices`), you need to accept the PocketTTS model terms and authenticate with HuggingFace:

1. Go to [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) on HuggingFace and accept the model's terms.
2. Create a token on HuggingFace (under Access Tokens) with `Read` permissions.
2. Log in locally so the token is available to download the voice cloning weights:

```bash
hf auth login
```

This only needs to be done once. After that, custom voice registration works automatically. If you skip this step, you'll get an error when trying to register a custom voice — predefined voices will still work fine.

## Supported Architectures

- `BitnetForCausalLM` — BitNet with ternary weights and ReLU² activation
- `LlamaForCausalLM` — Llama-style with SiLU activation

## Platform Support

| Platform | Status |
|----------|--------|
| x86_64 (AVX2) | Supported |
| ARM64 (NEON) | Supported |

Thread count is auto-detected as `num_cores - 2`. Override by passing a `--threads N` CLI argument.

## Documentation

- [What is Trillim?](docs/about-trillim.md) — overview, motivation, and who it's for
- Install — [macOS](docs/install-mac.md) | [Linux](docs/install-linux.md) | [Windows](docs/install-windows.md)
- [CLI Reference](docs/cli.md) — all commands and flags
- [Chat](docs/chat.md) — interactive chat interface
- [Server](docs/server.md) — API endpoints, Python SDK, and OpenAI client usage

## License

The Trillim Python SDK source code is MIT-licensed. The C++ inference engine binaries (`inference`, `trillim-quantize`) bundled in the pip package are **proprietary** — you may use them as part of Trillim but may not reverse-engineer or redistribute them separately. See [LICENSE](LICENSE) for full terms.
