Metadata-Version: 2.4
Name: tinfer-ai
Version: 0.1.1
Summary: Tinfer — Tiny Inference Engine. Run LLMs locally with GPU acceleration.
Author-email: Sunil <kumawatsunil0663@gmail.com>
License: MIT
Project-URL: Documentation, https://sunilk240.github.io/tinfer-documentation/
Keywords: llm,inference,ai,gpu,cuda,gguf,local
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.28.0

# Tinfer — Tiny Inference Engine

Run LLMs locally with GPU acceleration. Built on [llama.cpp](https://github.com/ggml-org/llama.cpp). No cloud, no API keys, no C++ build tools.

## Install

```bash
pip install tinfer-ai
```

## Upgrade

```bash
pip install --upgrade tinfer-ai
```

## Uninstall

```bash
pip uninstall tinfer-ai
```

---

## Quick Start

```bash
# 1. Download a model (GGUF format)
pip install huggingface-hub
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf --local-dir ./models

# 2. Run
tinfer -m ./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello, what is AI?" -n 100
```

---

## Three Ways to Use

### 1. CLI — Direct Chat

```bash
tinfer -m model.gguf -p "Explain quantum computing" -n 200 -ngl 99
```

Key flags: `-m` model path, `-p` prompt, `-n` max tokens, `-ngl` GPU layers, `-c` context size, `--temp` temperature.

### 2. Server — WebUI + API

```bash
tinfer-server -m model.gguf --port 8080 -ngl 99
# Open http://localhost:8080 for the built-in chat interface
```

Key flags: `--port`, `--host`, `-ngl`, `-c`, `--api-key`, `--embedding`, `--slots`.

### 3. HTTP API — OpenAI-Compatible

```bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'
```

Works with any OpenAI SDK — just change `base_url` to `http://localhost:8080/v1`.

---

## Inference Types

| Type | Description |
|------|-------------|
| **Text Generation** | Standard LLM chat and completion |
| **Vision / Multimodal** | Image understanding, OCR, visual QA |
| **Embedding & Reranking** | Semantic search, text similarity |
| **LoRA Fine-Tuned** | Run fine-tuned adapters with `--lora` flag |

## Key Features

| Feature | Description |
|---------|-------------|
| **Layer Offloading** | Run models larger than VRAM — Disk → CPU → GPU |
| **PagedAttention** | Zero-fragmentation KV cache with Copy-on-Write |
| **KV Cache Eviction** | Infinite-length generation with smart eviction |
| **Model Conversion** | Convert HuggingFace models & LoRA adapters to GGUF |
| **Quantization** | 30+ types (Q4_K_M, Q5_K_M, Q8_0, IQ, etc.) |
| **Benchmarking** | Measure tokens/sec with `tinfer-bench` |

## Tools Included

| Command | Purpose |
|---------|---------|
| `tinfer` | CLI inference and chat |
| `tinfer-server` | HTTP server with WebUI |
| `tinfer-bench` | Performance benchmarking |
| `tinfer-quantize` | Model quantization |

---

## Documentation

📖 **Full documentation:** [https://sunilk240.github.io/tinfer-documentation/](https://sunilk240.github.io/tinfer-documentation/)

---

## License

MIT
