Metadata-Version: 2.4
Name: alta-acceleration
Version: 0.3.0
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Linguistic
Summary: High-performance Rust acceleration for ALTA Tokenizer (Kinyarwanda BPE)
Keywords: tokenizer,bpe,kinyarwanda,rust,nlp
Author-email: Yali Labs <yalilabs24@gmail.com>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Issues, https://github.com/Nschadrack/Kin-Tokenizer/issues
Project-URL: Repository, https://github.com/Nschadrack/Kin-Tokenizer

# kin-merge

High-performance Rust acceleration for [ALTA Tokenizer](https://pypi.org/project/altatokenizer/) — a BPE tokenizer for Kinyarwanda.

This package provides optional Rust-powered functions for faster BPE training, encoding, and text preprocessing. It is automatically used by `alta_tokenizer` when installed.

## Installation

```bash
pip install kin-merge
```

Or install with the main tokenizer:

```bash
pip install altatokenizer[fast]
```

## What it accelerates

| Operation | Function | Speedup |
|-----------|----------|---------|
| BPE merge loop | `rust_bpe_train_loop`, `RustBpeTrainer.step` | 3–5x |
| File streaming + pretokenization | `rust_stream_pretokenize_file` | Memory-efficient |
| Text encoding | `rust_encode_text_raw_batch` | 4–5x |
| Text preprocessing | `rust_preprocess_text` | 3–5x |
| Pair counting | `rust_count_pairs` | Parallel via Rayon |

## Usage

You don't need to import `kin_merge` directly. Once installed, `alta_tokenizer` detects it automatically and uses the Rust functions where available. If not installed, the tokenizer falls back to pure Python.

## License

MIT

