Metadata-Version: 2.4
Name: fast-bpe-rs
Version: 0.3.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: datasets>=4.1 ; extra == 'dev'
Requires-Dist: distro>=1.9 ; extra == 'dev'
Requires-Dist: memory-profiler>=0.61 ; extra == 'dev'
Requires-Dist: maturin>=1.7,<2.0 ; extra == 'dev'
Requires-Dist: pre-commit>=4.2 ; extra == 'dev'
Requires-Dist: psutil>=7.0 ; extra == 'dev'
Requires-Dist: py-cpuinfo>=9.0 ; extra == 'dev'
Requires-Dist: pytest>=8.3 ; extra == 'dev'
Requires-Dist: rustbpe>=0.1.0 ; extra == 'dev'
Requires-Dist: ruff>=0.11.13 ; extra == 'dev'
Requires-Dist: tiktoken>=0.12 ; extra == 'dev'
Requires-Dist: twine>=6.1 ; extra == 'dev'
Provides-Extra: dev
License-File: LICENSE
Summary: Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.
Keywords: bpe,nlp,pyo3,rust,tokenizer
Author: fast-bpe-rs contributors
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# fast-bpe-rs

A blazing-fast [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokenizer — written in Rust, usable from Python.

---

## Quick start

### Install

```bash
pip install fast-bpe-rs
```

> No prebuilt wheel for your platform? `pip` will compile from source. You'll need a [Rust toolchain](https://rustup.rs) installed first.

### Train

```python
from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])
```

For real corpora, use a GPT-style split pattern:

```python
bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)
```

### Encode & Decode

```python
ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)
```

---

## License

[Apache 2.0](LICENSE)

