Metadata-Version: 2.4
Name: deeplatent-nlp
Version: 0.3.4
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: pyarrow>=17.0.0
Requires-Dist: regex>=2024.11.6
Requires-Dist: tokenizers>=0.20.3
Requires-Dist: zstandard>=0.23.0
Requires-Dist: transformers>=4.0.0 ; extra == 'all'
Requires-Dist: huggingface-hub>=0.14.0 ; extra == 'all'
Requires-Dist: zstandard>=0.21.0 ; extra == 'all'
Requires-Dist: pytest>=7.0.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0 ; extra == 'dev'
Requires-Dist: black>=23.0.0 ; extra == 'dev'
Requires-Dist: isort>=5.0.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0.0 ; extra == 'dev'
Requires-Dist: zstandard>=0.21.0 ; extra == 'dev'
Requires-Dist: sphinx>=6.0.0 ; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.0.0 ; extra == 'docs'
Requires-Dist: transformers>=4.0.0 ; extra == 'hf'
Requires-Dist: huggingface-hub>=0.14.0 ; extra == 'hf'
Requires-Dist: zstandard>=0.21.0 ; extra == 'prepare'
Provides-Extra: all
Provides-Extra: dev
Provides-Extra: docs
Provides-Extra: hf
Provides-Extra: prepare
Summary: DeepLatent - Morphology-aware tokenizer for Arabic/English bilingual text with native Rust core
Keywords: tokenizer,arabic,nlp,morphology,sarf,deeplatent,bpe,myte,transformers,huggingface,bilingual,rust
Author-email: Mohammed Almaghrabi <almaghrabima@gmail.com>
Maintainer-email: Mohammed Almaghrabi <almaghrabima@gmail.com>
License: CC-BY-NC-4.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Bug Tracker, https://github.com/almaghrabima/deeplatent/issues
Project-URL: Documentation, https://huggingface.co/almaghrabima/deeplatent-tokenizer
Project-URL: Homepage, https://github.com/almaghrabima/deeplatent
Project-URL: HuggingFace, https://huggingface.co/almaghrabima/deeplatent-tokenizer
Project-URL: Repository, https://github.com/almaghrabima/deeplatent

# DeepLatent

**DeepLatent** - SARF Tokenizer for Arabic/English bilingual text with native Rust core.

This package provides the SARF (Sarf-Aware Representation Framework) tokenizer that achieves excellent Arabic/English parity (1.09) by applying morpheme-level preprocessing before BPE tokenization.

## Installation

```bash
pip install deeplatent-nlp
```

### Optional Extras

```bash
pip install deeplatent-nlp[hf]    # HuggingFace transformers integration
pip install deeplatent-nlp[all]   # Everything
```

### Building from Source

```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh  # Install Rust
pip install .
```

## Quick Start

### Native Mode (Recommended)

Native mode uses the compiled Rust core. It requires encrypted tokenizer data files
(generated by `scripts/prepare_tokenizer_data.py` or bundled during build):

```python
from deeplatent import SARFTokenizer

# Auto-detect bundled data
tokenizer = SARFTokenizer.from_native()

# Or load from explicit paths
tokenizer = SARFTokenizer.from_native(
    morpheme_map_path="path/to/morpheme_map.bin.enc",
    bpe_data_path="path/to/bpe.bin.enc",
)

# Encode
ids = tokenizer.encode("مرحبا بكم في العالم")
print(f"Token IDs: {ids}")
print(f"Token count: {len(ids)}")

# Decode
text = tokenizer.decode(ids)
print(f"Decoded: {text}")

# Batch operations
texts = ["مرحبا", "Hello world", "كتب الطالب الدرس"]
batch_ids = tokenizer.encode_batch(texts)
decoded = tokenizer.decode_batch(batch_ids)
```

### HuggingFace Mode

```python
from deeplatent import SARFTokenizer

# Requires: pip install deeplatent-nlp[hf]
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Full HF-compatible API
result = tokenizer.encode(
    "مرحبا بكم",
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)
```

## Roundtrip Guarantee

The SARF tokenizer provides an exact roundtrip guarantee:

```
decode(encode(text)) == normalize(text)
```

The encoder applies Arabic text normalization (the same normalization used during
BPE training) before tokenization. Character variants are unified, diacritics are
stripped, and Indic digits are converted to ASCII. The roundtrip returns the
**normalized** form of the input.

```python
from deeplatent import SARFTokenizer

tokenizer = SARFTokenizer.from_native()

# English roundtrips exactly
text = "Hello world"
assert tokenizer.decode(tokenizer.encode(text)) == text

# Arabic roundtrips to normalized form
assert tokenizer.decode(tokenizer.encode("أحمد")) == "احمد"

# Character variants produce identical token IDs
assert tokenizer.encode("أحمد") == tokenizer.encode("احمد")

# Diacritics are stripped
assert tokenizer.encode("كَتَبَ") == tokenizer.encode("كتب")

# Indic digits map to ASCII
assert tokenizer.encode("١٢٣") == tokenizer.encode("123")
```

### What Gets Normalized

| Input | Output | Rule |
|-------|--------|------|
| أ إ آ ٱ | ا | Alef unification |
| ى | ي | Ya normalization |
| ؤ | و | Hamza-on-waw |
| ئ | ي | Hamza-on-ya |
| كَتَبَ | كتب | Diacritic removal |
| ـعربيـ | عربي | Tatweel removal |
| ١٢٣ | 123 | Indic digit conversion |
| Zero-width chars | *(removed)* | ZWJ/ZWNJ/BOM cleanup |

This matches standard Arabic NLP practice and is the same as GPT-family tokenizers
that normalize Unicode on input.

### Validated on eval_1b

Roundtrip fidelity verified on 10,000 samples from the eval_1b dataset:

```
Samples tested:  10,000
Passed:          10,000 (100.00%)
Failed:          0
Avg tokens/char: 0.3649
```

## Performance

| Metric | With SARF | Without |
|--------|-----------|---------|
| Arabic Fertility | 2.29 | 5.65 |
| English Fertility | 2.10 | 2.91 |
| Parity (Ar/En) | **1.09** | 1.94 |

*Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means
more equal treatment between languages.*

## Supported Platforms

Pre-built wheels are published for every release:

| Platform | Architectures | Python |
|----------|--------------|--------|
| Linux (manylinux) | x86_64, aarch64 | 3.8 - 3.13 |
| macOS | x86_64, arm64 | 3.10 - 3.13 |
| Windows | x86_64 | 3.10 - 3.13 |

Source distribution is also available for other platforms (requires Rust toolchain).

## API Reference

### Loading

```python
from deeplatent import SARFTokenizer

# Native mode — fast, no network, no Python dependencies
tokenizer = SARFTokenizer.from_native()
tokenizer = SARFTokenizer.from_native("morpheme_map.bin.enc", "bpe.bin.enc")

# HuggingFace mode — full transformers compatibility
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Local directory (HF format)
tokenizer = SARFTokenizer.from_directory("./my_tokenizer")
```

### Encoding

```python
# Single text
ids = tokenizer.encode("مرحبا بكم")

# Batch
batch_ids = tokenizer.encode_batch(["مرحبا", "Hello", "كتب الدرس"])

# HF mode options
result = tokenizer.encode("text", padding=True, truncation=True,
                           max_length=512, return_tensors="pt")
```

### Decoding

```python
# Single sequence
text = tokenizer.decode(ids)

# Batch
texts = tokenizer.decode_batch(batch_ids)
```

### Token Inspection

```python
# Tokenize to strings
tokens = tokenizer.tokenize("مرحبا بكم")

# Convert between tokens and IDs
token_id = tokenizer.token_to_id("hello")
token_str = tokenizer.id_to_token(42)

# Vocabulary info
print(tokenizer.vocab_size)          # 65792
print(tokenizer.using_native)        # True
print(tokenizer.preprocessing_enabled)  # True
```

### Normalization (Rust Core)

```python
from deeplatent._core import normalize_arabic_text

normalized = normalize_arabic_text("أحمد")  # "احمد"
```

## What is SARF?

**SARF (صَرْف)** is the Arabic term for **morphology**. In Arabic linguistics,
*sarf* refers to the system that governs word formation, roots and patterns
(جذر / وزن), prefixes, suffixes, infixes, tense, gender, number, and derivation.

Most tokenizers treat Arabic as bytes or characters. **SARF treats Arabic as a language.**

## License

**CC-BY-NC-4.0** (Creative Commons Attribution-NonCommercial 4.0 International).

For commercial licensing: almaghrabima@gmail.com

## Links

- [PyPI Package](https://pypi.org/project/deeplatent-nlp/)
- [HuggingFace Model](https://huggingface.co/almaghrabima/deeplatent-tokenizer)
- [Evaluation Dataset](https://huggingface.co/datasets/almaghrabima/eval-test-data)

