Metadata-Version: 2.4
Name: is-it-slop-preprocessing
Version: 0.5.0
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development
Classifier: Topic :: Text Processing
Classifier: Typing :: Typed
Requires-Dist: numpy>=2.0
Requires-Dist: scipy>=1.14
Summary: Fast TF-IDF vectorization. Preprocessing step for `is-it-slop` package, written in Rust.
Keywords: AI-text-detector,ML,TF-IDF,Tokenization,ai-detection,machine-learning,onnx,pyo3,rust,text-classification
Author-email: SamBroomy <36888606+SamBroomy@users.noreply.github.com>
License: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Project-URL: Documentation, https://github.com/SamBroomy/is-it-slop/blob/main/README.md
Project-URL: Homepage, https://github.com/SamBroomy/is-it-slop/blob/main/python/is-it-slop-preprocessing/README.md
Project-URL: Issues, https://github.com/SamBroomy/is-it-slop/issues
Project-URL: Repository, https://github.com/SamBroomy/is-it-slop

# is-it-slop-preprocessing

Fast TF-IDF text vectorization for training AI text detection models.

Implementation in Rust with Python bindings.

> **Note for inference users:** If you only want to use the AI text detection model for predictions, install [`is-it-slop`](https://pypi.org/project/is-it-slop/) instead. This preprocessing library is primarily for the training step or accessing the preprocessing pipeline directly.

The Python bindings allow us to use the same Rust-based text preprocessing at training and inference time, ensuring consistency between model training and deployment.

## Features

- **Token n-grams**: Uses tiktoken BPE token sequences (not characters/words)
- **sklearn-compatible API**: Drop-in replacement for training pipelines
- **Parallel processing**: Automatic multi-threading via Rust/rayon
- **Multiple serialization formats**: rkyv (default), bincode, and JSON support

## Installation

```bash
pip install is-it-slop-preprocessing
```

## Quick Start

```python
from is_it_slop_preprocessing import TfidfVectorizer, VectorizerParams

# Configure vectorizer (n-gram range is fixed at 2-4 tokens)
params = VectorizerParams(
    min_df=10,           # Ignore terms in < 10 docs
    max_df=0.8,          # Ignore terms in > 80% of docs
    sublinear_tf=True    # Apply log scaling to term frequencies
)

# Fit and transform training data
vectorizer, X_train = TfidfVectorizer.fit_transform(train_texts, params)

# Transform test data
X_test = vectorizer.transform(test_texts)

# Save vectorizer for inference
vectorizer.save("tfidf_vectorizer.rkyv")
```

## Platform Support

Pre-built wheels available for:

- **Linux**: x86_64, aarch64 (manylinux_2_28)
- **macOS**: Apple Silicon (ARM64)
- **Windows**: x86_64

## License

MIT

