Metadata-Version: 2.4
Name: rusket
Version: 0.1.25
Requires-Dist: zensical>=0.0.23 ; extra == 'docs'
Requires-Dist: numpy>=1.24 ; extra == 'pandas'
Requires-Dist: pandas>=2.0 ; extra == 'pandas'
Requires-Dist: polars>=0.20 ; extra == 'polars'
Provides-Extra: docs
Provides-Extra: pandas
Provides-Extra: polars
License-File: LICENSE
Summary: Blazing-fast FP-Growth and Association Rules — pure Rust via PyO3
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/bmsuisse/rusket
Project-URL: Repository, https://github.com/bmsuisse/rusket.git

<p align="center">
  <img src="docs/assets/logo.svg" alt="rusket logo" width="200" height="200" />
</p>

<p align="center">
  <strong>Blazing-fast Market Basket Analysis and Recommender Engines (ALS, BPR, FP-Growth, PrefixSpan) for Python, powered by Rust.</strong>
</p>

<p align="center">
  <a href="https://pypi.org/project/rusket/"><img src="https://img.shields.io/pypi/v/rusket?color=%2334D058&logo=pypi&logoColor=white" alt="PyPI"></a>
  <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10%2B-blue?logo=python&logoColor=white" alt="Python"></a>
  <a href="https://www.rust-lang.org/"><img src="https://img.shields.io/badge/rust-1.83%2B-orange?logo=rust" alt="Rust"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="License"></a>
  <a href="https://bmsuisse.github.io/rusket/"><img src="https://img.shields.io/badge/docs-Zensical-blue" alt="Docs"></a>
</p>

---

`rusket` is a **modern library for Market Basket Analysis and Recommender Engines**. 
**Arrow-backed, fully compatible with Spark, and written entirely in Rust** (via [PyO3](https://pyo3.rs/)), it delivers **2–15× speed-ups** and dramatically lower memory usage compared to traditional Python implementations.

It features **Alternating Least Squares (ALS)** and **Bayesian Personalized Ranking (BPR)** for collaborative filtering, as well as **FP-Growth** (parallel via Rayon), **Eclat** (vertical bitset mining), **HUPM** (High-Utility Pattern Mining via EFIM), and **PrefixSpan** (sequential pattern mining). It serves as a **drop-in replacement** for [`mlxtend`](https://rasbt.github.io/mlxtend/)'s APIs, natively supporting **Pandas** (including Arrow backend), **Polars**, and **sparse DataFrames** out of the box.

All algorithms expose both a **functional API** (`mine(df, ...)`) and an **OOP class API** (`FPGrowth.from_transactions(df).mine()`) that flows naturally from raw transaction logs.

---

## ✨ Highlights

| | `rusket` | `mlxtend` |
|---|---|---|
| **Core language** | Rust (PyO3) | Pure Python |
| **Algorithms** | ALS, BPR, PrefixSpan, FP-Growth, Eclat, HUPM | FP-Growth only |
| **Recommender API** | ✅ Hybrid Engine + i2i Similarity | ❌ |
| **Graph & Embeddings** | ✅ NetworkX Export, Vector DB Export | ❌ |
| **OOP class API** | ✅ `FPGrowth.from_transactions(df).mine()` | ❌ |
| **Pandas dense input** | ✅ C-contiguous `np.uint8` | ✅ |
| **Pandas Arrow backend** | ✅ Arrow zero-copy (pandas 2.0+) | ❌ Not supported |
| **Pandas sparse input** | ✅ Zero-copy CSR → Rust | ❌ Densifies first |
| **Polars input** | ✅ Arrow zero-copy | ❌ Not supported |
| **Spark / distributed** | ✅ `mine_grouped`, `rules_grouped`, `prefixspan_grouped`, `hupm_grouped`, `recommend_batches` | ❌ |
| **Parallel mining** | ✅ Rayon work-stealing | ❌ Single-threaded |
| **Memory** | Low (native Rust buffers) | High (Python objects) |
| **API compatibility** | ✅ Drop-in replacement | — |
| **Metrics** | 12 built-in metrics | 9 |

---

## 📦 Installation

```bash
pip install rusket
# or with uv:
uv add rusket
```

**Optional extras:**

```bash
# Polars support
pip install "rusket[polars]"

# Pandas/NumPy support (usually already installed)
pip install "rusket[pandas]"
```

---

## 🚀 Quick Start

### "Frequently Bought Together" — Grocery Checkout Data

Identify which products co-occur most in customer baskets — the foundation of cross-sell widgets, promotional bundles, and shelf placement decisions.

```python
import pandas as pd
from rusket import mine, association_rules

# One week of supermarket checkout data (1 row = 1 receipt, 1 col = 1 SKU)
receipts = pd.DataFrame({
    "milk":         [1, 1, 0, 1, 1, 0, 1],
    "bread":        [1, 0, 1, 1, 0, 1, 1],
    "butter":       [1, 0, 1, 0, 0, 1, 0],
    "eggs":         [0, 1, 1, 0, 1, 0, 1],
    "coffee":       [0, 1, 0, 0, 1, 1, 0],
    "orange_juice": [1, 0, 0, 1, 0, 0, 1],
}, dtype=bool)

# Step 1 — which SKU combinations appear in ≥40% of receipts?
# method="auto" selects FP-Growth or Eclat based on catalogue density
freq = mine(receipts, min_support=0.4, use_colnames=True)

# Step 2 — keep rules with ≥60% confidence
rules = association_rules(
    freq,
    num_itemsets=len(receipts),
    metric="confidence",
    min_threshold=0.6,
)

# Lift > 1 means customers buy these together more than chance alone
print(rules[["antecedents", "consequents", "support", "confidence", "lift"]]
      .sort_values("lift", ascending=False))
```

---

### 🛒 E-Commerce Order Lines (Long Format)

Real-world data arrives as `(order_id, sku)` rows from a database — not one-hot matrices.

#### Functional API

```python
import pandas as pd
from rusket import from_transactions, mine

# Order line export from your e-commerce backend
orders = pd.DataFrame({
    "order_id": [1001, 1001, 1001, 1002, 1002, 1003, 1003],
    "sku":      ["HDPHONES", "USB_DAC", "AUX_CABLE",
                 "HDPHONES", "CARRY_CASE",
                 "USB_DAC",  "AUX_CABLE"],
})

# Convert long-format → one-hot boolean matrix, then mine
ohe  = from_transactions(orders, transaction_col="order_id", item_col="sku")
freq = mine(ohe, min_support=0.3, use_colnames=True)
print(freq)
```

#### OOP Class API

All mining algorithms expose a class-based API that goes straight from order lines to recommendations:

```python
from rusket import FPGrowth, Eclat, AutoMiner

model = AutoMiner.from_transactions(
    orders,
    transaction_col="order_id",
    item_col="sku",
    min_support=0.3,
)

freq  = model.mine(use_colnames=True)
rules = model.association_rules(metric="confidence", min_threshold=0.6)

# Which accessories should be suggested when headphones are in the cart?
suggestions = model.recommend_items(["HDPHONES"], n=3)
# → e.g. ["USB_DAC", "AUX_CABLE", "CARRY_CASE"]
```

Or use the explicit type variants:

```python
from rusket import from_pandas, from_polars

ohe = from_pandas(orders, transaction_col="order_id", item_col="sku")
ohe = from_polars(pl_orders, transaction_col="order_id", item_col="sku")
ohe = from_transactions([["HDPHONES", "USB_DAC"], ["HDPHONES", "CARRY_CASE"]])  # list of lists
```

> **Spark** is also supported: `from_spark(spark_df)` calls `.toPandas()` internally.

---

### ⚡ Eclat — Large SKU Catalogues

`eclat` uses vertical bitset representation + hardware `popcnt` for fast support counting. Ideal for **large SKU catalogues** where baskets contain only a handful of items out of thousands (low density, typically < 0.15).

```python
import pandas as pd
from rusket import eclat, association_rules

# Fashion e-tailer: 5 receipts, basket contains only a subset of the catalogue
baskets = pd.DataFrame({
    "jeans":    [True, True, False, True, True],
    "t_shirt":  [True, False, True,  True, False],
    "sneakers": [True, True, True,  False, True],
    "belt":     [False, True, True,  False, True],
})

# Eclat — same API as fpgrowth, typically faster on sparse catalogues
freq  = eclat(baskets, min_support=0.4, use_colnames=True)
rules = association_rules(freq, num_itemsets=len(baskets), min_threshold=0.6)
print(rules)
```

#### When to use which?

You almost always want to use `rusket.mine(method="auto")`. This evaluates the density of your dataset `nnz / (rows * cols)` using the [Borgelt heuristic (2003)](https://borgelt.net/doc/eclat/eclat.html) to pick the best algorithm under the hood:

| Scenario | Algorithm chosen by `method="auto"` |
|---|---|
| Large SKU catalogue, small basket size (density < 0.15) | `eclat` (bitset/SIMD intersections) |
| Smaller catalogue, dense baskets (density > 0.15) | `fpgrowth` (FP-tree traversals) |

---

### 🐻‍❄️ Polars Input — Reading from Data Lake Parquet

For teams running a modern data stack with Parquet files on S3/GCS/Azure Blob, `rusket` natively accepts [Polars](https://pola.rs/) DataFrames. Data is transferred via Arrow zero-copy buffers — **no conversion overhead**.

The fastest path from a data lake to "Frequently Bought Together" rules:

```python
import polars as pl
from rusket import mine, association_rules

# ── 1. Read a one-hot basket matrix directly from S3/GCS/local Parquet ──
# Columns = SKUs (bool), rows = receipts — produced by your dbt or Spark pipeline
baskets = pl.read_parquet("s3://data-lake/gold/basket_ohe.parquet")
print(f"Loaded {baskets.shape[0]:,} receipts × {baskets.shape[1]} SKUs")

# ── 2. Mine frequent combinations ────────────────────────────────────
freq = mine(baskets, min_support=0.02, use_colnames=True, max_len=3)
print(f"Found {len(freq):,} frequent itemsets")
print(freq.sort_values("support", ascending=False).head(10))

# ── 3. Generate cross-sell rules ────────────────────────────────────
rules = association_rules(freq, num_itemsets=len(baskets), metric="lift", min_threshold=1.2)
print(f"Rules with lift > 1.2: {len(rules):,}")
print(
    rules[["antecedents", "consequents", "confidence", "lift"]]
    .sort_values("lift", ascending=False)
    .head(8)
)
```

> **How it works under the hood:**  
> Polars → Arrow buffer → `np.uint8` (zero-copy) → Rust `fpgrowth_from_dense`

---

### 💎 High-Utility Pattern Mining (HUPM) — Profit-Driven Bundle Discovery

Frequent items aren't always the most profitable. HUPM finds product combinations that generate the **highest total gross margin** — even if they appear rarely. `rusket` implements the state-of-the-art **EFIM** algorithm in Rust.

#### OOP Class API

```python
import pandas as pd
from rusket import HUPM

# Specialty foods retailer: receipt line items with gross margin per unit sold
orders = pd.DataFrame({
    "receipt_id": [1, 1, 1, 2, 2, 3, 3],
    "product": ["aged_cheese", "wine_flight", "charcuterie",
                "aged_cheese", "charcuterie",
                "wine_flight", "charcuterie"],
    "margin": [8.50, 12.00, 6.50,   # receipt 1 — margin per item
               8.50, 6.50,           # receipt 2
               12.00, 6.50],         # receipt 3
})

# Find all product bundles generating ≥ €20 total margin across all receipts
high_margin = HUPM.from_transactions(
    orders,
    transaction_col="receipt_id",
    item_col="product",
    utility_col="margin",
    min_utility=20.0,
).mine()
print(high_margin.head())
# e.g. aged_cheese + wine_flight + charcuterie → total margin 81.0
```

#### Functional API

```python
from rusket import mine_hupm

high_margin = mine_hupm(
    data=orders,
    transaction_col="receipt_id",
    item_col="product",
    utility_col="margin",
    min_utility=20.0,
)
print(high_margin.head())
```

---

### 📊 Sparse Pandas Input

For very sparse datasets (e.g. e-commerce with thousands of SKUs), use Pandas `SparseDtype` to minimize memory. `rusket` passes the raw CSR arrays straight to Rust — **no densification ever happens**.

```python
import pandas as pd
import numpy as np
from rusket import fpgrowth

rng = np.random.default_rng(7)
n_rows, n_cols = 30_000, 500

# Very sparse: average basket size ≈ 3 items out of 500
p_buy = 3 / n_cols
matrix = rng.random((n_rows, n_cols)) < p_buy
products = [f"sku_{i:04d}" for i in range(n_cols)]

df_dense = pd.DataFrame(matrix.astype(bool), columns=products)
df_sparse = df_dense.astype(pd.SparseDtype("bool", fill_value=False))

dense_mb = df_dense.memory_usage(deep=True).sum() / 1e6
sparse_mb = df_sparse.memory_usage(deep=True).sum() / 1e6
print(f"Dense  memory: {dense_mb:.1f} MB")
print(f"Sparse memory: {sparse_mb:.1f} MB  ({dense_mb / sparse_mb:.1f}× smaller)")

# Same API, same results — just faster and lighter
freq = mine(df_sparse, min_support=0.01, use_colnames=True)
print(f"Frequent itemsets: {len(freq):,}")
```

> **How it works under the hood:**  
> Sparse DataFrame → COO → CSR → `(indptr, indices)` → Rust `fpgrowth_from_csr`

---

### 🌊 Out-of-Core Processing (FPMiner Streaming)

For datasets scaling to **Billion-row** sizes that don't fit in memory, use the `FPMiner` accumulator. It accepts chunks of `(txn_id, item_id)` pairs, sorting them in-place immediately, and uses a memory-safe **k-way merge** across all chunks to build the CSR matrix on the fly avoiding massive memory spikes.

```python
import numpy as np
from rusket import FPMiner

n_items = 5_000
miner = FPMiner(n_items=n_items)

# Feed chunks incrementally (e.g. from Parquet/CSV/SQL)
for chunk in dataset:
    txn_ids = chunk["txn_id"].to_numpy(dtype=np.int64)
    item_ids = chunk["item_id"].to_numpy(dtype=np.int32)
    
    # Fast O(k log k) per-chunk sort
    miner.add_chunk(txn_ids, item_ids)

# Stream k-way merge and mine in one pass!
# Returns a DataFrame with 'support' and 'itemsets' just like fpgrowth()
freq = miner.mine(min_support=0.001, max_len=3)
```

**Memory efficiency:** The peak memory overhead at `mine()` time is just $O(k)$ for the cursors (where $k$ is the number of chunks), plus the final compressed CSR allocation. 

---

### 🌩️ Distributed Computing with Apache Spark

`rusket` ships a full Spark integration layer in `rusket.spark`. All algorithms run as **Native Arrow UDFs** via `applyInArrow` — Rust is called directly on each executor, with zero Python overhead per row.

#### How it works

```
PySpark DataFrame
  └─► groupby(group_col).applyInArrow(...)
        └─► Arrow Table (per partition / per group)
              └─► Polars zero-copy conversion
                    └─► rusket Rust extension (on the executor)
                          └─► results → PyArrow → PySpark DataFrame
```

#### Full Example — Retail Basket Analysis per Store

```python
from pyspark.sql import SparkSession
from rusket.spark import mine_grouped, rules_grouped

spark = SparkSession.builder.appName("rusket-demo").getOrCreate()

# ── 1. Load your OHE transaction table (one row = one basket) ──────────────
#    Schema: store_id (string), bread (bool), butter (bool), milk (bool), ...
spark_df = spark.read.parquet("s3://data/baskets/")

# ── 2. Mine frequent itemsets per store in parallel ──────────────────────────
#    Each Spark task calls the Rust FP-Growth/Eclat engine on its Arrow batch.
freq_df = mine_grouped(
    spark_df,
    group_col="store_id",
    min_support=0.05,    # 5% support per store
    method="auto",       # auto-selects FP-Growth or Eclat
)
# freq_df schema: store_id | support (double) | itemsets (array<string>)

# ── 3. Count transactions per store (needed for rule support) ────────────────
from pyspark.sql import functions as F
counts = (
    spark_df.groupby("store_id")
    .agg(F.count("*").alias("n"))
    .rdd.collectAsMap()          # {"store_1": 12000, "store_2": 8500, ...}
)

# ── 4. Generate association rules per store ──────────────────────────────────
rules_df = rules_grouped(
    freq_df,
    group_col="store_id",
    num_itemsets=counts,         # pass per-group counts as a dict
    metric="confidence",
    min_threshold=0.6,
)
# rules_df schema: store_id | antecedents | consequents | confidence | lift | ...

rules_df.orderBy("lift", ascending=False).show(10, truncate=False)
```

#### Sequential Patterns per Category

```python
from rusket.spark import prefixspan_grouped

# event_log schema: category_id, user_id, item_id, event_ts
event_log = spark.read.parquet("s3://data/events/")

seq_df = prefixspan_grouped(
    event_log,
    group_col="category_id",   # mine independently per product category
    user_col="user_id",        # sequence identifier within the group
    time_col="event_ts",       # ordering column
    item_col="item_id",
    min_support=50,            # absolute count: pattern must appear in ≥50 sessions
    max_len=4,
)
# seq_df schema: category_id | support (long) | sequence (array<string>)
seq_df.show(5, truncate=False)
```

#### High-Utility Patterns per Region

```python
from rusket.spark import hupm_grouped

# profit_log schema: region_id, txn_id, item_id, profit
profit_log = spark.read.parquet("s3://data/profit/")

utility_df = hupm_grouped(
    profit_log,
    group_col="region_id",
    transaction_col="txn_id",
    item_col="item_id",
    utility_col="profit",
    min_utility=500.0,         # only itemsets with combined profit ≥ €500
)
# utility_df schema: region_id | utility (double) | itemset (array<long>)
utility_df.show(5, truncate=False)
```

#### Batch Recommendations across the Cluster

```python
from rusket.spark import recommend_batches
from rusket import ALS

# 1. Train an ALS model locally (or load a pre-trained one)
als = ALS(factors=64, iterations=15).from_transactions(
    events_pd,
    user_col="user_id",
    item_col="item_id",
)

# 2. Scale-out scoring: one recommendation row per user
user_df = spark.read.parquet("s3://data/users/").select("user_id")

recs_df = recommend_batches(user_df, model=als, user_col="user_id", k=10)
# recs_df schema: user_id (string) | recommended_items (array<int>)
recs_df.show(5, truncate=False)
```

> **Tip — Databricks / Delta Lake:** All functions return a standard PySpark DataFrame, so you can write results back with `.write.format("delta").save(...)` or `.saveAsTable(...)` directly.

---

### 🔄 Migrating from mlxtend

`rusket` is a **drop-in replacement**. The only API difference is `num_itemsets`:

```diff
- from mlxtend.frequent_patterns import fpgrowth, association_rules
+ from rusket import mine, association_rules

- freq  = fpgrowth(df, min_support=0.05, use_colnames=True)
+ freq  = mine(df, min_support=0.05, use_colnames=True)

- rules = association_rules(freq, metric="lift", min_threshold=1.2)
+ rules = association_rules(freq, num_itemsets=len(df),             # ← add this
+                           metric="lift", min_threshold=1.2)
```

> **Why `num_itemsets`?** This makes support calculation explicit and avoids a hidden internal pandas join that `mlxtend` performs.

**Gotchas:**
1. Input must be `bool` or `0/1` integers — `rusket` warns if you pass floats
2. Polars is supported natively — just pass the DataFrame directly
3. Sparse pandas DataFrames work too — and use much less RAM

---

## 📖 API Reference

### OOP Class API

Every algorithm in `rusket` exposes a **class-based API** in addition to the functional helpers. All classes share a unified interface inherited from `BaseModel`:

| Class | Inherits from | Description |
|-------|--------------|-------------|
| `FPGrowth` | `Miner`, `RuleMinerMixin` | FP-Tree parallel mining |
| `Eclat` | `Miner`, `RuleMinerMixin` | Vertical bitset mining |
| `AutoMiner` | `Miner`, `RuleMinerMixin` | Auto-selects FP-Growth or Eclat |
| `HUPM` | `Miner` | High-Utility Pattern Mining (EFIM) |
| `PrefixSpan` | `Miner` | Sequential pattern mining |
| `ALS` | `ImplicitRecommender` | Alternating Least Squares CF |
| `BPR` | `ImplicitRecommender` | Bayesian Personalized Ranking CF |

All classes share the following data-ingestion class methods inherited from `BaseModel`:

```python
# Load from long-format (transaction_id, item_id) DataFrame or list of lists
model = FPGrowth.from_transactions(df, transaction_col="order_id", item_col="item", min_support=0.3)

# Typed convenience aliases — same result
model = FPGrowth.from_pandas(df,  ...)
model = FPGrowth.from_polars(pl_df, ...)
model = FPGrowth.from_spark(spark_df, ...)
```

`Miner` subclasses (`FPGrowth`, `Eclat`, `AutoMiner`) additionally expose `RuleMinerMixin`, giving a fluent pipeline:

```python
model  = AutoMiner.from_transactions(df, min_support=0.3)
freq   = model.mine(use_colnames=True)             # pd.DataFrame [support, itemsets]
rules  = model.association_rules(metric="lift")    # pd.DataFrame [antecedents, consequents, ...]
recs   = model.recommend_items(["bread", "milk"])  # list of suggested items
```

`ImplicitRecommender` subclasses (`ALS`, `BPR`) expose:

```python
model = ALS(factors=64, iterations=15).fit(user_item_csr)
# — or directly from an event log —
model = ALS(factors=64).from_transactions(df, user_col="user_id", item_col="item_id")

items, scores = model.recommend_items(user_id=42, n=10, exclude_seen=True)
users, scores = model.recommend_users(item_id=99, n=5)
```

---

### `mine` (functional)

```python
rusket.mine(
    df,
    min_support: float = 0.5,
    null_values: bool = False,
    use_colnames: bool = False,
    max_len: int | None = None,
    method: str = "auto",
    verbose: int = 0,
) -> pd.DataFrame
```

Dynamically selects the optimal mining algorithm based on the dataset density heuristically. It's highly recommended to use this instead of `fpgrowth` or `eclat` directly. Equivalent to `AutoMiner(...).mine()`.

| Parameter | Type | Description |
|-----------|------|-------------|
| `df` | `pd.DataFrame` \| `pl.DataFrame` \| `np.ndarray` | One-hot encoded input (bool / 0-1). Dense, sparse, or Polars. |
| `min_support` | `float` | Minimum support threshold in `(0, 1]`. |
| `null_values` | `bool` | Allow NaN values in `df` (pandas only). |
| `use_colnames` | `bool` | Return column names instead of integer indices in itemsets. |
| `max_len` | `int \| None` | Maximum itemset length. `None` = unlimited. |
| `method` | `"auto" \| "fpgrowth" \| "eclat"` | Algorithm to use. "auto" selects Eclat for `<0.15` density distributions. |
| `verbose` | `int` | Verbosity level. |

**Returns** a `pd.DataFrame` with columns `['support', 'itemsets']`.

---

### `fpgrowth` (functional)

```python
rusket.fpgrowth(
    df,
    min_support: float = 0.5,
    null_values: bool = False,
    use_colnames: bool = False,
    max_len: int | None = None,
    verbose: int = 0,
) -> pd.DataFrame
```

Equivalent to `FPGrowth(...).mine()`. See class table above.

| Parameter | Type | Description |
|-----------|------|-------------|
| `df` | `pd.DataFrame` \| `pl.DataFrame` \| `np.ndarray` | One-hot encoded input (bool / 0-1). Dense, sparse, or Polars. |
| `min_support` | `float` | Minimum support threshold in `(0, 1]`. |
| `null_values` | `bool` | Allow NaN values in `df` (pandas only). |
| `use_colnames` | `bool` | Return column names instead of integer indices in itemsets. |
| `max_len` | `int \| None` | Maximum itemset length. `None` = unlimited. |
| `verbose` | `int` | Verbosity level (kept for API compatibility with mlxtend). |

**Returns** a `pd.DataFrame` with columns `['support', 'itemsets']`.

---

### `eclat` (functional)

```python
rusket.eclat(
    df,
    min_support: float = 0.5,
    null_values: bool = False,
    use_colnames: bool = False,
    max_len: int | None = None,
    verbose: int = 0,
) -> pd.DataFrame
```

Equivalent to `Eclat(...).mine()`. Same parameters as `fpgrowth`. Uses vertical bitset representation (Eclat algorithm) instead of FP-Tree.

**Returns** a `pd.DataFrame` with columns `['support', 'itemsets']`.

---

### `association_rules` (functional)

```python
rusket.association_rules(
    df,
    num_itemsets: int,
    metric: str = "confidence",
    min_threshold: float = 0.8,
    support_only: bool = False,
    return_metrics: list[str] = [...],  # all 12 metrics by default
) -> pd.DataFrame
```

Alternatively, if you used the OOP API, call `model.association_rules(metric=..., min_threshold=...)` directly — `num_itemsets` is tracked automatically.

| Parameter | Type | Description |
|-----------|------|-------------|
| `df` | `pd.DataFrame` | Output from `fpgrowth()`. |
| `num_itemsets` | `int` | Number of transactions in the original dataset (`len(df_original)`). |
| `metric` | `str` | Metric to filter rules on (see table below). |
| `min_threshold` | `float` | Minimum value of `metric` to include a rule. |
| `support_only` | `bool` | Only compute support; fill other columns with `NaN`. |
| `return_metrics` | `list[str]` | Subset of metrics to include in the result. |

**Returns** a `pd.DataFrame` with columns `antecedents`, `consequents`, plus all requested metric columns.

#### Available Metrics

| Metric | Formula / Description |
|--------|----------------------|
| `support` | P(A ∪ B) |
| `confidence` | P(B \| A) |
| `lift` | confidence / P(B) |
| `leverage` | support − P(A)·P(B) |
| `conviction` | (1 − P(B)) / (1 − confidence) |
| `zhangs_metric` | Symmetrical correlation measure |
| `jaccard` | Jaccard similarity between A and B |
| `certainty` | Certainty factor |
| `kulczynski` | Average of P(B\|A) and P(A\|B) |
| `representativity` | Rule coverage across transactions |
| `antecedent support` | P(A) |
| `consequent support` | P(B) |

---

### `from_transactions` (functional)

```python
rusket.from_transactions(
    data,
    transaction_col: str | None = None,
    item_col: str | None = None,
) -> pd.DataFrame
```

Converts long-format transactional data to a one-hot boolean matrix. Accepts Pandas DataFrames, Polars DataFrames, Spark DataFrames, or `list[list[...]]`.

### `from_pandas` / `from_polars` / `from_spark`

Explicit typed variants of `from_transactions` for specific DataFrame types:

```python
rusket.from_pandas(df, transaction_col=None, item_col=None) -> pd.DataFrame
rusket.from_polars(df, transaction_col=None, item_col=None) -> pd.DataFrame
rusket.from_spark(df, transaction_col=None, item_col=None)  -> pd.DataFrame
```

---

## 🧠 Advanced Pattern & Recommendation Algorithms

`rusket` provides more than just basic market basket analysis. It includes an entire suite of modern algorithms and a high-level Business Recommender API.

### 🎯 ALS & BPR Collaborative Filtering

Both models learn user and item embeddings from **implicit feedback** (purchases, clicks, plays) and power personalised recommendations at scale. Use **ALS** for broad serendipitous discovery; use **BPR** when you care only about top-N ranking.

```python
from rusket import ALS, BPR

# ── "For You" homepage — music streaming platform ────────────────────
# event log: user_id | track_id | plays (optional weight)
plays = pd.DataFrame({
    "user_id":  [101, 101, 102, 102, 103, 103, 103],
    "track_id": ["T01", "T03", "T01", "T05", "T02", "T03", "T05"],
    "plays":    [12, 5, 8, 3, 20, 1, 7],  # play count as confidence weight
})

als = ALS(factors=64, iterations=15, alpha=40.0).from_transactions(
    plays, user_col="user_id", item_col="track_id", rating_col="plays"
)

# Top-10 tracks for user 101, excluding already-played tracks
tracks, scores = als.recommend_items(user_id=101, n=10, exclude_seen=True)

# Which users are most likely to enjoy track T05? — useful for email campaigns
users, scores = als.recommend_users(item_id="T05", n=50)

# BPR — optimise ranking directly rather than reconstruction
bpr = BPR(factors=64, learning_rate=0.05, iterations=150).fit(user_item_csr)
```

### 🎯 Hybrid Recommender API

Combine **Collaborative Filtering** (ALS/BPR) with **Frequent Pattern Mining** to cover every placement surface — personalised homepage ("For You") and active cart ("Frequently Bought Together") — in a single engine.

```python
from rusket import ALS, Recommender, mine, association_rules

# 1. Train on purchase history (implicit feedback)
als = ALS(factors=64, iterations=15).fit(user_item_csr)

# 2. Mine co-purchase rules from basket data
freq  = mine(basket_ohe, min_support=0.01)
rules = association_rules(freq, num_itemsets=n_receipts)

# 3. Create the Hybrid Engine
rec = Recommender(als_model=als, rules_df=rules)

# "For You" homepage — personalised for customer 1001
items, scores = rec.recommend_for_user(user_id=1001, n=5)

# Blend CF + product embeddings (e.g. from a PIM or sentence-transformer)
items, scores = rec.recommend_for_user(user_id=1001, n=5, alpha=0.7,
                                       target_item_for_semantic="HDPHONES")

# Active cart cross-sell — "Frequently Bought Together"
add_ons = rec.recommend_for_cart(["USB_DAC", "AUX_CABLE"], n=3)

# Overnight batch — score all customers, write to CRM
batch_df = rec.predict_next_chunk(user_history_df, user_col="customer_id", k=5)
```

### 🔍 Analytics Helpers

```python
from rusket import find_substitutes, customer_saturation

# Identify cannibalizing SKUs (lift < 1.0) for assortment rationalisation
subs = find_substitutes(rules_df, max_lift=0.8)
#  antecedents  consequents  lift
#  (Cola A,)    (Cola B,)    0.61   ← these products hurt each other's sales

# Segment customers by category penetration (decile 10 = buy everything; 1 = barely engaged)
saturation = customer_saturation(
    purchases_df, user_col="customer_id", category_col="category_id"
)
```

### 📈 BPR & Sequential Patterns

- **BPR (Bayesian Personalized Ranking):** Directly optimises ranking of positive interactions over negative ones — ideal for newsfeeds, playlists, and app recommendation surfaces that prioritise top-N precision.
- **Sequential Pattern Mining (PrefixSpan):** Discovers ordered patterns across time (e.g., "Subscriber signed up for broadband → mobile plan → premium bundle" or "Customer viewed Camera → 2 weeks later bought Lens"). 

`rusket` natively extracts PrefixSpan sequences from **Pandas, Polars, and PySpark** event logs with zero-copy Arrow mapping:

#### OOP Class API

```python
from rusket import PrefixSpan

# Telco product adoption journeys — what sequence of subscriptions do customers follow?
# df: customer_id | subscription_date | product_id
model = PrefixSpan.from_transactions(
    subscription_events,
    transaction_col="customer_id",
    item_col="product_id",
    time_col="subscription_date",
    min_support=50,    # at least 50 customers follow this path
    max_len=4,
)
freq_seqs = model.mine()
# e.g. [broadband] → [mobile] → [tv_bundle] appears in 312 journeys
```

#### Functional API

```python
from rusket.prefixspan import sequences_from_event_log, prefixspan

sequences, mapping = sequences_from_event_log(
    df=subscription_events,
    user_col="customer_id",
    time_col="subscription_date",
    item_col="product_id",
)

freq_seqs = prefixspan(sequences, min_support=50, max_len=4)
```



### 🕸️ Graph Analytics & Embeddings

Integrate natively with the modern GenAI/LLM stack:

- **Vector Export:** Export user/item factors to a Pandas `DataFrame` ready for FAISS/Qdrant using `rusket.export_item_factors`.
- **Item-to-Item Similarity:** Fast Cosine Similarity on embeddings using `rusket.similar_items(als_model, item_id)`.
- **Graph Generation:** Automatically convert association rules into a `networkx` directed Graph for community detection using `rusket.viz.to_networkx(rules)`.

---

## ⚡ Benchmarks

### Scale Benchmarks (1M → 200M rows)

| Scale | `from_transactions` → fpgrowth | Direct CSR → Rust | **Speedup** |
|---|:---:|:---:|:---:|
| 1M rows | 5.0s | **0.1s** | **50×** |
| 10M rows | 24.4s | **1.2s** | **20×** |
| 50M rows | 63.1s | **4.0s** | **15×** |
| 100M rows (20M txns × 200k items) | 134.2s | **10.1s** | **13×** |
| **200M rows** (40M txns × 200k items) | 246.8s | **17.6s** | **14×** |

#### Power-user path: Direct CSR → Rust

```python
import numpy as np
from scipy import sparse as sp
from rusket import mine

# Build CSR directly from integer IDs (no pandas!)
csr = sp.csr_matrix(
    (np.ones(len(txn_ids), dtype=np.int8), (txn_ids, item_ids)),
    shape=(n_transactions, n_items),
)
freq = mine(csr, min_support=0.001, max_len=3,
            use_colnames=True, column_names=item_names)
```

> At 100M rows, the mining step takes **1.3 seconds** — the bottleneck is entirely the CSR build.

### Real-World Datasets

| Dataset | Transactions | Items | `rusket` | `mlxtend` | Speedup |
|---------|:----------:|:-----:|:--------:|:---------:|:-------:|
| [andi_data.txt](https://github.com/andi611/Apriori-and-Eclat-Frequent-Itemset-Mining) | 8,416 | 119 | **9.7 s** (22.8M itemsets) | **TIMEOUT** 💥 | ∞ |
| [andi_data2.txt](https://github.com/andi611/Apriori-and-Eclat-Frequent-Itemset-Mining) | 540,455 | 2,603 | **7.9 s** | 16.2 s | **2×** |

Run benchmarks yourself:

```bash
uv run python benchmarks/bench_scale.py       # Scale benchmark + Plotly chart
uv run python benchmarks/bench_realworld.py   # Real-world datasets
uv run pytest tests/test_benchmark.py -v -s   # pytest-benchmark
```

---

## 🏗 Architecture

### Data Flow

```
pandas dense         ──► np.uint8 array (C-contiguous)  ──► Rust fpgrowth_from_dense
pandas Arrow backend ──► Arrow → np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense
pandas sparse        ──► CSR int32 arrays               ──► Rust fpgrowth_from_csr
polars               ──► Arrow → np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense
numpy ndarray        ──► np.uint8 (C-contiguous)        ──► Rust fpgrowth_from_dense
```

All mining and rule generation happens **inside Rust**. No Python loops, no round-trips.

### The 1 Billion Row Architecture

To pass the "1 Billion Row" threshold without OOM crashes, `rusket` employs a zero-allocation mining loop:
- **Eclat Scratch Buffers:** `intersect_count_into` writes intersections directly into thread-local pre-allocated memory bytes and computes `popcnt` in a single pass. It implements **early-exit** loop termination the moment it proves a combination cannot reach `min_support`.
- **FPGrowth Parallel Tree Build:** Conditional FP-trees are collected concurrently inside the rayon parallel mining step, replacing the standard sequential loop and eliminating memory contention bottlenecks.
- **`AHashMap` Deduplication:** Extremely fast O(N) duplicate basket counting replaces standard O(N log N) unstable sorts in the core pipeline.
### Project Structure

```
├── src/                          # Rust core (PyO3)
│   ├── lib.rs                    # Module root & Python bindings
│   ├── fpgrowth.rs               # FP-Tree construction + FP-Growth mining (Rayon parallel)
│   ├── eclat.rs                  # Eclat vertical mining (bitset intersection + popcnt)
│   ├── als.rs                    # ALS collaborative filtering (CG + Cholesky + Anderson)
│   ├── bpr.rs                    # Bayesian Personalized Ranking (Hogwild! SGD)
│   ├── hupm.rs                   # High-Utility Pattern Mining (EFIM algorithm)
│   ├── prefixspan.rs             # Sequential pattern mining (PrefixSpan)
│   └── association_rules.rs      # Rule generation + 12 metrics (Rayon parallel)
│
├── rusket/                       # Python wrappers & validation
│   ├── __init__.py               # Package root
│   ├── model.py                  # BaseModel / Miner / ImplicitRecommender / RuleMinerMixin
│   ├── fpgrowth.py               # FPGrowth class + fpgrowth() functional API
│   ├── eclat.py                  # Eclat class + eclat() functional API
│   ├── mine.py                   # AutoMiner class + mine() functional API
│   ├── als.py                    # ALS collaborative filtering model
│   ├── bpr.py                    # BPR collaborative filtering model
│   ├── hupm.py                   # HUPM class + hupm() / mine_hupm() functional API
│   ├── prefixspan.py             # PrefixSpan class + prefixspan() functional API
│   ├── recommend.py              # Recommender / NextBestAction / score_potential
│   ├── analytics.py              # find_substitutes / customer_saturation
│   ├── similarity.py             # similar_items()
│   ├── export.py                 # export_item_factors()
│   ├── streaming.py              # FPMiner / mine_duckdb / mine_spark
│   ├── spark.py                  # mine_grouped / prefixspan_grouped / hupm_grouped /
│   │                             #   rules_grouped / recommend_batches / to_spark
│   ├── transactions.py           # from_transactions / from_pandas / from_polars /
│   │                             #   from_spark / from_transactions_csr
│   ├── viz.py                    # to_networkx()
│   ├── _validation.py            # Input validation
│   └── _rusket.pyi               # Type stubs for Rust extension
│
├── tests/                        # Comprehensive test suite
├── benchmarks/                   # Real-world benchmark scripts
├── docs/                         # Zensical documentation
└── pyproject.toml                # Build config (maturin)
```

---

## 🧑‍💻 Development

### Prerequisites

- **Rust** 1.83+ (`rustup update`)
- **Python** 3.10+
- [**uv**](https://docs.astral.sh/uv/) (recommended package manager)

### Getting Started

```bash
# Clone
git clone https://github.com/bmsuisse/rusket.git
cd rusket

# Build Rust extension in dev mode
uv run maturin develop --release

# Run the full test suite
uv run pytest tests/ -x -q

# Type-check the Python layer
uv run pyright rusket/

# Cargo check (Rust)
cargo check
```

### Run Examples

```bash
# Getting started
uv run python examples/01_getting_started.py

# Market basket analysis with Faker
uv run python examples/02_market_basket_faker.py

# Polars input
uv run python examples/03_polars_input.py

# Sparse input
uv run python examples/04_sparse_input.py

# Large-scale mining (100k+ rows)
uv run python examples/05_large_scale.py

# mlxtend migration guide
uv run python examples/06_mlxtend_migration.py
```

---

## 🤖 AI Disclosure

A large part of this library — including the Rust core algorithms, the Python wrappers, the OOP class hierarchy, and the Spark integration layer — was written with substantial assistance from **AI pair-programming tools** (specifically [Google Gemini / Antigravity](https://deepmind.google/technologies/gemini/)). Human review, benchmarking, and architectural decisions were applied throughout.

We believe in transparency about AI-assisted development. The algorithms are correct, the tests pass, and the performance numbers are real — but if you find a bug or a piece of "AI slop", please open an issue!

---

## 📜 License

[MIT License](LICENSE)

