Metadata-Version: 2.4
Name: coil_python
Version: 0.6.0
Summary: COIL - Token-optimized structured data encoding for LLM pipelines. Compress JSON into compact, schema-aware blocks that reduce LLM token usage by 40-70%.
Author: Muthukumaran S
License: MIT
Project-URL: Homepage, https://pypi.org/project/coil_python
Project-URL: Repository, https://github.com/muthukumaran/coil-python
Project-URL: Issues, https://github.com/muthukumaran/coil-python/issues
Keywords: llm,token-optimization,token-reduction,json,compression,prompt-engineering,prompt-compression,ai-infra,protocol,encoding,context-window,rag,ai-agent,structured-data,openai,anthropic,gpt,claude
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.6; extra == "tiktoken"
Provides-Extra: charts
Requires-Dist: matplotlib>=3.7; extra == "charts"
Provides-Extra: all
Requires-Dist: tiktoken>=0.6; extra == "all"
Requires-Dist: matplotlib>=3.7; extra == "all"

# 🧬 coil_python — COIL v0.5.0

> Token-optimized structured data encoding for LLM pipelines.

**COIL** (Compact Object Input Language) compresses JSON into a compact,
schema-aware representation that reduces LLM token usage by **40–70%** while
remaining fully lossless and reversible.

Unlike general-purpose compression, COIL is designed around how transformers
tokenize text. It detects table-shaped arrays, builds a greedy value-alias map
using hill-climbing optimisation, and encodes rows as pipe-delimited strings —
squeezing the same information into far fewer tokens.

---

## ✨ Features

- **40–70% token reduction** on structured JSON payloads
- **Lossless round-trip** — integers, floats, booleans, None, and arrays all
  survive encode → decode with their exact Python types
- **Nested object support** — deep dicts are flattened to dot-notation keys
  and reconstructed on decode
- **Hill-climbing vmap** — Algorithm 1 from the COIL paper; only commits
  substitutions that strictly reduce token count (no negative savings)
- **Per-file type registry** — `input_coil_types.json` alongside your data,
  never a shared global file
- **Terminal visuals** — `bar_chart`, `graph`, `report` for instant feedback
- **matplotlib charts** — `show_charts` for richer visual analysis (optional)
- **CLI** — encode, decode, and stats from the command line
- **tiktoken optional** — falls back to a character-length heuristic if not installed

---

## 📦 Installation

```bash
pip install coil_python
```

With real token counting (recommended for production):

```bash
pip install "coil_python[tiktoken]"
```

With matplotlib charts:

```bash
pip install "coil_python[all]"
```

---

## 🚀 Quick Start

```python
import json
import coil_python as coil

# Load your data
with open("data.json") as f:
    data = json.load(f)

# Encode → writes data_coil_types.json alongside the encoded output
type_file = coil.types_file_path("data.json")   # "data_coil_types.json"
encoded   = coil.encode(data, type_file=type_file)

# Save encoded output
with open("data_encoded.json", "w") as f:
    json.dump(encoded, f, indent=2)

# Decode → restores exact types
decoded = coil.decode(encoded, type_file=type_file)

# Verify losslessness
print(coil.is_lossless(data, decoded))  # True

# Stats
s = coil.stats(data, encoded, decoded, out="coil_stats.json")
coil.report(s)
```

---

## 🖥️ CLI

After installation the `coil` command is available:

```bash
# Encode
coil encode data.json
# → data_encoded.json  +  data_coil_types.json

# Encode with explicit output paths
coil encode data.json -o out_encoded.json --type-file out_types.json

# Decode
coil decode data_encoded.json
# → data_encoded_decoded.json

# Stats dashboard
coil stats data.json data_encoded.json

# Library info
coil info
```

Or via the module:

```bash
python -m coil_python encode data.json
python -m coil_python decode data_encoded.json
python -m coil_python stats data.json data_encoded.json
```

---

## 📘 API Reference

### `coil.encode(data, *, type_file="coil_types.json", return_types=False)`

Encodes a Python object into COIL format.

COIL automatically:
- Detects table-shaped arrays (`list[dict]`) and encodes them positionally
- Flattens nested dicts to dot-notation keys (`{"a": {"b": 1}}` → `{"a.b": 1}`)
- Builds a hill-climbing value-alias map — only commits aliases that save tokens
- Stores precise column types to *type_file* for lossless decode

```python
encoded = coil.encode(data)

# Per-file type registry (recommended — avoids collisions)
type_file = coil.types_file_path("employees.json")  # "employees_coil_types.json"
encoded   = coil.encode(data, type_file=type_file)

# Get the type registry inline
encoded, registry = coil.encode(data, return_types=True)
```

---

### `coil.decode(encoded_data, *, type_file="coil_types.json")`

Restores original data from COIL encoding.

```python
decoded = coil.decode(encoded)
decoded = coil.decode(encoded, type_file="employees_coil_types.json")
```

The type_file must match the path used during `encode()`.

---

### `coil.verify(original, encoded, *, type_file="coil_types.json")`

Decodes and checks losslessness in one step.

```python
result = coil.verify(data, encoded)
print(result["lossless"])   # True / False
decoded = result["decoded"]
```

---

### `coil.stats(original, encoded, decoded=None, *, out="coil_stats.json")`

Computes compression metrics and saves them to *out*.

```python
s = coil.stats(data, encoded, decoded)
print(s["comparison"]["token_saving_percent"])   # e.g. 62.79
print(s["comparison"]["compression_ratio"])       # e.g. 2.69
print(s["lossless"])                              # True
```

Returns a dict with `original`, `encoded`, `comparison`, and optionally `lossless`.

---

### `coil.is_lossless(original, decoded)`

Deep structural equality check for any JSON-compatible object.

```python
print(coil.is_lossless(data, decoded))  # True
```

---

### `coil.bar_chart(stats, metric="tokens")`

Horizontal bar chart in the terminal.

```python
coil.bar_chart(stats, "tokens")   # token count comparison
coil.bar_chart(stats, "bytes")    # byte size comparison
coil.bar_chart(stats, "ratio")    # compression ratio bar
coil.bar_chart(stats, "chars")    # character count
```

---

### `coil.graph(stats, metric="full")`

Multi-metric summary graph.

```python
coil.graph(stats, "savings")   # token & byte savings with colour bars
coil.graph(stats, "size")      # side-by-side size comparison
coil.graph(stats, "full")      # everything: savings + size + quality
```

---

### `coil.report(stats)`

Full compression dashboard — the clearest single-call summary.

```python
coil.report(stats)
```

```
╔══════════════════════════════════════════════╗
║           COIL Compression Report            ║
╠══════════════════════════════════════════════╣
║  Original  │ 8154   tokens │ 32614    bytes  ║
║  Encoded   │ 3034   tokens │ 12134    bytes  ║
╠══════════════════════════════════════════════╣
║  Token saved  : █████████████░░░░░░░ 62.79%  ║
║  Bytes saved  : █████████████░░░░░░░ 62.8%   ║
╠══════════════════════════════════════════════╣
║  Compression  : 2.69x                        ║
║  Lossless     : ✅ YES                        ║
╚══════════════════════════════════════════════╝
```

---

### `coil.show_charts(stats)`

Matplotlib bar charts for token count, byte size, and savings summary.
Requires `pip install "coil_python[charts]"`.

```python
coil.show_charts(stats)
```

---

### `coil.types_file_path(input_file)`

Derives the type-registry filename from an input file path.

```python
coil.types_file_path("data/employees.json")
# → "data/employees_coil_types.json"
```

---

### `coil.debug_mode(flag=True)`

Enable verbose internal logging.

```python
coil.debug_mode(True)
# [COIL] Encoding → type_file=coil_types.json, model=default
# [COIL] Encoding complete
```

---

### `coil.set_model(model_name)`

Select the tokenizer backend for token counting.

```python
coil.set_model("gpt-4o-mini")   # uses tiktoken if installed
coil.set_model("claude-4")      # anthropic (heuristic fallback)
coil.set_model("default")       # character-length heuristic (no deps)
```

Supported: `"gpt-4o"`, `"gpt-4o-mini"`, `"gpt-4.1"`, `"claude-3"`,
`"claude-4"`, `"gemini"`, `"mistral"`, `"default"`.

---

### `coil.info()`

Returns library metadata.

```python
coil.info()
# {
#   "library":   "coil_python",
#   "version":   "0.5.0",
#   "ecosystem": "python",
#   "model":     "default",
#   "debug":     False,
#   "purpose":   "Token-optimized structured data encoding for LLMs"
# }
```

---

## 🔁 Complete Example

```python
import json
import coil_python as coil

# ── Setup ──────────────────────────────────────────────────────────────────────
coil.set_model("gpt-4o-mini")   # use tiktoken for accurate token counts
coil.debug_mode(False)

# ── Load ───────────────────────────────────────────────────────────────────────
with open("telemetry.json") as f:
    data = json.load(f)

# ── Encode ─────────────────────────────────────────────────────────────────────
type_file = coil.types_file_path("telemetry.json")
encoded   = coil.encode(data, type_file=type_file)

with open("telemetry_encoded.json", "w") as f:
    json.dump(encoded, f, indent=2)

# ── Decode ─────────────────────────────────────────────────────────────────────
decoded = coil.decode(encoded, type_file=type_file)

with open("telemetry_decoded.json", "w") as f:
    json.dump(decoded, f, indent=2)

# ── Verify ─────────────────────────────────────────────────────────────────────
result = coil.verify(data, encoded, type_file=type_file)
print("Lossless:", result["lossless"])

# ── Analyse ────────────────────────────────────────────────────────────────────
s = coil.stats(data, encoded, decoded, out="telemetry_stats.json")

coil.report(s)
coil.bar_chart(s, "tokens")
coil.bar_chart(s, "ratio")
coil.graph(s, "full")
coil.show_charts(s)   # requires matplotlib

print(coil.info())
```

---

## 🧠 How COIL Works

COIL applies a two-stage algorithm to each table-shaped array it encounters:

**Stage 1 — Structural analysis**
Arrays of uniform dicts are detected as tables. Nested dicts are flattened to
dot-notation keys (`user.name`, `location.city`) so the entire record fits
into a single flat row. The original nesting is reconstructed on decode.

**Stage 2 — Hill-climbing vmap (Algorithm 1)**
1. Candidate values are sorted by `freq(v) × tokenCount(v)` descending.
2. Each iteration finds the single best alias assignment.
3. An alias is committed only if it strictly reduces the token count.
4. A stall counter stops the loop after 6 consecutive non-improving candidates.
5. Alias tokens are compact: `V1..V9, VA..VZ, VAA..`.

The result is a `META` header (column order, alias map) and a `BODY` of
pipe-delimited rows — a format that is dense for transformer attention while
remaining fully decodable.

---

## ⚠️ Important: type_file pairing

`encode()` writes a type registry file and `decode()` reads it.
**Both calls must use the same `type_file` path**, and the file must exist
when you call `decode()`.

```python
# ✅ Correct — same type_file path
encoded = coil.encode(data, type_file="my_types.json")
decoded = coil.decode(encoded, type_file="my_types.json")

# ❌ Wrong — default path, then missing file on another machine
encoded = coil.encode(data)           # writes coil_types.json locally
decoded = coil.decode(encoded)        # reads coil_types.json — must exist!
```

For multi-file pipelines, use `types_file_path()` to generate a unique name
per input file automatically.

---

## 📊 Benchmark

Tested on `mixed_test.json` (smart-city analytics payload with 6 embedded tables,
nested objects, mixed types, nulls, and array columns):

| Metric | Value |
|---|---|
| Original tokens | 3,000 |
| Encoded tokens | 1,823 |
| Token saving | **39.2%** |
| Byte saving | **39.3%** |
| Compression ratio | **1.65×** |
| Lossless | ✅ |

On highly repetitive datasets (sensor logs, transaction records) savings reach
**60–70%** with compression ratios above **2.5×**.

---

## 📋 Known Limitations

- Token counts use a `ceil(len / 4)` heuristic unless `tiktoken` is installed.
  Install with `pip install "coil_python[tiktoken]"` for accurate counts.
- COIL falls back to the original data when encoding would not reduce size,
  so not every JSON document will be compressed.
- The `type_file` must be kept alongside the encoded JSON for decode to work.
  It is not embedded in the encoded output.

---

## 📜 License

MIT

---

## 👤 Author

**Muthukumaran S** — Creator of the COIL Protocol

If you use COIL in research, please cite it as:
*COIL — Compact Object Input Language, 2026.*
