Metadata-Version: 2.4
Name: arpeggio-shredder
Version: 0.2.0
License: AGPL-3.0-or-later OR Commercial
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: description
Dynamic: description-content-type
Dynamic: license
Dynamic: license-file

# arpeggio-shredder

Reversible shredding (flatten) and unshredding (rebuild) of semi-structured JSONL into Apache Arrow RecordBatches.

**Source Code**: [https://codeberg.org/alwyna/shredder](https://codeberg.org/alwyna/shredder)

This package provides a **thin Python binding** over a C++ core that converts JSON documents into a columnar “atoms” representation suitable for Arrow / Parquet workflows, while preserving the ability to reconstruct the original documents exactly.

The scope is intentionally narrow: **flattening with reversibility**, not general JSON processing.

## Key properties

- **Reversible**: JSON → Arrow atoms → JSON
- **Columnar-first**: output is optimized for Arrow-native pipelines
- **Deterministic identity**: optional object and transaction tagging
- **Minimal Python surface**: High-level `shredder` package or raw `shredder_ext` extension

## Installation

```bash
pip install arpeggio-shredder
```

## Usage

Shredder 0.2.0 supports two explicit "codec" strategies: `json` (UTF-8 text) and `arrow` (native Struct/List DOM).

### Known limitations / Non-goals for 0.2.x

- **No support for array-of-array**: Shredder throws a `runtime_error` on nested arrays (e.g., `[[1,2]]`).
- **Row expansion semantics (no sibling back-fill)**: Fields encountered *after* an array may only be present in the last expanded row. Reconstruction remains correct via internal metadata.
- **Depth/complexity caveat**: The iterative core is frozen for 0.2.0. We prioritize stability over redesigning the traversal engine for extreme nesting.
- **Codec strategy is explicit**: `codec="json"` requires strings; `codec="arrow"` requires Arrow DOM (struct/list). No auto-detection.

### 1. JSON Text Codec (Default)

> **Warning**: The `json` codec expects **UTF-8 JSON strings** (e.g., `pa.string()`), not Python dicts or Arrow structs. No silent casting is performed.

The classic workflow: JSONL → Arrow atoms → JSONL.

```python
import pyarrow as pa
import shredder

# 1. Load JSONL as Arrow string array
lines = ['{"id": 1, "msg": "hello"}', '{"id": 2, "msg": "world"}']
array = pa.array(lines, type=pa.string())

# 2. Shred into columnar "atoms"
# codec="json" is default, but can be explicit
atoms = shredder.shred(array, codec="json")

# 3. Unshred back to JSON
# Returns RecordBatch with single column 'doc'
reconstructed = shredder.unshred(atoms, codec="json")
json_docs = reconstructed.column(0)

for doc in json_docs:
    print(doc.as_py())
```

### 2. Arrow DOM Codec

New in 0.2.0: Treat Arrow nested structures as the source of truth. No JSON parsing involved.

```python
import pyarrow as pa
import pyarrow.json as pajson
import shredder

# 1. Read JSONL into Arrow-native columns
# table = pajson.read_json("input.jsonl")
table = pa.table({"id": [1, 2], "val": [10.5, 20.0]})

# 2. Convert to a single struct column (Arrow DOM input)
doc_struct = pa.StructArray.from_arrays(table.columns, names=table.column_names)

# 3. Use the stateful Shredder class
s = shredder.Shredder(codec="arrow")

# 4. Shred/Unshred
atoms = s.shred(doc_struct)
reconstructed = s.unshred(atoms)

# Output 'doc' column is the reconstructed StructArray
doc_out = reconstructed.column(0)
print(doc_out.to_pylist())
```

## How it works

1. **Codecs**: 
   - `JsonTextCodec`: 1-col UTF-8 string → internal DOM → atoms.
   - `ArrowDomCodec`: 1-col Arrow struct/list/etc → internal DOM → atoms.
2. **Flattening**: Nested objects are flattened into path-based columns (e.g., `root_user_name`).
3. **Row Expansion**: Nested arrays are "shredded" by duplicating parent values for each array element, adding an `__idx` column to preserve order.
4. **Identity**: The engine automatically injects `__obj_id` and `__txn_id` (BLAKE3 hashes of content) to ensure perfect reconstruction.
5. **Reversibility**: Shredder preserves enough metadata to rebuild the exact original structure, regardless of the input codec.

## Native extension

The package ships with a **prebuilt native extension (`.so`)** built against Apache Arrow and exposed via pybind11.

- No runtime compilation
- No system Arrow installation required
- Shared libraries are bundled into the wheel

## Platform support

- **OS**: Linux (manylinux-compatible)
- **Architecture**: x86_64
- **Python**: CPython 3.12
- **ABI**: glibc (manylinux)

Other platforms are not currently supported.

## Relationship to the C++ project

This package is the **Python distribution layer** for the Shredder C++ project hosted on Codeberg: [https://codeberg.org/alwyna/shredder](https://codeberg.org/alwyna/shredder).

## License

This package is dual-licensed:

- **AGPL-3.0** for open-source use and networked deployments
- **Commercial license** for proprietary or closed-source use

Commercial licensing is available via https://arpeggio.one/shop.
