Metadata-Version: 2.4
Name: markdownify-rs
Version: 0.1.2
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Operating System :: OS Independent
Summary: Rust implementation of Python markdownify with a Python API
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# markdownify-rs

Rust implementation of Python markdownify with output parity as the primary goal.

## Python bindings

Build and install locally with maturin (uv):
```bash
uv venv
uv pip install maturin
.venv/bin/maturin develop --features python
```

Build via pip (PEP 517):
```bash
uv pip install .
```

Usage:
```python
from markdownify_rs import markdownify

print(markdownify("<b>Hello</b>"))
```

Batch usage (parallelized in Rust):
```python
from markdownify_rs import markdownify_batch

outputs = markdownify_batch(["<b>Hello</b>", "<i>World</i>"])
```

Markdown-adjacent utilities (submodule):
```python
from markdownify_rs.markdown_utils import (
    split_into_chunks,
    split_into_chunks_batch,
    coalesce_small_chunks,
    link_percentage,
    link_percentage_batch,
    filter_by_link_percentage,
    strip_links_with_substring,
    strip_links_with_substring_batch,
    remove_large_tables,
    remove_large_tables_batch,
    remove_lines_with_substring,
    remove_lines_with_substring_batch,
    fix_newlines,
    fix_newlines_batch,
    split_on_dividers,
    strip_html_and_contents,
    strip_html_and_contents_batch,
    strip_data_uri_images,
    text_pipeline_batch,
)

chunks = split_into_chunks(text, how="sections")
chunks_batch = split_into_chunks_batch([text1, text2], how="sections")
cleaned = strip_links_with_substring(text, "javascript")
cleaned_batch = strip_links_with_substring_batch([text1, text2], "javascript")
filtered = filter_by_link_percentage([text1, text2], threshold=0.5)
pipelined = text_pipeline_batch(
    [text1, text2],
    steps=[
        ("strip_links_with_substring", {"substring": "javascript"}),
        ("remove_large_tables", {"max_cells": 200}),
        ("fix_newlines", {}),
    ],
)
```

Notes:
- `code_language_callback` is not yet supported in the Python bindings.

CLI:
```bash
markdownify-rs input.html
cat input.html | markdownify-rs
```

## Parity hacks (scraper vs. BeautifulSoup)

These are explicit, ad hoc behaviors added on top of `scraper`/`html5ever` to match
`python-markdownify` (BeautifulSoup + html.parser) output. They are intentionally
quirky and may be replaced with more “correct” behavior once parity is stable.

- **`<br>` parser quirk**: With BeautifulSoup’s html.parser, if a non‑self‑closing
  `<br>` appears before a self‑closing `<br/>`, the later `<br/>` can be treated like
  an opening `<br>` whose contents run until that implicit `<br>` is closed (usually
  when its parent closes). We emulate this by removing the content between that
  `<br/>` and the closing tag that ends the implicit `<br>` (ignoring `<br>` tags
  inside comments/scripts), which matches python-markdownify’s output.
- **Leading whitespace reconstruction**: html.parser preserves whitespace‑only text
  nodes that html5ever drops (notably between `<html>` children and at the start of
  `<body>`). We reconstruct the normalized leading whitespace prefix (using the same
  “single space vs. single newline” rules as BeautifulSoup’s `endData`) and merge it
  with the converter output, carrying it across non‑block tags and empty custom
  elements whose contents are only comments/whitespace.
- **Table header inference**: For tables whose header row is effectively empty,
  we avoid forcing a “---” separator to match python-markdownify behavior.
- **Top-level `<td>/<th>` wrapping**: If input is a bare `<td>`/`<th>`, we wrap it
  in a `<table><tr>…</tr></table>` fragment to align with python-markdownify output.

## Benchmarks

Datasets
- Michigan Statutes (JSONL, 241 HTML documents).
  - Total HTML bytes: 101,029,525 (~96.35 MiB).
  - Largest document: 8,034,686 bytes (~7.66 MiB).
  - Source file size: 102,856,616 bytes (~98.10 MiB).
- Law websites (CSV, 3,136 HTML documents).
  - Total HTML bytes: 111,747,114 (~106.57 MiB).
  - Largest document: 1,381,380 bytes (~1.32 MiB).
  - Source file size: 148,486,852 bytes (~141.61 MiB).

Run
```bash
# Michigan Statutes (JSONL)
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify --dist-name markdownify --label markdownify

# Law websites (CSV)
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify --dist-name markdownify --label markdownify
```

Python binding comparison (both run through Python, 2026-01-28, Apple M3, macOS 14.6 / Darwin 24.6.0, Python 3.13.0)

Michigan Statutes (JSONL)
- `markdownify_rs` `convert_all` (241 docs): time 2.266594 s, throughput 42.508 MiB/s
- `markdownify_rs` `convert_all_batch` (241 docs): time 0.538012 s, throughput 179.084 MiB/s
- `markdownify_rs` `convert_largest` (8,034,686 bytes): time 187.941 ms, throughput 40.771 MiB/s
- `markdownify` `convert_all` (241 docs): time 29.654787 s, throughput 3.249 MiB/s
- `markdownify` `convert_largest` (8,034,686 bytes): time 4.496880 s, throughput 1.704 MiB/s

Speedup summary (wall-clock time, lower is better)
| Scenario | markdownify_rs time | markdownify_rs batch time | markdownify time | Speedup (rs vs py) | Speedup (batch vs py) | Batch vs rs |
| --- | --- | --- | --- | --- | --- | --- |
| convert_all | 2.266594 s | 0.538012 s | 29.654787 s | 13.08x (+1208.34%) | 55.12x (+5411.92%) | 4.21x (+321.29%) |
| convert_largest | 187.941 ms | n/a | 4.496880 s | 23.93x (+2292.71%) | n/a | n/a |

Law websites (CSV)
- `markdownify_rs` `convert_all` (3,136 docs): time 2.596691 s, throughput 41.041 MiB/s
- `markdownify_rs` `convert_all_batch` (3,136 docs): time 0.672013 s, throughput 158.584 MiB/s
- `markdownify_rs` `convert_largest` (1,381,380 bytes): time 54.482 ms, throughput 24.180 MiB/s
- `markdownify` `convert_all` (3,136 docs): time 17.680570 s, throughput 6.028 MiB/s
- `markdownify` `convert_largest` (1,381,380 bytes): time 280.459 ms, throughput 4.697 MiB/s

Speedup summary (wall-clock time, lower is better)
| Scenario | markdownify_rs time | markdownify_rs batch time | markdownify time | Speedup (rs vs py) | Speedup (batch vs py) | Batch vs rs |
| --- | --- | --- | --- | --- | --- | --- |
| convert_all | 2.596691 s | 0.672013 s | 17.680570 s | 6.81x (+580.89%) | 26.31x (+2530.99%) | 3.86x (+286.40%) |
| convert_largest | 54.482 ms | n/a | 280.459 ms | 5.15x (+414.77%) | n/a | n/a |

