Metadata-Version: 2.4
Name: markdownify-rs
Version: 0.1.5
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Operating System :: OS Independent
Summary: Rust implementation of Python markdownify with a Python API
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# markdownify-rs

Rust implementation of Python markdownify with output parity as the primary goal.

## Python bindings

Build and install locally with maturin (uv):
```bash
uv venv
uv pip install maturin
.venv/bin/maturin develop --features python
```

Build via pip (PEP 517):
```bash
uv pip install .
```

Usage:
```python
from markdownify_rs import markdownify

print(markdownify("<b>Hello</b>"))
```

BS4-style read-only HTML API:
```python
from markdownify_rs import Soup

soup = Soup("<div class='main'><p>Hello</p></div>")
p = soup.find("p")
print(p.get_text())                  # "Hello"
print(soup.select("div.main > p"))   # [<Tag <p>Hello</p>>]
print(p.parent.name)                 # "div"
```

BS4 batch query API (GIL-free execution in Rust):
```python
from markdownify_rs import QueryPlan, run_batch

plan = QueryPlan()
plan.add_select_count("links", "a[href]")
plan.add_select_one_text("title", "title")
plan.add_select_all_texts("link_texts", "a", separator=" ", strip=True)
plan.add_get_text("all_text", " ", True)

rows = run_batch(["<html><title>A</title><a href='/x'>x</a></html>"], plan)
print(rows[0]["links"], rows[0]["title"], rows[0]["link_texts"])  # 1, "A", ["x"]
```

Full Python bs4 API + QueryPlan guide:
- `PYTHON_BS4_API.md`

Markdown -> HTML usage (Python-Markdown-style API):
```python
from markdownify_rs import markdown, Markdown

print(markdown("# Hello", extensions=["tables", "footnotes"]))

md = Markdown(extensions=["toc", "admonition"])
print(md.convert("[TOC]\n\n# Title"))
print(md.toc_tokens)  # [{"level": 1, "id": "...", "name": "Title"}]

# Optional CommonMark/GFM-ish toggles (default False):
print(markdown("visit https://example.com", autolink=True))
print(markdown("- [ ] task", tasklist=True))
print(markdown("~~old~~", strikethrough=True))

# Parsing mode:
# - python_compat (default): best-effort Python-Markdown compatibility
# - fast: pure comrak/CommonMark fast path
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="python_compat"))
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="fast"))
```

Full Python markdown-to-HTML API quickstart (all public args):
- `PYTHON_MARKDOWN_TO_HTML_QUICKSTART.md`

Batch usage (parallelized in Rust):
```python
from markdownify_rs import markdownify_batch

outputs = markdownify_batch(["<b>Hello</b>", "<i>World</i>"])
```

Markdown-adjacent utilities (submodule):
```python
from markdownify_rs.markdown_utils import (
    split_into_chunks,
    split_into_chunks_batch,
    coalesce_small_chunks,
    link_percentage,
    link_percentage_batch,
    filter_by_link_percentage,
    strip_links_with_substring,
    strip_links_with_substring_batch,
    remove_large_tables,
    remove_large_tables_batch,
    remove_lines_with_substring,
    remove_lines_with_substring_batch,
    fix_newlines,
    fix_newlines_batch,
    split_on_dividers,
    strip_html_and_contents,
    strip_html_and_contents_batch,
    strip_data_uri_images,
    text_pipeline_batch,
)

chunks = split_into_chunks(text, how="sections")
chunks_batch = split_into_chunks_batch([text1, text2], how="sections")
cleaned = strip_links_with_substring(text, "javascript")
cleaned_batch = strip_links_with_substring_batch([text1, text2], "javascript")
filtered = filter_by_link_percentage([text1, text2], threshold=0.5)
pipelined = text_pipeline_batch(
    [text1, text2],
    steps=[
        ("strip_links_with_substring", {"substring": "javascript"}),
        ("remove_large_tables", {"max_cells": 200}),
        ("fix_newlines", {}),
    ],
)
```

Notes:
- `code_language_callback` is not yet supported in the Python bindings.

CLI:
```bash
markdownify-rs input.html
cat input.html | markdownify-rs
```

## Parity hacks (scraper vs. BeautifulSoup)

These are explicit, ad hoc behaviors added on top of `scraper`/`html5ever` to match
`python-markdownify` (BeautifulSoup + html.parser) output. They are intentionally
quirky and may be replaced with more “correct” behavior once parity is stable.

- **`<br>` parser quirk**: With BeautifulSoup’s html.parser, if a non‑self‑closing
  `<br>` appears before a self‑closing `<br/>`, the later `<br/>` can be treated like
  an opening `<br>` whose contents run until that implicit `<br>` is closed (usually
  when its parent closes). We emulate this by removing the content between that
  `<br/>` and the closing tag that ends the implicit `<br>` (ignoring `<br>` tags
  inside comments/scripts), which matches python-markdownify’s output.
- **Leading whitespace reconstruction**: html.parser preserves whitespace‑only text
  nodes that html5ever drops (notably between `<html>` children and at the start of
  `<body>`). We reconstruct the normalized leading whitespace prefix (using the same
  “single space vs. single newline” rules as BeautifulSoup’s `endData`) and merge it
  with the converter output, carrying it across non‑block tags and empty custom
  elements whose contents are only comments/whitespace.
- **Table header inference**: For tables whose header row is effectively empty,
  we avoid forcing a “---” separator to match python-markdownify behavior.
- **Top-level `<td>/<th>` wrapping**: If input is a bare `<td>`/`<th>`, we wrap it
  in a `<table><tr>…</tr></table>` fragment to align with python-markdownify output.

## Benchmarks

Datasets
- Michigan Statutes (JSONL, 241 HTML documents).
  - Total HTML bytes: 101,029,525 (~96.35 MiB).
  - Largest document: 8,034,686 bytes (~7.66 MiB).
  - Source file size: 102,856,616 bytes (~98.10 MiB).
- Law websites (CSV, 3,136 HTML documents).
  - Total HTML bytes: 111,747,114 (~106.57 MiB).
  - Largest document: 1,381,380 bytes (~1.32 MiB).
  - Source file size: 148,486,852 bytes (~141.61 MiB).

Run
```bash
# Michigan Statutes (JSONL)
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify --dist-name markdownify --label markdownify

# Law websites (CSV)
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify --dist-name markdownify --label markdownify
```

Python binding comparison (both run through Python, 2026-01-28, Apple M3, macOS 14.6 / Darwin 24.6.0, Python 3.13.0)

Michigan Statutes (JSONL)
- `markdownify_rs` `convert_all` (241 docs): time 2.266594 s, throughput 42.508 MiB/s
- `markdownify_rs` `convert_all_batch` (241 docs): time 0.538012 s, throughput 179.084 MiB/s
- `markdownify_rs` `convert_largest` (8,034,686 bytes): time 187.941 ms, throughput 40.771 MiB/s
- `markdownify` `convert_all` (241 docs): time 29.654787 s, throughput 3.249 MiB/s
- `markdownify` `convert_largest` (8,034,686 bytes): time 4.496880 s, throughput 1.704 MiB/s

Speedup summary (wall-clock time, lower is better)
| Scenario | markdownify_rs time | markdownify_rs batch time | markdownify time | Speedup (rs vs py) | Speedup (batch vs py) | Batch vs rs |
| --- | --- | --- | --- | --- | --- | --- |
| convert_all | 2.266594 s | 0.538012 s | 29.654787 s | 13.08x (+1208.34%) | 55.12x (+5411.92%) | 4.21x (+321.29%) |
| convert_largest | 187.941 ms | n/a | 4.496880 s | 23.93x (+2292.71%) | n/a | n/a |

Law websites (CSV)
- `markdownify_rs` `convert_all` (3,136 docs): time 2.596691 s, throughput 41.041 MiB/s
- `markdownify_rs` `convert_all_batch` (3,136 docs): time 0.672013 s, throughput 158.584 MiB/s
- `markdownify_rs` `convert_largest` (1,381,380 bytes): time 54.482 ms, throughput 24.180 MiB/s
- `markdownify` `convert_all` (3,136 docs): time 17.680570 s, throughput 6.028 MiB/s
- `markdownify` `convert_largest` (1,381,380 bytes): time 280.459 ms, throughput 4.697 MiB/s

Speedup summary (wall-clock time, lower is better)
| Scenario | markdownify_rs time | markdownify_rs batch time | markdownify time | Speedup (rs vs py) | Speedup (batch vs py) | Batch vs rs |
| --- | --- | --- | --- | --- | --- | --- |
| convert_all | 2.596691 s | 0.672013 s | 17.680570 s | 6.81x (+580.89%) | 26.31x (+2530.99%) | 3.86x (+286.40%) |
| convert_largest | 54.482 ms | n/a | 280.459 ms | 5.15x (+414.77%) | n/a | n/a |

Markdown -> HTML parity/speed report:
```bash
.venv/bin/python scripts/report_markdown_to_html_parity_speed.py \
  --corpus-dir /tmp/test_markdowns \
  --report BENCHMARKS_MARKDOWN_TO_HTML.md
```

