Metadata-Version: 2.2
Name: cutysoup
Version: 0.1.1
Summary: High-performance HTML parsing library with BeautifulSoup4-compatible API
Author: CutySoup Team
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: C++
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Project-URL: Homepage, https://github.com/cutysoup/cutysoup
Project-URL: Repository, https://github.com/cutysoup/cutysoup
Requires-Python: >=3.7
Requires-Dist: beautifulsoup4>=4.13.4
Requires-Dist: chardet>=5.2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.4; extra == "dev"
Requires-Dist: beautifulsoup4>=4.9.0; extra == "dev"
Description-Content-Type: text/markdown

# CutySoup

High-performance HTML parsing for Python with a BeautifulSoup4-compatible API (Gumbo + C++ via pybind11).

## Why CutySoup

If you like the BeautifulSoup4 API but need more throughput, CutySoup moves the heavy work (parsing and common queries) into a native extension while keeping the same mental model: `find`, `find_all`, `select`, `get_text`, navigation, and DOM manipulation.

Status: **100% BeautifulSoup4 API compatibility** with **148/148 tests passing** in this repo.

## Install

```bash
pip install cutysoup
```

If you're installing from source (or no wheel exists for your platform), you'll need a C++ toolchain and CMake.

Platforms: macOS, Linux, and Windows (binary wheels when available; otherwise builds from source).

## Quickstart

```python
from cutysoup import CutySoup

soup = CutySoup("<html><body><p class='intro'>Hello</p></body></html>")

print(soup.p.text)                 # Hello
print(soup.find("p", class_="intro"))
print(soup.select_one("p.intro"))
```

## Performance

Benchmarks vary by hardware, Python version, and document shape:

- `cutysoup==0.1.0` (this repo build)
- `beautifulsoup4==4.13.4`
- Python `3.8.20`

Command:

```bash
uv run python benchmarks/portable_benchmark.py --sizes l
```

For the `l` fixture (about **639 KB HTML**, **2000** repeated `<article>` blocks), the mean speedups vs BeautifulSoup4 (`html.parser`) were:

| Operation | Speedup |
| --- | ---: |
| `parse` | **7.5x** |
| `parse_and_scrape_articles(loop)` | **3.8x** |
| `find_all_tag(p)` | **178x** |
| `find_all_class(.text)` | **1762x** |
| `select_descendant(article .link)` | **7.3x** |
| `href_batch(all a href)` | **18.8x** |

To reproduce the full table (including `s` and `m` sizes) and write a JSON report:

```bash
uv run python benchmarks/portable_benchmark.py --sizes s,m,l --json-out /tmp/cutysoup_portable_benchmark.json
```

## Extra APIs (Optional)

CutySoup includes some convenience helpers that are not part of BeautifulSoup4 but are useful for performance-sensitive scraping:

```python
from cutysoup import CutySoup

soup = CutySoup('<a href="/one">One</a><a href="/two">Two</a>')
print(soup.get_all_hrefs())  # ['/one', '/two']
```

There is also a structured extraction helper (`soup.query(...)`) for extracting data using CSS selectors and transforms. See `docs/CSQL_GUIDE.md`.

## Development

Recommended (creates an env and builds the extension):

```bash
uv sync
```

Run tests:

```bash
uv run pytest tests/
```

Local build script (alternative):

```bash
python dev_build.py
```

## License

MIT
