Metadata-Version: 2.4
Name: officemd
Version: 0.1.2
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: License :: OSI Approved :: MIT License
Requires-Dist: rich>=13.0
License-File: LICENSE
Summary: Native Python bindings for OfficeMD document extraction
Keywords: markdown,docx,xlsx,pptx,pdf,document,extraction
Home-Page: https://github.com/ThomAub/officemd
Author: Thomas Aubry
License: MIT
Requires-Python: >=3.12
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/ThomAub/officemd
Project-URL: Repository, https://github.com/ThomAub/officemd

# officemd

Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.

## Install

```bash
uv add officemd
# or
pip install officemd
```

For the CLI without adding to a project:

```bash
uvx officemd markdown report.docx
```

## CLI

```bash
officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd render report.docx
officemd diff old.docx new.docx
```

## SDK

```python
from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes

content = Path("report.docx").read_bytes()

# Markdown
print(markdown_from_bytes(content, format="docx"))

# Structured JSON IR
print(extract_ir_json(content, format="docx"))

# Docling JSON
print(docling_from_bytes(content, format="docx"))
```

## Supported Formats

| Format | Extension | Markdown | JSON IR | Docling |
|--------|-----------|----------|---------|---------|
| Word | .docx | yes | yes | yes |
| Excel | .xlsx | yes | yes | yes |
| CSV | .csv | yes | yes | - |
| PowerPoint | .pptx | yes | yes | yes |
| PDF | .pdf | yes | yes | - |

## License

MIT

