Metadata-Version: 2.4
Name: unword
Version: 0.1.0
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Text Processing :: Markup
Summary: Convert legacy MS Word .doc files to Markdown — inspired by antiword
Keywords: word,doc,parser,antiword,markdown
License-Expression: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://github.com/dridk/unword

# unword

Convert legacy Microsoft Word `.doc` files (OLE/CFB format) to Markdown. Inspired by [antiword](http://www.winfield.demon.nl/).

Extracts body text with heading levels, page breaks, and textbox contents. No external dependencies (no LibreOffice, no COM).

## Installation

### CLI (Rust)

```bash
cargo build --release
```

### Python

Requires [maturin](https://www.maturin.rs/) and a virtual environment:

```bash
uv venv .venv && source .venv/bin/activate
maturin develop
```

Or build a wheel:

```bash
maturin build --release
pip install target/wheels/unword-*.whl
```

## Usage

### CLI

```bash
# Print to stdout
unword -i document.doc

# Write to file
unword -i document.doc -o output.md
```

### Python

```python
import unword

doc = unword.parse_doc(open("document.doc", "rb").read())

print(doc.body_text)      # Markdown string with headings
print(doc.textboxes)      # List of textbox strings
```

### Rust library

```rust
let data = std::fs::read("document.doc")?;
let doc = unword::parse_doc(&data)?;
println!("{}", doc.body_text);
```

## Output format

- Headings are rendered as `#`, `##`, `###`, etc. based on Word styles
- Paragraphs are separated by blank lines
- Page breaks become `---`
- Textboxes are extracted separately

## Tests

```bash
# Rust
cargo test

# Python
pytest tests/test_python.py
```

## License

MIT

