Metadata-Version: 2.4
Name: groundmark
Version: 0.3.6
Summary: PDF to Markdown conversion and quote-to-bbox resolution
Project-URL: Homepage, https://github.com/populationgenomics/groundmark
Project-URL: Bug Tracker, https://github.com/populationgenomics/groundmark/issues
Author-email: Tobias Sargeant <tobias.sargeant@gmail.com>
License: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Requires-Dist: pypdf
Requires-Dist: pypdfium2
Requires-Dist: seq-smith
Provides-Extra: anthropic
Requires-Dist: pydantic-ai-slim[anthropic]; extra == 'anthropic'
Provides-Extra: bedrock
Requires-Dist: pydantic-ai-slim[bedrock]; extra == 'bedrock'
Provides-Extra: google
Requires-Dist: pydantic-ai-slim[google]; extra == 'google'
Provides-Extra: openai
Requires-Dist: pydantic-ai-slim[openai]; extra == 'openai'
Description-Content-Type: text/markdown

# groundmark

<img src="https://raw.githubusercontent.com/populationgenomics/groundmark/main/groundmark.webp" alt="groundmark" width="200">

PDF to Markdown conversion and quote-to-bbox resolution.

## What it does

1. **Convert**: Send PDF pages to a vision-capable LLM (via [Pydantic AI](https://ai.pydantic.dev/)) to produce clean Markdown with `<!--page-->` markers between pages.
2. **Resolve**: Given verbatim quote strings, locate them in the source PDF and return bounding box coordinates. Uses [pypdfium2](https://github.com/nickel-ern/pypdfium2) for per-character bbox extraction and [seq-smith](https://github.com/populationgenomics/seq-smith) for Smith-Waterman alignment.

## Quick Start

```python
import asyncio
from groundmark import DocumentIndex
from groundmark.convert import Config, convert

async def main():
    pdf_bytes = open("document.pdf", "rb").read()

    # PDF -> Markdown (requires pydantic-ai, install with e.g. groundmark[bedrock])
    result = await convert(pdf_bytes, Config(model="bedrock:au.anthropic.claude-sonnet-4-6"))
    print(result.markdown[:500])

    # Resolve verbatim quotes to PDF bounding boxes
    doc = DocumentIndex(pdf_bytes)
    resolved = doc.resolve(["the patient presented with"])
    # -> {"the patient presented with": [(page, BBox(top, left, bottom, right)), ...]}

    # The DocumentIndex can be reused for multiple resolve calls against the same PDF
    more = doc.resolve(["another quote from the same paper"])

if __name__ == "__main__":
    asyncio.run(main())
```

## Installation

```bash
# Resolve only (no LLM dependencies)
uv add groundmark

# With LLM provider extra(s) for conversion
uv add groundmark --extra anthropic,bedrock,google,openai
```

## Configuration

### Timeouts

The LLM call for PDF-to-Markdown conversion can take several minutes for large documents, especially with Opus on Bedrock. Timeout defaults by provider:

| Provider | Default | Environment Variable |
|----------|---------|---------------------|
| Bedrock (boto3) | 300s | `AWS_READ_TIMEOUT` |
| Anthropic (httpx) | 600s | — (use `ModelSettings(timeout=...)`) |

For Bedrock with Opus, 300s may not be enough. Set a higher timeout:

```bash
export AWS_READ_TIMEOUT=600
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
