doctr

Index PDFs, DOCX, XLSX, and embedded files into a tree.

Deterministic, local-first indexing. Plug any LLM later for chat.

Quickstart Function Reference
PDFDOCXXLSX/XLSMMarkdownEmbedded Files

Quickstart

pip install -e '.[dev,office,docling]'

from doctr import index_document
idx = index_document(
    "/path/to/report.pdf",
    include_embedded=True,
    max_embedded_depth=2,
)
tree = idx.to_pageindex_dict(include_empty_nodes=False)

1. Document Input

Accept `pdf/docx/xlsx/xlsm/md/txt/msg` with one entrypoint.

from doctr import index_document
idx = index_document("/path/to/file.docx")

2. Docling Conversion

Use Docling for layout + OCR + reading order extraction.

from doctr import DoclingConverterAdapter
converted = DoclingConverterAdapter().convert("/path/to/file.pdf")

3. Tree Index Builder

Produce PageIndex-style nodes with IDs and page ranges.

from doctr import DocumentPipeline
p = DocumentPipeline()
idx = p.build_tree_index(converted=converted)

4. Retrieval/Chat Layer

Retrieve compact context, then send to Sonar or any model.

ctx = p.retrieve_for_chat(
  idx, "What changed in supervision?", top_k=6
)

Sample Output

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    }
  ]
}

Core Python Functions