1. Document Input
Accept `pdf/docx/xlsx/xlsm/md/txt/msg` with one entrypoint.
from doctr import index_document
idx = index_document("/path/to/file.docx")
doctr
Deterministic, local-first indexing. Plug any LLM later for chat.
pip install -e '.[dev,office,docling]'
from doctr import index_document
idx = index_document(
"/path/to/report.pdf",
include_embedded=True,
max_embedded_depth=2,
)
tree = idx.to_pageindex_dict(include_empty_nodes=False)
Accept `pdf/docx/xlsx/xlsm/md/txt/msg` with one entrypoint.
from doctr import index_document
idx = index_document("/path/to/file.docx")
Use Docling for layout + OCR + reading order extraction.
from doctr import DoclingConverterAdapter
converted = DoclingConverterAdapter().convert("/path/to/file.pdf")
Produce PageIndex-style nodes with IDs and page ranges.
from doctr import DocumentPipeline
p = DocumentPipeline()
idx = p.build_tree_index(converted=converted)
Retrieve compact context, then send to Sonar or any model.
ctx = p.retrieve_for_chat(
idx, "What changed in supervision?", top_k=6
)
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
}
]
}
index_document(...)DocumentIndexer.index_document(...)DocumentIndexer.index_with_ocr(...)DocumentPipeline.document_input(...)DocumentPipeline.docling_conversion(...)DocumentPipeline.build_tree_index(...)DocumentPipeline.retrieve_for_chat(...)retrieve_context(...)