Metadata-Version: 2.4
Name: geotech-report-extraction
Version: 0.5.3
Summary: Extract geotechnical data from PDF reports and output DIGGS XML
Author-email: Sean O'Connell <soconnell345@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/soconnell345/GeotechReportExtraction
Project-URL: Repository, https://github.com/soconnell345/GeotechReportExtraction
Project-URL: Issues, https://github.com/soconnell345/GeotechReportExtraction/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: lxml>=4.9
Requires-Dist: xgboost>=2.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: pandas>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24.0; extra == "pdf"
Provides-Extra: ocr
Requires-Dist: pymupdf>=1.24.0; extra == "ocr"
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Requires-Dist: Pillow>=9.0; extra == "ocr"
Provides-Extra: vision
Requires-Dist: anthropic>=0.30.0; extra == "vision"
Provides-Extra: validation
Requires-Dist: pydiggs>=0.1.5; extra == "validation"
Provides-Extra: geo
Requires-Dist: pyproj>=3.6; extra == "geo"
Requires-Dist: geopy>=2.4; extra == "geo"
Requires-Dist: folium>=0.15; extra == "geo"
Requires-Dist: shapely>=2.0; extra == "geo"
Requires-Dist: geopandas>=0.14; extra == "geo"
Requires-Dist: contextily>=1.5; extra == "geo"
Requires-Dist: utm>=0.7; extra == "geo"
Requires-Dist: mgrs>=1.4; extra == "geo"
Requires-Dist: matplotlib>=3.7; extra == "geo"
Provides-Extra: all
Requires-Dist: geotech-report-extraction[geo,ocr,validation,vision]; extra == "all"
Dynamic: license-file

# Geotech Report Extraction

Extract geotechnical borehole data from PDF reports or Azure Document Intelligence JSON exports and output [DIGGS 2.6](https://diggsml.org/) XML.

[![PyPI version](https://badge.fury.io/py/geotech-report-extraction.svg)](https://pypi.org/project/geotech-report-extraction/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Features

- Parse borehole logs from geotechnical reports (Langan, Schnabel, and generic formats)
- Extract soil layers, SPT blow counts, groundwater levels, and lab test results
- **Azure Document Intelligence (DI) JSON input** for cloud/serverless workflows
- **Palantir Foundry integration** with ready-to-use Spark transforms
- XGBoost page classifier for automatic boring log identification
- Template-based appendix cover page detection and report structure analysis
- Optional vision-based extraction using Anthropic Claude or GPT-4o via Palantir Funhouse
- Output DIGGS 2.6 XML for interoperability
- Geospatial utilities for coordinate conversion and boring location mapping

## Installation

```bash
pip install geotech-report-extraction
```

Core dependencies include XGBoost, scikit-learn, and pandas for ML-based page classification.

### Optional extras

```bash
# PDF parsing (PyMuPDF)
pip install geotech-report-extraction[pdf]

# OCR support (Tesseract)
pip install geotech-report-extraction[ocr]

# Vision LLM extraction (Anthropic Claude)
pip install geotech-report-extraction[vision]

# Geospatial utilities (coordinate conversion, mapping)
pip install geotech-report-extraction[geo]

# Everything
pip install geotech-report-extraction[all]
```

## Quick Start

### From PDF

```python
from geotech_report_extraction import extract_report

result = extract_report("report.pdf")

# With vision LLM
result = extract_report("report.pdf", use_vision=True, vision_api_key="sk-...")
```

### From Azure Document Intelligence JSON

```python
from geotech_report_extraction.di_reader import extract_from_di_json

result = extract_from_di_json("report_di.json")

# Or with a pre-parsed dict
result = extract_from_di_json(di_data_dict, file_label="my_report")
```

### Palantir Foundry

See `foundry_transforms/boring_log_pipeline.py` for a three-stage Spark pipeline:

1. **Flatten** raw DI JSON files into a per-page tabular dataset
2. **Identify** boring log pages and group by boring ID
3. **Extract** samples, soil layers, and water levels per boring

### Report Structure Analysis

After page classification, the appendix cover page analyzer identifies report structure:

```python
from geotech_report_extraction.report_structure import analyze_report_structure

# pages: list of dicts with 'text' and 'predicted_class' keys
structure = analyze_report_structure(pages)

for section in structure.sections:
    print(f"Appendix {section.letter}: {section.title} ({section.appendix_type})")
    print(f"  Pages {section.start_page}–{section.end_page}")
```

The analyzer builds a per-report template from classifier-predicted covers, then uses it to confirm predictions and find missed covers. Works across firm formats (Schnabel structured headers, Langan minimal covers).

## CLI

```bash
geotech-extract report.pdf -o output.xml
```

## License

MIT
