Metadata-Version: 2.4
Name: pdf-intelligent-splitter
Version: 1.0.0
Summary: Intelligent PDF document splitter based on LLM and OCR
Home-page: https://github.com/loudlous/pdf-intelligent-splitter
Author: loudlous
Author-email: 1948259843@qq.com
Keywords: python,pdf,split,ocr,llm,document,pdf splitter
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: Topic :: Office/Business
Classifier: Topic :: Utilities
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26.4
Requires-Dist: Pillow>=11.3.0
Requires-Dist: psutil>=5.9.8
Requires-Dist: tqdm>=4.67.1
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: paddlepaddle>=2.5.0
Requires-Dist: paddleocr>=2.7.0
Requires-Dist: openai>=1.0.0
Requires-Dist: python-dotenv>=1.0.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary


# PDF Intelligent Splitter

An intelligent PDF document splitting tool based on Large Language Models (LLM) and OCR, designed to split merged PDF documents into multiple independent documents.

## 🚀 Features

1. **Automatic OCR Recognition**: Supports automatic GPU/CPU switching using PaddleOCR for text recognition
2. **Intelligent Splitting Strategy**:
   - **Priority 1**: Uses PDF table of contents for precise splitting (zero cost, no model calls)
   - **Priority 2**: Uses LLM for intelligent splitting when TOC is unavailable
3. **Universal Document Support**: Works with all document types (legal documents, academic papers, general documents, etc.)
4. **Automatic File Naming**: Generates standardized filenames based on document titles
5. **Configurable Keywords**: Supplementary materials and page type keywords are configurable for easy extension

## 📋 Requirements

- Python 3.8+
- GPU support (optional, auto-detected)
- Sufficient disk space for OCR results and split PDFs

## 📦 Installation

### Install Dependencies

```bash
pip install -r requirements.txt
```

### Core Dependencies

**OCR Support (Required)**:
```bash
pip install paddlepaddle paddleocr
```

**PDF Processing (Required)**:
```bash
pip install pymupdf
```

**LLM API (Required)**:
```bash
pip install openai
```

**Optional Dependencies**:
```bash
pip install python-dotenv  # Environment variable support
```

## ⚙️ Configuration

### Environment Variables

Create a `.env` file in the project root (optional):

```bash
# LLM API Configuration
LLM_API_KEY=your_api_key_here
LLM_API_BASE_URL=https://api.example.com/v1

# Or use DeepSeek
DEEPSEEK_API_KEY=your_deepseek_key_here
```

### Security Notice

⚠️ **For security and open-source best practices, this project does NOT hardcode any API keys in the source code.**

You **must** configure LLM access parameters through environment variables:
- `LLM_API_KEY` or `DEEPSEEK_API_KEY`: Your API key (required)
- `LLM_API_BASE_URL`: API base URL (optional, defaults to `https://one-api.maas.com.cn/v1`)

If API keys are not set, the tool will raise an error when attempting to use LLM-based splitting.

## 🎯 Usage

### Basic Usage

```bash
python pdf-split.py <input.pdf> -o <output_dir>
```

### Full Parameters

```bash
python pdf-split.py <input.pdf> \
    -o <output_dir> \                    # Output directory
    --document-type <type> \              # Document type: general/legal/academic
    --ocr-json <path> \                  # Use existing OCR JSON file
    --use-gpu \                           # Force GPU usage (if available)
    --use-cpu \                           # Force CPU usage
    --image-scale <scale>                 # Image scale factor (default: 1.0, lower for large files)
```

### Examples

```bash
# Basic splitting
python pdf-split.py document.pdf -o ./result

# Specify document type
python pdf-split.py academic_papers.pdf -o ./result --document-type academic

# Use existing OCR results (skip OCR step)
python pdf-split.py document.pdf -o ./result --ocr-json ./ocr_result.json

# Large file optimization (reduce memory usage)
python pdf-split.py large_document.pdf -o ./result --image-scale 0.5
```

## 📤 Output

After splitting, the output directory contains:

- `split_points.json`: Split point information (JSON format)
  - `total_pages`: Total number of pages
  - `splits`: List of split results
    - `start_page`: Start page number
    - `end_page`: End page number
    - `title`: Document title
- `*_ocr.json`: OCR recognition results (optional, for subsequent processing)
- `01_<title>.pdf`, `02_<title>.pdf`, ...: Split PDF files

## 🔧 How It Works

### 1. OCR Recognition Phase

- Automatically detects GPU availability
- Uses PaddleOCR for text recognition
- Generates simplified OCR JSON (only key information: page number, page height, text, and Y coordinates)

### 2. Splitting Strategy

**Strategy 1: Table of Contents Splitting (Priority)**
- Automatically detects TOC pages in PDF
- Extracts TOC entries and page numbers
- Splits precisely based on TOC (zero cost, no model calls)

**Strategy 2: LLM Intelligent Splitting**
- Extracts key page information (headers, first 3 lines, page type)
- Builds compact prompts and sends to LLM
- LLM analyzes document structure and returns splitting suggestions

### 3. Post-processing

- Corrects overlapping pages
- Merges supplementary materials (appendices, references, etc.)
- Validates complete page coverage
- Normalizes filenames

## 🎨 Configurable Keywords

The tool uses a configurable keyword system for easy extension:

### Supplementary Material Keywords

```python
SUPPLEMENT_KEYWORDS = {
    'appendix': ['appendix', '附录'],
    'references': ['references', 'bibliography', '参考文献'],
    'supplementary': ['supplementary', '补充材料', 'supplement']
}
```

### Page Type Keywords

```python
PAGE_TYPE_KEYWORDS = {
    'toc': ['目录', 'contents', 'table of contents', '目 录'],
    'abstract': ['abstract', '摘要'],
    'references': ['references', 'bibliography', '参考文献'],
    'title_page': ['abstract', '摘要', 'introduction', '引言', ...]
}
```

You can modify these configurations in the code to adapt to different document types.

## ⚡ Performance Optimization

### Memory Optimization

- Automatically reduces image scale for large files (`--image-scale 0.5-0.6`)
- Reduces OCR batch size for memory-constrained environments
- Timely memory release (using `gc.collect()`)

### Token Optimization

- OCR JSON only contains key information (text and Y coordinates)
- Prompts only include headers and first 3 lines of text
- Large files are automatically truncated to ensure all pages are processed

### GPU Acceleration

- Automatically detects GPU availability
- Supports PaddleOCR GPU acceleration
- Automatically falls back to CPU when GPU is unavailable

## 🐛 Troubleshooting

### Common Issues

1. **OCR Initialization Failed**
   - Check if PaddleOCR is correctly installed
   - Check GPU drivers and CUDA version
   - Try using `--use-cpu` to force CPU usage

2. **Out of Memory (Exit code 137)**
   - Reduce `--image-scale` (e.g., 0.5)
   - Use existing OCR JSON to skip OCR step
   - Process files in smaller batches

3. **Inaccurate Splitting Results**
   - Check OCR quality (view OCR JSON)
   - Try different document type parameters
   - Verify LLM API is working correctly

4. **API Call Failed**
   - Check API key and base URL configuration
   - Verify network connectivity
   - Check API service status

## 📝 Notes

1. **Splitting Principle**: The tool follows the principle of "prefer over-splitting over incorrect merging"
2. **Title Extraction**: If a clear title cannot be extracted, header information or default titles will be used
3. **Page Coverage**: The tool validates that all pages are covered without gaps or overlaps
4. **File Naming**: Special characters in filenames are replaced with underscores to ensure filesystem compatibility

## 📄 License

This tool is provided as-is for open-source use. Please refer to the LICENSE file for details.

## 📚 Documentation

- [Usage Guide](docs/USAGE.md)
- [Configuration Guide](docs/CONFIG.md)
- [Architecture Overview](docs/ARCHITECTURE.md)

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## 📝 Changelog

### v1.0.0
- Initial release
- Support for TOC-based and LLM-based splitting
- GPU/CPU automatic switching
- Configurable keyword system
- Token optimization and performance improvements

## 🙏 Acknowledgments

- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for OCR capabilities
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) for PDF processing
- OpenAI-compatible API providers for LLM support

---

**Made with ❤️ for the open-source community**

