Metadata-Version: 2.4
Name: pdfa-parser
Version: 1.0.1
Summary: Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.
Project-URL: Homepage, https://github.com/Ilusinusmate/pdfa-parser
Project-URL: Issues, https://github.com/Ilusinusmate/pdfa-parser/issues
Project-URL: Repository, https://github.com/Ilusinusmate/pdfa-parser
Author: João
License-Expression: GPL-3.0-or-later
License-File: LICENSE
Keywords: conversion,ghostscript,pdf,pdf-a,pdfa,validation,verapdf
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: reportlab>=4.1; extra == 'dev'
Description-Content-Type: text/markdown

# pdfa-parser

> Convert PDFs to **PDF/A** using [GhostScript](https://www.ghostscript.com/) and validate compliance with [VeraPDF](https://verapdf.org/) — **zero-config, batteries included**.

**pdfa-parser** is a Python library (Python ≥ 3.10) that wraps GhostScript for
PDF → PDF/A conversion and VeraPDF for conformance validation. All external
tools are **downloaded automatically** on first use — just `pip install` and go.

```python
from pdfa_parser import create_parser

parser = create_parser()
parser.convert("input.pdf", "output.pdf")

result = parser.validate("output.pdf")
print(result.compliant)  # True
```

---

## Features

- **PDF → PDF/A conversion** via GhostScript (levels 1, 2, 3).
- **PDF/A validation** via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
- **Zero config** — GhostScript, Java (JRE), and VeraPDF are resolved
  automatically (system PATH → `apt-get` → binary download).
- **Works in Docker** — `pip install pdfa-parser` in a bare `python:3.x-slim`
  image is all you need.
- **Sync & async** — every public method has an `a_` async counterpart.
- **Factory function** (`create_parser()`) for instant quick start.
- **Adapter pattern** — swap GhostScript / VeraPDF for any CLI tool by
  implementing `IBaseAdapter`.
- **CLI** — `pdfa-parser input.pdf output.pdf` or `python -m pdfa_parser`.
- **Typed** — ships with `py.typed` marker and full type annotations.

---

## Installation

```bash
pip install pdfa-parser
```

That's it. No system packages to install, no manual binary setup.

> **Development install** (with test dependencies):
>
> ```bash
> pip install -e ".[dev]"
> ```

---

## Quick start

### Python API

```python
from pdfa_parser import create_parser

# Create a parser (GhostScript + VeraPDF are auto-resolved)
parser = create_parser()

# Convert a PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")

# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant)   # True / False
print(result.profile)     # "PDF/A-2B validation profile"

# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant
```

> **Tip:** `PdfaParser` is a convenience alias for `PdfParser` — both work:
>
> ```python
> from pdfa_parser import PdfaParser          # alias
> from pdfa_parser import PdfParser           # canonical name
> from pdfa_parser import create_parser       # recommended factory
> ```

### Conversion only (no VeraPDF)

```python
parser = create_parser(with_verapdf=False)
parser.convert("input.pdf", "output.pdf")
```

### Async API

Every method has an `a_` prefixed async twin:

```python
import asyncio
from pdfa_parser import create_parser

async def main():
    parser = create_parser()
    await parser.a_convert("input.pdf", "output.pdf")
    result = await parser.a_validate("output.pdf")
    print(result.compliant)

asyncio.run(main())
```

### CLI

```bash
# Basic conversion
pdfa-parser input.pdf output.pdf

# With validation
pdfa-parser input.pdf output.pdf --validate

# PDF/A level 1, flavour 1b
pdfa-parser input.pdf output.pdf --level 1 --validate --flavour 1b

# Also works as a module
python -m pdfa_parser input.pdf output.pdf --validate
```

---

## How dependency resolution works

On first use, the library checks for each tool in this order:

| Tool        | 1. System PATH    | 2. Package manager            | 3. Download                                                                           |
| ----------- | ----------------- | ----------------------------- | ------------------------------------------------------------------------------------- |
| GhostScript | `gs` / `gswin64c` | `apt-get install ghostscript` | GitHub archive (fallback)                                                             |
| Java (JRE)  | `java`            | —                             | [Adoptium Temurin 21](https://adoptium.net/)                                          |
| VeraPDF     | —                 | —                             | [Maven Central JAR](https://repo1.maven.org/maven2/org/verapdf/apps/greenfield-apps/) |

- Binaries are stored in `~/.local/share/pdfa-parser/bin/` (or `src/bin/`
  during development).
- The JRE and VeraPDF JAR are downloaded once and reused across runs.
- You can force a specific binary by setting the adapter path manually
  (see [Advanced usage](#advanced-usage)).

---

## Public API reference

### Top-level imports

```python
from pdfa_parser import (
    create_parser,      # Factory — recommended entry point
    PdfParser,          # Core class (canonical name)
    PdfaParser,         # Alias for PdfParser
    ValidationResult,   # Dataclass returned by validate()
    DependencyManager,  # Manual dependency orchestration
    # For custom adapters:
    IBaseAdapter,
    BinaryExecuter,
    GhostScriptAdapter,
    VeraPDFAdapter,
)
```

### `create_parser(**kwargs) → PdfParser`

| Parameter       | Type             | Default | Description                            |
| --------------- | ---------------- | ------- | -------------------------------------- |
| `pdfa_level`    | `int`            | `2`     | PDF/A conformance level (1, 2, 3)      |
| `with_verapdf`  | `bool`           | `True`  | Attach VeraPDF for validation          |
| `extra_gs_args` | `tuple[str,...]` | `()`    | Extra flags for every GhostScript call |

### `PdfParser` methods

| Method                       | Returns            | Description                        |
| ---------------------------- | ------------------ | ---------------------------------- |
| `convert(input, output)`     | `Path`             | Convert PDF to PDF/A               |
| `validate(file, *, flavour)` | `ValidationResult` | Check PDF/A compliance via VeraPDF |
| `convert_and_validate(…)`    | `ValidationResult` | Convert then validate in one call  |
| `a_convert(…)`               | `Path`             | Async convert                      |
| `a_validate(…)`              | `ValidationResult` | Async validate                     |
| `a_convert_and_validate(…)`  | `ValidationResult` | Async convert + validate           |

All path parameters accept both `str` and `pathlib.Path`.

### `ValidationResult`

| Field       | Type   | Description                             |
| ----------- | ------ | --------------------------------------- |
| `compliant` | `bool` | `True` if the PDF satisfies the profile |
| `profile`   | `str`  | Profile name (e.g. `"PDF/A-2B …"`)      |
| `details`   | `str`  | Raw XML snippet for debugging           |

---

## Advanced usage

### Custom adapters

```python
from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path

class MyGSAdapter(IBaseAdapter):
    def get_binary_path(self) -> Path:
        return Path("/opt/gs-10/bin/gs")

parser = PdfParser(
    gs_executer=BinaryExecuter(MyGSAdapter()),
    pdfa_level=3,
    extra_gs_args=("-dQUIET",),
)
```

### Manual dependency management

```python
from pdfa_parser import DependencyManager

m = DependencyManager()

# Check availability without downloading
print(m.ghostscript.is_available())  # True / False
print(m.verapdf.is_available())

# Force download / resolution
gs_path = m.ensure_ghostscript()
verapdf_path = m.ensure_verapdf()
```

---

## Project structure

```
pdfa-parser/
├── src/pdfa_parser/
│   ├── __init__.py             # Public API, create_parser(), PdfaParser alias
│   ├── __main__.py             # python -m pdfa_parser
│   ├── main.py                 # CLI entry-point
│   ├── pdf_parser.py           # PdfParser – convert / validate
│   ├── settings.py             # Lazy binary-path resolution
│   ├── data/
│   │   ├── PDFA_def.ps         # Bundled PostScript for PDF/A OutputIntent
│   │   └── srgb.icc            # Bundled sRGB ICC profile
│   ├── dependencies/
│   │   ├── _base.py            # Dependency / ResolutionStrategy ABCs
│   │   ├── _ghostscript.py     # GhostScript strategies
│   │   ├── _jre.py             # JRE (Adoptium) strategies
│   │   ├── _verapdf.py         # VeraPDF (Maven JAR) strategies
│   │   └── _manager.py         # DependencyManager orchestrator
│   ├── interfaces/
│   │   ├── base_adapter.py     # IBaseAdapter (ABC)
│   │   └── binary_executer.py  # BinaryExecuter (facade)
│   └── implementations/
│       ├── ghostscript_adapter.py
│       └── verapdf_adapter.py
├── tests/
│   ├── conftest.py             # Fixtures, skip markers, PDF generation
│   ├── test_unit.py            # Unit tests (no binaries needed)
│   ├── test_integration.py     # Integration tests (real binaries)
│   ├── test_sample_files.py    # Tests for bundled sample PDFs
│   ├── test_dependencies.py    # Dependency resolution tests
│   └── files/
│       ├── sample_pdf.pdf      # Regular PDF sample
│       └── sample_pdfa.pdf     # PDF/A sample
├── pyproject.toml
├── LICENSE
└── README.md
```

---

## Testing

```bash
# Everything (integration tests auto-skip if binaries are missing)
pytest -v

# Unit tests only (no binaries required)
pytest tests/test_unit.py -v

# Integration + sample file tests
pytest tests/test_integration.py tests/test_sample_files.py -v
```

### Test suites

| Suite                  | Tests | Requires binaries | What it covers                                                                |
| ---------------------- | ----: | ----------------- | ----------------------------------------------------------------------------- |
| `test_unit.py`         |    26 | No                | Helpers, XML parsing, arg building, mocked convert/validate, async, factory   |
| `test_dependencies.py` |    38 | No                | Dependency resolution strategies, DependencyManager, backward-compat shim     |
| `test_integration.py`  |    20 | Yes               | Real GS conversion, VeraPDF validation, round-trip, async, multiple PDF types |
| `test_sample_files.py` |     7 | Yes               | Bundled sample PDFs: conversion, validation, round-trip (sync + async)        |

Integration tests generate PDFs using **reportlab** (portrait, landscape,
coloured shapes, multi-page, text-heavy) and run them through the full
GhostScript → VeraPDF pipeline. Tests are **auto-skipped** when binaries are
not available.

### Docker smoke test

```bash
# Build the wheel
uv build --wheel

# Run in a clean Python container
docker run --rm \
  -v $PWD/dist:/dist \
  -v $PWD/tests:/tests \
  python:3.13-slim \
  bash -c "pip install /dist/*.whl && python /tests/docker_smoke_test.py"
```

---

## Requirements

| Requirement | Version | Notes                                               |
| ----------- | ------- | --------------------------------------------------- |
| Python      | ≥ 3.10  | No runtime dependencies beyond the standard library |
| GhostScript | any     | Auto-installed via `apt-get` or system PATH         |
| Java (JRE)  | ≥ 11    | Auto-downloaded from Adoptium if missing            |
| VeraPDF     | 1.26.5  | Auto-downloaded from Maven Central                  |

---

## License

[GPLv3+](LICENSE)
