Metadata-Version: 2.4
Name: iscc-tika
Version: 0.4.0
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: pdoc ; extra == 'docs'
Requires-Dist: pytest ; extra == 'test'
Requires-Dist: pytest-timeout ; extra == 'test'
Requires-Dist: scikit-learn ; extra == 'test'
Provides-Extra: docs
Provides-Extra: test
Summary: Fast text and metadata extraction from documents using Apache Tika compiled to native code
Home-Page: https://github.com/iscc/iscc-tika
License: Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/iscc/iscc-tika
Project-URL: Repository, https://github.com/iscc/iscc-tika

# iscc-tika Python Bindings

This project provides Python bindings for the iscc-tika library, allowing you to use iscc-tika
functionality in your Python applications.

## Installation

```bash
pip install iscc-tika
```

## Usage

Extracting a file to string:

```python
from iscc_tika import Extractor

# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)

# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)
```

Extracting a file(URL / bytearray) to a buffered stream:

```python
from iscc_tika import Extractor

extractor = Extractor()
# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)
```

Extracting a file with OCR:

```python
from iscc_tika import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string(
    "../../test_files/documents/eng-ocr.pdf"
)

print(result)
print(metadata)
```

