Metadata-Version: 2.2
Name: noLZSS
Version: 1.2.0
Summary: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization (high-performance C++ core with Python bindings)
Author: Omer Kerner
License: BSD-3-Clause
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: C++
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Science/Research
Project-URL: Homepage, https://github.com/OmerKerner/noLZSS
Requires-Python: >=3.8
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: numpy>=1.20.0; extra == "test"
Requires-Dist: pandas>=1.3.0; extra == "test"
Requires-Dist: matplotlib>=3.5.0; extra == "test"
Requires-Dist: scipy>=1.7.0; extra == "test"
Provides-Extra: benchmark
Requires-Dist: matplotlib>=3.5.0; extra == "benchmark"
Provides-Extra: plotting
Requires-Dist: matplotlib>=3.5.0; extra == "plotting"
Provides-Extra: genomics
Requires-Dist: numpy>=1.20.0; extra == "genomics"
Requires-Dist: scipy>=1.7.0; extra == "genomics"
Provides-Extra: panel
Requires-Dist: numpy>=1.20.0; extra == "panel"
Requires-Dist: pandas>=1.3.0; extra == "panel"
Requires-Dist: holoviews>=1.15.0; extra == "panel"
Requires-Dist: bokeh>=2.4.0; extra == "panel"
Requires-Dist: panel>=1.0.0; extra == "panel"
Requires-Dist: datashader>=0.14.0; extra == "panel"
Requires-Dist: colorcet>=3.0.0; extra == "panel"
Description-Content-Type: text/markdown

# noLZSS

[![Build Wheels](https://github.com/OmerKerner/noLZSS/actions/workflows/wheels.yml/badge.svg)](https://github.com/OmerKerner/noLZSS/actions/workflows/wheels.yml)
[![Documentation](https://github.com/OmerKerner/noLZSS/actions/workflows/docs.yml/badge.svg)](https://omerkerner.github.io/noLZSS/)
<img align="right" src="assets/logo.png" alt="noLZSS Logo" width=200px/>

**Non-overlapping Lempel–Ziv–Storer–Szymanski factorization**

High-performance Python library for text factorization using compressed suffix trees. The library provides efficient algorithms for finding non-overlapping factors in text data, with both in-memory and file-based processing capabilities. Based on a paper by Dominik Köppl - [Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees](https://doi.org/10.3390/a14020044)

## Features

- 🚀 **High Performance**: Uses compressed suffix trees (SDSL) for optimal factorization speed
- 💾 **Memory Efficient**: File-based processing for large datasets without loading everything into memory
- 🐍 **Python Bindings**: Easy-to-use Python interface with proper GIL management
- 📊 **Multiple Output Formats**: Get factors as lists, counts, or binary files
- 🔧 **Flexible API**: Support for both strings and files with optional performance hints
- 🧬 **Genomics Support**: Specialized functions for FASTA file processing of nucleotide and protein sequences
- ⚡ **C++ Extensions**: High-performance C++ implementations for memory-intensive operations

## Installation

### From Source (Development)

```bash
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
pip install -e .
```

### Requirements

- Python 3.8+
- C++17 compatible compiler
- CMake 3.20+

## Quick Start

### Basic Usage

```python
import noLZSS

# Factorize a text string
text = b"abracadabra"
factors = noLZSS.factorize(text)
print(factors)  # [(0, 1, 0), (1, 1, 1), (2, 1, 2), (3, 1, 0), (4, 1, 4), (5, 1, 0), (6, 1, 6), (7, 4, 0)]
```

### Working with Files

```python
# Factorize text from a file
factors = noLZSS.factorize_file("large_text.txt")
print(f"Found {len(factors)} factors")

# Just count factors without storing them (memory efficient)
count = noLZSS.count_factors_file("large_text.txt")
print(f"Total factors: {count}")

# Write factors to binary file for later processing
noLZSS.write_factors_binary_file("input.txt", "factors.bin")
```

### Genomics Applications

```python
import noLZSS.genomics

# Process DNA sequences from FASTA file
results = noLZSS.genomics.read_nucleotide_fasta("sequences.fasta")
for seq_id, factors in results:
    print(f"Sequence {seq_id}: {len(factors)} factors")
```

#### Statistical Significance Analysis

Distinguish signal from noise in genomic factorizations by comparing with shuffled genomes:

```python
from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_factors.bin",
    "shuffled_factors.bin",
    tau_expected_fp=1.0
)

print(f"Minimum significant factor length: {result['L_star']}")
```

See [Factor Length Significance Analysis](docs/genomics_significance.md) for detailed methodology.

### Batch Processing

For processing multiple FASTA files, noLZSS provides batch processing scripts:

#### Local Batch Processing
```bash
# Process multiple files locally with parallel downloads and factorization
python -m noLZSS.genomics.batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --mode with_reverse_complement \
    --max-workers 4
```

#### LSF Cluster Batch Processing
```bash
# Process on LSF cluster with optimal resource allocation
python -m noLZSS.genomics.lsf_batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --max-threads 16 \
    --mode with_reverse_complement
```

The LSF batch processor:
- Estimates time, memory, and disk requirements from benchmark data
- Allocates optimal threads per job based on file size
- Submits jobs with `bsub` and tracks completion
- Provides consolidated failure reports without log spam

See [LSF Batch Factorizer Documentation](docs/LSF_BATCH_FACTORIZER.md) for details.

## Algorithm Details

The library implements the **Non-overlapping Lempel-Ziv-Storer-Szymanski (LZSS)** factorization algorithm using:

- **Compressed Suffix Trees**: Built using the SDSL (Succinct Data Structure Library)
- **Range Minimum Queries**: For efficient lowest common ancestor computations
- **Sink-based Processing**: Memory-efficient processing using callback functions

### Tie-Breaking for DNA Factorization

When factorizing DNA sequences with reverse complement awareness, if both a forward match and a reverse complement match have the same length, the **forward match is preferred**. Among candidates of the same type, the one with the earliest position wins.

## Performance

- **Time Complexity**: 𝒪(𝑛 lg<sup>ϵ</sup> 𝑛) for factorization, where n is input length, and 𝜖 ∈ (0,1]
- **Space Complexity**: 𝒪(𝑛lg𝜎) for suffix tree construction, where 𝜎 is the alphabet size
- **Memory Usage**: File-based processing uses minimal memory for large files
- **C++ Extensions**: Specialized high-performance functions for memory-intensive genomics operations

## Documentation

Complete documentation is available at **[omerkerner.github.io/noLZSS](https://omerkerner.github.io/noLZSS/)**

The documentation includes:
- **Python API Reference**: Complete Python API with examples and parameter descriptions
- **C++ API Reference**: Auto-generated C++ API documentation from source code
- **Genomics Module**: Specialized functions for biological sequence analysis
- **Examples and Tutorials**: Comprehensive usage examples and best practices

## Development

### Building from Source

```bash
# Clone repository
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/
```

## License

This project is licensed under the BSD 3-Clause License (see `LICENSE`).

The repository vendors third-party components (notably SDSL v3). Third-party license texts and attribution are provided in `THIRD_PARTY_LICENSES.txt`.

## Citation

If you use this library in your research, please cite:

```bibtex
@software{noLZSS,
  title = {noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization},
  author = {Kerner, Omer},
  url = {https://github.com/OmerKerner/noLZSS},
  year = {2024}
}
```

