Metadata-Version: 2.4
Name: pynw
Version: 0.1.2
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Typing :: Typed
Requires-Dist: numpy>=1.16
License-File: LICENSE
Summary: Rust-accelerated Needleman-Wunsch global sequence alignment for precomputed score matrices.
Keywords: needleman-wunsch,sequence-alignment,dynamic-programming,bioinformatics,edit-distance
Author-email: Christopher Deutsch <hello@cdeutsch.org>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/chrisdeutsch/pynw
Project-URL: Issues, https://github.com/chrisdeutsch/pynw/issues
Project-URL: Repository, https://github.com/chrisdeutsch/pynw

# pynw

[![CI](https://github.com/chrisdeutsch/pynw/actions/workflows/CI.yml/badge.svg)](https://github.com/chrisdeutsch/pynw/actions/workflows/CI.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

Rust-accelerated [Needleman-Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)
global sequence alignment for precomputed score matrices.

Inspired by
[`scipy.optimize.linear_sum_assignment`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html):
the same goal of robust, high-performance global matching from a NumPy-first
interface, but specialized for **ordered** sequence alignment where gaps
(insertions/deletions) are modeled.

## Features

- **Fast:** Core Dynamic Programming and traceback are implemented in optimized Rust. Evaluates a `1000x1000` matrix in single-digit milliseconds.
- **NumPy-First:** The interface accepts NumPy arrays directly and requires no intermediate Python objects, keeping the hot path completely free of Python overhead.
- **Domain-Agnostic:** Any metric (e.g. edit costs, cosine similarities, log-likelihoods) can be provided as a pre-computed score matrix.

## Installation

```sh
pip install pynw
```

Requires Python 3.10+ and NumPy 1.16+.

## Quick start

```python
import numpy as np
from pynw import needleman_wunsch

seq1 = np.array(list("GATTACA"))
seq2 = np.array(list("GCATGCA"))

# Vectorized score matrix construction
# 1.0 for match, -1.0 for mismatch
score_matrix = np.where(seq1[:, None] == seq2[None, :], 1.0, -1.0)

score, seq1_idx, seq2_idx = needleman_wunsch(
    score_matrix,
    gap_penalty_seq1=-1.0,
    gap_penalty_seq2=-1.0,
)

# score: 2.0
# seq1_idx: array([ 0, -1,  1,  2,  3,  4,  5,  6])  (-1 = gap in seq1)
# seq2_idx: array([ 0,  1,  2, -1,  3,  4,  5,  6])  (-1 = gap in seq2)

for i, j in zip(seq1_idx, seq2_idx):
    s1 = seq1[i] if i >= 0 else "-"
    s2 = seq2[j] if j >= 0 else "-"
    print(f"{s1} <-> {s2}")
# G <-> G
# - <-> C
# A <-> A
# T <-> -
# T <-> T
# A <-> G
# C <-> C
# A <-> A
```

## Performance

When comparing with `scipy.optimize.linear_sum_assignment`, note:

1. **Different problems:** `linear_sum_assignment` solves unconstrained bipartite matching (order does not matter). `pynw` solves ordered sequence alignment (order matters, gaps allowed).
2. **Different time complexities:** `linear_sum_assignment` runs in $O(N^3)$ time. `pynw` evaluates alignments in $O(NM)$ time.

This $O(NM)$ scaling makes `pynw` significantly faster for sequence alignment. A typical `1000x1000` alignment takes **< 10ms** on modern CPUs (compared to > 60ms for `linear_sum_assignment`).

## API

```python
needleman_wunsch(
    score_matrix: np.ndarray,
    gap_penalty_seq1: float = -1.0,
    gap_penalty_seq2: float = -1.0,
) -> tuple[float, np.ndarray, np.ndarray]
```

**Parameters:**

- `score_matrix`: 2D array of shape `(n, m)` where `score_matrix[i, j]` is
  the similarity score for aligning element `i` of sequence 1 with element `j`
  of sequence 2. Non-2D inputs raise `ValueError`. Non-float inputs are cast
  to `float64` automatically.
- `gap_penalty_seq1`: Penalty when a gap is inserted in sequence 1 (sequence
  2 advances). Typically negative. This can be thought of as the cost of a deletion from sequence 1 or an insertion into sequence 2.
- `gap_penalty_seq2`: Penalty when a gap is inserted in sequence 2 (sequence
  1 advances). Typically negative. This can be thought of as the cost of a deletion from sequence 2 or an insertion into sequence 1.

**Returns:**

- `score`: The optimal alignment score.
- `seq1_idx`: `np.intp` array of indices into sequence 1 at each alignment
  position. `-1` indicates a gap.
- `seq2_idx`: `np.intp` array of indices into sequence 2 at each alignment
  position. `-1` indicates a gap.

The match-score component is composable, like `linear_sum_assignment`:

```python
matched = (seq1_idx >= 0) & (seq2_idx >= 0)
score_matrix[seq1_idx[matched], seq2_idx[matched]].sum()  # match contribution
```

Empty matrices are supported and return a score of `0.0` with empty index
arrays.

## Why use this instead of `linear_sum_assignment`?

`linear_sum_assignment` solves unconstrained bipartite matching where order does
not matter. Use `needleman_wunsch` when alignment must **preserve order** and
model gaps (insertions/deletions), e.g. for strings, time series, or token
sequences.

## Why a precomputed score matrix?

Computing pairwise scores can be just as expensive as the alignment itself.
By accepting a precomputed matrix, `pynw` lets you build scores in a vectorized
fashion (e.g. with NumPy or a domain-specific library) outside of the
Python-level loop, keeping the hot path fast. This design also makes the
interface agnostic to the scoring function: any metric - edit costs, embedding
similarities, log-likelihoods - works as long as you can fill an `(n, m)`
array.

The trade-off is that you must allocate the full score matrix up front, which
uses O(nm) memory even when the scoring rule could be expressed more compactly.

## Scoring convention

- Matrix entries are added to the total score on diagonal moves.
- Gap penalties are added on gap moves, so they are typically negative.

## Deterministic tie-breaking

When multiple moves are co-optimal, `pynw` uses:

```text
DIAG > UP > LEFT
```

This affects which optimal path is returned, not the optimal score.

## Edit-distance parameterizations

Needleman-Wunsch can reproduce common metrics with the right matrix/gap
settings:

| Metric               | match | mismatch | gap      | NW score equals |
|----------------------|-------|----------|----------|-----------------|
| Levenshtein distance | 0     | -1       | -1       | `-distance`     |
| Indel distance       | 0     | -2       | -1       | `-distance`     |
| LCS length           | 1     | 0        | 0        | `lcs_length`    |
| Hamming distance     | 0     | -1       | `-(n+1)` | `-distance`     |

For Hamming, strings must have equal length.

## Development

This repository uses [pixi](https://pixi.sh) for development:

```sh
pixi install
pixi run build          # build the Rust extension
pixi run test           # run deterministic tests
pixi run test-hypothesis  # run property-based tests
pixi run lint           # run ruff linter
pixi run typecheck      # run mypy
```

## License

[MIT](LICENSE)

