Metadata-Version: 2.4
Name: pynw
Version: 0.4.1
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Dist: numpy>=1.22
License-File: LICENSE
Summary: Rust-accelerated Needleman-Wunsch global sequence alignment for precomputed score matrices.
Keywords: needleman-wunsch,sequence-alignment,dynamic-programming,bioinformatics,edit-distance
Author-email: Christopher Deutsch <hello@cdeutsch.org>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/chrisdeutsch/pynw
Project-URL: Issues, https://github.com/chrisdeutsch/pynw/issues
Project-URL: Repository, https://github.com/chrisdeutsch/pynw

# pynw

[![Check](https://github.com/chrisdeutsch/pynw/actions/workflows/check.yml/badge.svg)](https://github.com/chrisdeutsch/pynw/actions/workflows/check.yml)
[![Build](https://github.com/chrisdeutsch/pynw/actions/workflows/build.yml/badge.svg)](https://github.com/chrisdeutsch/pynw/actions/workflows/build.yml)
[![PyPI](https://img.shields.io/pypi/v/pynw)](https://pypi.org/project/pynw/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

Rust-accelerated [Needleman-Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)
global sequence alignment for user-supplied similarity matrices. Python bindings
are built with [PyO3](https://pyo3.rs).

Align two ordered sequences given any precomputed pairwise similarity matrix.
Unlike string-distance or bioinformatics libraries, which are designed around
specific alphabets and scoring rules, `pynw` accepts whatever scores you
provide: cosine similarity of embeddings, model outputs, or distance metrics.

## Features

- **Fast:** Alignment runs in $O(nm)$ time; a `1000×1000` matrix takes <10 ms on modern CPUs.
- **NumPy-first:** Pass NumPy arrays directly, no conversion needed.
- **Domain-agnostic:** Operates on a precomputed similarity matrix; the scoring function is up to you.
- **Asymmetric gaps:** Penalize inserts and deletes independently.

## Installation

Requires Python 3.10+ and NumPy 1.22+. Prebuilt wheels for Linux, macOS, and
Windows are published on PyPI.

```sh
pip install pynw
```

On platforms without a prebuilt wheel, pip will build from the source
distribution. This requires a [Rust toolchain](https://rustup.rs/) (1.85+).

## Quick start

Using `pynw` involves three steps:

1. **Build a similarity matrix.** Compute pairwise scores between every element
   of your two sequences using any scoring function.
2. **Run the alignment.** Pass the matrix and a gap penalty to
   `needleman_wunsch`. The gap penalty controls when leaving an element unmatched
   is preferable to a low-scoring match.
3. **Interpret the results.** The returned edit operations tell you which
   elements were aligned, inserted, or deleted.

The example below aligns two word sequences. The similarity matrix is built from
[GloVe](https://nlp.stanford.edu/projects/glove/) cosine similarities, letting
semantically related words align even without an exact match:

```python
import numpy as np
from pynw import needleman_wunsch, alignment_indices

source = np.array(
    ["clever", "sneaky", "fox", "leaped"]
)
target = np.array(
    ["sly", "fox", "jumped", "across"]
)

# Cosine similarity from GloVe (glove-wiki-gigaword-50)
similarity_matrix = np.array([
    # sly     fox     jumped  across
    [ 0.65,   0.25,   0.06,   0.20],  # clever
    [ 0.57,   0.06,  -0.14,  -0.05],  # sneaky
    [ 0.26,   1.00,   0.30,   0.41],  # fox
    [-0.00,   0.07,   0.77,   0.35],  # leaped
])

# Each gap deducts 0.5 from the total score; increase the penalty to force
# more alignments, decrease it to allow more gaps
score, editops = needleman_wunsch(similarity_matrix, gap_penalty=-0.5)
src_idx, tgt_idx = alignment_indices(editops)

# Reconstruct aligned sequences; masked positions are gaps
aligned_src = np.ma.array(source).take(src_idx).filled("-")
aligned_tgt = np.ma.array(target).take(tgt_idx).filled("-")

print(f"Score: {round(score, 2)}")
for s, t in zip(aligned_src, aligned_tgt):
    print(f"  {s:10s}  {t}")

# Score: 1.42
#   clever     sly       (semantic match)
#   sneaky     -         (deleted)
#   fox        fox       (exact match)
#   leaped     jumped    (semantic match)
#   -          across    (inserted)
```

## User Guide

`pynw` exposes two alignment functions and a helper for interpreting results.
Which function you need depends on whether you need just the score or the full
alignment.

### Score only: `needleman_wunsch_score`

Use `needleman_wunsch_score` when you only need the alignment score, for
example when ranking or filtering many sequence pairs. It skips the traceback
entirely, using O(m) memory instead of O(nm):

```python
from pynw import needleman_wunsch_score

score = needleman_wunsch_score(similarity_matrix)
```

### Score and alignment: `needleman_wunsch`

Use `needleman_wunsch` when you need to know _how_ the sequences were aligned.
It returns the optimal score along with an array of edit operations (`editops`).
Each element in the editops array is one of three `EditOp` values:

- `EditOp.Align`: a source element is matched with a target element.
- `EditOp.Delete`: a source element is consumed with no matching target element
  (gap in target).
- `EditOp.Insert`: a target element is consumed with no matching source element
  (gap in source).

The editops array alone is enough for aggregate statistics:

```python
from pynw import EditOp, needleman_wunsch

score, editops = needleman_wunsch(similarity_matrix)

n_aligned = np.sum(editops == EditOp.Align)
n_inserted = np.sum(editops == EditOp.Insert)
n_deleted = np.sum(editops == EditOp.Delete)
```

### Reconstructing the alignment: `alignment_indices`

Use `alignment_indices` when you need to map alignment positions back to the
original sequences. It converts the editops array into two masked index arrays: one
for the source, one for the target. Each array has one entry per alignment
position. Positions where the corresponding sequence has a gap are masked.

```python
from pynw import alignment_indices

src_idx, tgt_idx = alignment_indices(editops)
```

Because the indices are masked arrays, `take` propagates the mask and `filled`
substitutes a value at gap positions. This makes it easy to reconstruct aligned
sequences with gap markers:

```python
source = np.ma.array(["the", "quick", "fox"])
target = np.ma.array(["the", "slow", "red", "fox"])

aligned_src = source.take(src_idx).filled("-")
aligned_tgt = target.take(tgt_idx).filled("-")
# aligned_src: ['the', 'quick', '-',   'fox']
# aligned_tgt: ['the', 'slow',  'red', 'fox']
```

When iterating over a masked array, masked positions yield the `np.ma.masked`
sentinel instead of an integer. This means you can iterate over `editops` and the
index arrays together without checking masks explicitly. In the diff example
below, `s` is masked at Insert positions and `t` is masked at Delete positions,
so only the valid index is used in each branch:

```python
for op, s, t in zip(editops, src_idx, tgt_idx):
    if op == EditOp.Align:
        print(f"  {source[s]}")
    elif op == EditOp.Delete:
        print(f"- {source[s]}")
    elif op == EditOp.Insert:
        print(f"+ {target[t]}")
```

### Asymmetric gap penalties

By default, `gap_penalty` applies equally to insertions and deletions. To
penalize them independently, pass `insert_penalty` and/or `delete_penalty`:

```python
score, editops = needleman_wunsch(
    similarity_matrix,
    insert_penalty=-0.3,
    delete_penalty=-0.7,
)
```

When set, these override `gap_penalty` for the corresponding direction. This is
useful when the cost of missing a source element differs from the cost of
introducing a spurious target element.

## Details

### Precomputed similarity matrix

`pynw` takes a precomputed `(n, m)` similarity matrix rather than a scoring
function. This means the entire alignment runs in compiled Rust code with no
Python callbacks, and you can build scores with vectorized NumPy operations
rather than element-wise Python loops.

The trade-off is that you must allocate the full matrix up front, which uses
$O(nm)$ memory even when the scoring rule could be expressed more compactly.

### Scoring

The total alignment score is the sum of similarity-matrix entries for matched
positions and gap penalties for insertions/deletions. Gap penalties are
typically negative. By default a single `gap_penalty` applies to both
directions; set `insert_penalty` and/or `delete_penalty` to penalise them
independently.

When multiple alignments achieve the same optimal score, `pynw` breaks ties
deterministically: `Align > Delete > Insert`.

### Edit-distance parameterizations

Needleman-Wunsch can reproduce common metrics with the right similarity-matrix
values and gap penalty:

| Metric               | `S[i,j]` match | `S[i,j]` mismatch | `gap_penalty` | NW score equals |
| -------------------- | -------------- | ----------------- | ------------- | --------------- |
| Levenshtein distance | 0              | -1                | -1            | `-distance`     |
| Indel distance       | 0              | -2                | -1            | `-distance`     |
| LCS length           | 1              | 0                 | 0             | `lcs_length`    |
| Hamming distance     | 0              | -1                | `-(n+1)`      | `-distance`     |

For Hamming distance, strings must have equal length.

## API

Full API documentation is available at
[chrisdeutsch.github.io/pynw](https://chrisdeutsch.github.io/pynw/pynw.html).
To browse locally:

```sh
pixi run docs          # generate static HTML in site/
pixi run docs-serve    # serve live docs in the browser
```

## Development

This repository uses [pixi](https://pixi.sh) for development:

```sh
pixi install
pixi run build            # build the Rust extension
pixi run test             # run deterministic tests
pixi run lint             # run all pre-commit checks (ruff, cargo fmt, prettier, markdownlint, taplo, actionlint)
pixi run check            # run all pre-push checks (cargo clippy, mypy)
pixi run docs             # generate API docs in site/
pixi run docs-serve       # serve API docs in the browser
```

Linting and formatting are managed by [lefthook](https://github.com/evilmartians/lefthook)
and run automatically as git hooks.

## Related projects

- [rapidfuzz](https://github.com/rapidfuzz/RapidFuzz):
  Highly optimized string distances (Levenshtein, Jaro-Winkler, etc.) with
  scoring, edit operations, and alignment. The better choice when you only
  need standard string metrics.
- [sequence-align](https://github.com/kensho-technologies/sequence_align):
  Rust-accelerated Needleman-Wunsch and Hirschberg for token sequences with
  built-in match/mismatch scoring.
- [`scipy.optimize.linear_sum_assignment`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html):
  Solves unconstrained bipartite matching ($O(N^3)$) where order does not
  matter.
- [BioPython `Bio.Align.PairwiseAligner`](https://biopython.org/docs/dev/Tutorial/chapter_align.html):
  Needleman-Wunsch/Smith-Waterman for biological sequences with
  alphabet-based substitution matrices (built-in and custom).

## License

[MIT](LICENSE)

