Metadata-Version: 2.2
Name: error-align
Version: 0.1.0b8
Summary: Text-to-text alignment algorithm for speech recognition error analysis.
Keywords: speech-recognition,text-alignment,NLP,evaluation
Author-Email: Lasse Borgholt <lb@corti.ai>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: C++
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Project-URL: Homepage, https://github.com/corticph/error-align
Requires-Python: <3.15,>=3.10
Requires-Dist: tqdm>=4.67.1
Requires-Dist: unidecode>=1.4.0
Requires-Dist: regex>=2025.9.18
Provides-Extra: evaluation
Requires-Dist: rapidfuzz>=3.13.0; python_version == "3.12" and extra == "evaluation"
Requires-Dist: num2words>=0.5.14; python_version == "3.12" and extra == "evaluation"
Requires-Dist: datasets>=3.3.2; python_version == "3.12" and extra == "evaluation"
Requires-Dist: soundfile>=0.13.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: librosa>=0.11.0; python_version == "3.12" and extra == "evaluation"
Requires-Dist: click>=8.2.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: pandas>=2.3.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: scipy>=1.16.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: textgrid>=1.6.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: nemo-toolkit[all]>=2.4.0; python_version == "3.12" and extra == "evaluation"
Requires-Dist: transformers>=4.51.3; python_version == "3.12" and extra == "evaluation"
Requires-Dist: backoff>=2.2.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: gitpython>=3.1.45; python_version == "3.12" and extra == "evaluation"
Requires-Dist: pyphen>=0.17.2; python_version == "3.12" and extra == "evaluation"
Requires-Dist: matplotlib>=3.10.6; python_version == "3.12" and extra == "evaluation"
Requires-Dist: numpy<2.0,>=1.22; python_version == "3.12" and extra == "evaluation"
Requires-Dist: lilcom==1.8.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: torchcodec<0.10.0,>=0.9.1; python_version == "3.12" and extra == "evaluation"
Requires-Dist: torch<2.10,>=2.9; python_version == "3.12" and extra == "evaluation"
Requires-Dist: torchvision==0.24; python_version == "3.12" and extra == "evaluation"
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/logo.svg" alt="ErrorAlign Logo" width="100%"/>
</p>

<p align="center">
  <img src="https://img.shields.io/badge/python-%203.10%20|%203.11%20|%203.12%20|%203.13%20|%203.14-green" alt="Python Versions">
  <img src="https://img.shields.io/codecov/c/github/corticph/error-align/main.svg?style=flat-square" alt="Coverage" style="margin-left:5px;">
  <img src="https://img.shields.io/pypi/v/error-align.svg" alt="PyPI" style="margin-left:5px;">
  <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License" style="margin-left:5px;">
</p>

<br/>

**Text-to-text alignment algorithm for speech recognition error analysis.** ErrorAlign helps you dig deeper into your speech recognition projects by accurately aligning each word in a reference transcript with the model-generated transcript. Unlike traditional methods, such as Levenshtein-based alignment, it is not restricted to simple one-to-one alignment, but can map a single reference word to multiple words or subwords in the model output. This enables quick and reliable identification of error patterns in rare words, names, or domain-specific terms that matter most for your application.

→ **Update [2025-12-10]:** As of version `0.1.0b5`, `error-align` will include a word-level pass to efficiently identify unambiguous matches, along with C++ extensions to accelerate beam search and backtrace construction. The combined speedup is ~15× over the pure-Python implementation ⚡

[//]: <> (https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/logo_gpt.svg)

__Contents__ | [Installation](#installation) | [Quickstart](#quickstart) | [Citation and Research](#citation) |



<a name="installation">

## Installation

```
pip install error-align
```


<a name="quickstart">

## Quickstart
```python
from error_align import error_align

ref = "Some things are worth noting!"
hyp = "Something worth nothing period?"

alignments = error_align(ref, hyp)
```

Resulting alignments:
```python
Alignment(SUBSTITUTE: "Some"- -> "Some"),
Alignment(SUBSTITUTE: -"thing" -> "things"),
Alignment(DELETE: "are"),
Alignment(MATCH: "worth" == "worth"),
Alignment(SUBSTITUTE: "noting" -> "nothing"),
Alignment(INSERT: "period")
```


<a name="citation">

## Citation and Research

```
@article{borgholt2025text,
  title={A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems},
  author={Borgholt, Lasse and Havtorn, Jakob and Igel, Christian and Maal{\o}e, Lars and Tan, Zheng-Hua},
  journal={arXiv preprint arXiv:2509.24478},
  year={2025}
}
```

__To reproduce results from the paper:__
- Install with extra evaluation dependencies - only supported with Python 3.12:
  - `pip install error-align[evaluation]`
- Clone this repository:
  - `git clone https://github.com/corticph/error-align.git`
- Navigate to the evaluation directory:
  - `cd error-align/evaluation`
- Transcribe a dataset for evaluation. For example:
  - `python transcribe_dataset.py --model_name whisper --dataset_name commonvoice --language_code fr`
- Run evaluation script on the output file. For example:
  - `python evaluate_dataset.py --transcript_file transcribed_data/whisper_commonvoice_test_fr.parquet`

__Notes:__
- To reproduce results on the `primock57` dataset, first run: `python prepare_primock57.py`.
- Use the `--help` flag to see all available options for `transcribe_dataset.py` and `evaluate_dataset.py`.
- All results reported in the paper are based on the test sets.

__Collaborators:__

<br/>

<div>
  <a href="https://corti.ai">
    <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/corti.png" alt="Corti" height="75">
  </a>
  <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/empty.png" alt="" width="30">
  <a href="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/aau.png">
    <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/aau.png" alt="Aalborg University" height="75">
  </a>
  <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/empty.png" alt="" width="30">
  <a href="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/pcai.png">
    <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/pcai.png" alt="Pioneer Centre for Artificial Intelligence" height="75">
  </a>
  <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/empty.png" alt="" width="30">
  <a href="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/dtu.png">
    <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/dtu.png" alt="Technical University of Denmark" height="75">
  </a>
  <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/empty.png" alt="" width="30">
  <a href="https://www.ku.dk/">
    <img src="https://raw.githubusercontent.com/corticph/error-align/refs/heads/main/.github/assets/ucph.png" alt="University of Copenhagen" height="75">
  </a>
</div>

---
