Metadata-Version: 2.4
Name: text2num
Version: 3.0.1
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Development Status :: 5 - Production/Stable
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: Filters
Classifier: Natural Language :: French
Classifier: Natural Language :: English
Classifier: Natural Language :: Spanish
Classifier: Natural Language :: Portuguese
Classifier: Natural Language :: German
Classifier: Natural Language :: Dutch
Classifier: Natural Language :: Italian
License-File: LICENSE
Summary: Parse and convert numbers written in French, Spanish, English, Portuguese, German, Dutch or Italian into their digit representation.
Keywords: French,Spanish,English,Portuguese,German,Italian,Dutch,NLP,words-to-numbers
Author-email: Allo-Media <contact@allo-media.fr>
License-Expression: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Changelog, https://github.com/allo-media/text2num/releases
Project-URL: Documentation, https://text2num.readthedocs.io
Project-URL: Homepage, https://github.com/allo-media/text2num
Project-URL: Issues, https://github.com/allo-media/text2num/issues
Project-URL: Repository, https://github.com/allo-media/text2num

# text2num

A fast and robust text2num converter library a library for recognizing, parsing and transcribing into digits (base 10) numbers expressed in natural language. It works on strings as well as on custom token lists.

No IA involved: it is low on resources (and energy!) consumption and the latency is very small.

[![Documentation status](https://readthedocs.org/projects/text2num/badge/?version=latest)](https://text2num.readthedocs.io/en/latest/?badge=latest)

`text2num` is a python package that provides functions and parser classes for:

- Parsing numbers expressed as natural language words and converting them to integer values.
- Detection of ordinal, cardinal and decimal numbers in a stream of natural language words and get their decimal digit representations.

Supported natural languages (in alphabetical order):

* Dutch;
* English;
* French;
* German;
* Italian;
* Portuguese (Brazilian and European);
* Spanish.

## Versions 3.X vs 2.X

This new generation of `text2num` relies on a [new and improved algorithm implemented in Rust](https://crates.io/crates/text2num) whereas the 2.X branch
is in pure python and uses a less capable algorithm and has now been retired.

**You don't need Rust to install and run text2num!** as we provide precompiled wheels.

### Backward incompatible changes:

- dropped support for signed numbers — the feature was broken anyway;
- parsing mode is relaxed by default (i.e. "_greedy_") — you can use punctuation (e.g. commas) to separate groups in text, or voice pauses if processing Speech-to-Text token streams;
- the `threshold` optional parameter to `alpha2digits` now applies to both ordinals and cardinals.
  As a consequence the signature of `alpha2digits` has changed.
- the Russian and Catalan languages have not been ported yet.


## Installation

We provide pre-compiled wheels for Linux, MacOS and Windows:

|               | py 3.8 | py 3.9 | py 3.10 | py 3.11 | py 3.12 | py 3.13 |
| ------------- | -------|------- | --------|---------|---------|---------|
| Linux aarch64 |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| Linux armv7l  |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| Linux ppc64le |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| Linux s390x   |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| Linux x86_64  |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| Linux i686    |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| win32         |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| win amd64     |   ✅   |  ✅    |  ✅     |  ✅     |   ✅    |  ✅     |
| macos 11      |   ❌   |  ❌    |  ❌     |  ✅     |   ✅    |  ✅     |

So all you need to do is this:

```
pip install text2num
# or, if using an `uv` project:
uv add text2num
```

## Quickstart by example

Not every supported language is covered in these examples, but it gives you an idea.

### Parse and convert

French examples:

```python

    >>> from text_to_num import text2num
    >>> text2num('quatre-vingt-quinze', "fr")
    95

    >>> text2num('nonante-cinq', "fr")
    95

    >>> text2num('mille neuf cent quatre-vingt dix-neuf', "fr")
    1999

    >>> text2num('dix-neuf cent quatre-vingt dix-neuf', "fr")
    1999

    >>> text2num("cinquante et un million cinq cent soixante dix-huit mille trois cent deux", "fr")
    51578302

    >>> text2num('mille mille deux cents', "fr")
    Traceback (most recent call last):
        ...
    ValueError: invalid literal for text2num: 'mille mille deux cents'

```

English examples:

```python

    >>> from text_to_num import text2num

    >>> text2num("fifty-one million five hundred seventy-eight thousand three hundred two", "en")
    51578302

    >>> text2num("eighty-one", "en")
    81

```

Spanish examples:

```python

    >>> from text_to_num import text2num
    >>> text2num("ochenta y uno", "es")
    81

    >>> text2num("nueve mil novecientos noventa y nueve", "es")
    9999

    >>> text2num("cincuenta y tres millones doscientos cuarenta y tres mil setecientos veinticuatro", "es")
    53243724

```

Portuguese examples:

```python

    >>> from text_to_num import text2num
    >>> text2num("trinta e dois", "pt")
    32

    >>> text2num("mil novecentos e seis", "pt")
    1906

    >>> text2num("vinte e quatro milhões duzentos mil quarenta e sete", "pt")
    24200047

```

German examples:

```python

    >>> from text_to_num import text2num

    >>> text2num("einundfünfzigmillionenfünfhundertachtundsiebzigtausenddreihundertzwei", "de")
    51578302

    >>> text2num("ein und achtzig", "de")
    81

```

### Find and transcribe

Any numbers, even ordinals.

French:

```python

    >>> from text_to_num import alpha2digit
    >>> sentence = (
    ...     "Huit cent quarante-deux pommes, vingt-cinq chiens, mille trois chevaux, "
    ...     "douze mille six cent quatre-vingt-dix-huit clous.\n"
    ...     "Quatre-vingt-quinze vaut nonante-cinq. On tolère l'absence de tirets avant les unités : "
    ...     "soixante seize vaut septante six.\n"
    ...     "Nombres en série : douze, quinze, zéro zéro quatre, vingt, cinquante-deux, cent trois, cinquante deux, "
    ...     "trente et un.\n"
    ...     "Ordinaux: cinquième troisième vingt et unième centième mille deux cent trentième.\n"
    ...     "Décimaux: douze virgule quatre-vingt dix-neuf, cent vingt virgule zéro cinq ; "
    ...     "mais soixante zéro deux."
    ... )
    >>> print(alpha2digit(sentence, "fr"))
    842 pommes, 25 chiens, 1003 chevaux, 12698 clous.
    95 vaut 95. On tolère l'absence de tirets avant les unités : 76 vaut 76.
    Nombres en série : 12, 15, 004, 20, 52, 103, 52, 31.
    Ordinaux: 5ème 3ème 21ème 100ème 1230ème.
    Décimaux: 12,99, 120,05 ; mais 60 02.

    >>> sentence = "Cinquième premier deuxième troisième vingt et unième centième mille deux cent trentième."
    >>> print(alpha2digit(sentence, "fr"))
    5ème 1er 2ème 3ème 21ème 100ème 1230ème.

```

English:

```python

    >>> from text_to_num import alpha2digit

    >>> text = "On May twenty-third, I bought twenty-five cows, twelve chickens and one hundred twenty five point four zero kg of potatoes."
    >>> alpha2digit(text, "en")
    'On May 23rd, I bought 25 cows, 12 chickens and 125.40 kg of potatoes.'

    >>> alpha2digit("I finished the race in the twelfth position!", "en")
    'I finished the race in the 12th position!'

```

Spanish:

```python

    >>> from text_to_num import alpha2digit

    >>> text = "Compramos veinticinco vacas, doce gallinas y ciento veinticinco coma cuarenta kg de patatas."
    >>> alpha2digit(text, "es")
    'Compramos 25 vacas, 12 gallinas y 125,40 kg de patatas.'

    >>> text = "Compramos veinticinco vacas, doce gallinas y ciento veinticinco punto cuarenta kg de patatas."
    >>> alpha2digit(text, "es")
    'Compramos 25 vacas, 12 gallinas y 125.40 kg de patatas.'

    >>> text = "Ella ha quedado tercera"
    >>> alpha2digit(text, "es", threshold=0)
    'Ella ha quedado 3ª'

```

Portuguese:

```python

    >>> from text_to_num import alpha2digit

    >>> text = "Comprámos vinte e cinco vacas, doze galinhas e cento e vinte e cinco vírgula quarenta kg de batatas."
    >>> alpha2digit(text, "pt")
    'Comprámos 25 vacas, 12 galinhas e 125,40 kg de batatas.'

    >>> text = "Ordinais: quinto, terceiro, vigésima, vigésimo primeiro, centésimo quarto"
    >>> alpha2digit(text, "pt")
    'Ordinais: 5º, 3º, 20ª, 21º, 104º'

```

German:

```python

    >>> from text_to_num import alpha2digit

    >>> text = "Ich habe fünfundzwanzig Kühe, zwölf Hühner und einhundertfünfundzwanzig kg Kartoffeln gekauft."
    >>> alpha2digit(text, "de")
    'Ich habe 25 Kühe, 12 Hühner und 125 kg Kartoffeln gekauft.'

    >>> text = "Die Telefonnummer lautet dreiunddreißig neun sechzig null sechs zwölf einundzwanzig."
    >>> alpha2digit(text, "de")
    'Die Telefonnummer lautet 33 9 60 06 12 21.'

    >>> text = "Der zweiundzwanzigste Januar zweitausendzweiundzwanzig."
    >>> alpha2digit(text, "de")
    'Der 22. Januar 2022.'

    >>> text = "Es ist ein Buch mit dreitausend Seiten aber nicht das erste."
    >>> alpha2digit(text, "de", threshold=0)
    'Es ist 1 Buch mit 3000 Seiten aber nicht das 1..'

    >>> text = "Pi ist drei Komma eins vier und so weiter, aber nicht drei Komma vierzehn :-p"
    >>> alpha2digit(text, "de", threshold=0)
    'Pi ist 3,14 und so weiter, aber nicht 3 Komma 14 :-p'

```

### Working with tokens

Imagine that we have an ASR application that returns a transcript as a list of tokens (text, start timestamp, end timestamp)
where the timestamps are integers representing milliseconds relative to the beginning of the speech.

```python

from text_to_num import (Token, find_numbers)


class DecodedWord(Token):
    def __init__(self, text, start, end):
        self._text = text
        self.start = start
        self.end = end

    def text(self):
        return self._text

    def nt_separated(self, previous):
        # we consider a voice gap of more that 100 ms as significant
        return self.start - previous.end > 100


# Let's simulate ASR output

stream = [
    DecodedWord("We", 0, 100),
    DecodedWord("have", 100, 200),
    DecodedWord("respectively", 200, 400),
    DecodedWord("twenty", 400, 500),
    DecodedWord("nine", 610, 700),
    DecodedWord("and", 700, 800),
    DecodedWord("thirty", 800, 900),
    DecodedWord("four", 950, 1000),
    DecodedWord("dollars", 1010, 1410)
]

occurences = find_numbers(stream, "en")

for num in occurences:
    print(f"found number {num.text} ({num.value}) at range [{num.start}, {num.end}] in the stream")
```

When executed, that code snippet prints::

```
found number 20 (20.0) at range [3, 4] in the stream
found number 9 (9.0) at range [4, 5] in the stream
found number 34 (34.0) at range [6, 8] in the stream
```


Read the complete documentation on [ReadTheDocs](http://text2num.readthedocs.io/).

## Contribute

Join us on https://github.com/allo-media/text2num

