Metadata-Version: 2.4
Name: aadil_nazar_sindhi_nlp
Version: 1.1.3
Summary: A comprehensive Sindhi NLP Suite (Lemmatizer & Spellchecker)
Home-page: https://github.com/aadilnazar/sindhi_nlp
Author: Aadil Nazar
Author-email: adilhussainburiro14912@gmail.com
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-python
Dynamic: summary


# Sindhi NLP Suite (Aadil Nazar)

## 👤 About the Author

**Aadil Nazar** is a **Data Engineer** and  **Computational Linguistics Researcher** . This suite is the result of intensive research into low-resource language digitization, designed to provide the same level of NLP sophistication for Sindhi that exists for major global languages.

---

## 🛠 Project Overview

The **Sindhi NLP Suite** is a high-performance, integrated toolkit for processing the Sindhi language. It moves beyond simple string matching by incorporating  **Morphological Analysis** ,  **Orthographic Confusion Mapping** , and **Levenshtein Edit Distance** algorithms.

### **Core Capabilities:**

#### **1. The Sindhi Lemmatizer (Morphological Engine)**

This isn't just a stemmer; it’s a rule-based lemmatizer that understands the grammatical structure of Sindhi.

* **Verb Stemming:** Handles complex suffixes like `ائيندا` (future habitual) or `يندڙ` (habitual participle) to find the base root.
* **Noun Pluralization Rules:** Automatically reverts plurals (ending in `يون`, `ون`, `ين`) to their singular masculine or feminine forms.
* **Linguistic Metadata:** Returns POS tags, gender (masculine/feminine), and number (singular/plural) for every analyzed token.
* **Synonym Support:** Integrated WordNet lookup to provide contextual synonyms.

#### **2. The Sindhi Spellchecker (Logic-Driven)**

A morphology-aware spellchecker that reduces "False Misspellings" by cross-referencing with the Lemmatizer.

* **Confusion Map System:** Uses a custom mapping to handle phonetically similar characters that are frequently swapped in digital typing (e.g., `ھ` vs `ح`, `س` vs `ص`, `ز` vs `ذ`).
* **Edit Distance 1 & 2:** Implements optimized algorithms to suggest corrections within one or two character changes.
* **Normalization:** Strips diacritics (Zabar, Zer, Pesh) automatically to ensure matching is based on core orthography.

---

## 🚀 Installation

**Bash**

```
pip install aadil-nazar-sindhi-nlp
```

---

## 💻 Technical Usage

### **Advanced Spellchecking with Suggestions**

The spellchecker first checks the dictionary, then the lemma, then the confusion map, and finally calculates edit distances.

**Python**

```
from aadil_nazar_sindhi_nlp import SindhiSpellchecker

checker = SindhiSpellchecker()

# Test a word with an orthographic confusion error
# Input: 'اصلاح' (with a common character swap)
result = checker.check("اصلاح") 

print(f"Correct: {result['correct']}")
print(f"Suggestions: {result['suggestions']}")
```

### **Deep Linguistic Analysis**

Use the Lemmatizer to extract the "DNA" of a Sindhi word.

**Python**

```
from aadil_nazar_sindhi_nlp import SindhiLemmatizer

lem = SindhiLemmatizer()

# Analyze a plural inflected word: 'ڪتابن'
data = lem.analyze_word("ڪتابن")

print(f"Root: {data['root']}")      # Output: ڪتاب
print(f"Tag: {data['tag']}")        # Output: noun
print(f"Number: {data['number']}")  # Output: plural
```

---

## 📊 Data Engineering & Performance

* **O(1) Lookup:** Built on Python sets and hash-maps for near-instant validation against massive datasets.
* **LRU Caching:** Uses `@lru_cache` for variant generation, making repetitive sentence processing lightning-fast.
* **Unicode Standardized:** Built to handle the 52-letter Sindhi alphabet and specific UTF-8 character encodings without corruption.

---

## 🗺 The Roadmap (Growing Ecosystem)

This package is the foundation. As a researcher, I am actively building and will soon integrate:

* **Sindhi POS Tagger:** To identify parts of speech in full sentence contexts.
* **Named Entity Recognition (NER):** For extracting names, dates, and locations.
* **Stopword Filtering:** For cleaning Sindhi text for Machine Learning models.

---

## ⚖ License

Distributed under the  **MIT License** . See `LICENSE.txt` for details.
