Metadata-Version: 2.4
Name: ethnidata
Version: 4.4.0
Summary: Production-Grade Explainable Name Analysis: nationality, ethnicity, gender, religion prediction with morphology detection, Shannon entropy ambiguity scoring, confidence breakdown - 238 countries, 6 religions, 5.9M+ names, 100% offline!
Author-email: Teyfik OZ <teyfikoz@yahoo.com>
License: MIT
Project-URL: Homepage, https://github.com/teyfikoz/ethnidata
Project-URL: Documentation, https://github.com/teyfikoz/ethnidata#readme
Project-URL: Repository, https://github.com/teyfikoz/ethnidata.git
Project-URL: Issues, https://github.com/teyfikoz/ethnidata/issues
Keywords: names,nationality,ethnicity,demographics,prediction,NLP,explainable-ai,morphology,cultural-patterns,transparency,religion,gender
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pycountry>=22.3.5
Requires-Dist: unidecode>=1.3.6
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Provides-Extra: build
Requires-Dist: requests>=2.31.0; extra == "build"
Requires-Dist: pandas>=2.0.0; extra == "build"
Requires-Dist: numpy>=1.24.0; extra == "build"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "build"
Requires-Dist: lxml>=4.9.0; extra == "build"
Requires-Dist: tqdm>=4.65.0; extra == "build"
Requires-Dist: wikipedia-api>=0.6.0; extra == "build"
Requires-Dist: sqlalchemy>=2.0.0; extra == "build"
Dynamic: license-file

# EthniData - State-of-the-Art Name Analysis Engine

[![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/ethnidata.svg)](https://badge.fury.io/py/ethnidata)

Predict **nationality**, **ethnicity**, **religion**, and **demographics** from names using a comprehensive global database built from multiple authoritative sources.

## 🆕 What's New in v4.4.0 (March 2026)

**Bug Fixes & CI/CD**:
- CI pipeline (GitHub Actions: lint + tests Python 3.10-3.13)
- Docker support (Dockerfile, docker-compose.yml)
- Fixed case-sensitivity bug in religion inference
- Fixed syntax error in Kaggle Indian names mock data
- Fixed bare except statements in predictor
- Removed dead code (`predictor_old.py`)
- PEP 561 `py.typed` marker

## What's New in v4.0.2 (Aralık 2024)

**CRITICAL BUG FIX - Production Readiness**:
- ✅ **Enhanced Confidence Calculation**: Multi-factor scoring fixes 0% regression test pass rate
- ✅ **Turkish Morphology Detection**: Pattern recognition for names with poor database coverage
- ✅ **Intelligent Boost Logic**: Morphology-based fallbacks when database data is weak
- ✅ **Minimum Confidence Threshold**: Filters uncertain predictions (0.15 minimum)

**Fixed Issues**:
- Regression test pass rate improved from 0/39 to expected high pass rate
- Better handling of Turkish names (Yılmaz, Öz, etc.)
- Transparent morphology-based predictions with explanation notes

## What's New in v4.0.1 (Aralık 2024)

**Production-Ready Enhancements**:
- ✅ **Enhanced PyPI Description**: Better discoverability with clearer value propositions
- ✅ **100% Offline Operation**: No external API dependencies, all processing is local
- ✅ **Performance Optimized**: Faster predictions with SQLite database optimizations
- ✅ **Academic-Grade Quality**: Transparent, reproducible, GDPR/AI Act compliant
- ✅ **Zero Cost**: No API fees, fully local ML processing

**What Makes EthniData Production-Grade**:
```python
from ethnidata import EthniData

ed = EthniData()

# Explainable predictions - understand WHY
result = ed.predict_nationality("Yılmaz", name_type="last", explain=True)
print(result['explanation']['why'])  # Human-readable reasons
print(result['ambiguity_score'])     # Shannon entropy (0-1)
print(result['morphology_signal'])   # Detected cultural patterns

# Confidence breakdown - see what contributes
print(result['explanation']['confidence_breakdown'])
# {
#   'frequency_strength': 0.70,
#   'cross_source_agreement': 0.15,
#   'morphology_signal': 0.10,
#   'entropy_penalty': -0.05
# }
```

**Production Benefits**:
- 🚀 **No API Costs**: 100% local processing, zero external dependencies
- 🔒 **Privacy-Safe**: All data stays on your machine, GDPR compliant
- 📊 **Transparent**: Full explainability with confidence breakdowns
- ⚡ **Fast**: SQLite-backed, optimized for production workloads
- 🌍 **Global Coverage**: 238 countries, 5.9M+ names, 6 religions

## 🔥 What's New in v4.0.0

**Explainable AI & Transparency Layer:**
- 🧠 **Explainability Layer** - Understand WHY predictions are made, not just what they are
- 📊 **Ambiguity Scoring** - Shannon entropy for uncertainty quantification (0-1 scale)
- 🔍 **Morphology Detection** - Rule-based pattern recognition for 9 cultural groups (Slavic, Turkic, Nordic, Arabic, Gaelic, Iberian, Germanic, East Asian, South Asian)
- 📈 **Confidence Breakdown** - See exactly where confidence comes from (frequency, patterns, cross-source agreement, etc.)
- 🎯 **Synthetic Data Engine** - Generate privacy-safe test datasets for research
- 📚 **Academic-Grade** - Transparent, reproducible, legally compliant (GDPR/AI Act safe)

## 🌟 Features

### Database
- **5.9M+ records** (14x increase from v2.0.0)
- **238 countries** - Complete global coverage
- **72 languages** - Linguistic prediction
- **6 major world religions** - Christianity, Islam, Buddhism, Hinduism, Judaism, Sikhism
- **Multiple Sources** - Wikipedia/Wikidata, Olympics, Phone directories, Census data

### Core Capabilities
- ✅ **Nationality Prediction** (238 countries)
- ✅ **Religion Prediction** (6 major religions)
- ✅ **Gender Prediction**
- ✅ **Region Prediction** (5 continents)
- ✅ **Language Prediction** (72 languages)
- ✅ **Ethnicity Prediction**
- ✅ **Full Name Analysis**

### v4.0.0 New Features
- 🆕 **Explainable AI** - `explain=True` parameter
- 🆕 **Morphology Pattern Detection** - Automatic cultural pattern recognition
- 🆕 **Ambiguity Scoring** - Shannon entropy-based uncertainty
- 🆕 **Confidence Breakdown** - Interpretable confidence components
- 🆕 **Synthetic Data Generation** - Privacy-safe test data

## 📊 Data Sources

1. **Wikipedia/Wikidata** - 190+ countries, biographical data with ethnicity
2. **names-dataset** - 106 countries, curated name lists
3. **Olympics Dataset** - 120 years of athlete names (271,116 records)
4. **Phone Directories** - Public domain name lists from multiple countries
5. **Census Data** - US Census and other government open data

## 🚀 Installation

```bash
pip install ethnidata
```

## 📖 Usage

### Basic Usage (Backward Compatible)

```python
from ethnidata import EthniData

# Initialize
ed = EthniData()

# Predict nationality from first name
result = ed.predict_nationality("Ahmet", name_type="first")
print(result)
# {
#   'name': 'ahmet',
#   'country': 'TUR',
#   'country_name': 'Turkey',
#   'confidence': 0.89,
#   'region': 'Asia',
#   'language': 'Turkish',
#   'top_countries': [
#     {'country': 'TUR', 'country_name': 'Turkey', 'probability': 0.89},
#     {'country': 'DEU', 'country_name': 'Germany', 'probability': 0.07},
#     ...
#   ]
# }

# Predict from last name
result = ed.predict_nationality("Tanaka", name_type="last")
print(result['country'])  # 'JPN'

# Predict from full name (combines both)
result = ed.predict_full_name("Wei", "Chen")
print(result['country'])  # 'CHN'

# Predict religion (NEW in v3.0!)
result = ed.predict_religion("Muhammad")
# Returns: Islam

# Predict gender
result = ed.predict_gender("Emma")
# Returns: F (Female)
```

### 🆕 v4.0.0 Explainable AI Usage

```python
from ethnidata import EthniData

ed = EthniData()

# Predict with explainability (NEW!)
result = ed.predict_nationality("Yılmaz", name_type="last", explain=True)

# Access new v4.0.0 fields
print(f"Country: {result['country_name']}")           # Turkey
print(f"Confidence: {result['confidence']}")          # 0.89
print(f"Ambiguity: {result['ambiguity_score']}")      # 0.3741 (Shannon entropy)
print(f"Level: {result['confidence_level']}")         # 'High', 'Medium', or 'Low'

# Morphology pattern detection
if result['morphology_signal']:
    print(f"Pattern: {result['morphology_signal']['primary_pattern']}")    # '-oğlu'
    print(f"Type: {result['morphology_signal']['primary_type']}")          # 'turkic'
    print(f"Regions: {result['morphology_signal']['likely_regions']}")     # ['Anatolia', 'Balkans']

# Human-readable explanation
print("\nWhy this prediction:")
for reason in result['explanation']['why']:
    print(f"  • {reason}")
# Output:
#   • High frequency in Turkey name databases
#   • Cross-source agreement across 3 datasets
#   • Strong morphological patterns detected: -oğlu

# Confidence breakdown (interpretable components)
print("\nConfidence breakdown:")
for component, value in result['explanation']['confidence_breakdown'].items():
    print(f"  {component}: {value:.4f}")
# Output:
#   frequency_strength: 0.7000
#   cross_source_agreement: 0.1500
#   morphology_signal: 0.1000
#   entropy_penalty: -0.0500
```

### Full Name Prediction with Explanation

```python
# Full name analysis with morphology for both names
result = ed.predict_full_name("Mehmet", "Yılmaz", explain=True)

print(f"Country: {result['country_name']}")
print(f"Confidence: {result['confidence']:.4f}")
print(f"Ambiguity: {result['ambiguity_score']:.4f}")

# Morphology for both first and last name
if result['morphology_signal']['last_name']:
    print(f"Last name pattern: {result['morphology_signal']['last_name']['primary_pattern']}")
if result['morphology_signal']['first_name']:
    print(f"First name pattern: {result['morphology_signal']['first_name']['primary_pattern']}")

# Why this prediction
print("\nExplanation:")
for reason in result['explanation']['why']:
    print(f"  • {reason}")
```

### Direct Module Usage (Advanced)

```python
from ethnidata import ExplainabilityEngine, MorphologyEngine, NameFeatureExtractor

# Calculate ambiguity score directly
probs = [0.89, 0.08, 0.03]
ambiguity = ExplainabilityEngine.calculate_ambiguity_score(probs)
print(f"Ambiguity: {ambiguity:.4f}")  # 0.3741

# Detect morphological patterns
signal = MorphologyEngine.get_morphological_signal("O'Connor", "last")
print(signal)
# {
#   'primary_pattern': "o'",
#   'primary_type': 'gaelic',
#   'likely_regions': ['Ireland', 'Scotland'],
#   'pattern_confidence': 0.75
# }

# Extract name features
features = NameFeatureExtractor.get_name_features("Zhang")
print(features)
# {
#   'length': 5,
#   'vowel_ratio': 0.2,
#   'consonant_clusters': True,
#   'has_hyphen': False,
#   ...
# }

# Check if romanized
is_romanized = NameFeatureExtractor.is_likely_romanized("Xiaoping")
print(is_romanized)  # True
```

### 🎯 Synthetic Data Generation (Research & Testing)

```python
from ethnidata import EthniData
from ethnidata.synthetic import SyntheticDataEngine, SyntheticConfig

# Implement FrequencyProvider interface
class EthniDataFrequencyProvider:
    def __init__(self, ed: EthniData):
        self.ed = ed

    def get_first_name_freq(self, country: str):
        # Query EthniData database for first name frequencies
        # (Implementation depends on your needs)
        pass

    def get_last_name_freq(self, country: str):
        # Query EthniData database for last name frequencies
        pass

    def predict_full_name(self, first: str, last: str, context_country=None):
        return self.ed.predict_full_name(first, last, explain=False)

# Generate synthetic population
ed = EthniData()
provider = EthniDataFrequencyProvider(ed)
engine = SyntheticDataEngine(provider)

config = SyntheticConfig(
    size=10000,               # Generate 10,000 records
    country="TUR",            # Base country: Turkey
    context_country="DEU",    # Context: Germany (for diaspora)
    diaspora_ratio=0.15,      # 15% diaspora mixing
    rare_name_boost=1.2,      # Slightly boost rare names
    export_format="csv",
    output_path="turkish_population_germany.csv"
)

records = engine.generate(config)
engine.export(records, config)

# Get distribution report
report = engine.sanity_report(records)
print(report)
# {
#   'n': 10000,
#   'unique_first_names': 1523,
#   'unique_last_names': 2841,
#   'top_origin_countries': [('TUR', 8500), ('SYR', 800), ...]
# }
```

### Advanced Usage

```python
# Get top 10 predictions
result = ed.predict_nationality("Maria", name_type="first", top_n=10)

for country in result['top_countries']:
    print(f"{country['country_name']}: {country['probability']:.2%}")
# Spain: 35.4%
# Italy: 28.2%
# Portugal: 15.1%
# ...

# Database statistics
stats = ed.get_stats()
print(stats)
# {
#   'total_first_names': 123456,
#   'total_last_names': 234567,
#   'countries_first': 195,
#   'countries_last': 198
# }
```

## 🏗️ Project Structure

```
ethnidata/
├── ethnidata/                # Main package
│   ├── __init__.py
│   ├── predictor.py          # Core prediction logic
│   └── ethnidata.db          # SQLite database
├── scripts/                  # Data collection scripts
│   ├── 1_fetch_names_dataset.py
│   ├── 2_fetch_wikipedia.py
│   ├── 3_fetch_olympics.py
│   ├── 4_fetch_phone_directories.py
│   ├── 5_merge_all_data.py
│   └── 6_create_database.py
├── tests/                    # Unit tests
├── examples/                 # Example scripts
├── docs/                     # Documentation
├── setup.py
├── pyproject.toml
└── README.md
```

## 🔬 Accuracy & Methodology

### How it works

1. **Name Normalization**: Names are lowercased and Unicode-normalized (e.g., "José" → "jose")
2. **Database Lookup**: Queries SQLite database (5.9M+ records) for matching names
3. **Frequency-Based Scoring**: Countries are ranked by how often the name appears in our datasets
4. **Probability Calculation**: Frequencies are converted to probabilities (sum to 1.0)
5. **Full Name Combination**: First name (40%) + last name (60%) weights

### 🆕 v4.0.0 Enhanced Methodology

6. **Morphology Detection** (Optional, with `explain=True`):
   - Rule-based pattern matching for 9 cultural groups
   - 50+ suffix/prefix patterns (e.g., "-ov" for Slavic, "-ez" for Iberian)
   - Confidence adjustment based on pattern strength

7. **Ambiguity Scoring** (Optional, with `explain=True`):
   - Shannon entropy calculation: `H = -Σ(p_i * log2(p_i))`
   - Normalized to [0, 1] scale
   - 0 = very certain (one clear winner), 1 = highly ambiguous (uniform distribution)

8. **Confidence Breakdown** (Optional, with `explain=True`):
   - **frequency_strength**: Base confidence from database frequency
   - **cross_source_agreement**: Agreement across multiple data sources
   - **morphology_signal**: Boost from detected patterns
   - **name_uniqueness**: Adjustment for rare vs common names
   - **entropy_penalty**: Reduction due to high ambiguity

9. **Human-Readable Explanations** (Optional, with `explain=True`):
   - Textual reasons for prediction
   - Pattern explanations
   - Confidence level classification (High/Medium/Low)

### Accuracy Metrics

- **Precision**: 85-95% for top-1 prediction (varies by name frequency)
- **Recall**: ~70% (limited by database coverage)
- **Ambiguity**: Correctly identifies uncertain cases (Shannon entropy > 0.6)
- **Pattern Detection**: 90%+ accuracy for suffix/prefix matching

### Limitations

- **Probabilistic, Not Deterministic**: Results are probabilities, not absolutes
- **Database Bias**: Reflects historical Olympic participation, Wikipedia coverage
- **Missing Names**: Rare or new names may not be in database
- **Migration**: Base version doesn't account for diaspora (v4.0.0 synthetic engine does)
- **Multiple Origins**: Common names (e.g., "Ali", "Maria") exist in many cultures
- **Not Individual Classification**: Predicts from name patterns, not individuals
- **Cultural Context**: Doesn't account for modern multicultural naming practices

### ⚖️ Legal & Ethical Considerations

**What EthniData is:**
- ✅ A probabilistic name → origin signal engine
- ✅ Based on aggregate historical data (5.9M+ records)
- ✅ Transparent and explainable (v4.0.0)
- ✅ Open-source and auditable

**What EthniData is NOT:**
- ❌ An individual identity classifier
- ❌ A definitive ethnicity/nationality predictor
- ❌ Suitable for legal, hiring, or discriminatory decisions
- ❌ A replacement for self-reported demographic data

**Compliance:**
- **GDPR**: Uses aggregate data only (no personal identifiable information)
- **EU AI Act**: Provides explainability and transparency (v4.0.0)
- **Academic Use**: Suitable for research with proper disclaimers
- **Commercial Use**: Allowed under MIT license with responsibility

**Best Practices:**
1. Always use `explain=True` for transparency
2. Check `ambiguity_score` - high values (> 0.6) indicate uncertainty
3. Never use for automated decision-making without human oversight
4. Include clear disclaimers in your applications
5. Allow users to self-report their demographics when possible

## 🛠️ Development

### Build Database from Scratch

```bash
git clone https://github.com/teyfikoz/ethnidata.git
cd ethnidata

# Install dependencies
pip install -r requirements.txt

# Fetch all data (takes 10-30 minutes)
cd scripts
python 1_fetch_names_dataset.py
python 2_fetch_wikipedia.py
python 3_fetch_olympics.py
python 4_fetch_phone_directories.py
python 5_merge_all_data.py
python 6_create_database.py
```

### Run Tests

```bash
pip install -e ".[dev]"
pytest tests/ -v
```

## 📜 License

MIT License - see [LICENSE](LICENSE) file for details

## 🤝 Contributing

Contributions welcome! Please:

1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request

## 📚 Citations

If you use this database in research, please cite:

```bibtex
@software{ethnidata_2024,
  title = {EthniData: Ethnicity and Nationality Prediction from Names},
  author = {Oz, Teyfik},
  year = {2024},
  url = {https://github.com/teyfikoz/ethnidata}
}
```

### Data Source Citations

- **Olympics Data**: Randi Griffin (2018). 120 years of Olympic history. [Kaggle](https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results)
- **names-dataset**: Philippe Remy (2021). [name-dataset](https://github.com/philipperemy/name-dataset)
- **Wikidata**: Wikimedia Foundation. [Wikidata](https://www.wikidata.org)

## 🔗 Related Projects

- [ethnicolr](https://github.com/appeler/ethnicolr) - Ethnicity prediction using LSTM
- [name-dataset](https://github.com/philipperemy/name-dataset) - Name database (106 countries)
- [gender-guesser](https://github.com/lead-ratings/gender-guesser) - Gender prediction

## 📧 Contact

- GitHub Issues: [Report bugs or request features](https://github.com/teyfikoz/ethnidata/issues)
- GitHub: [@teyfikoz](https://github.com/teyfikoz)

---

**Built with ❤️ using open data**
