Metadata-Version: 2.4
Name: gdeltnews-rs
Version: 0.1.3
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Requires-Dist: requests>=2.0
Requires-Dist: tqdm>=4.0
Requires-Dist: boolean-py>=5.0
License-File: LICENSE
Summary: Fast GDELT Web NGrams news reconstruction engine (Rust-powered)
Keywords: gdelt,webngrams,news,nlp,rust
Author: Kerem Tugberk Capraz
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# gdeltnews-rs

Rust-powered reconstruction of full-text news articles from the [GDELT Web News NGrams 3.0](https://blog.gdeltproject.org/announcing-the-new-web-news-ngrams-3-0-dataset/) dataset.

This is a from-scratch reimplementation of the [gdeltnews](https://github.com/iandreafc/gdeltnews) Python package by Andrea Fronzetti Colladon and Roberto Vestrelli. It preserves the same reconstruction algorithm and output quality while delivering significantly faster performance through a Rust core exposed via PyO3.

## Why This Exists

The original `gdeltnews` package demonstrated that full-text news articles can be reconstructed from GDELT's fragmented n-gram data with up to 95% textual fidelity (see the [paper](https://doi.org/10.3390/bdcc10020045)). However, the Python implementation is CPU-bound on the reconstruction step — the authors reported **1 hour 8 minutes** to process 39 files on a single core.

This package reduces that to seconds.

## Performance

Benchmarked on a 1 MB gzipped GDELT Web NGrams file (42K fragments, 18 articles):

| Implementation | Time | Speedup |
|---|---|---|
| Original Python (single core) | 12.3s | 1x |
| **gdeltnews-rs** | **0.18s** | **67x** |

On a larger 55 MB file (687K fragments, 1,176 articles), **gdeltnews-rs** completed in **11.3s**. The Python benchmark on this file was early-stopped (it was taking too long to be practical), so there is no direct comparison. However, the original authors reported **1 hour 8 minutes** to process 39 files on a single core — roughly 1.7 minutes per file. Extrapolating from the 1 MB benchmark ratio, the real speedup on larger files is likely well above 67x, as larger articles with more fragments amplify Rust's advantage: zero-copy slice comparisons avoid the repeated list allocation overhead that dominates in Python.

These numbers include gzip decompression, JSON parsing, fragment creation, overlap-based assembly, and JSONL output — the entire pipeline. The Rust version also reads `.gz` files directly, eliminating the separate decompression step.

## What Changed from the Original

### Architecture

- **Rust core via PyO3**: The CPU-intensive path (JSON parsing, gzip decompression, fragment creation, overlap assembly, JSONL output) is implemented in Rust and compiled to a native Python extension. Users install with `pip` and get compiled wheels — no Rust toolchain needed.
- **Python wrapper**: Download (HTTP/I/O bound) and filtermerge (light CPU) remain in Python. The `__init__.py` re-exports everything for a clean API.
- **Rayon parallelism**: The original used Python's `multiprocessing.Pool` which requires serializing data across process boundaries. Rust's rayon provides zero-overhead work-stealing thread parallelism at two levels — across files and across articles within each file. No `freeze_support()` needed.

### Output Format

- **JSONL instead of pipe-delimited CSV**: The original used `|` as a delimiter with `QUOTE_NONE`, which breaks silently when article text contains `|` or newlines. JSONL (one JSON object per line) handles all text content safely and is natively supported by pandas: `pd.read_json("output.jsonl", lines=True)`.
- **Reconstruction metadata**: Each output record includes `fragments_used` and `fragments_total` counts, so you can assess reconstruction completeness per article.

### Algorithm

The reconstruction algorithm is **identical** to the original — greedy maximum-overlap assembly with position constraints. This was a deliberate decision. The original algorithm achieves 95% textual fidelity as validated against EventRegistry ground truth. The performance gain comes entirely from the language difference:

- **Zero-copy slice comparison**: Python's `result_words[-k:] == words[:k]` creates two new lists on every comparison. Rust's `result[len-k..] == candidate[..k]` compares memory in place.
- **No garbage collector**: Python allocates and deallocates thousands of temporary list objects during the overlap search. Rust uses stack-allocated slices.
- **Native string handling**: No interpreter overhead on the tight inner loops.

### What Was Removed

- **GUI**: The tkinter GUI is not included. This package is library-only.
- **CSV output**: Replaced by JSONL. The filtermerge module reads JSONL.
- **Decompression step**: The Rust engine reads `.gz` files directly via `flate2`. No need to decompress to disk first.
- **`freeze_support()` / multiprocessing boilerplate**: Rayon threads work in any context including Jupyter notebooks.

### What Was Kept

- **Same reconstruction quality**: The greedy overlap algorithm, position constraints, slash artifact removal, and circular overlap cleanup are all preserved exactly.
- **Same download logic**: HTTP downloads from `data.gdeltproject.org` with the same URL format and time range enumeration.
- **Same Boolean query syntax**: `AND`, `OR`, `NOT`, parentheses, and quoted phrases for filtermerge.
- **Same language/URL filtering**: Filter by language code and URL substrings.

## Install

```bash
pip install gdeltnews-rs
```

Pre-built wheels are provided for common platforms. If no wheel matches your platform, `pip` will build from source (requires a Rust toolchain).

For development:

```bash
git clone <this-repo>
cd gdeltnews
python -m venv .venv && source .venv/bin/activate
pip install maturin requests tqdm "boolean.py>=5.0"
maturin develop --release
```

## Quickstart

### Step 1: Download GDELT files

```python
from gdeltnews_rs import download

download(
    "2025-01-15T10:00:00",
    "2025-01-15T13:59:00",
    outdir="gdeltdata",
)
```

Unlike the original package, there is no `decompress` parameter. The Rust engine reads `.gz` files directly.

### Step 2: Reconstruct articles

```python
from gdeltnews_rs import reconstruct

reconstruct(
    "gdeltdata",
    "output",
    language="en",
    url_filters=["reuters.com", "nytimes.com"],
)
```

This processes all `.gz` and `.json` files in the input directory in parallel and writes one JSONL file per input file to the output directory.

Works in scripts, notebooks, anywhere — no `freeze_support()` needed.

### Step 3: Filter and deduplicate

```python
from gdeltnews_rs import filtermerge

filtermerge(
    "output",
    "final.jsonl",
    query='(elections OR primaries) AND democrats AND NOT republicans',
)
```

Reads all JSONL files from the output directory, applies the Boolean query, deduplicates by URL (keeping the longest text), and writes a single JSONL file.

### Low-level API

For more control, you can call the Rust engine directly:

```python
from gdeltnews_rs import reconstruct_file, reconstruct_file_to_jsonl

# Get articles as Python objects
articles = reconstruct_file("path/to/file.json.gz", language="en")
for article in articles:
    print(article.url, len(article.text), f"{article.fragments_used}/{article.fragments_total}")

# Or write directly to JSONL
count = reconstruct_file_to_jsonl("input.json.gz", "output.jsonl", language="en")
```

## Output Format

Each line in the JSONL output is a JSON object:

```json
{
  "text": "Full reconstructed article text...",
  "date": "2025-01-15",
  "url": "https://www.reuters.com/article/...",
  "source": "reuters.com",
  "fragments_used": 285,
  "fragments_total": 285
}
```

## Project Structure

```
gdeltnews-rs/
├── Cargo.toml                  # Rust dependencies and build config
├── pyproject.toml              # maturin-based Python packaging
├── src/
│   ├── lib.rs                  # PyO3 module entry point
│   ├── parse.rs                # JSON parsing + gzip decompression
│   ├── fragment.rs             # Fragment creation from pre/ngram/post
│   ├── assemble.rs             # Greedy maximum-overlap assembly
│   └── pipeline.rs             # File + article level parallelism
└── python/
    └── gdeltnews_rs/
        ├── __init__.py         # Public API re-exports
        ├── download.py         # GDELT HTTP downloads
        ├── reconstruct.py      # High-level reconstruction wrapper
        └── filtermerge.py      # Boolean query filtering + URL dedup
```

## Credits and Citation

This package is built on the methodology and research of:

**Andrea Fronzetti Colladon** (Roma Tre University) and **Roberto Vestrelli** (University of Perugia), who designed the reconstruction algorithm, validated it against ground truth data, and released the original [gdeltnews](https://github.com/iandreafc/gdeltnews) Python package.

If you use this package in research, please cite their paper:

> Fronzetti Colladon, A., & Vestrelli, R. (2026). Free Access to World News: Reconstructing Full-Text Articles from GDELT. *Big Data and Cognitive Computing*, 10(2), 45. [https://doi.org/10.3390/bdcc10020045](https://doi.org/10.3390/bdcc10020045)

## License

GPL-3.0, same as the original package.

