Metadata-Version: 2.4
Name: f1-blog-pipeline
Version: 0.1.0
Summary: A library for generating F1 blog posts from RSS feeds using AI
Author: Adrian
Project-URL: Homepage, https://github.com/yourusername/f1-blog-pipeline
Project-URL: Repository, https://github.com/yourusername/f1-blog-pipeline
Keywords: f1,formula-1,blog,rss,ai,gemini
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: google-genai>=0.4.0
Requires-Dist: playwright>=1.40.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"

# F1 Blog Pipeline

A Python library for generating F1 blog posts from RSS feeds using AI. Automatically extracts articles from F1 news sources and generates comprehensive blog posts using Google's Gemini API.

## Features

- 📰 **RSS Feed Parsing** with date filtering support
- 🤖 **AI-Powered Blog Generation** using Gemini API
- 🌐 **Web Scraping** with Playwright for full article extraction
- 📅 **Flexible Date Filtering** (today, yesterday, specific dates, date ranges)
- 🔧 **Library-First Design** - use programmatically in your applications
- ⚙️ **Configurable** feeds, extraction settings, and generation parameters
- 🛡️ **Robust Error Handling** with partial results on failures

## Installation

### Basic Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/f1-blog-pipeline.git
cd f1-blog-pipeline

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium
```

### Development Installation

```bash
# Install in editable mode
pip install -e .
```

## Quick Start

```python
from f1_blog_pipeline import F1BlogPipeline

# Initialize pipeline with API key
pipeline = F1BlogPipeline(api_key='your-gemini-api-key')

# Run full pipeline
result = pipeline.run_full_pipeline(filter_date='today')

# Check results
if result.success:
    print(f"Extracted {result.articles_extracted} articles")
    print(f"Blog: {len(result.blog_content)} characters")
```

## Configuration

### Environment Variables

Create a `.env` file (optional):

```bash
# Required
GEMINI_API_KEY=your_gemini_api_key_here

# Optional
GEMINI_MODEL=gemini-2.0-flash
```

**Note:** The library does NOT load `.env` files automatically. Your application is responsible for credential management (use `python-dotenv` or your own config system).

**Security:** Never commit `.env` files to version control. The `.env` file is in `.gitignore` by default. Always use `.env` for local development and CI/CD secrets management for production deployments.

### Custom Feeds

```python
custom_feeds = {
    "Autosport": "https://www.autosport.com/rss/feed/f1",
    "The Race": "https://www.the-race.com/feed/",
    "RaceFans": "https://www.racefans.net/feed/"
}

pipeline = F1BlogPipeline(
    api_key='your-key',
    feeds=custom_feeds,
    max_per_feed=10
)
```

### Configuration File

Copy `config.yaml.example` to `config.yaml` and customize:

```yaml
feeds:
  Autosport: "https://www.autosport.com/rss/feed/f1"
  Motorsport: "https://www.motorsport.com/rss/f1/news"

extraction:
  max_per_feed: 5
  timeout: 30
  delay: 2

generation:
  model: "gemini-2.0-flash"
  temperature: 0.7
```

## Usage

#### Basic Pipeline

```python
from f1_blog_pipeline import F1BlogPipeline

# Initialize
pipeline = F1BlogPipeline(api_key='your-key')

# Run full pipeline
result = pipeline.run_full_pipeline(filter_date='today')

# Access results
print(result.success)              # True/False
print(result.articles_extracted)   # Number of articles
print(result.blog_content)         # Generated blog markdown
print(result.errors)               # List of errors if any
```

#### Individual Stages

```python
# Just extract articles
articles = pipeline.extract_articles(filter_date='today')

# Generate blog from existing articles
blog = pipeline.generate_blog(articles, author='Your Name')
```

#### Custom Configuration

```python
import logging

pipeline = F1BlogPipeline(
    api_key='your-key',
    feeds={'Source': 'https://...'},
    max_per_feed=10,
    timeout=45000,
    delay=3,
    model='gemini-2.0-pro',
    temperature=0.8,
    max_tokens=8000,
    output_dir='./output',
    logger_level=logging.DEBUG
)
```

#### Credential Management

The library is designed to work with any credential management strategy:

```python
# Option 1: Pass API key directly
pipeline = F1BlogPipeline(api_key='sk-...')

# Option 2: Use environment variable (your app sets it)
import os
os.environ['GEMINI_API_KEY'] = get_key_from_your_config()
pipeline = F1BlogPipeline()  # Falls back to env var

# Option 3: Load from .env in your app (library doesn't do this)
from dotenv import load_dotenv
load_dotenv()  # YOUR responsibility, not library's
pipeline = F1BlogPipeline()
```

### Date Filtering

```python
# Today's articles only
result = pipeline.run_full_pipeline(filter_date='today')

# Yesterday
result = pipeline.run_full_pipeline(filter_date='yesterday')

# Specific date
result = pipeline.run_full_pipeline(filter_date='2026-02-22')

# Last N days
result = pipeline.run_full_pipeline(days_back=7)

# No filter (latest articles per feed)
result = pipeline.run_full_pipeline()
```

## Examples

See the `examples/` directory for detailed usage examples:

- [`basic_usage.py`](examples/basic_usage.py) - Basic library usage
- [`custom_feeds.py`](examples/custom_feeds.py) - Custom RSS feeds
- [`date_range.py`](examples/date_range.py) - Date filtering examples

## Output Formats

### Extracted Articles

The pipeline saves extracted articles to `articles/extracted_articles.txt` in this format:

```text
================================================================================
Source: Autosport
Title: Hamilton explains Mercedes 2026 strategy shift
Authors: ['Scott Mitchell-Malm']
URL: https://www.autosport.com/f1/news/...

[Full article text content with proper formatting and paragraphs...]

================================================================================
Source: Motorsport
Title: Red Bull confirms major technical update for Bahrain
...
```

### Generated Blog Post

The pipeline generates a markdown blog post with YAML frontmatter:

```markdown
---
title: "2026 Testing Unpacked: Regulation Chaos and Team Performance"
date: 2026-02-22
author: "Adrian"
tags: [f1, testing, 2026-season, bahrain, ferrari, mercedes, red-bull]
summary: "Welcome to the 2026 era. If pre-season testing in Bahrain..."
---

# 2026 Testing Unpacked: Regulation Chaos and Team Performance

Welcome to the 2026 era. If pre-season testing in Bahrain has taught us anything...

## Team Performance Analysis

**Mercedes** emerged as the early pace-setters:
- Russell topped FP1 with a 1:31.247
- New power unit showing promising reliability
- Aerodynamic package generating consistent downforce

**Ferrari** showed mixed results:
- Leclerc struggled with rear stability in high-speed corners
- Team working on suspension geometry adjustments
- Power unit performance on par with expectations

## Technical Updates

The new 2026 regulations have forced teams to rethink...
```

## Project Structure

```
f1_app/
├── f1_blog_pipeline/          # Main library package
│   ├── __init__.py           # Public API
│   ├── pipeline.py           # Pipeline orchestrator
│   ├── core/                 # Core modules
│   │   ├── rss_parser.py     # RSS feed parsing with date filtering
│   │   ├── article_extractor.py  # Article extraction
│   │   ├── blog_generator.py     # AI blog generation
│   │   └── text_cleaner.py       # Text cleaning utilities
├── examples/                 # Usage examples
├── articles/                 # Extracted articles (generated)
├── posts/                    # Generated blog posts (generated)
├── requirements.txt          # Dependencies
├── pyproject.toml           # Package configuration
├── config.yaml.example      # Example configuration
└── README.md                # This file
```

## API Reference

### F1BlogPipeline

Main pipeline class for orchestrating the entire workflow.

```python
class F1BlogPipeline:
    def __init__(
        api_key: Optional[str] = None,
        feeds: Optional[Dict[str, str]] = None,
        max_per_feed: int = 5,
        timeout: int = 30000,
        delay: int = 2,
        model: str = "gemini-2.0-flash",
        temperature: float = 0.7,
        max_tokens: int = 4000,
        output_dir: Optional[str] = None,
        logger_level: Optional[int] = None
    )
```

### PipelineResult

Result object returned by pipeline execution.

```python
@dataclass
class PipelineResult:
    success: bool                          # Overall success
    articles_extracted: int                # Number of articles
    articles: List[Dict[str, Any]]        # Extracted articles
    blog_content: Optional[str]           # Generated blog
    errors: List[Dict[str, str]]          # Errors
    warnings: List[str]                   # Warnings
```

## Advanced Usage

### Batch Processing

Generate blog posts for multiple dates:
```python
from datetime import datetime, timedelta
from f1_blog_pipeline import F1BlogPipeline

pipeline = F1BlogPipeline(api_key='your-key')

# Generate posts for last 7 days
for days_ago in range(7):
    date = datetime.now() - timedelta(days=days_ago)
    date_str = date.strftime('%Y-%m-%d')
    
    result = pipeline.run_full_pipeline(filter_date=date_str)
    if result.success:
        with open(f'posts/blog_{date_str}.md', 'w') as f:
            f.write(result.blog_content)
        print(f"✓ Generated blog for {date_str}")
```

### Custom Model Configuration

Test different Gemini models:
```python
pipeline = F1BlogPipeline(
    api_key='your-key',
    model='gemini-2.0-pro',
    temperature=0.9,  # More creative
    max_tokens=8000   # Longer output
)
result = pipeline.run_full_pipeline(filter_date='today')
```

Available models:
- `gemini-2.0-flash` (fast, good quality, cost-effective)
- `gemini-2.0-pro` (higher quality, slower, more expensive)
- `gemini-2.0-flash-exp-preview` (experimental features)

## Error Handling

The pipeline continues on partial failures:

```python
result = pipeline.run_full_pipeline(filter_date='today')

if result.success:
    print(f"Success! {result.articles_extracted} articles")
else:
    print("Pipeline failed")
    for error in result.errors:
        print(f"[{error['stage']}] {error['message']}")

# Check warnings even on success
for warning in result.warnings:
    print(f"Warning: {warning}")
```

## Requirements

- Python 3.10+
- google-genai >= 0.4.0
- playwright >= 1.40.0
- python-dotenv >= 1.0.0

## Troubleshooting

### Playwright Browsers Not Installed

```bash
playwright install chromium
```

### API Key Not Found

```python
# Option 1: Pass directly
pipeline = F1BlogPipeline(api_key='your-key')

# Option 2: Use environment variable
import os
os.environ['GEMINI_API_KEY'] = 'your-key'
pipeline = F1BlogPipeline()

# Option 3: Load from .env file
from dotenv import load_dotenv
load_dotenv()
pipeline = F1BlogPipeline()
```

### Bot Detection / 403 Errors

Increase delay between requests:
```python
pipeline = F1BlogPipeline(delay=5)  # 5 seconds
result = pipeline.run_full_pipeline(filter_date='today')
```

## License

MIT

## Contributing

Contributions welcome! Please open an issue or submit a pull request.

## Acknowledgments

- Powered by [Google Gemini API](https://ai.google.dev/)
- Browser automation by [Playwright](https://playwright.dev/)
- RSS feeds from Autosport, Motorsport.com, and other F1 news sources
