Metadata-Version: 2.3
Name: sibi-flux
Version: 2026.2.1
Summary: Sibi Toolkit: A collection of tools for Data Analysis/Engineering.
Author: Luis Valverde
Author-email: Luis Valverde <lvalverdeb@gmail.com>
Requires-Dist: pandas>=2.3.3
Requires-Dist: pyarrow>=22.0.0
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pydantic-settings>=2.12.0
Requires-Dist: dask>=2025.11.0
Requires-Dist: fsspec>=2025.10.0
Requires-Dist: s3fs>=2025.10.0
Requires-Dist: sqlalchemy>=2.0.44
Requires-Dist: psycopg2>=2.9.11
Requires-Dist: pymysql>=1.1.2
Requires-Dist: clickhouse-connect>=0.10.0
Requires-Dist: concurrent-log-handler>=0.9.28
Requires-Dist: rich>=14.2.0
Requires-Dist: filelock>=3.20.1
Requires-Dist: tqdm>=4.67.1
Requires-Dist: watchdog>=6.0.0
Requires-Dist: tornado==6.5.4
Requires-Dist: typer>=0.21.0
Requires-Dist: psutil>=6.1.1
Requires-Dist: httpx>=0.28.1
Requires-Dist: opentelemetry-api>=1.38.0
Requires-Dist: opentelemetry-exporter-otlp>=1.38.0
Requires-Dist: opentelemetry-sdk>=1.38.0
Requires-Dist: deep-translator>=1.11.4
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: distributed>=2025.11.0
Requires-Dist: sibi-flux[distributed,geospatial,mcp] ; extra == 'complete'
Requires-Dist: osmnx>=2.0.7 ; extra == 'geospatial'
Requires-Dist: geopandas>=1.1.2 ; extra == 'geospatial'
Requires-Dist: geopy>=2.4.1 ; extra == 'geospatial'
Requires-Dist: folium>=0.20.0 ; extra == 'geospatial'
Requires-Dist: osmium>=4.2.0 ; extra == 'geospatial'
Requires-Dist: shapely>=2.0.0 ; extra == 'geospatial'
Requires-Dist: networkx>=3.6.1 ; extra == 'geospatial'
Requires-Dist: mcp>=1.1.2 ; extra == 'mcp'
Requires-Dist: fastapi>=0.127.0 ; extra == 'mcp'
Requires-Dist: uvicorn>=0.40.0 ; extra == 'mcp'
Requires-Dist: httpx>=0.28.1 ; extra == 'mcp'
Requires-Python: >=3.11
Provides-Extra: complete
Provides-Extra: geospatial
Provides-Extra: mcp
Description-Content-Type: text/markdown

# SibiFlux: The Production Data Toolkit

SibiFlux is a production-grade data engineering toolkit designed for Python 3.11+. It provides a unified ecosystem for project scaffolding, declarative datacube management, automated resource generation, and agentic interoperability through the Model Context Protocol (MCP).

## Key Features

- **Declarative Datacubes**: Define data structures in YAML and auto-generate specialized Python classes.
- **Automated Init Engine**: Transform modular YAML configs and `.env` files into strongly-typed Pydantic `settings.py`.
- **High-Performance Storage**: Unified fsspec and PyArrow storage registry with depot isolation and path traversal protection.
- **Agentic Ready**: Native MCP (Model Context Protocol) integration to expose your data pipelines directly to AI agents.
- **Scalable Workloads**: First-class support for Dask and distributed computing.
- **Geospatial Power**: Integrated OSMnx helpers and PBF handling for complex GIS workflows.

---

## Tech Stack

- **Core**: Python 3.11+
- **Configuration**: [Pydantic Settings v2](https://docs.pydantic.dev/latest/concepts/pydantic_settings/)
- **CLI Framework**: [Typer](https://typer.tiangolo.com/)
- **Data Engineering**: [Pandas](https://pandas.pydata.org/), [PyArrow](https://arrow.apache.org/docs/python/index.html), [Dask](https://www.dask.org/)
- **Storage Layer**: [fsspec](https://filesystem-spec.readthedocs.io/), [s3fs](https://s3fs.readthedocs.io/)
- **Databases**: SQLAlchemy, Clickhouse, MySQL, PostgreSQL
- **Web/API**: FastAPI, Uvicorn, MCP
- **Orchestration**: Docker, Docker Compose, uv

---

## Prerequisites

- **Python 3.11+**
- **[uv](https://github.com/astral-sh/uv)**: Extremely fast Python package manager.
- **Docker & Docker Compose**: For local services (Postgres, Clickhouse).
- **Poethepoet**: Task runner (installed via dev dependencies).

---

## Getting Started

### 1. Clone & Install

```bash
git clone https://github.com/lvalverdeb/sibi-flux.git
cd sibi-flux
uv sync --all-extras
```

### 2. Environment Setup

Copy the example environment file:

```bash
cp .env.linux .env
```

Initialize the project configuration:

```bash
# This generates conf/settings.py and conf/credentials/
sibi-flux env --env-file .env
```

### 3. Start Local Infrastructure

```bash
docker-compose up -d
```

### 4. Project Scaffolding

```bash
# Initialize a new project structure
sibi-flux init my_project --app

# Create a specific application within the project
sibi-flux create_app my_app
```

### 5. Datacube Workflow

```bash
# Propose cubes from a database domain
sibi-flux propose-cubes my_db_domain my_app

# Generate the app-specific extensions
sibi-flux create-cubes my_app
```

---

## Architecture

### Directory Structure

```text
├── src/sibi_flux/
│   ├── init/             # Project bootstrapping & logic generation
│   ├── datacube/         # Datacube orchestration, discovery & mapping
│   ├── storage/          # Unified storage manager (fsspec + pyarrow)
│   ├── mcp/              # MCP Router & Resource registration
│   ├── osmnx_helper/     # Geospatial PBF handling & Graph loading
│   ├── config/           # Unified settings & config management
│   ├── parquet/          # Optimized parquet I/O with gatekeeping
│   ├── df_helper/        # Rich DataFrame manipulation utils
│   └── orchestration/    # Dask/Distributed workload management
├── conf/                 # Generated configuration (settings.py)
├── solutions/            # Reference implementations & examples
├── test_prj/             # Standard test project layout
└── Dockerfile            # Production-grade container image
```

### Data Lifecycle

1. **Declarative Phase**: Developers define discovery rules and datacube mappings in `conf/discovery_params.yaml`.
2. **Generation Phase**: `sibi-flux dc` commands scan data sources and generate Python wrappers.
3. **Runtime Phase**:
   - `ProjectService` detects the root and loads environment-specific `settings.py`.
   - `StorageManager` establishes connections to S3/WebDAV/Local.
   - `Datacube` instances provide high-level abstractions over PyArrow/Dask.
4. **Exposure Phase**: The `MCP Router` registers datacubes as resources for external consumption (e.g., by LLM agents).

### Unified Storage Registry

SibiFlux implements a robust storage registry that isolates distinct "depots" (e.g., `bronze`, `silver`, `gold`).

- **Isolation**: Each depot is mapped to a designated path.
- **Performance**: Automatic switching between `fsspec` and `Native C++ PyArrow S3FileSystem`.
- **Security**: Built-in path traversal protection (blocking `..` in joined segments).

---

## Environment Configuration

### Required Variables

| Variable | Description | Default |
|---|---|---|
| `DB_URL` | Primary database connection string | `sqlite:///:memory:` |
| `FS_TYPE` | Storage type (`s3`, `file`, `webdav`) | `s3` |
| `STORAGE_PATH` | Base path for all data artifacts | - |

### Modular YAMLs

The `init` engine populates `conf/` with modular YAML files:

- `storage.yaml`: S3 layouts and filesystem definitions.
- `databases.yaml`: Database connection mapping.
- `osmnx.yaml`: Geospatial storage references.

---

## Available Scripts (Poe)

| Command | Description |
|---|---|
| `poe dev` | Start development server with reload |
| `poe test` | Run the full pytest suite |
| `poe dc-sync` | Synchronize datacube registry |
| `poe dc-discover` | Discover new data structures |
| `poe test-snapshots` | Run regression snapshot tests |

---

## Testing

SibiFlux uses complex snapshot-based testing for code generation verification.

```bash
# Run all tests
pytest tests/

# Update snapshots (Golden Masters)
UPDATE_SNAPSHOTS=1 pytest
```

---

## Deployment

### Docker

The project includes a production-ready multi-stage `Dockerfile`.

```bash
docker build -t sibi-flux:latest .
```

### Docker Compose

The `docker-compose.yml` orchestrates:

- **API**: The main SibiFlux web service (FastAPI).
- **Dask Scheduler**: For distributed task execution.
- **Postgres/Clickhouse**: Local data stores for development.

---

## Troubleshooting

- **ImportError**: Ensure you have run `uv sync` to install all dependencies.
- **Connection Error**: Check your `.env` variables and ensure Docker services are running.
- **Storage Risk**: If you see "Security Risk: Path traversal detected," ensure your dynamic paths do not contain `..` or absolute prefixes.

---

## Contributing

1. Follow the **Conventional Commits** standard.
2. Ensure all tests pass (`poe test`).
3. Add snapshots for any new code generation logic.
