Metadata-Version: 2.4
Name: feature-sql-tool
Version: 0.2.1
Summary: SQL feature lineage analysis and unified vector SQL generation toolkit
Author-email: Your Name <you@example.com>
License: MIT
Project-URL: Homepage, https://github.com/rwgunner/feature_sql_tool
Project-URL: Repository, https://github.com/rwgunner/feature_sql_tool
Project-URL: Issues, https://github.com/rwgunner/feature_sql_tool/issues
Keywords: sql,lineage,feature,sqlglot,ml
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sqlglot<28,>=25
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Dynamic: license-file

# feature-sql-tool

`feature_sql_tool` is a Python library for:

- parsing SQL feature definitions with `sqlglot`
- extracting feature lineage down to physical source columns
- tracking intermediate computed aliases
- classifying filter-only intermediate features
- building a unified SQL query for multiple features with safe reusable CTE detection

## Main ideas

The library treats SQL as a dependency graph rather than plain text. It separates:

- value dependencies
- filter dependencies
- join dependencies
- computed intermediate aliases
- physical source columns

This makes it possible to:
- explain how a final feature is computed
- detect shared computation between multiple features
- generate a reusable SQL plan for model input vectors

## Quick example

```python
from pathlib import Path

from feature_sql_tool import FeatureSpec, FeatureSqlTool

features = [
    FeatureSpec(
        feature_name="avg_payment_30d",
        sql_file_path=Path("sql_samples/feature_avg_payment_30d.sql"),
        final_alias="avg_paid_amount_per_payment_30d",
        entity_key="client_id",
        dialect="spark",
        grain="client_id",
    ),
    FeatureSpec(
        feature_name="cnt_paid_txn_30d",
        sql_file_path=Path("sql_samples/feature_cnt_paid_txn_30d.sql"),
        final_alias="cnt_paid_txn_30d",
        entity_key="client_id",
        dialect="spark",
        grain="client_id",
    ),
]

tool = FeatureSqlTool()

results = tool.analyze_features(features)
for result in results:
    print(result.feature_spec.feature_name)
    print(result.source_columns)
    print(result.intermediate_features)
    print(result.filter_only_intermediate_features)

sql = tool.build_unified_sql(features)
print(sql)
```

## Package structure

- `models/` – dataclasses and domain models
- `parser/` – file loading, SQL parsing, AST normalization
- `scope/` – scope, relation and alias registries
- `lineage/` – recursive column resolution and lineage extraction
- `graph/` – graph container, classifiers and graph merge
- `planner/` – reusable subgraph detection and execution planning
- `generator/` – final SQL rendering
- `reporting/` – JSON reports
- `sql_samples/` – example SQL files
- `examples/` – runnable example scripts

## Current scope

This version implements the second-iteration architecture in a practical way:
- recursive resolution through CTE and alias sources
- computed vs passthrough aliases
- filter-only intermediate feature classification
- safe grouping of features with identical execution signatures
- unified SQL generation with reusable aggregate CTEs

It is still a best-effort engineering implementation and should be validated on your real SQL corpus before production use.
