Metadata-Version: 2.4
Name: querylens
Version: 0.1.14
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Topic :: Software Development
Classifier: Topic :: Utilities
Summary: SQL inspection CLI: static checks, lineage, PR review, cost diff
Author-email: Maria Dubyaga <kraftaa@gmail.com>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/kraftaa/querylens
Project-URL: Repository, https://github.com/kraftaa/querylens

# querylens

Catch risky analytics SQL before it runs and explain what complex queries actually do.

`querylens` scans SQL repositories, ranks the worst files first, reviews pull requests for regressions, and highlights reliability and cost risks such as full table scans, wide joins, missing filters, and expensive query patterns before they merge and hit production.

## Quick Demo

Input SQL:

```sql
SELECT
  c.customer_id,
  SUM(o.amount) AS revenue
FROM customers c
JOIN orders o
  ON c.id = o.customer_id
GROUP BY c.customer_id
```

Command:

```bash
cargo run -- lineage examples/revenue.sql
```

Output:

```text
examples/revenue.sql
Projections:
revenue
 └─ SUM(orders.amount)
o.customer_id
 └─ orders.customer_id
```

## Install

Prebuilt binary:

```bash
curl -L -o querylens.tar.gz https://github.com/kraftaa/querylens/releases/download/v0.1.11/querylens-macos-aarch64.tar.gz
tar -xzf querylens.tar.gz
./querylens --help
```

Homebrew:

```bash
brew install kraftaa/tap/querylens
```

Build from source:

```bash
cargo build
# enable AWS Bedrock provider if needed
# cargo build --features bedrock
```

Scan estimates (optional)

- Athena (post-run calibration): `querylens explain --athena-query-execution-id <id> --athena-region us-east-1 --file your.sql` uses `aws athena get-query-execution` to pull `DataScannedInBytes` from a past run. Use this to compare a real scan to our estimate or as the “before” value in PR cost diff. Not for brand-new queries.
- Manual override: `--scan-tb 1.5` or `--scan-bytes 1500000000000`.
- Table stats file (offline, no cloud calls): `--stats-file stats.json`

Example `stats.json`:

```json
{
  "tables": {
    "athena.sales": {
      "total_bytes": 1200000000000,
      "row_count": 3500000000,
      "partition_columns": ["ds"],
      "partitions_per_year": 365
    },
    "athena.customers": {
      "total_bytes": 8000000000
    }
  }
}
```

Heuristics:
- No WHERE → assume full scan.
- Filters on non-partition columns → ~0.7 of table size.
- Equality/range on partition column → small slice (2–5% by default, or days/partitions if provided).
- Sum per-table bytes to show “Estimated scan” and PR cost deltas.

## Features

| Feature | Description |
|---|---|
| Static SQL checks | Detect risky patterns (`SELECT *`, possible Cartesian joins, wildcard `LIKE`, etc.) |
| Column lineage | Trace projection, filter, and join lineage |
| Query explanation | Summarize tables, joins, aggregations, and likely query meaning |
| Table extraction | List tables used by a query |
| Folder scanning | Analyze a directory of SQL files |
| Rule controls | Disable rules or override severity by `rule_id` |
| Athena mode | Extra heuristics for partition/cost patterns |
| CI thresholds | Fail on `low|medium|high` severity |

## Why This Exists

SQL pipelines grow quickly and are hard to review.

When a metric looks wrong, teams need to:

- trace where output columns come from
- catch risky query patterns before they hit production
- understand query intent quickly

`querylens` helps with deterministic checks and optional LLM explanations.

## Detect Risky SQL Patterns

Example query:

```sql
SELECT *
FROM orders o
JOIN customers c
```

Command:

```bash
cargo run -- --file examples/bad_join.sql --static-only
```

Expected findings include:

- `SELECT *`
- possible Cartesian join (`JOIN` without `ON/USING`)

Example subquery pattern:

```bash
cargo run -- --file examples/subquery.sql --static-only
```

Expected suggestion includes:

- consider replacing `IN (SELECT ...)` with `JOIN` or `EXISTS`

## How Is This Different From dbt?

dbt builds and runs transformation pipelines.

`querylens` analyzes SQL itself:

- detect risky query patterns
- trace lineage in query text
- explain query logic

They complement each other: dbt for orchestration/modeling, `querylens` for query inspection.

## Distribution

### GitHub Releases (prebuilt binaries)

Tagging `v*` triggers `.github/workflows/release.yml` and publishes:

- `querylens-macos-aarch64.tar.gz`
- `querylens-linux-x86_64.tar.gz`
- `SHA256SUMS`

Create a release tag:

```bash
git tag v0.1.2
git push origin v0.1.2
```

### Homebrew tap

Use the formula template at:

- `packaging/homebrew/querylens.rb`

For each release:

1. Set `version` (without `v`).
2. Fill `__SHA256_MACOS_AARCH64__` and `__SHA256_LINUX_X86_64__` from `SHA256SUMS`.
3. Commit the formula in your tap repo (for example `kraftaa/homebrew-tap`) as `Formula/querylens.rb`.
4. Users install with:

```bash
brew install kraftaa/tap/querylens
```

## Usage

Note: all analysis commands are static-only; they never execute your SQL against a database.

### Subcommands

```bash
cargo run -- lineage <file.sql>
cargo run -- lineage <file.sql> --column revenue
cargo run -- risk <file.sql>
cargo run -- risk <file.sql> --summary-only
cargo run -- guard <file.sql> --max-risk high --deny-rule CROSS_JOIN
cargo run -- simulate <file.sql> --limit 100
cargo run -- tables <file.sql>
cargo run -- explain <file.sql>
cargo run -- analyze <dir> --glob "*.sql"
cargo run -- analyze <dir> --glob "*.sql" --changed-only
cargo run -- analyze <dir> --glob "*.sql" --changed-only --changed-base main
cargo run -- analyze <dir> --glob "*.sql" --top 10
cargo run -- analyze <dir> --glob "*.sql" --top 10 --verbose
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql"
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql" --ci
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql" --markdown
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql" --cost-diff --stats-file stats.json
```

### PR review mode

```bash
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql"
```

Example output:

```text
SQL Inspect PR Review
Base: main
Head: HEAD

PR status: PASS

No new SQL risk regressions detected.

1 changed SQL file
0 new HIGH-risk queries
0 query lost partition filter
0 ORDER BY without LIMIT regressions
0 possible join amplification regressions
0 files increased estimated scan cost

File: models/example.sql
Previous risk: HIGH
Current risk: HIGH
Risk trend: unchanged
Still risky because:
- SELECT_STAR
Estimated scan: unknown -> unknown
Estimated scan delta: unknown
```

Compact CI mode:

```bash
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql" --ci
```

```text
PR status: PASS
Changed SQL files: 1
New HIGH-risk queries: 0
Partition filter regressions: 0
ORDER BY without LIMIT regressions: 0
Join amplification regressions: 0
Files with higher estimated scan: 0
```

Markdown mode for PR comments:

```bash
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql" --markdown
```

Cost regression mode:

```bash
cargo run -- pr-review --base main --head HEAD --dir models --glob "*.sql" --cost-diff --stats-file stats.json
```

Example output:

```text
SQL Cost Regression

1 changed SQL file

File: models/revenue.sql

Estimated scan change:
Before: 22 GB
After: 1.40 TB
Increase: 63.6x

Cost regression: HIGH

Reason:
Filter removed: orders.order_date >= DATE '2026-01-01'

Recommendation:
Restore a selective WHERE or partition predicate.
```

### Repo scan summary

`analyze` now highlights severity shape and hotspots, not only rule counts.

Useful flags:

- `--top <N>` number of hotspot files to display (default `5`)
- `--changed-only` limit to changed files in working tree/staging
- `--changed-base <ref>` (with `--changed-only`) limit to committed diff vs ref (for example `main`)

Example:

```text
SQL Inspect Report
Scope: current selection

Analyzed 225 SQL files

Top risks:
1. 37 HIGH-risk files
2. 117 files likely scan full tables
3. 18 files have complex multi-join patterns
4. 12 files contain CROSS JOIN or likely Cartesian behavior

Severity shape:
HIGH: 37 files
MEDIUM: 88 files
LOW: 100 files

Most severe files:
- models/a.sql  HIGH  SELECT_STAR, FULL_TABLE_SCAN_LIKELY
- models/b.sql  HIGH  WIDE_JOIN_GRAPH, MISSING_WHERE
```

Verbose mode:

```bash
cargo run -- analyze . --glob "*.sql" --top 10 --verbose
```

### Query Risk Scanner

```bash
cargo run -- risk examples/bad_join.sql
cargo run -- risk examples/bad_join.sql --summary-only
cargo run -- risk examples/bad_join.sql --scan-tb 2.3
cargo run -- risk examples/bad_join.sql --scan-bytes 2300000000000
```

Example output:

```text
examples/bad_join.sql
Risk score: HIGH

Reasons:
- select star: SELECT *
- missing where: No WHERE clause

Estimated scan: 2.30 TB
```

Compact risk summary:

```text
SQL Inspect Risk
File: examples/bad_join.sql
Risk: HIGH
Estimated scan: unknown
Top reasons:
- select star: SELECT *
- missing where: No WHERE clause
- full table scan likely: Likely full table scan
```

Use one of:

- `--scan-tb <value>` for TB units directly
- `--scan-bytes <value>` to auto-convert bytes to TB
- `--athena-query-execution-id <id>` to fetch `DataScannedInBytes` from Athena via `aws` CLI
- `--athena-region <region>` optional region override for Athena lookup
- `--stats-file <path.json>` to estimate scan bytes from table-level stats

Example `stats.json`:

```json
{
  "tables": {
    "orders": { "bytes": 1000000000000 },
    "customers": 200000000000
  }
}
```

### Query explanation

```bash
cargo run -- explain examples/query.sql
```

Example output:

```text
Query explanation
Meaning: total amount per id, created at, email
Tables: orders, customers, order_items
Join: customers.id = orders.customer_id; order_items.order_id = orders.id
Aggregation: SUM(order_items.quantity * order_items.unit_price) AS total_amount
```

### Block dangerous queries

```bash
cargo run -- guard examples/bad_join.sql --max-risk high --deny-rule CROSS_JOIN --deny-rule FULL_TABLE_SCAN_LIKELY
```

Example output:

```text
SQL Inspect Guard
Policy: default

Status: FAIL
Risk: HIGH

Blocking violations
- FULL_TABLE_SCAN_LIKELY
- SELECT_STAR

Why blocked
This query shape is likely to scan most rows and can materially increase cost

Exit code: 2
```

Exit code is `2` when blocked, so this works directly in CI.

### Simulate safer preview query

```bash
cargo run -- simulate examples/query.sql --limit 100
```

### Column-specific lineage

```bash
cargo run -- lineage examples/revenue.sql --column revenue
```

Example output:

```text
examples/revenue.sql
Projections:
revenue
 └─ SUM(orders.amount)
```

### Main Analyze Command (LLM + static)

Provide one of:

- `--sql "<query>"`
- `--file <path>`
- `--dir <path>`

Optional:

- `--provider openai|bedrock|local`
- `--dialect generic|athena`
- `--static-only`
- `--fail-on low|medium|high`
- `--glob "*.sql"`
- `--config querylens.toml`
- `--json`

OpenAI example:

```bash
export OPENAI_API_KEY="..."
export OPENAI_MODEL="gpt-4.1-mini"
cargo run -- --provider openai --file examples/query.sql
```

Bedrock example:

```bash
export AWS_REGION="us-east-1"
export BEDROCK_MODEL_ID="anthropic.claude-3-5-sonnet-20241022-v2:0"
cargo run -- --provider bedrock --file examples/query.sql
```

Local OpenAI-compatible server example:

```bash
export LOCAL_LLM_BASE_URL="http://127.0.0.1:8080"
export LOCAL_LLM_MODEL="llama_instruct.gguf"
cargo run -- --provider local --file examples/query.sql --json
```

## Config

Create `querylens.toml`:

```toml
dialect = "athena"
fail_on = "high"
glob = "*.sql"
suggest_limit_for_exploratory = true
static_only = false

[rules.SELECT_STAR]
enabled = true
severity = "high"

[rules.MISSING_WHERE]
enabled = true
severity = "medium"
```

Rule controls:

- `enabled = false` disables a finding by `rule_id`
- `severity = "low|medium|high"` overrides severity

Example:

```toml
[rules.SELECT_STAR]
enabled = false

[rules.MISSING_WHERE]
severity = "low"
```

## CI Usage

Fail a build when risky SQL is found:

```bash
cargo run -- --dir models --dialect athena --fail-on high
```

Or subcommand mode:

```bash
cargo run -- analyze models --glob "*.sql"
cargo run -- analyze models --glob "*.sql" --changed-only
```

### Inline rule suppression

Add suppression comments directly in SQL when needed:

```sql
-- querylens: disable=SELECT_STAR,MISSING_WHERE
SELECT * FROM some_small_reference_table
```

Suppression applies to matching `rule_id` values for that file.

## Examples Folder

Ready-to-run examples:

- `examples/query.sql`
- `examples/revenue.sql`
- `examples/bad_join.sql`
- `examples/subquery.sql`
- `examples/silver_proposal_attachments.sql`

## Project Layout

```text
querylens/
  Cargo.toml
  src/
    main.rs
    analyzer.rs
    insights.rs
    config.rs
    prompt.rs
    providers/
      openai.rs
      bedrock.rs
      local.rs
  examples/
```

## Troubleshooting

- missing env vars: set required provider vars
- unexpected model JSON shape: run with `--json` and inspect response
- no secrets in repo: `.env` and `.env.*` are gitignored

