The AI-powered evals framework
for ecommerce search.

Ecommerce search evaluation powered by LLM-as-a-Judge. Domain-aware scoring, deterministic checks, and full IR metrics.

For search teams tired of eyeballing result pages.

$ pip install veritail

Everything you need to evaluate search

LLM-as-a-Judge

Every search result scored for relevance using structured, domain-aware rubrics tailored to your industry vertical.

14 Industry Verticals

Built-in domain expertise for automotive, electronics, fashion, groceries, medical, and more — with per-query overlay classification at zero extra cost.

Deterministic Quality Checks

Catch price outliers, missing titles, duplicate SKUs, and ranking mistakes with fast, rule-based checks that run before LLM scoring.

Metrics & Reports

NDCG, MRR, MAP, Precision@K, and attribute-match metrics with rich HTML reports and A/B comparison views.

What you get after a run

IR Metrics

MetricValue
NDCG@50.9829
NDCG@100.9829
MRR0.9800
MAP0.9751

Avg Score by Position

Bars should decrease top to bottom if ranking is correct.

#1
2.88
#2
3.00
#3
2.52
#4
1.84
#5
1.16

Query Corrections

OriginalCorrectedVerdict
intrior paint interior paint appropriate
led lighrs red lights inappropriate

Veritail catches when autocorrect fixes a typo but changes the product.

Worst Performing Queries

QueryTypeNDCG@10
self drilling drywall anchors long-tail 0.7829
pex crimp tool broad 0.8554
1/2 inch pex tubing 100ft red long-tail 0.9697

Autocomplete Quality

MetricValue
Avg Relevance 2.94/3
Avg Diversity 2.50/3
Total Flagged 4

Deterministic Checks

CheckPassedFailed
Near-duplicate products 103 22
Out-of-stock prominence 124 1
Price outlier 108 17

Search Result Judgment Query: simpson strong tie joist hanger 2x10

#1 Simpson Strong-Tie LUS210 2x10 Joist Hanger 3/3 match

Exact specification match. Brand, product type, and 2x10 dimensional rating all satisfied. In stock at position 1.

#2 Simpson Strong-Tie LUS28 2x8 Joist Hanger 1/3 mismatch

Query specifies 2x10 but this is rated for 2x8. A 2x8 hanger cannot safely support a 2x10 joist — dimensional mismatch makes this a structural failure risk.

Autocomplete Judgment Prefix: "dec"

deck screws deck stain deck boards composite deck railing decaf coffee deck post cap
Relevance 2/3 Diversity 2/3

Five of six suggestions target deck materials, but "decaf coffee" is completely off-domain — a grocery item with no relevance to home improvement, degrading the suggestion set.

Three steps to scored results

1

Load queries

Import your search queries. Veritail classifies each query and assigns domain-specific overlays automatically.

2

Evaluate

Your search adapter returns results. Deterministic checks flag defects, then the LLM judge scores each result on a 0–3 rubric.

3

Report

Get IR metrics, per-query breakdowns, and rich HTML reports. Compare two search configurations side by side.

Up and running in minutes

Install
$ pip install veritail
# With Anthropic / Gemini support:
$ pip install "veritail[cloud]"
Create an adapter
# my_adapter.py
from veritail import SearchAdapter, SearchResult

class MyAdapter(SearchAdapter):
    def search(self, query: str, **kwargs) -> list[SearchResult]:
        # Call your search engine here
        return [
            SearchResult(
                title="Product Name",
                price=29.99,
                url="https://example.com/product",
            )
        ]
Run evaluation
$ veritail --adapter my_adapter.MyAdapter \
          --queries queries.csv \
          --vertical electronics