Evaluators Reference

Complete reference for all evaluation classes and functions in HoneyHive.

Base Classes

BaseEvaluator

Base class for all custom evaluators.

class honeyhive.evaluation.evaluators.BaseEvaluator(name, **kwargs)[source]

Bases: object

Base class for custom evaluators.

Parameters:
__init__(name, **kwargs)[source]

Initialize the evaluator.

Parameters:
Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate the given inputs and outputs.

Parameters:
Return type:

Dict[str, Any]

__call__(inputs, outputs, ground_truth=None, **kwargs)[source]

Make the evaluator callable.

Parameters:
Return type:

Dict[str, Any]

Example

from honeyhive.evaluation import BaseEvaluator

class CustomEvaluator(BaseEvaluator):
    def __init__(self, threshold=0.5, **kwargs):
        super().__init__("custom_evaluator", **kwargs)
        self.threshold = threshold

    def evaluate(self, inputs, outputs, ground_truth=None, **kwargs):
        # Custom evaluation logic
        score = self._compute_score(outputs)
        return {
            "score": score,
            "passed": score >= self.threshold
        }

Built-in Evaluators

ExactMatchEvaluator

Evaluates exact string matching between expected and actual outputs.

class honeyhive.evaluation.evaluators.ExactMatchEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for exact string matching.

Parameters:

kwargs (Any)

__init__(**kwargs)[source]

Initialize the exact match evaluator.

Parameters:

kwargs (Any)

Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate exact match between expected and actual outputs.

Parameters:
Return type:

Dict[str, Any]

Description

The ExactMatchEvaluator checks if the actual output exactly matches the expected output. String comparisons are case-insensitive and whitespace is stripped.

Example

from honeyhive.evaluation import ExactMatchEvaluator

evaluator = ExactMatchEvaluator()

result = evaluator.evaluate(
    inputs={"expected": "The answer is 42"},
    outputs={"response": "The answer is 42"}
)
# Returns: {"exact_match": 1.0, "expected": "...", "actual": "..."}

# Case-insensitive matching
result = evaluator.evaluate(
    inputs={"expected": "hello"},
    outputs={"response": "HELLO"}
)
# Returns: {"exact_match": 1.0, ...}

F1ScoreEvaluator

Evaluates F1 score for text similarity.

class honeyhive.evaluation.evaluators.F1ScoreEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for F1 score calculation.

Parameters:

kwargs (Any)

__init__(**kwargs)[source]

Initialize the F1 score evaluator.

Parameters:

kwargs (Any)

Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate F1 score between expected and actual outputs.

Parameters:
Return type:

Dict[str, Any]

Description

The F1ScoreEvaluator computes the F1 score between predicted and ground truth text based on word-level token overlap. It calculates precision and recall and combines them into an F1 score.

Formula

precision = |predicted_words ∩ ground_truth_words| / |predicted_words|
recall = |predicted_words ∩ ground_truth_words| / |ground_truth_words|
f1_score = 2 * (precision * recall) / (precision + recall)

Example

from honeyhive.evaluation import F1ScoreEvaluator

evaluator = F1ScoreEvaluator()

result = evaluator.evaluate(
    inputs={"expected": "the quick brown fox"},
    outputs={"response": "the fast brown fox"}
)
# Returns: {"f1_score": 0.75}  # 3 out of 4 words match

SemanticSimilarityEvaluator

Evaluates semantic similarity using embeddings.

class honeyhive.evaluation.evaluators.SemanticSimilarityEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for semantic similarity using basic heuristics.

Parameters:

kwargs (Any)

__init__(**kwargs)[source]

Initialize the semantic similarity evaluator.

Parameters:

kwargs (Any)

Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate semantic similarity between expected and actual outputs.

Parameters:
Return type:

Dict[str, Any]

Description

The SemanticSimilarityEvaluator uses embeddings to compute semantic similarity between texts. This is more sophisticated than exact match or F1 score as it understands meaning rather than just token overlap.

Example

from honeyhive.evaluation import SemanticSimilarityEvaluator

evaluator = SemanticSimilarityEvaluator(
    embedding_model="text-embedding-ada-002",
    threshold=0.8
)

result = evaluator.evaluate(
    inputs={"expected": "The weather is nice today"},
    outputs={"response": "It's a beautiful day outside"}
)
# Returns: {"similarity": 0.85, "passed": True}

Evaluation Decorators

evaluator

Decorator for defining synchronous evaluators.

honeyhive.evaluation.evaluators.evaluator(_name=None, _session_id=None, **_kwargs)[source]

Decorator for synchronous evaluation functions.

Parameters:
  • name – Evaluation name

  • session_id – Session ID for tracing

  • **kwargs – Additional evaluation parameters

  • _name (str | None)

  • _session_id (str | None)

  • _kwargs (Any)

Return type:

Callable[[Callable], Callable]

Description

The evaluator decorator converts a regular function into an evaluator that can be used with the HoneyHive evaluation system.

Example

from honeyhive import evaluator

@evaluator
def length_check(inputs, outputs, ground_truth=None, min_length=10):
    """Check if output meets minimum length requirement."""
    text = outputs.get("response", "")
    length = len(text)

    return {
        "length": length,
        "meets_minimum": length >= min_length,
        "score": 1.0 if length >= min_length else 0.0
    }

# Use with experiments evaluate()
from honeyhive.experiments import evaluate

results = evaluate(
    function=lambda datapoint: {"response": datapoint["inputs"].get("input", "")},
    dataset=[{"inputs": {"input": "test"}, "ground_truth": {}}],
    evaluators=[length_check]
)

aevaluator

Decorator for defining asynchronous evaluators.

honeyhive.evaluation.evaluators.aevaluator(_name=None, _session_id=None, **_kwargs)[source]

Decorator for asynchronous evaluation functions.

Parameters:
  • name – Evaluation name

  • session_id – Session ID for tracing

  • **kwargs – Additional evaluation parameters

  • _name (str | None)

  • _session_id (str | None)

  • _kwargs (Any)

Return type:

Callable[[Callable], Callable]

EvaluatorMeta

Metaclass for evaluator type handling.

class honeyhive.experiments.evaluators.EvaluatorMeta[source]

Bases: type

Metaclass for evaluator accessor pattern.

TerminalColors

Terminal color constants for formatted output.

class honeyhive.experiments.evaluators.TerminalColors[source]

Bases: object

ANSI terminal color codes for output formatting.

HEADER = '\x1b[95m'
OKBLUE = '\x1b[94m'
OKCYAN = '\x1b[96m'
OKGREEN = '\x1b[92m'
WARNING = '\x1b[93m'
FAIL = '\x1b[91m'
ENDC = '\x1b[0m'
BOLD = '\x1b[1m'
UNDERLINE = '\x1b[4m'

Description

The aevaluator decorator is used for async evaluators that need to make asynchronous calls (e.g., API calls for LLM-based evaluation).

Example

from honeyhive import aevaluator
import aiohttp

@aevaluator
async def llm_grader(inputs, outputs, ground_truth=None):
    """Use an LLM to grade the output."""
    async with aiohttp.ClientSession() as session:
        async with session.post(
            "https://api.openai.com/v1/chat/completions",
            json={
                "model": "gpt-4",
                "messages": [{
                    "role": "user",
                    "content": f"Grade this output: {outputs['response']}"
                }]
            }
        ) as response:
            result = await response.json()
            grade = parse_grade(result)

            return {
                "grade": grade,
                "score": grade / 100.0
            }

Data Models

EvaluationResult

Result model for evaluation outputs.

class honeyhive.evaluation.evaluators.EvaluationResult(score, metrics, feedback=None, metadata=None, evaluation_id=<factory>, timestamp=None)[source]

Bases: object

Result of an evaluation.

Parameters:
score: float
metrics: Dict[str, Any]
feedback: str | None = None
metadata: Dict[str, Any] | None = None
evaluation_id: str
timestamp: str | None = None

Fields

  • score (float): Numeric score from evaluation

  • metrics (Dict[str, Any]): Additional metrics

  • feedback (Optional[str]): Text feedback

  • metadata (Optional[Dict[str, Any]]): Additional metadata

  • evaluation_id (str): Unique ID for this evaluation

  • timestamp (Optional[str]): Timestamp of evaluation

Example

from honeyhive.evaluation import EvaluationResult

result = EvaluationResult(
    score=0.85,
    metrics={"accuracy": 0.9, "latency": 250},
    feedback="Good response, minor improvements possible",
    metadata={"model": "gpt-4", "version": "1.0"}
)

EvaluationContext

Context information for evaluation runs.

class honeyhive.evaluation.evaluators.EvaluationContext(project, source, session_id=None, metadata=None)[source]

Bases: object

Context for evaluation runs.

Parameters:
project: str
source: str
session_id: str | None = None
metadata: Dict[str, Any] | None = None

Fields

  • project (str): Project name

  • source (str): Source of evaluation

  • session_id (Optional[str]): Session identifier

  • metadata (Optional[Dict[str, Any]]): Additional context

Example

from honeyhive.evaluation import EvaluationContext

context = EvaluationContext(
    project="my-llm-app",
    source="production",
    session_id="session-123",
    metadata={"user_id": "user-456"}
)

Evaluation Functions

evaluate

Main function for running experiments with evaluation.

Note

evaluate() is exported from honeyhive.experiments, not honeyhive.evaluation. The honeyhive.evaluation module is deprecated — use honeyhive.experiments for new code. See Core Functions for full documentation.

from honeyhive.experiments import evaluate, evaluator

@evaluator
def check_length(inputs, outputs, ground_truth=None):
    words = len(outputs.get("response", "").split())
    return {"word_count": words, "score": 1.0 if words >= 5 else 0.0}

def my_function(inputs, ground_truth):
    return {"response": "Generated response for: " + inputs.get("prompt", "")}

result = evaluate(
    function=my_function,
    dataset=[
        {"inputs": {"prompt": "What is AI?"}, "ground_truth": {}},
        {"inputs": {"prompt": "Explain ML"}, "ground_truth": {}},
    ],
    evaluators=[check_length],
    project="my-project",
    name="baseline-eval"
)

See Also