Evaluators Reference
====================

Complete reference for all evaluation classes and functions in HoneyHive.

.. contents:: Table of Contents
   :local:
   :depth: 2

Base Classes
------------

BaseEvaluator
~~~~~~~~~~~~~

Base class for all custom evaluators.

.. autoclass:: honeyhive.evaluation.evaluators.BaseEvaluator
   :members:
   :undoc-members:
   :show-inheritance:
   :special-members: __init__, __call__

Example
^^^^^^^

.. code-block:: python

   from honeyhive.evaluation import BaseEvaluator
   
   class CustomEvaluator(BaseEvaluator):
       def __init__(self, threshold=0.5, **kwargs):
           super().__init__("custom_evaluator", **kwargs)
           self.threshold = threshold
       
       def evaluate(self, inputs, outputs, ground_truth=None, **kwargs):
           # Custom evaluation logic
           score = self._compute_score(outputs)
           return {
               "score": score,
               "passed": score >= self.threshold
           }

Built-in Evaluators
-------------------

ExactMatchEvaluator
~~~~~~~~~~~~~~~~~~~

Evaluates exact string matching between expected and actual outputs.

.. autoclass:: honeyhive.evaluation.evaluators.ExactMatchEvaluator
   :members:
   :undoc-members:
   :show-inheritance:
   :special-members: __init__

Description
^^^^^^^^^^^

The ExactMatchEvaluator checks if the actual output exactly matches the expected output.
String comparisons are case-insensitive and whitespace is stripped.

Example
^^^^^^^

.. code-block:: python

   from honeyhive.evaluation import ExactMatchEvaluator
   
   evaluator = ExactMatchEvaluator()
   
   result = evaluator.evaluate(
       inputs={"expected": "The answer is 42"},
       outputs={"response": "The answer is 42"}
   )
   # Returns: {"exact_match": 1.0, "expected": "...", "actual": "..."}
   
   # Case-insensitive matching
   result = evaluator.evaluate(
       inputs={"expected": "hello"},
       outputs={"response": "HELLO"}
   )
   # Returns: {"exact_match": 1.0, ...}

F1ScoreEvaluator
~~~~~~~~~~~~~~~~

Evaluates F1 score for text similarity.

.. autoclass:: honeyhive.evaluation.evaluators.F1ScoreEvaluator
   :members:
   :undoc-members:
   :show-inheritance:
   :special-members: __init__

Description
^^^^^^^^^^^

The F1ScoreEvaluator computes the F1 score between predicted and ground truth text
based on word-level token overlap. It calculates precision and recall and combines
them into an F1 score.

Formula
^^^^^^^

.. code-block:: text

   precision = |predicted_words ∩ ground_truth_words| / |predicted_words|
   recall = |predicted_words ∩ ground_truth_words| / |ground_truth_words|
   f1_score = 2 * (precision * recall) / (precision + recall)

Example
^^^^^^^

.. code-block:: python

   from honeyhive.evaluation import F1ScoreEvaluator
   
   evaluator = F1ScoreEvaluator()
   
   result = evaluator.evaluate(
       inputs={"expected": "the quick brown fox"},
       outputs={"response": "the fast brown fox"}
   )
   # Returns: {"f1_score": 0.75}  # 3 out of 4 words match

SemanticSimilarityEvaluator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Evaluates semantic similarity using embeddings.

.. autoclass:: honeyhive.evaluation.evaluators.SemanticSimilarityEvaluator
   :members:
   :undoc-members:
   :show-inheritance:
   :special-members: __init__

Description
^^^^^^^^^^^

The SemanticSimilarityEvaluator uses embeddings to compute semantic similarity
between texts. This is more sophisticated than exact match or F1 score as it
understands meaning rather than just token overlap.

Example
^^^^^^^

.. code-block:: python

   from honeyhive.evaluation import SemanticSimilarityEvaluator
   
   evaluator = SemanticSimilarityEvaluator(
       embedding_model="text-embedding-ada-002",
       threshold=0.8
   )
   
   result = evaluator.evaluate(
       inputs={"expected": "The weather is nice today"},
       outputs={"response": "It's a beautiful day outside"}
   )
   # Returns: {"similarity": 0.85, "passed": True}

Evaluation Decorators
---------------------

evaluator
~~~~~~~~~

Decorator for defining synchronous evaluators.

.. autofunction:: honeyhive.evaluation.evaluators.evaluator

Description
^^^^^^^^^^^

The ``evaluator`` decorator converts a regular function into an evaluator that can be
used with the HoneyHive evaluation system.

Example
^^^^^^^

.. code-block:: python

   from honeyhive import evaluator
   
   @evaluator
   def length_check(inputs, outputs, ground_truth=None, min_length=10):
       """Check if output meets minimum length requirement."""
       text = outputs.get("response", "")
       length = len(text)
       
       return {
           "length": length,
           "meets_minimum": length >= min_length,
           "score": 1.0 if length >= min_length else 0.0
       }
   
   # Use with experiments evaluate()
   from honeyhive.experiments import evaluate
   
   results = evaluate(
       function=lambda datapoint: {"response": datapoint["inputs"].get("input", "")},
       dataset=[{"inputs": {"input": "test"}, "ground_truth": {}}],
       evaluators=[length_check]
   )

aevaluator
~~~~~~~~~~

Decorator for defining asynchronous evaluators.

.. autofunction:: honeyhive.evaluation.evaluators.aevaluator

EvaluatorMeta
~~~~~~~~~~~~~

Metaclass for evaluator type handling.

.. autoclass:: honeyhive.experiments.evaluators.EvaluatorMeta
   :members:
   :undoc-members:
   :show-inheritance:

TerminalColors
~~~~~~~~~~~~~~

Terminal color constants for formatted output.

.. autoclass:: honeyhive.experiments.evaluators.TerminalColors
   :members:
   :undoc-members:
   :show-inheritance:

Description
^^^^^^^^^^^

The ``aevaluator`` decorator is used for async evaluators that need to make
asynchronous calls (e.g., API calls for LLM-based evaluation).

Example
^^^^^^^

.. code-block:: python

   from honeyhive import aevaluator
   import aiohttp
   
   @aevaluator
   async def llm_grader(inputs, outputs, ground_truth=None):
       """Use an LLM to grade the output."""
       async with aiohttp.ClientSession() as session:
           async with session.post(
               "https://api.openai.com/v1/chat/completions",
               json={
                   "model": "gpt-4",
                   "messages": [{
                       "role": "user",
                       "content": f"Grade this output: {outputs['response']}"
                   }]
               }
           ) as response:
               result = await response.json()
               grade = parse_grade(result)
               
               return {
                   "grade": grade,
                   "score": grade / 100.0
               }

Data Models
-----------

EvaluationResult
~~~~~~~~~~~~~~~~

Result model for evaluation outputs.

.. autoclass:: honeyhive.evaluation.evaluators.EvaluationResult
   :members:
   :undoc-members:
   :show-inheritance:

Fields
^^^^^^

- **score** (float): Numeric score from evaluation
- **metrics** (Dict[str, Any]): Additional metrics
- **feedback** (Optional[str]): Text feedback
- **metadata** (Optional[Dict[str, Any]]): Additional metadata
- **evaluation_id** (str): Unique ID for this evaluation
- **timestamp** (Optional[str]): Timestamp of evaluation

Example
^^^^^^^

.. code-block:: python

   from honeyhive.evaluation import EvaluationResult
   
   result = EvaluationResult(
       score=0.85,
       metrics={"accuracy": 0.9, "latency": 250},
       feedback="Good response, minor improvements possible",
       metadata={"model": "gpt-4", "version": "1.0"}
   )

EvaluationContext
~~~~~~~~~~~~~~~~~

Context information for evaluation runs.

.. autoclass:: honeyhive.evaluation.evaluators.EvaluationContext
   :members:
   :undoc-members:
   :show-inheritance:

Fields
^^^^^^

- **project** (str): Project name
- **source** (str): Source of evaluation
- **session_id** (Optional[str]): Session identifier
- **metadata** (Optional[Dict[str, Any]]): Additional context

Example
^^^^^^^

.. code-block:: python

   from honeyhive.evaluation import EvaluationContext
   
   context = EvaluationContext(
       project="my-llm-app",
       source="production",
       session_id="session-123",
       metadata={"user_id": "user-456"}
   )

Evaluation Functions
--------------------

evaluate
~~~~~~~~

Main function for running experiments with evaluation.

.. note::
   ``evaluate()`` is exported from ``honeyhive.experiments``, not ``honeyhive.evaluation``.
   The ``honeyhive.evaluation`` module is deprecated — use ``honeyhive.experiments`` for new code.
   See :doc:`/reference/experiments/core-functions` for full documentation.

.. code-block:: python

   from honeyhive.experiments import evaluate, evaluator

   @evaluator
   def check_length(inputs, outputs, ground_truth=None):
       words = len(outputs.get("response", "").split())
       return {"word_count": words, "score": 1.0 if words >= 5 else 0.0}

   def my_function(inputs, ground_truth):
       return {"response": "Generated response for: " + inputs.get("prompt", "")}

   result = evaluate(
       function=my_function,
       dataset=[
           {"inputs": {"prompt": "What is AI?"}, "ground_truth": {}},
           {"inputs": {"prompt": "Explain ML"}, "ground_truth": {}},
       ],
       evaluators=[check_length],
       project="my-project",
       name="baseline-eval"
   )

See Also
--------

- :doc:`/reference/experiments/experiments` - Experiments API

