Skip to content

Built-in Evaluators

Pydantic Evals provides several built-in evaluators for common evaluation tasks.

Comparison Evaluators

EqualsExpected

Check if the output exactly equals the expected output from the case.

from pydantic_evals.evaluators import EqualsExpected

EqualsExpected()

Parameters: None

Returns: bool - True if ctx.output == ctx.expected_output

Example:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected

dataset = Dataset(
    cases=[
        Case(
            name='addition',
            inputs='2 + 2',
            expected_output='4',
        ),
    ],
    evaluators=[EqualsExpected()],
)

Notes:

  • Skips evaluation if expected_output is None (returns empty dict {})
  • Uses Python's == operator, so works with any comparable types
  • For structured data, considers nested equality

Equals

Check if the output equals a specific value.

from pydantic_evals.evaluators import Equals

Equals(value='expected_result')

Parameters:

  • value (Any): The value to compare against
  • evaluation_name (str | None): Custom name for this evaluation in reports

Returns: bool - True if ctx.output == value

Example:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Equals

# Check output is always "success"
dataset = Dataset(
    cases=[Case(inputs='test')],
    evaluators=[
        Equals(value='success', evaluation_name='is_success'),
    ],
)

Use Cases:

  • Checking for sentinel values
  • Validating consistent outputs
  • Testing classification into specific categories

Contains

Check if the output contains a specific value or substring.

from pydantic_evals.evaluators import Contains

Contains(
    value='substring',
    case_sensitive=True,
    as_strings=False,
)

Parameters:

  • value (Any): The value to search for
  • case_sensitive (bool): Case-sensitive comparison for strings (default: True)
  • as_strings (bool): Convert both values to strings before checking (default: False)
  • evaluation_name (str | None): Custom name for this evaluation in reports

Returns: EvaluationReason - Pass/fail with explanation

Behavior:

For strings: checks substring containment

  • Contains(value='hello', case_sensitive=False)
  • Matches: "Hello World", "say hello", "HELLO"
  • Doesn't match: "hi there"

For lists/tuples: checks membership

  • Contains(value='apple')
  • Matches: ['apple', 'banana'], ('apple',)
  • Doesn't match: ['apples', 'orange']

For dicts: checks key-value pairs

  • Contains(value={'name': 'Alice'})
  • Matches: {'name': 'Alice', 'age': 30}
  • Doesn't match: {'name': 'Bob'}

Example:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Contains

dataset = Dataset(
    cases=[Case(inputs='test')],
    evaluators=[
        # Check for required keywords
        Contains(value='terms and conditions', case_sensitive=False),
        # Check for PII (fail if found)
        # Note: Use a custom evaluator that returns False when PII found
    ],
)

Use Cases:

  • Required content verification
  • Keyword detection
  • PII/sensitive data detection
  • Multi-value validation

Type Validation

IsInstance

Check if the output is an instance of a type with the given name.

from pydantic_evals.evaluators import IsInstance

IsInstance(type_name='str')

Parameters:

  • type_name (str): The type name to check (uses __name__ or __qualname__)
  • evaluation_name (str | None): Custom name for this evaluation in reports

Returns: EvaluationReason - Pass/fail with type information

Example:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance

dataset = Dataset(
    cases=[Case(inputs='test')],
    evaluators=[
        # Check output is always a string
        IsInstance(type_name='str'),
        # Check for Pydantic model
        IsInstance(type_name='MyModel'),
        # Check for dict
        IsInstance(type_name='dict'),
    ],
)

Notes:

  • Matches against both __name__ and __qualname__ of the type
  • Works with built-in types (str, int, dict, list, etc.)
  • Works with custom classes and Pydantic models
  • Checks the entire MRO (Method Resolution Order) for inheritance

Use Cases:

  • Format validation
  • Structured output verification
  • Type consistency checks

Performance Evaluation

MaxDuration

Check if task execution time is under a maximum threshold.

from datetime import timedelta

from pydantic_evals.evaluators import MaxDuration

MaxDuration(seconds=2.0)
# or
MaxDuration(seconds=timedelta(seconds=2))

Parameters:

  • seconds (float | timedelta): Maximum allowed duration

Returns: bool - True if ctx.duration <= seconds

Example:

from datetime import timedelta

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import MaxDuration

dataset = Dataset(
    cases=[Case(inputs='test')],
    evaluators=[
        # SLA: must respond in under 2 seconds
        MaxDuration(seconds=2.0),
        # Or using timedelta
        MaxDuration(seconds=timedelta(milliseconds=500)),
    ],
)

Use Cases:

  • SLA compliance
  • Performance regression testing
  • Latency requirements
  • Timeout validation

See Also: Concurrency & Performance


LLM-as-a-Judge

LLMJudge

Use an LLM to evaluate subjective qualities based on a rubric.

from pydantic_evals.evaluators import LLMJudge

LLMJudge(
    rubric='Response is accurate and helpful',
    model='openai:gpt-4o',
    include_input=False,
    include_expected_output=False,
    model_settings=None,
    score=False,
    assertion={'include_reason': True},
)

Parameters:

  • rubric (str): The evaluation criteria (required)
  • model (Model | KnownModelName | None): Model to use (default: 'openai:gpt-4o')
  • include_input (bool): Include task inputs in the prompt (default: False)
  • include_expected_output (bool): Include expected output in the prompt (default: False)
  • model_settings (ModelSettings | None): Custom model settings
  • score (OutputConfig | False): Configure score output (default: False)
  • assertion (OutputConfig | False): Configure assertion output (default: includes reason)

Returns: Depends on score and assertion parameters (see below)

Output Modes:

By default, returns a boolean assertion with reason:

  • LLMJudge(rubric='Response is polite')
  • Returns: {'LLMJudge_pass': EvaluationReason(value=True, reason='...')}

Return a score (0.0 to 1.0) instead:

  • LLMJudge(rubric='Response quality', score={'include_reason': True}, assertion=False)
  • Returns: {'LLMJudge_score': EvaluationReason(value=0.85, reason='...')}

Return both score and assertion:

  • LLMJudge(rubric='Response quality', score={'include_reason': True}, assertion={'include_reason': True})
  • Returns: {'LLMJudge_score': EvaluationReason(value=0.85, reason='...'), 'LLMJudge_pass': EvaluationReason(value=True, reason='...')}

Customize evaluation names:

  • LLMJudge(rubric='Response is factually accurate', assertion={'evaluation_name': 'accuracy', 'include_reason': True})
  • Returns: {'accuracy': EvaluationReason(value=True, reason='...')}

Example:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[Case(inputs='test', expected_output='result')],
    evaluators=[
        # Basic accuracy check
        LLMJudge(
            rubric='Response is factually accurate',
            include_input=True,
        ),
        # Quality score with different model
        LLMJudge(
            rubric='Overall response quality',
            model='anthropic:claude-3-7-sonnet-latest',
            score={'evaluation_name': 'quality', 'include_reason': False},
            assertion=False,
        ),
        # Check against expected output
        LLMJudge(
            rubric='Response matches the expected answer semantically',
            include_input=True,
            include_expected_output=True,
        ),
    ],
)

See Also: LLM Judge Deep Dive


Span-Based Evaluation

HasMatchingSpan

Check if OpenTelemetry spans match a query (requires Logfire configuration).

from pydantic_evals.evaluators import HasMatchingSpan

HasMatchingSpan(
    query={'name_contains': 'tool_call'},
    evaluation_name='called_tool',
)

Parameters:

  • query (SpanQuery): Query to match against spans
  • evaluation_name (str | None): Custom name for this evaluation in reports

Returns: bool - True if any span matches the query

Example:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan

dataset = Dataset(
    cases=[Case(inputs='test')],
    evaluators=[
        # Check that a specific tool was called
        HasMatchingSpan(
            query={'name_contains': 'search_database'},
            evaluation_name='used_database',
        ),
        # Check for errors
        HasMatchingSpan(
            query={'has_attributes': {'error': True}},
            evaluation_name='had_errors',
        ),
        # Check duration constraints
        HasMatchingSpan(
            query={
                'name_equals': 'llm_call',
                'max_duration': 2.0,  # seconds
            },
            evaluation_name='llm_fast_enough',
        ),
    ],
)

See Also: Span-Based Evaluation


Quick Reference Table

Evaluator Purpose Return Type Cost Speed
EqualsExpected Exact match with expected bool Free Instant
Equals Equals specific value bool Free Instant
Contains Contains value/substring bool + reason Free Instant
IsInstance Type validation bool + reason Free Instant
MaxDuration Performance threshold bool Free Instant
LLMJudge Subjective quality bool and/or float $$ Slow
HasMatchingSpan Behavioral check bool Free Fast

Combining Evaluators

Best practice is to combine fast deterministic checks with slower LLM evaluations:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import (
    Contains,
    IsInstance,
    LLMJudge,
    MaxDuration,
)

dataset = Dataset(
    cases=[Case(inputs='test')],
    evaluators=[
        # Fast checks first (fail fast)
        IsInstance(type_name='str'),
        Contains(value='required_field'),
        MaxDuration(seconds=2.0),
        # Expensive LLM checks last
        LLMJudge(rubric='Response is helpful and accurate'),
    ],
)

This approach:

  1. Catches format/structure issues immediately
  2. Validates required content quickly
  3. Only runs expensive LLM evaluation if basic checks pass
  4. Provides comprehensive quality assessment

Next Steps