Skip to content

Core Concepts

This page explains the key concepts in Pydantic Evals and how they work together.

Overview

Pydantic Evals is built around these core concepts:

  • Dataset - A static definition containing test cases and evaluators
  • Case - A single test scenario with inputs and optional expected outputs
  • Evaluator - Logic for scoring or validating outputs
  • Experiment - The act of running a task function against all cases in a dataset. (This corresponds to a call to Dataset.evaluate.)
  • EvaluationReport - The results from running an experiment

The key distinction is between:

  • Definition (Dataset with Cases and Evaluators) - what you want to test
  • Execution (Experiment) - running your task against those tests
  • Results (EvaluationReport) - what happened during the experiment

Unit Testing Analogy

A helpful way to think about Pydantic Evals:

Unit Testing Pydantic Evals
Test function Case + Evaluator
Test suite Dataset
Running tests (pytest) Experiment (dataset.evaluate(task))
Test report EvaluationReport
assert Evaluator returning bool

Key Difference: AI systems are probabilistic, so instead of simple pass/fail, evaluations can have:

  • Quantitative scores (0.0 to 1.0)
  • Qualitative labels ("good", "acceptable", "poor")
  • Pass/fail assertions with explanatory reasons

Just like you can run pytest multiple times on the same test suite, you can run multiple experiments on the same dataset to compare different implementations or track changes over time.

Dataset

A Dataset is a collection of test cases and evaluators that define an evaluation suite.

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance

dataset = Dataset(
    name='my_eval_suite',  # Optional name
    cases=[
        Case(inputs='test input', expected_output='test output'),
    ],
    evaluators=[
        IsInstance(type_name='str'),
    ],
)

Key Features

  • Type-safe: Generic over InputsT, OutputT, and MetadataT types
  • Serializable: Can be saved to/loaded from YAML or JSON files
  • Evaluable: Run against any function with matching input/output types

Dataset-Level vs Case-Level Evaluators

Evaluators can be defined at two levels:

  • Dataset-level: Apply to all cases in the dataset
  • Case-level: Apply only to specific cases
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected, IsInstance

dataset = Dataset(
    cases=[
        Case(
            name='special_case',
            inputs='test',
            expected_output='TEST',
            evaluators=[
                # This evaluator only runs for this case
                EqualsExpected(),
            ],
        ),
    ],
    evaluators=[
        # This evaluator runs for ALL cases
        IsInstance(type_name='str'),
    ],
)

Experiments

An Experiment is what happens when you execute a task function against all cases in a dataset. This is the bridge between your static test definition (the Dataset) and your results (the EvaluationReport).

Running an Experiment

You run an experiment by calling evaluate() or evaluate_sync() on a dataset:

from pydantic_evals import Case, Dataset

# Define your dataset (static definition)
dataset = Dataset(
    cases=[
        Case(inputs='hello', expected_output='HELLO'),
        Case(inputs='world', expected_output='WORLD'),
    ],
)

# Define your task
def uppercase_task(text: str) -> str:
    return text.upper()

# Run the experiment (execution)
report = dataset.evaluate_sync(uppercase_task)

What Happens During an Experiment

When you run an experiment:

  1. Setup: The dataset loads all cases and evaluators
  2. Execution: For each case:
    1. The task function is called with case.inputs
    2. Execution time is measured and OpenTelemetry spans are captured (if logfire is configured)
    3. The outputs of the task function for each case are recorded
  3. Evaluation: For each case output:
    1. All dataset-level evaluators are run
    2. Case-specific evaluators are run (if any)
    3. Results are collected (scores, assertions, labels)
  4. Reporting: All results are aggregated into an EvaluationReport

Multiple Experiments from One Dataset

A key feature of Pydantic Evals is that you can run the same dataset against different task implementations:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected

dataset = Dataset(
    cases=[
        Case(inputs='hello', expected_output='HELLO'),
    ],
    evaluators=[EqualsExpected()],
)


# Original implementation
def task_v1(text: str) -> str:
    return text.upper()


# Improved implementation (with exclamation)
def task_v2(text: str) -> str:
    return text.upper() + '!'


# Compare results
report_v1 = dataset.evaluate_sync(task_v1)
report_v2 = dataset.evaluate_sync(task_v2)

avg_v1 = report_v1.averages()
avg_v2 = report_v2.averages()
print(f'V1 pass rate: {avg_v1.assertions if avg_v1 and avg_v1.assertions else 0}')
#> V1 pass rate: 1.0
print(f'V2 pass rate: {avg_v2.assertions if avg_v2 and avg_v2.assertions else 0}')
#> V2 pass rate: 0

This allows you to:

  • Compare implementations across versions
  • Track performance over time
  • A/B test different approaches
  • Validate changes before deployment

Case

A Case represents a single test scenario with specific inputs and optional expected outputs.

from pydantic_evals import Case
from pydantic_evals.evaluators import EqualsExpected

case = Case(
    name='test_uppercase',  # Optional, but recommended for reporting
    inputs='hello world',  # Required: inputs to your task
    expected_output='HELLO WORLD',  # Optional: expected output
    metadata={'category': 'basic'},  # Optional: arbitrary metadata
    evaluators=[EqualsExpected()],  # Optional: case-specific evaluators
)

Case Components

Inputs

The inputs to pass to the task being evaluated. Can be any type:

from pydantic import BaseModel

from pydantic_evals import Case


class MyInputModel(BaseModel):
    field1: str


# Simple types
Case(inputs='hello')
Case(inputs=42)

# Complex types
Case(inputs={'query': 'What is AI?', 'max_tokens': 100})
Case(inputs=MyInputModel(field1='value'))

Expected Output

The expected result, used by evaluators like EqualsExpected:

from pydantic_evals import Case

Case(
    inputs='2 + 2',
    expected_output='4',
)

If no expected_output is provided, evaluators that require it (like EqualsExpected) will skip that case.

Metadata

Arbitrary data that evaluators can access via EvaluatorContext:

from pydantic_evals import Case

Case(
    inputs='question',
    metadata={
        'difficulty': 'hard',
        'category': 'math',
        'source': 'exam_2024',
    },
)

Metadata is useful for:

  • Filtering cases during analysis
  • Providing context to evaluators
  • Organizing test suites

Evaluators

Cases can have their own evaluators that only run for that specific case. This is particularly powerful for building comprehensive evaluation suites where different cases have different requirements - if you could write one evaluator rubric that worked perfectly for all cases, you'd just incorporate it into your agent instructions. Case-specific LLMJudge evaluators are especially useful for quickly building maintainable golden datasets by describing what "good" looks like for each scenario. See Case-specific evaluators for a more detailed explanation and examples.

Evaluator

An Evaluator assesses the output of your task and returns one or more scores, labels, or assertions. Each score, label or assertion can also have an optional string-value reason associated.

Evaluator Types

Evaluators return different types of results:

Return Type Purpose Example
bool Assertion - Pass/fail check True → ✔, False → ✗
int or float Score - Numeric quality metric 0.95, 87
str Label - Categorical result "correct", "hallucination"
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ExactMatch(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return ctx.output == ctx.expected_output  # Assertion


@dataclass
class Confidence(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> float:
        # Analyze output and return confidence score
        return 0.95  # Score


@dataclass
class Classifier(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> str:
        if 'error' in ctx.output.lower():
            return 'error'  # Label
        return 'success'

Evaluators can also return instances of EvaluationReason, and dictionaries mapping labels to output values. See the custom evaluator return types docs for more detail.

EvaluatorContext

All evaluators receive an EvaluatorContext containing:

  • name: Case name (optional)
  • inputs: Task inputs
  • metadata: Case metadata (optional)
  • expected_output: Expected output (optional)
  • output: Actual output from task
  • duration: Task execution time in seconds
  • span_tree: OpenTelemetry spans (if logfire is configured)
  • attributes: Custom attributes dict
  • metrics: Custom metrics dict

Multiple Evaluations

Evaluators can return multiple results by returning a dictionary:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class MultiCheck(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | float | str]:
        return {
            'is_valid': isinstance(ctx.output, str),  # Assertion
            'length': len(ctx.output),  # Metric
            'category': 'long' if len(ctx.output) > 100 else 'short',  # Label
        }

Evaluation Reasons

Add explanations to your evaluations using EvaluationReason:

from dataclasses import dataclass

from pydantic_evals.evaluators import EvaluationReason, Evaluator, EvaluatorContext


@dataclass
class SmartCheck(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        if ctx.output == ctx.expected_output:
            return EvaluationReason(
                value=True,
                reason='Exact match with expected output',
            )
        return EvaluationReason(
            value=False,
            reason=f'Expected {ctx.expected_output!r}, got {ctx.output!r}',
        )

Reasons appear in reports when using include_reasons=True.

Evaluation Report

An EvaluationReport is the result of running an experiment. It contains all the data from executing your task against the dataset's cases and running all evaluators.

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected

dataset = Dataset(
    cases=[Case(inputs='hello', expected_output='HELLO')],
    evaluators=[EqualsExpected()],
)


def my_task(text: str) -> str:
    return text.upper()


# Run an experiment
report = dataset.evaluate_sync(my_task)

# Print to console
report.print()
"""
    Evaluation Summary: my_task
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID  ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Case 1   │ ✔          │     10ms │
├──────────┼────────────┼──────────┤
│ Averages │ 100.0% ✔   │     10ms │
└──────────┴────────────┴──────────┘
"""

# Access data programmatically
for case in report.cases:
    print(f'{case.name}: {case.scores}')
    #> Case 1: {}

Report Structure

The EvaluationReport contains:

  • name: Experiment name
  • cases: List of successful case evaluations
  • failures: List of failed executions
  • trace_id: OpenTelemetry trace ID (optional)
  • span_id: OpenTelemetry span ID (optional)

ReportCase

Each successfulcase result contains:

Case data:

  • name: Case name
  • inputs: Task inputs
  • metadata: Case metadata (optional)
  • expected_output: Expected output (optional)
  • output: Actual output from task

Evaluation results:

  • scores: Dictionary of numeric scores from evaluators
  • labels: Dictionary of categorical labels from evaluators
  • assertions: Dictionary of pass/fail assertions from evaluators

Performance data:

  • task_duration: Task execution time
  • total_duration: Total time including evaluators

Additional data:

  • metrics: Custom metrics dict
  • attributes: Custom attributes dict

Tracing:

  • trace_id: OpenTelemetry trace ID (optional)
  • span_id: OpenTelemetry span ID (optional)

Errors:

  • evaluator_failures: List of evaluator errors

Data Model Relationships

Here's how the core concepts relate to each other:

Static Definition

  • A Dataset contains:
  • Many Cases (test scenarios with inputs and expected outputs)
  • Many Evaluators (logic for scoring outputs)

Execution (Experiment)

When you call dataset.evaluate(task), an Experiment runs:

  • The Task function is executed against all Cases in the Dataset
  • All Evaluators are run (both dataset-level and case-specific) against each output as appropriate
  • One EvaluationReport is produced as the final output

Results

  • An EvaluationReport contains:
  • Results for each Case (inputs, outputs, scores, assertions, labels)
  • Summary statistics (averages, pass rates)
  • Performance data (durations)
  • Tracing information (OpenTelemetry spans)

Key Relationships

  • One Dataset → Many Experiments: You can run the same dataset against different task implementations or multiple times to track changes
  • One Experiment → One Report: Each time you call dataset.evaluate(...), you get one report
  • One Experiment → Many Case Results: The report contains results for every case in the dataset

Next Steps