Skip to content

Pydantic Evals

Pydantic Evals is a powerful evaluation framework for systematically testing and evaluating AI systems, from simple LLM calls to complex multi-agent applications.

What is Pydantic Evals?

Pydantic Evals helps you:

  • Create test datasets with type-safe structured inputs and expected outputs
  • Run evaluations against your AI systems with automatic concurrency
  • Score results using deterministic checks, LLM judges, or custom evaluators
  • Generate reports with detailed metrics, assertions, and performance data
  • Track changes by comparing evaluation runs over time
  • Integrate with Logfire for visualization and collaborative analysis

Installation

pip install pydantic-evals

For OpenTelemetry tracing and Logfire integration:

pip install 'pydantic-evals[logfire]'

Quick Start

While evaluations are typically used to test AI systems, the Pydantic Evals framework works with any function call. To demonstrate the core functionality, we'll start with a simple, deterministic example.

Here's a complete example of evaluating a simple text transformation function:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Contains, EqualsExpected

# Create a dataset with test cases
dataset = Dataset(
    cases=[
        Case(
            name='uppercase_basic',
            inputs='hello world',
            expected_output='HELLO WORLD',
        ),
        Case(
            name='uppercase_with_numbers',
            inputs='hello 123',
            expected_output='HELLO 123',
        ),
    ],
    evaluators=[
        EqualsExpected(),  # Check exact match with expected_output
        Contains(value='HELLO', case_sensitive=True),  # Check contains "HELLO"
    ],
)


# Define the function to evaluate
def uppercase_text(text: str) -> str:
    return text.upper()


# Run the evaluation
report = dataset.evaluate_sync(uppercase_text)

# Print the results
report.print()
"""
        Evaluation Summary: uppercase_text
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID                ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ uppercase_basic        │ ✔✔         │     10ms │
├────────────────────────┼────────────┼──────────┤
│ uppercase_with_numbers │ ✔✔         │     10ms │
├────────────────────────┼────────────┼──────────┤
│ Averages               │ 100.0% ✔   │     10ms │
└────────────────────────┴────────────┴──────────┘
"""

Output:

                  Evaluation Summary: uppercase_text
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID                 ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ uppercase_basic         │ ✔✔         │     10ms │
├─────────────────────────┼────────────┼──────────┤
│ uppercase_with_numbers  │ ✔✔         │     10ms │
├─────────────────────────┼────────────┼──────────┤
│ Averages                │ 100.0% ✔   │     10ms │
└─────────────────────────┴────────────┴──────────┘

Key Concepts

Understanding a few core concepts will help you get the most out of Pydantic Evals:

  • Dataset - A collection of test cases and (optional) evaluators
  • Case - A single test scenario with inputs and optional expected outputs and case-specific evaluators
  • Evaluator - A function that scores or validates task outputs
  • EvaluationReport - Results from running an evaluation

For a deeper dive, see Core Concepts.

Common Use Cases

Deterministic Validation

Test that your AI system produces correctly-structured outputs:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Contains, IsInstance

dataset = Dataset(
    cases=[
        Case(inputs={'data': 'required_key present'}, expected_output={'result': 'success'}),
    ],
    evaluators=[
        IsInstance(type_name='dict'),
        Contains(value='required_key'),
    ],
)

LLM-as-a-Judge Evaluation

Use an LLM to evaluate subjective qualities like accuracy or helpfulness:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[
        Case(inputs='What is the capital of France?', expected_output='Paris'),
    ],
    evaluators=[
        LLMJudge(
            rubric='Response is accurate and helpful',
            include_input=True,
            model='anthropic:claude-3-7-sonnet-latest',
        )
    ],
)

Performance Testing

Ensure your system meets performance requirements:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import MaxDuration

dataset = Dataset(
    cases=[
        Case(inputs='test input', expected_output='test output'),
    ],
    evaluators=[
        MaxDuration(seconds=2.0),
    ],
)

Next Steps

Explore the documentation to learn more: