Multi-Run Evaluation
Run each case multiple times to measure variability and get more reliable aggregate results.
Overview
AI systems are inherently stochastic — the same input can produce different outputs across runs. The repeat parameter lets you run each case multiple times and automatically aggregates the results, giving you a clearer picture of your system's typical behavior.
Basic Usage
Pass repeat to evaluate() or evaluate_sync():
from pydantic_evals import Case, Dataset
dataset = Dataset(
cases=[
Case(name='greeting', inputs='Say hello'),
Case(name='farewell', inputs='Say goodbye'),
]
)
def task(inputs: str) -> str:
return inputs.upper()
# Run each case 5 times
report = dataset.evaluate_sync(task, repeat=5)
# 2 cases × 5 repeats = 10 total runs
print(len(report.cases))
#> 10
When repeat > 1, each run gets an indexed name like greeting [1/5], greeting [2/5], etc., while the original case name is preserved in source_case_name for grouping.
Accessing Grouped Results
Use case_groups() to access runs organized by original case, with per-group aggregated statistics:
from pydantic_evals import Case, Dataset
dataset = Dataset(
cases=[
Case(name='greeting', inputs='Say hello'),
Case(name='farewell', inputs='Say goodbye'),
]
)
def task(inputs: str) -> str:
return inputs.upper()
report = dataset.evaluate_sync(task, repeat=3)
groups = report.case_groups()
assert groups is not None # None for single-run (repeat=1)
print(len(groups))
#> 2
group_names = [g.name for g in groups]
print(group_names)
#> ['greeting', 'farewell']
# Each group has 3 runs and aggregated statistics
for group in groups:
assert len(group.runs) == 3
assert len(group.failures) == 0
assert group.summary.task_duration > 0
Each ReportCaseGroup contains:
name— the original case nameruns— the individualReportCaseresultsfailures— any runs that raised exceptionssummary— aReportCaseAggregatewith averaged scores, metrics, labels, assertions, and durations
Aggregation
With repeat > 1, the report's [averages()][pydantic_evals.reporting.EvaluationReport.averages] uses a two-level aggregation strategy:
- Per-group averages: Each case's runs are averaged into a group summary
- Cross-group averages: The group summaries are averaged to produce the final result
This ensures each original case contributes equally to the overall averages, regardless of how many runs succeeded or failed.
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected
dataset = Dataset(
cases=[
Case(name='easy', inputs='hello', expected_output='HELLO'),
Case(name='hard', inputs='world', expected_output='WORLD'),
],
evaluators=[EqualsExpected()],
)
def task(inputs: str) -> str:
return inputs.upper()
report = dataset.evaluate_sync(task, repeat=3)
averages = report.averages()
assert averages is not None
print(f'Overall assertion rate: {averages.assertions}')
#> Overall assertion rate: 1.0
Default Behavior
When repeat=1 (the default), behavior is identical to a standard evaluation — no run indexing, no source_case_name, and case_groups() returns None:
from pydantic_evals import Case, Dataset
dataset = Dataset(cases=[Case(name='test', inputs='hello')])
def task(inputs: str) -> str:
return inputs.upper()
report = dataset.evaluate_sync(task) # repeat=1 by default
assert report.case_groups() is None
assert all(c.source_case_name is None for c in report.cases)
Next Steps
- Concurrency & Performance — Control parallel execution with
max_concurrency - Metrics & Attributes — Track custom metrics across runs
- Logfire Integration — Visualize multi-run results