Case Lifecycle Hooks
Control per-case setup, context preparation, and teardown during evaluation using CaseLifecycle.
Overview
CaseLifecycle provides hooks at each stage of case evaluation. You pass a lifecycle class (not an instance) to Dataset.evaluate, and a new instance is created for each case, so instance attributes naturally hold case-specific state.
Evaluation Flow
Each case follows this flow:
setup()— called before task execution- Task runs
prepare_context()— called after task, before evaluators- Evaluators run
teardown()— called after evaluators complete
Per-Case Setup and Teardown
Use setup() and teardown() when each case needs its own environment — for example, creating a database, starting a service, or preparing fixtures driven by case metadata. Since a new lifecycle instance is created for each case, instance attributes are naturally case-scoped:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators.context import EvaluatorContext
from pydantic_evals.lifecycle import CaseLifecycle
from pydantic_evals.reporting import ReportCase, ReportCaseFailure
class SetupFromMetadata(CaseLifecycle[str, str, dict]):
async def setup(self) -> None:
prefix = (self.case.metadata or {}).get('prefix', '')
self.prefix = prefix
async def prepare_context(
self, ctx: EvaluatorContext[str, str, dict]
) -> EvaluatorContext[str, str, dict]:
ctx.metrics['prefix_length'] = len(self.prefix)
return ctx
async def teardown(
self,
result: ReportCase[str, str, dict] | ReportCaseFailure[str, str, dict],
) -> None:
pass # Clean up resources here
dataset = Dataset(
cases=[
Case(name='no_prefix', inputs='hello', metadata={'prefix': ''}),
Case(name='with_prefix', inputs='hello', metadata={'prefix': 'PREFIX:'}),
]
)
report = dataset.evaluate_sync(lambda inputs: inputs.upper(), lifecycle=SetupFromMetadata)
metrics = {c.name: c.metrics for c in report.cases}
print(metrics['no_prefix']['prefix_length'])
#> 0
print(metrics['with_prefix']['prefix_length'])
#> 7
The case metadata drives per-case behavior without needing custom Case subclasses or serialization.
Conditional Teardown
The teardown() hook receives the full result, so you can vary cleanup logic based on success or failure — for example, keeping test environments up for manual inspection when a case fails:
from pydantic_evals import Case, Dataset
from pydantic_evals.lifecycle import CaseLifecycle
from pydantic_evals.reporting import ReportCase, ReportCaseFailure
cleaned_up: list[str] = []
class ConditionalCleanup(CaseLifecycle[str, str, dict]):
async def setup(self) -> None:
self.resource_id = self.case.name
async def teardown(
self,
result: ReportCase[str, str, dict] | ReportCaseFailure[str, str, dict],
) -> None:
keep_on_failure = (self.case.metadata or {}).get('keep_on_failure', False)
if isinstance(result, ReportCaseFailure) and keep_on_failure:
pass # Keep resource for inspection
else:
cleaned_up.append(self.resource_id)
dataset = Dataset(
cases=[
Case(name='success_case', inputs='hello', metadata={'keep_on_failure': True}),
Case(name='failure_case', inputs='fail', metadata={'keep_on_failure': True}),
]
)
def task(inputs: str) -> str:
if inputs == 'fail':
raise ValueError('intentional failure')
return inputs.upper()
report = dataset.evaluate_sync(task, max_concurrency=1, lifecycle=ConditionalCleanup)
print(cleaned_up)
#> ['success_case']
Preparing Evaluator Context
The prepare_context() hook runs after the task completes but before evaluators see the context. This can be used to add metrics or attributes based on the task output, span tree, or any other state — for example, deriving metrics from instrumented spans (like tool call counts or API latency), or computing values from external resources set up during setup():
from dataclasses import dataclass
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
from pydantic_evals.lifecycle import CaseLifecycle
class EnrichMetrics(CaseLifecycle):
async def prepare_context(self, ctx: EvaluatorContext) -> EvaluatorContext:
ctx.metrics['output_length'] = len(str(ctx.output))
return ctx
@dataclass
class CheckLength(Evaluator):
max_length: int = 50
def evaluate(self, ctx: EvaluatorContext) -> bool:
return ctx.metrics.get('output_length', 0) <= self.max_length
dataset = Dataset(
cases=[Case(name='short', inputs='hi'), Case(name='long', inputs='hello world')],
evaluators=[CheckLength()],
)
report = dataset.evaluate_sync(lambda inputs: inputs.upper(), lifecycle=EnrichMetrics)
for case in report.cases:
print(f'{case.name}: output_length={case.metrics["output_length"]}')
#> short: output_length=2
#> long: output_length=11
Type Parameters
CaseLifecycle is generic over the same three type parameters as Case: InputsT, OutputT, and MetadataT. All three default to Any, so you can omit them when your hooks don't need type-specific access:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators.context import EvaluatorContext
from pydantic_evals.lifecycle import CaseLifecycle
# Works with any dataset — no type parameters needed
class GenericMetricEnricher(CaseLifecycle):
async def prepare_context(self, ctx: EvaluatorContext) -> EvaluatorContext:
ctx.metrics['custom'] = 42
return ctx
dataset = Dataset(cases=[Case(inputs='test')])
report = dataset.evaluate_sync(lambda inputs: inputs, lifecycle=GenericMetricEnricher)
print(report.cases[0].metrics['custom'])
#> 42
Next Steps
- Metrics & Attributes — Recording metrics inside tasks
- Custom Evaluators — Using enriched metrics in evaluators
- Span-Based Evaluation — Analyzing execution traces