Span-Based Evaluation
Evaluate AI system behavior by analyzing OpenTelemetry spans captured during execution.
Requires Logfire
Span-based evaluation requires logfire to be installed and configured:
pip install 'pydantic-evals[logfire]'
Overview
Span-based evaluation enables you to evaluate how your AI system executes, not just what it produces. This is essential for complex agents where ensuring the desired behavior depends on the execution path taken, not just the final output.
Why Span-Based Evaluation?
Traditional evaluators assess task inputs and outputs. For simple tasks, this may be sufficient—if the output is correct, the task succeeded. But for complex multi-step agents, the process matters as much as the result:
- A correct answer reached incorrectly - An agent might produce the right output by accident (e.g., guessing, using cached data when it should have searched, calling the wrong tools but getting lucky)
- Verification of required behaviors - You need to ensure specific tools were called, certain code paths executed, or particular patterns followed
- Performance and efficiency - The agent should reach the answer efficiently, without unnecessary tool calls, infinite loops, or excessive retries
- Safety and compliance - Critical to verify that dangerous operations weren't attempted, sensitive data wasn't accessed inappropriately, or guardrails weren't bypassed
Real-World Scenarios
Span-based evaluation is particularly valuable for:
- RAG systems - Verify documents were retrieved and reranked before generation, not just that the answer included citations
- Multi-agent coordination - Ensure the orchestrator delegated to the right specialist agents in the correct order
- Tool-calling agents - Confirm specific tools were used (or avoided), and in the expected sequence
- Debugging and regression testing - Catch behavioral regressions where outputs remain correct but the internal logic deteriorates
- Production alignment - Ensure your evaluation assertions operate on the same telemetry data captured in production, so eval insights directly translate to production monitoring
How It Works
When you configure logfire (logfire.configure()), Pydantic Evals captures all OpenTelemetry spans generated during task execution. You can then write evaluators that assert conditions on:
- Which tools were called -
HasMatchingSpan(query={'name_contains': 'search_tool'}) - Code paths executed - Verify specific functions ran or particular branches taken
- Timing characteristics - Check that operations complete within SLA bounds
- Error conditions - Detect retries, fallbacks, or specific failure modes
- Execution structure - Verify parent-child relationships, delegation patterns, or execution order
This creates a fundamentally different evaluation paradigm: you're testing behavioral contracts, not just input-output relationships.
Basic Usage
import logfire
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan
# Configure logfire to capture spans
logfire.configure(send_to_logfire='if-token-present')
dataset = Dataset(
cases=[Case(inputs='test')],
evaluators=[
# Check that database was queried
HasMatchingSpan(
query={'name_contains': 'database_query'},
evaluation_name='used_database',
),
],
)
HasMatchingSpan Evaluator
The HasMatchingSpan evaluator checks if any span matches a query:
from pydantic_evals.evaluators import HasMatchingSpan
HasMatchingSpan(
query={'name_contains': 'test'},
evaluation_name='span_check',
)
Returns: bool - True if any span matches the query
SpanQuery Reference
A SpanQuery is a dictionary with query conditions:
Name Conditions
Match spans by name:
# Exact name match
{'name_equals': 'search_database'}
# Contains substring
{'name_contains': 'tool_call'}
# Regex pattern
{'name_matches_regex': r'llm_call_\d+'}
Attribute Conditions
Match spans with specific attributes:
# Has specific attribute values
{'has_attributes': {'operation': 'search', 'status': 'success'}}
# Has attribute keys (any value)
{'has_attribute_keys': ['user_id', 'request_id']}
Duration Conditions
Match based on execution time:
from datetime import timedelta
# Minimum duration
{'min_duration': 1.0} # seconds
{'min_duration': timedelta(seconds=1)}
# Maximum duration
{'max_duration': 5.0} # seconds
{'max_duration': timedelta(seconds=5)}
# Range
{'min_duration': 0.5, 'max_duration': 2.0}
Logical Operators
Combine conditions:
# NOT
{'not_': {'name_contains': 'error'}}
# AND (all must match)
{'and_': [
{'name_contains': 'tool'},
{'max_duration': 1.0},
]}
# OR (any must match)
{'or_': [
{'name_equals': 'search'},
{'name_equals': 'query'},
]}
Child/Descendant Conditions
Query relationships between spans:
# Count direct children
{'min_child_count': 1}
{'max_child_count': 5}
# Some child matches query
{'some_child_has': {'name_contains': 'retry'}}
# All children match query
{'all_children_have': {'max_duration': 0.5}}
# No children match query
{'no_child_has': {'has_attributes': {'error': True}}}
# Descendant queries (recursive)
{'min_descendant_count': 5}
{'some_descendant_has': {'name_contains': 'api_call'}}
Ancestor/Depth Conditions
Query span hierarchy:
# Depth (root spans have depth 0)
{'min_depth': 1} # Not a root span
{'max_depth': 2} # At most 2 levels deep
# Ancestor queries
{'some_ancestor_has': {'name_equals': 'agent_run'}}
{'all_ancestors_have': {'max_duration': 10.0}}
{'no_ancestor_has': {'has_attributes': {'error': True}}}
Stop Recursing
Control recursive queries:
{
'some_descendant_has': {'name_contains': 'expensive'},
'stop_recursing_when': {'name_equals': 'boundary'},
}
# Only search descendants until hitting a span named 'boundary'
Practical Examples
Verify Tool Usage
Check that specific tools were called:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan
dataset = Dataset(
cases=[Case(inputs='test')],
evaluators=[
# Must call search tool
HasMatchingSpan(
query={'name_contains': 'search_tool'},
evaluation_name='used_search',
),
# Must NOT call dangerous tool
HasMatchingSpan(
query={'not_': {'name_contains': 'delete_database'}},
evaluation_name='safe_execution',
),
],
)
Check Multiple Tools
Verify a sequence of operations:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
HasMatchingSpan(
query={'name_contains': 'retrieve_context'},
evaluation_name='retrieved_context',
),
HasMatchingSpan(
query={'name_contains': 'generate_response'},
evaluation_name='generated_response',
),
HasMatchingSpan(
query={'and_': [
{'name_contains': 'cite'},
{'has_attribute_keys': ['source_id']},
]},
evaluation_name='added_citations',
),
]
Performance Assertions
Ensure operations meet latency requirements:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Database queries should be fast
HasMatchingSpan(
query={'and_': [
{'name_contains': 'database'},
{'max_duration': 0.1}, # 100ms max
]},
evaluation_name='fast_db_queries',
),
# Overall should complete quickly
HasMatchingSpan(
query={'and_': [
{'name_equals': 'task_execution'},
{'max_duration': 2.0},
]},
evaluation_name='within_sla',
),
]
Error Detection
Check for error conditions:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# No errors occurred
HasMatchingSpan(
query={'not_': {'has_attributes': {'error': True}}},
evaluation_name='no_errors',
),
# Retries happened
HasMatchingSpan(
query={'name_contains': 'retry'},
evaluation_name='had_retries',
),
# Fallback was used
HasMatchingSpan(
query={'name_contains': 'fallback_model'},
evaluation_name='used_fallback',
),
]
Complex Behavioral Checks
Verify sophisticated behavior patterns:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Agent delegated to sub-agent
HasMatchingSpan(
query={'and_': [
{'name_contains': 'agent'},
{'some_child_has': {'name_contains': 'delegate'}},
]},
evaluation_name='used_delegation',
),
# Made multiple LLM calls with retries
HasMatchingSpan(
query={'and_': [
{'name_contains': 'llm_call'},
{'some_descendant_has': {'name_contains': 'retry'}},
{'min_descendant_count': 3},
]},
evaluation_name='retry_pattern',
),
]
Custom Evaluators with SpanTree
For more complex span analysis, write custom evaluators:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class CustomSpanCheck(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | int]:
span_tree = ctx.span_tree
# Find specific spans
llm_spans = span_tree.find(lambda node: 'llm' in node.name)
tool_spans = span_tree.find(lambda node: 'tool' in node.name)
# Calculate metrics
total_llm_time = sum(
span.duration.total_seconds() for span in llm_spans
)
return {
'used_llm': len(llm_spans) > 0,
'used_tools': len(tool_spans) > 0,
'tool_count': len(tool_spans),
'llm_fast': total_llm_time < 2.0,
}
SpanTree API
The SpanTree provides methods for span analysis:
from pydantic_evals.otel import SpanTree
# Example API (requires span_tree from context)
def example_api(span_tree: SpanTree) -> None:
span_tree.find(lambda n: True) # Find all matching nodes
span_tree.any({'name_contains': 'test'}) # Check if any span matches
span_tree.all({'name_contains': 'test'}) # Check if all spans match
span_tree.count({'name_contains': 'test'}) # Count matching spans
# Iteration
for node in span_tree:
print(node.name, node.duration, node.attributes)
SpanNode Properties
Each SpanNode has:
from pydantic_evals.otel import SpanNode
# Example properties (requires node from context)
def example_properties(node: SpanNode) -> None:
_ = node.name # Span name
_ = node.duration # timedelta
_ = node.attributes # dict[str, AttributeValue]
_ = node.start_timestamp # datetime
_ = node.end_timestamp # datetime
_ = node.children # list[SpanNode]
_ = node.descendants # list[SpanNode] (recursive)
_ = node.ancestors # list[SpanNode]
_ = node.parent # SpanNode | None
Debugging Span Queries
View Spans in Logfire
If you're sending data to Logfire, you can view all spans in the web UI to understand the trace structure.
Print Span Tree
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class DebugSpans(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> bool:
for node in ctx.span_tree:
print(f"{' ' * len(node.ancestors)}{node.name} ({node.duration})")
return True
Query Testing
Test queries incrementally:
from pydantic_evals.evaluators import HasMatchingSpan
# Start simple
query = {'name_contains': 'tool'}
# Add conditions gradually
query = {'and_': [
{'name_contains': 'tool'},
{'max_duration': 1.0},
]}
# Test in evaluator
HasMatchingSpan(query=query, evaluation_name='test')
Use Cases
RAG System Verification
Verify retrieval-augmented generation workflow:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Retrieved documents
HasMatchingSpan(
query={'name_contains': 'vector_search'},
evaluation_name='retrieved_docs',
),
# Reranked results
HasMatchingSpan(
query={'name_contains': 'rerank'},
evaluation_name='reranked_results',
),
# Generated with context
HasMatchingSpan(
query={'and_': [
{'name_contains': 'generate'},
{'has_attribute_keys': ['context_ids']},
]},
evaluation_name='used_context',
),
]
Multi-Agent Systems
Verify agent coordination:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Master agent ran
HasMatchingSpan(
query={'name_equals': 'master_agent'},
evaluation_name='master_ran',
),
# Delegated to specialist
HasMatchingSpan(
query={'and_': [
{'name_contains': 'specialist_agent'},
{'some_ancestor_has': {'name_equals': 'master_agent'}},
]},
evaluation_name='delegated_correctly',
),
# No circular delegation
HasMatchingSpan(
query={'not_': {'and_': [
{'name_contains': 'agent'},
{'some_descendant_has': {'name_contains': 'agent'}},
{'some_ancestor_has': {'name_contains': 'agent'}},
]}},
evaluation_name='no_circular_delegation',
),
]
Tool Usage Patterns
Verify intelligent tool selection:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Used search before answering
HasMatchingSpan(
query={'and_': [
{'name_contains': 'search'},
{'some_ancestor_has': {'name_contains': 'answer'}},
]},
evaluation_name='searched_before_answering',
),
# Limited tool calls (no loops)
HasMatchingSpan(
query={'and_': [
{'name_contains': 'tool'},
{'max_child_count': 5},
]},
evaluation_name='reasonable_tool_usage',
),
]
Best Practices
- Start Simple: Begin with basic name queries, add complexity as needed
- Use Descriptive Names: Name your spans well in your application code
- Test Queries: Verify queries work before running full evaluations
- Combine with Other Evaluators: Use span checks alongside output validation
- Document Expectations: Comment why specific spans should/shouldn't exist
Next Steps
- Logfire Integration - Set up Logfire for span capture
- Custom Evaluators - Write advanced span analysis
- Built-in Evaluators - Other evaluator types