pydantic_evals.evaluators
Contains
dataclass
Bases: Evaluator[object, object, object]
Check if the output contains the expected output.
For strings, checks if expected_output is a substring of output. For lists/tuples, checks if expected_output is in output. For dicts, checks if all key-value pairs in expected_output are in output.
Note: case_sensitive only applies when both the value and output are strings.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|
Equals
dataclass
Bases: Evaluator[object, object, object]
Check if the output exactly equals the provided value.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
25 26 27 28 29 30 31 32 |
|
EqualsExpected
dataclass
Bases: Evaluator[object, object, object]
Check if the output exactly equals the expected output.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
35 36 37 38 39 40 41 42 |
|
HasMatchingSpan
dataclass
Bases: Evaluator[object, object, object]
Check if the span tree contains a span that matches the specified query.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
179 180 181 182 183 184 185 186 187 188 189 |
|
IsInstance
dataclass
Bases: Evaluator[object, object, object]
Check if the output is an instance of a type with the given name.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
LLMJudge
dataclass
Bases: Evaluator[object, object, object]
Judge whether the output of a language model meets the criteria of a provided rubric.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
|
MaxDuration
dataclass
Bases: Evaluator[object, object, object]
Check if the execution time is under the specified maximum.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
142 143 144 145 146 147 148 149 150 151 152 153 |
|
Python
dataclass
Bases: Evaluator[object, object, object]
The output of this evaluator is the result of evaluating the provided Python expression.
WARNING: this evaluator runs arbitrary Python code, so you should NEVER use it with untrusted inputs.
Source code in pydantic_evals/pydantic_evals/evaluators/common.py
193 194 195 196 197 198 199 200 201 202 203 204 |
|
EvaluatorContext
dataclass
Bases: Generic[InputsT, OutputT, MetadataT]
Context for evaluating a task execution.
An instance of this class is the sole input to all Evaluators. It contains all the information needed to evaluate the task execution, including inputs, outputs, metadata, and telemetry data.
Evaluators use this context to access the task inputs, actual output, expected output, and other information when evaluating the result of the task execution.
Example:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class ExactMatch(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> bool:
# Use the context to access task inputs, outputs, and expected outputs
return ctx.output == ctx.expected_output
Source code in pydantic_evals/pydantic_evals/evaluators/context.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
inputs
instance-attribute
inputs: InputsT
The inputs provided to the task for this case.
metadata
instance-attribute
metadata: MetadataT | None
Metadata associated with the case, if provided. May be None if no metadata was specified.
expected_output
instance-attribute
expected_output: OutputT | None
The expected output for the case, if provided. May be None if no expected output was specified.
output
instance-attribute
output: OutputT
The actual output produced by the task for this case.
attributes
instance-attribute
Attributes associated with the task run for this case.
These can be set by calling pydantic_evals.dataset.set_eval_attribute
in any code executed
during the evaluation task.
metrics
instance-attribute
Metrics associated with the task run for this case.
These can be set by calling pydantic_evals.dataset.increment_eval_metric
in any code executed
during the evaluation task.
span_tree
property
span_tree: SpanTree
Get the SpanTree
for this task execution.
The span tree is a graph where each node corresponds to an OpenTelemetry span recorded during the task execution, including timing information and any custom spans created during execution.
Returns:
Type | Description |
---|---|
SpanTree
|
The span tree for the task execution. |
Raises:
Type | Description |
---|---|
SpanTreeRecordingError
|
If spans were not captured during execution of the task, e.g. due to not having the necessary dependencies installed. |
EvaluationReason
dataclass
The result of running an evaluator with an optional explanation.
Contains a scalar value and an optional "reason" explaining the value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value
|
EvaluationScalar
|
The scalar result of the evaluation (boolean, integer, float, or string). |
required |
reason
|
str | None
|
An optional explanation of the evaluation result. |
None
|
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
EvaluationResult
dataclass
Bases: Generic[EvaluationScalarT]
The details of an individual evaluation result.
Contains the name, value, reason, and source evaluator for a single evaluation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the evaluation. |
required |
value
|
EvaluationScalarT
|
The scalar result of the evaluation. |
required |
reason
|
str | None
|
An optional explanation of the evaluation result. |
required |
source
|
Evaluator
|
The evaluator that produced this result. |
required |
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
downcast
downcast(
*value_types: type[T],
) -> EvaluationResult[T] | None
Attempt to downcast this result to a more specific type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*value_types
|
type[T]
|
The types to check the value against. |
()
|
Returns:
Type | Description |
---|---|
EvaluationResult[T] | None
|
A downcast version of this result if the value is an instance of one of the given types, |
EvaluationResult[T] | None
|
otherwise None. |
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
Evaluator
dataclass
Bases: Generic[InputsT, OutputT, MetadataT]
Base class for all evaluators.
Evaluators can assess the performance of a task in a variety of ways, as a function of the EvaluatorContext.
Subclasses must implement the evaluate
method. Note it can be defined with either def
or async def
.
Example:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class ExactMatch(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> bool:
return ctx.output == ctx.expected_output
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
|
name
classmethod
name() -> str
Return the 'name' of this Evaluator to use during serialization.
Returns:
Type | Description |
---|---|
str
|
The name of the Evaluator, which is typically the class name. |
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
148 149 150 151 152 153 154 155 156 157 158 |
|
evaluate
abstractmethod
evaluate(
ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput | Awaitable[EvaluatorOutput]
Evaluate the task output in the given context.
This is the main evaluation method that subclasses must implement. It can be either synchronous or asynchronous, returning either an EvaluatorOutput directly or an Awaitable[EvaluatorOutput].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ctx
|
EvaluatorContext[InputsT, OutputT, MetadataT]
|
The context containing the inputs, outputs, and metadata for evaluation. |
required |
Returns:
Type | Description |
---|---|
EvaluatorOutput | Awaitable[EvaluatorOutput]
|
The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping |
EvaluatorOutput | Awaitable[EvaluatorOutput]
|
of evaluation names to either of those. Can be returned either synchronously or as an |
EvaluatorOutput | Awaitable[EvaluatorOutput]
|
awaitable for asynchronous evaluation. |
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
|
evaluate_sync
evaluate_sync(
ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput
Run the evaluator synchronously, handling both sync and async implementations.
This method ensures synchronous execution by running any async evaluate implementation to completion using run_until_complete.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ctx
|
EvaluatorContext[InputsT, OutputT, MetadataT]
|
The context containing the inputs, outputs, and metadata for evaluation. |
required |
Returns:
Type | Description |
---|---|
EvaluatorOutput
|
The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping |
EvaluatorOutput
|
of evaluation names to either of those. |
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
|
evaluate_async
async
evaluate_async(
ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput
Run the evaluator asynchronously, handling both sync and async implementations.
This method ensures asynchronous execution by properly awaiting any async evaluate implementation. For synchronous implementations, it returns the result directly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ctx
|
EvaluatorContext[InputsT, OutputT, MetadataT]
|
The context containing the inputs, outputs, and metadata for evaluation. |
required |
Returns:
Type | Description |
---|---|
EvaluatorOutput
|
The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping |
EvaluatorOutput
|
of evaluation names to either of those. |
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
|
serialize
serialize(info: SerializationInfo) -> Any
Serialize this Evaluator to a JSON-serializable form.
Returns:
Type | Description |
---|---|
Any
|
A JSON-serializable representation of this evaluator as an EvaluatorSpec. |
Source code in pydantic_evals/pydantic_evals/evaluators/evaluator.py
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
|
EvaluatorOutput
module-attribute
EvaluatorOutput = Union[
EvaluationScalar,
EvaluationReason,
Mapping[str, Union[EvaluationScalar, EvaluationReason]],
]
Type for the output of an evaluator, which can be a scalar, an EvaluationReason, or a mapping of names to either.