Skip to main content
output (list): The model-generated list of tool calls.
expectedOutput (list): The reference list of expected tool calls (JSON-formatted objects with tool name and arguments).
Output
Result (int): Binary value (1 = perfect match, 0 = any mismatch).
Reasoning (str): Detailed explanation of the matching process and any mismatches detected.
How Is It Calculated?
The evaluator compares the model-generated tool calls with the expected tool calls by validating both structure and content. It checks that tool names match and that all parameters conform to the expected schema. The comparison supports nested and flexible matching logic to handle complex tool-calling scenarios.
The evaluator supports the following matching constructs:
inAnyOrder: Matches all specified tool calls regardless of order using a greedy matching strategy.
anyOne: Requires exactly one branch among multiple alternatives to match (OR logic).
These constructs can be nested within each other to represent complex expectations. If all required tool calls match successfully, the evaluation passes. Any mismatch results in failure.
Interpretation
- 1: All expected tool calls matched successfully (perfect match).
- 0: One or more tool calls failed to match (mismatch detected).
Use Cases
- Evaluating agent compliance with required tool sequences
- Assessing function-calling tasks that require specific arguments
- Measuring multi-step tool-use workflows end-to-end
- Validating tool call structure and parameter schemas