Tool Call Accuracy

Input
Output
How Is It Calculated?
Interpretation
Use Cases

Input

output (list): The model-generated list of tool calls.
expectedOutput (list): The reference list of expected tool calls (JSON-formatted objects with tool name and arguments).

Output

Result (int): Binary value (1 = perfect match, 0 = any mismatch).
Reasoning (str): Detailed explanation of the matching process and any mismatches detected.

How Is It Calculated?

The evaluator compares the model-generated tool calls with the expected tool calls by validating both structure and content. It checks that tool names match and that all parameters conform to the expected schema. The comparison supports nested and flexible matching logic to handle complex tool-calling scenarios. The evaluator supports the following matching constructs:

inAnyOrder: Matches all specified tool calls regardless of order using a greedy matching strategy.
anyOne: Requires exactly one branch among multiple alternatives to match (OR logic).

These constructs can be nested within each other to represent complex expectations. If all required tool calls match successfully, the evaluation passes. Any mismatch results in failure.

Interpretation

1: All expected tool calls matched successfully (perfect match).
0: One or more tool calls failed to match (mismatch detected).

Use Cases

Evaluating agent compliance with required tool sequences
Assessing function-calling tasks that require specific arguments
Measuring multi-step tool-use workflows end-to-end
Validating tool call structure and parameter schemas

SQLite Validation

Tree Similarity Editing Distance

⌘I

Introduction

Prompt Engineering

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

CI/CD

Input

Output

How Is It Calculated?

Interpretation

Use Cases

Introduction

Prompt Engineering

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

CI/CD

​Input

​Output

​How Is It Calculated?

​Interpretation

​Use Cases

Input

Output

How Is It Calculated?

Interpretation

Use Cases