tau-bench

tau-bench (Tool-Agent-User Benchmark) is a benchmark introduced by Yao et al. (2024) from Princeton/Sierra for evaluating language agents in dynamic, multi-turn conversations where agents must interact with simulated users while using domain-specific API tools and adhering to policy guidelines. It addresses a critical gap in existing benchmarks by testing the three-way interaction between tools, agents, and users in realistic customer service scenarios.

Motivation

Existing agent benchmarks test tool use or reasoning in isolation, but real-world deployment requires agents to simultaneously:

Interact seamlessly with humans over long conversation horizons
Accurately adhere to complex domain-specific policies and rules
Maintain consistency and reliability across millions of interactions

tau-bench addresses all three requirements through its Tool-Agent-User framework.

Framework Architecture

The benchmark is modeled as a partially observable Markov decision process (POMDP) where:

<latex> S = S_{db} \otimes S_{user} </latex>

The state combines hidden database states and user states. Agent actions span two spaces:

where A_db represents database API calls and A_user represents communications with the user. The agent cannot directly observe the database state and must gather information incrementally through tool calls and user interaction.

Key components:

User Simulator: Powered by language models with randomized personas, generating realistic multi-turn dialogues tied to hidden task instructions
Tool APIs: Python functions for database access (order lookups, modifications, etc.)
Policy Documents: Markdown-formatted domain rules the agent must follow
Hidden Goal State: Annotated ground-truth database state for evaluation

Domains

tau-bench includes two realistic customer service domains:

tau-retail: Retail customer service involving:

Inventory checks and product searches
Order modifications and cancellations
Refund processing with eligibility rules
Policy-constrained actions (e.g., return windows, exchange policies)

tau-airline: Airline customer service involving:

Flight bookings and reservation changes
Cancellation processing with fare restrictions
Seat assignments and upgrades
Compliance with overbooking policies and fare class rules

Each domain includes JSON databases, custom Python APIs, Markdown policy documents, and JSON task instances with ground-truth annotations.

The pass^k Metric

tau-bench introduces the pass^k metric to measure agent reliability – the probability of succeeding on ALL k independent trials of the same task:

<latex> \text{pass}^k = \mathbb{E}\left[ \prod_{i=1}^{k} \frac{c_i}{n} \right] </latex>

where c_i is the number of successes in trial i and n is the total number of tasks. Unlike pass@k (success in at least one of k trials), pass^k emphasizes consistency:

pass^1 equals the expected success rate E[r]
Higher k values exponentially penalize inconsistency
An agent with 50% pass^1 but variable behavior may have pass^8 below 1%

This metric is critical because real-world deployment requires agents to handle millions of conversations reliably.

Evaluation Method

Success is determined deterministically by comparing the final database state against the annotated goal state:

The conversation runs to completion (agent resolves the user intent or fails)
The final state of the database (orders, reservations, etc.) is extracted
This state is compared field-by-field against the ground-truth annotation
A task succeeds only if the database matches exactly

This approach is robust to dialogue variation – it does not matter how the agent reached the outcome, only that the final state is correct and policy-compliant.

Key Results

Model	Strategy	Retail pass	1	Airline pass	1
Claude 3.5 Sonnet	Tool Calling	0.692	0.460	0.462	0.225
GPT-4o	Tool Calling	0.604	0.420	0.491	0.200
GPT-4o	Act	–	0.365	–	0.140
GPT-4o	ReAct	–	0.325	–	0.160
GPT-4o-mini	Tool Calling	–	0.225	–	0.100

Key findings:

Even the best agents (GPT-4o) succeed on fewer than 50% of tasks
Reliability drops sharply: pass^8 < 25% in retail for all models
Native function calling outperforms ReAct and Act strategies
Claude 3.5 Sonnet shows strongest overall performance

Code Example

# tau-bench evaluation loop (simplified)
from tau_bench.envs import RetailEnv, AirlineEnv
from tau_bench.agents import ToolCallingAgent
 
def evaluate_pass_k(agent, env_class, tasks, k=4):
    task_results = {task["id"]: [] for task in tasks}
 
    for trial in range(k):
        for task in tasks:
            env = env_class(task=task)
            agent.reset()
            observation = env.reset()
            while not env.done:
                action = agent.act(observation)
                observation = env.step(action)
 
            # Compare final database state to ground truth
            success = env.compare_db_state(task["goal_state"])
            task_results[task["id"]].append(success)
 
    # Compute pass^k: product of per-trial success rates
    pass_k = 1.0
    for trial in range(k):
        trial_rate = sum(
            results[trial] for results in task_results.values()
        ) / len(tasks)
        pass_k *= trial_rate
    return pass_k
 
agent = ToolCallingAgent(model="gpt-4o")
tasks = load_tasks("retail")
reliability = evaluate_pass_k(agent, RetailEnv, tasks, k=4)
print(f"Retail pass^4: {reliability:.3f}")

Error Taxonomy

Failures decompose into three categories:

Reasoning errors: Incorrect tool selection, wrong API parameters, or flawed multi-step logic
Communication failures: Misaligned responses to user, asking irrelevant questions, or failing to confirm actions
Policy violations: Performing actions that violate domain rules (e.g., processing a refund outside the return window)

Extensions: tau-squared-bench

The follow-up tau-squared-bench adds:

A telecom domain focusing on troubleshooting scenarios
Bug fixes and improved evaluation
A dual-control environment for more complex agent-user dynamics

AI Agent Knowledge Base

Sidebar

Table of Contents

tau-bench

Motivation

Framework Architecture

Domains

The pass^k Metric

Evaluation Method

Key Results

Code Example

Error Taxonomy

Extensions: tau-squared-bench

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

tau-bench

Motivation

Framework Architecture

Domains

The pass^k Metric

Evaluation Method

Key Results

Code Example

Error Taxonomy

Extensions: tau-squared-bench

References

See Also

Page Tools