Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
tau-bench (Tool-Agent-User Benchmark) is a benchmark introduced by Yao et al. (2024) from Princeton/Sierra1) for evaluating language agents in dynamic, multi-turn conversations where agents must interact with simulated users while using domain-specific API tools and adhering to policy guidelines. It addresses a critical gap in existing benchmarks by testing the three-way interaction between tools, agents, and users in realistic customer service scenarios.
Existing agent benchmarks test tool use or reasoning in isolation, but real-world deployment requires agents to simultaneously:
tau-bench addresses all three requirements through its Tool-Agent-User framework.
The benchmark is modeled as a partially observable Markov decision process (POMDP) where the state combines hidden database states and user states. Agent actions span database API calls and communications with the user. The agent cannot directly observe the database state and must gather information incrementally through tool calls and user interaction.
Key components:
tau-bench includes two realistic customer service domains:
tau-retail: Retail customer service involving:
tau-airline: Airline customer service involving:
tau-bench introduces the pass^k metric to measure agent reliability, the probability of succeeding on ALL k independent trials of the same task. This metric is critical because real-world deployment requires agents to handle millions of conversations reliably.
| Model | Strategy | Retail pass | 1 | Airline pass | 1 | Retail pass | 4 | Airline pass | 4 |
|---|---|---|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | Tool Calling | 0.692 | 0.460 | 0.462 | 0.225 | ||||
| GPT-4o | Tool Calling | 0.604 | 0.420 | 0.491 | 0.200 | ||||
| GPT-4o | Act | , | 0.365 | , | 0.140 | ||||
| GPT-4o | ReAct | , | 0.325 | , | 0.160 | ||||
| GPT-4o-mini | Tool Calling | , | 0.225 | , | 0.100 |
Key findings:
# tau-bench evaluation loop (simplified) from tau_bench.envs import RetailEnv, AirlineEnv from tau_bench.agents import ToolCallingAgent def evaluate_pass_k(agent, env_class, tasks, k=4): task_results = {task["id"]: [] for task in tasks} for trial in range(k): for task in tasks: env = env_class(task=task) agent.reset() observation = env.reset() while not env.done: action = agent.act(observation) observation = env.step(action) success = env.compare_db_state(task["goal_state"]) task_results[task["id"]].append(success) pass_k = 1.0 for trial in range(k): trial_rate = sum( resultstrial for results in task_results.values() ) / len(tasks) pass_k *= trial_rate return pass_k agent = ToolCallingAgent(model="gpt-4o") tasks = load_tasks("retail") reliability = evaluate_pass_k(agent, RetailEnv, tasks, k=4) print(f"Retail pass^4: {reliability:.3f}")
Failures decompose into three categories:
The follow-up tau-squared-bench adds: