Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
tau-bench (Tool-Agent-User Benchmark) is a benchmark introduced by Yao et al. (2024) from Princeton/Sierra for evaluating language agents in dynamic, multi-turn conversations where agents must interact with simulated users while using domain-specific API tools and adhering to policy guidelines. It addresses a critical gap in existing benchmarks by testing the three-way interaction between tools, agents, and users in realistic customer service scenarios.
Existing agent benchmarks test tool use or reasoning in isolation, but real-world deployment requires agents to simultaneously:
tau-bench addresses all three requirements through its Tool-Agent-User framework.
The benchmark is modeled as a partially observable Markov decision process (POMDP) where:
<latex> S = S_{db} \otimes S_{user} </latex>
The state combines hidden database states and user states. Agent actions span two spaces:
<latex> A = A_{db} \cup A_{user} </latex>
where A_db represents database API calls and A_user represents communications with the user. The agent cannot directly observe the database state and must gather information incrementally through tool calls and user interaction.
Key components:
tau-bench includes two realistic customer service domains:
tau-retail: Retail customer service involving:
tau-airline: Airline customer service involving:
Each domain includes JSON databases, custom Python APIs, Markdown policy documents, and JSON task instances with ground-truth annotations.
tau-bench introduces the pass^k metric to measure agent reliability – the probability of succeeding on ALL k independent trials of the same task:
<latex> \text{pass}^k = \mathbb{E}\left[ \prod_{i=1}^{k} \frac{c_i}{n} \right] </latex>
where c_i is the number of successes in trial i and n is the total number of tasks. Unlike pass@k (success in at least one of k trials), pass^k emphasizes consistency:
pass^1 equals the expected success rate E[r]This metric is critical because real-world deployment requires agents to handle millions of conversations reliably.
Success is determined deterministically by comparing the final database state against the annotated goal state:
This approach is robust to dialogue variation – it does not matter how the agent reached the outcome, only that the final state is correct and policy-compliant.
| Model | Strategy | Retail pass | 1 | Airline pass | 1 | Retail pass | 4 | Airline pass | 4 |
|---|---|---|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | Tool Calling | 0.692 | 0.460 | 0.462 | 0.225 | ||||
| GPT-4o | Tool Calling | 0.604 | 0.420 | 0.491 | 0.200 | ||||
| GPT-4o | Act | – | 0.365 | – | 0.140 | ||||
| GPT-4o | ReAct | – | 0.325 | – | 0.160 | ||||
| GPT-4o-mini | Tool Calling | – | 0.225 | – | 0.100 |
Key findings:
# tau-bench evaluation loop (simplified) from tau_bench.envs import RetailEnv, AirlineEnv from tau_bench.agents import ToolCallingAgent def evaluate_pass_k(agent, env_class, tasks, k=4): task_results = {task["id"]: [] for task in tasks} for trial in range(k): for task in tasks: env = env_class(task=task) agent.reset() observation = env.reset() while not env.done: action = agent.act(observation) observation = env.step(action) # Compare final database state to ground truth success = env.compare_db_state(task["goal_state"]) task_results[task["id"]].append(success) # Compute pass^k: product of per-trial success rates pass_k = 1.0 for trial in range(k): trial_rate = sum( results[trial] for results in task_results.values() ) / len(tasks) pass_k *= trial_rate return pass_k agent = ToolCallingAgent(model="gpt-4o") tasks = load_tasks("retail") reliability = evaluate_pass_k(agent, RetailEnv, tasks, k=4) print(f"Retail pass^4: {reliability:.3f}")
Failures decompose into three categories:
The follow-up tau-squared-bench adds: