====== tau-bench ======
tau-bench (Tool-Agent-User Benchmark) is a benchmark introduced by Yao et al. (2024) from Princeton/Sierra for evaluating language agents in dynamic, multi-turn conversations where agents must interact with simulated users while using domain-specific API tools and adhering to policy guidelines. It addresses a critical gap in existing benchmarks by testing the three-way interaction between tools, agents, and users in realistic customer service scenarios.
===== Motivation =====
Existing agent benchmarks test tool use or reasoning in isolation, but real-world deployment requires agents to simultaneously:
- Interact seamlessly with humans over long conversation horizons
- Accurately adhere to complex domain-specific policies and rules
- Maintain consistency and reliability across millions of interactions
tau-bench addresses all three requirements through its Tool-Agent-User framework.
===== Framework Architecture =====
The benchmark is modeled as a **partially observable Markov decision process (POMDP)** where:
S = S_{db} \otimes S_{user}
The state combines hidden database states and user states. Agent actions span two spaces:
A = A_{db} \cup A_{user}
where ''A_db'' represents database API calls and ''A_user'' represents communications with the user. The agent cannot directly observe the database state and must gather information incrementally through tool calls and user interaction.
Key components:
* **User Simulator**: Powered by language models with randomized personas, generating realistic multi-turn dialogues tied to hidden task instructions
* **Tool APIs**: Python functions for database access (order lookups, modifications, etc.)
* **Policy Documents**: Markdown-formatted domain rules the agent must follow
* **Hidden Goal State**: Annotated ground-truth database state for evaluation
===== Domains =====
tau-bench includes two realistic customer service domains:
**tau-retail**: Retail customer service involving:
* Inventory checks and product searches
* Order modifications and cancellations
* Refund processing with eligibility rules
* Policy-constrained actions (e.g., return windows, exchange policies)
**tau-airline**: Airline customer service involving:
* Flight bookings and reservation changes
* Cancellation processing with fare restrictions
* Seat assignments and upgrades
* Compliance with overbooking policies and fare class rules
Each domain includes JSON databases, custom Python APIs, Markdown policy documents, and JSON task instances with ground-truth annotations.
===== The pass^k Metric =====
tau-bench introduces the **pass^k metric** to measure agent reliability -- the probability of succeeding on ALL k independent trials of the same task:
\text{pass}^k = \mathbb{E}\left[ \prod_{i=1}^{k} \frac{c_i}{n} \right]
where ''c_i'' is the number of successes in trial ''i'' and ''n'' is the total number of tasks. Unlike pass@k (success in at least one of k trials), pass^k emphasizes **consistency**:
* ''pass^1'' equals the expected success rate E[r]
* Higher k values exponentially penalize inconsistency
* An agent with 50% pass^1 but variable behavior may have pass^8 below 1%
This metric is critical because real-world deployment requires agents to handle millions of conversations reliably.
===== Evaluation Method =====
Success is determined **deterministically** by comparing the final database state against the annotated goal state:
- The conversation runs to completion (agent resolves the user intent or fails)
- The final state of the database (orders, reservations, etc.) is extracted
- This state is compared field-by-field against the ground-truth annotation
- A task succeeds only if the database matches exactly
This approach is robust to dialogue variation -- it does not matter how the agent reached the outcome, only that the final state is correct and policy-compliant.
===== Key Results =====
^ Model ^ Strategy ^ Retail pass^1 ^ Airline pass^1 ^ Retail pass^4 ^ Airline pass^4 ^
| Claude 3.5 Sonnet | Tool Calling | **0.692** | **0.460** | **0.462** | **0.225** |
| GPT-4o | Tool Calling | 0.604 | 0.420 | 0.491 | 0.200 |
| GPT-4o | Act | -- | 0.365 | -- | 0.140 |
| GPT-4o | ReAct | -- | 0.325 | -- | 0.160 |
| GPT-4o-mini | Tool Calling | -- | 0.225 | -- | 0.100 |
Key findings:
* Even the best agents (GPT-4o) succeed on fewer than 50% of tasks
* Reliability drops sharply: pass^8 < 25% in retail for all models
* Native function calling outperforms ReAct and Act strategies
* Claude 3.5 Sonnet shows strongest overall performance
===== Code Example =====
# tau-bench evaluation loop (simplified)
from tau_bench.envs import RetailEnv, AirlineEnv
from tau_bench.agents import ToolCallingAgent
def evaluate_pass_k(agent, env_class, tasks, k=4):
task_results = {task["id"]: [] for task in tasks}
for trial in range(k):
for task in tasks:
env = env_class(task=task)
agent.reset()
observation = env.reset()
while not env.done:
action = agent.act(observation)
observation = env.step(action)
# Compare final database state to ground truth
success = env.compare_db_state(task["goal_state"])
task_results[task["id"]].append(success)
# Compute pass^k: product of per-trial success rates
pass_k = 1.0
for trial in range(k):
trial_rate = sum(
results[trial] for results in task_results.values()
) / len(tasks)
pass_k *= trial_rate
return pass_k
agent = ToolCallingAgent(model="gpt-4o")
tasks = load_tasks("retail")
reliability = evaluate_pass_k(agent, RetailEnv, tasks, k=4)
print(f"Retail pass^4: {reliability:.3f}")
===== Error Taxonomy =====
Failures decompose into three categories:
* **Reasoning errors**: Incorrect tool selection, wrong API parameters, or flawed multi-step logic
* **Communication failures**: Misaligned responses to user, asking irrelevant questions, or failing to confirm actions
* **Policy violations**: Performing actions that violate domain rules (e.g., processing a refund outside the return window)
===== Extensions: tau-squared-bench =====
The follow-up **tau-squared-bench** adds:
* A **telecom** domain focusing on troubleshooting scenarios
* Bug fixes and improved evaluation
* A dual-control environment for more complex agent-user dynamics
===== References =====
* [[https://arxiv.org/abs/2406.12045|Yao et al. (2024) - tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains]]
* [[https://github.com/sierra-research/tau-bench|Official tau-bench Repository (Sierra Research)]]
* [[https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents|Sierra Blog: tau-bench Shaping Agent Development]]
===== See Also =====
* [[agentbench|AgentBench]] - Multi-dimensional benchmark for LLM agents across 8 environments
* [[llm_as_judge|LLM-as-a-Judge]] - Automated evaluation using LLMs as evaluators
* [[taskweaver|TaskWeaver]] - Code-first agent framework for task execution