====== tau-bench ====== tau-bench (Tool-Agent-User Benchmark) is a benchmark introduced by Yao et al. (2024) from Princeton/Sierra for evaluating language agents in dynamic, multi-turn conversations where agents must interact with simulated users while using domain-specific API tools and adhering to policy guidelines. It addresses a critical gap in existing benchmarks by testing the three-way interaction between tools, agents, and users in realistic customer service scenarios. ===== Motivation ===== Existing agent benchmarks test tool use or reasoning in isolation, but real-world deployment requires agents to simultaneously: - Interact seamlessly with humans over long conversation horizons - Accurately adhere to complex domain-specific policies and rules - Maintain consistency and reliability across millions of interactions tau-bench addresses all three requirements through its Tool-Agent-User framework. ===== Framework Architecture ===== The benchmark is modeled as a **partially observable Markov decision process (POMDP)** where: S = S_{db} \otimes S_{user} The state combines hidden database states and user states. Agent actions span two spaces: A = A_{db} \cup A_{user} where ''A_db'' represents database API calls and ''A_user'' represents communications with the user. The agent cannot directly observe the database state and must gather information incrementally through tool calls and user interaction. Key components: * **User Simulator**: Powered by language models with randomized personas, generating realistic multi-turn dialogues tied to hidden task instructions * **Tool APIs**: Python functions for database access (order lookups, modifications, etc.) * **Policy Documents**: Markdown-formatted domain rules the agent must follow * **Hidden Goal State**: Annotated ground-truth database state for evaluation ===== Domains ===== tau-bench includes two realistic customer service domains: **tau-retail**: Retail customer service involving: * Inventory checks and product searches * Order modifications and cancellations * Refund processing with eligibility rules * Policy-constrained actions (e.g., return windows, exchange policies) **tau-airline**: Airline customer service involving: * Flight bookings and reservation changes * Cancellation processing with fare restrictions * Seat assignments and upgrades * Compliance with overbooking policies and fare class rules Each domain includes JSON databases, custom Python APIs, Markdown policy documents, and JSON task instances with ground-truth annotations. ===== The pass^k Metric ===== tau-bench introduces the **pass^k metric** to measure agent reliability -- the probability of succeeding on ALL k independent trials of the same task: \text{pass}^k = \mathbb{E}\left[ \prod_{i=1}^{k} \frac{c_i}{n} \right] where ''c_i'' is the number of successes in trial ''i'' and ''n'' is the total number of tasks. Unlike pass@k (success in at least one of k trials), pass^k emphasizes **consistency**: * ''pass^1'' equals the expected success rate E[r] * Higher k values exponentially penalize inconsistency * An agent with 50% pass^1 but variable behavior may have pass^8 below 1% This metric is critical because real-world deployment requires agents to handle millions of conversations reliably. ===== Evaluation Method ===== Success is determined **deterministically** by comparing the final database state against the annotated goal state: - The conversation runs to completion (agent resolves the user intent or fails) - The final state of the database (orders, reservations, etc.) is extracted - This state is compared field-by-field against the ground-truth annotation - A task succeeds only if the database matches exactly This approach is robust to dialogue variation -- it does not matter how the agent reached the outcome, only that the final state is correct and policy-compliant. ===== Key Results ===== ^ Model ^ Strategy ^ Retail pass^1 ^ Airline pass^1 ^ Retail pass^4 ^ Airline pass^4 ^ | Claude 3.5 Sonnet | Tool Calling | **0.692** | **0.460** | **0.462** | **0.225** | | GPT-4o | Tool Calling | 0.604 | 0.420 | 0.491 | 0.200 | | GPT-4o | Act | -- | 0.365 | -- | 0.140 | | GPT-4o | ReAct | -- | 0.325 | -- | 0.160 | | GPT-4o-mini | Tool Calling | -- | 0.225 | -- | 0.100 | Key findings: * Even the best agents (GPT-4o) succeed on fewer than 50% of tasks * Reliability drops sharply: pass^8 < 25% in retail for all models * Native function calling outperforms ReAct and Act strategies * Claude 3.5 Sonnet shows strongest overall performance ===== Code Example ===== # tau-bench evaluation loop (simplified) from tau_bench.envs import RetailEnv, AirlineEnv from tau_bench.agents import ToolCallingAgent def evaluate_pass_k(agent, env_class, tasks, k=4): task_results = {task["id"]: [] for task in tasks} for trial in range(k): for task in tasks: env = env_class(task=task) agent.reset() observation = env.reset() while not env.done: action = agent.act(observation) observation = env.step(action) # Compare final database state to ground truth success = env.compare_db_state(task["goal_state"]) task_results[task["id"]].append(success) # Compute pass^k: product of per-trial success rates pass_k = 1.0 for trial in range(k): trial_rate = sum( results[trial] for results in task_results.values() ) / len(tasks) pass_k *= trial_rate return pass_k agent = ToolCallingAgent(model="gpt-4o") tasks = load_tasks("retail") reliability = evaluate_pass_k(agent, RetailEnv, tasks, k=4) print(f"Retail pass^4: {reliability:.3f}") ===== Error Taxonomy ===== Failures decompose into three categories: * **Reasoning errors**: Incorrect tool selection, wrong API parameters, or flawed multi-step logic * **Communication failures**: Misaligned responses to user, asking irrelevant questions, or failing to confirm actions * **Policy violations**: Performing actions that violate domain rules (e.g., processing a refund outside the return window) ===== Extensions: tau-squared-bench ===== The follow-up **tau-squared-bench** adds: * A **telecom** domain focusing on troubleshooting scenarios * Bug fixes and improved evaluation * A dual-control environment for more complex agent-user dynamics ===== References ===== * [[https://arxiv.org/abs/2406.12045|Yao et al. (2024) - tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains]] * [[https://github.com/sierra-research/tau-bench|Official tau-bench Repository (Sierra Research)]] * [[https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents|Sierra Blog: tau-bench Shaping Agent Development]] ===== See Also ===== * [[agentbench|AgentBench]] - Multi-dimensional benchmark for LLM agents across 8 environments * [[llm_as_judge|LLM-as-a-Judge]] - Automated evaluation using LLMs as evaluators * [[taskweaver|TaskWeaver]] - Code-first agent framework for task execution