====== tau-bench ======
tau-bench (Tool-Agent-User Benchmark) is a benchmark introduced by Yao et al. (2024) from Princeton/Sierra(([[https://arxiv.org/abs/2406.12045|Yao et al. (2024) - tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains]])) for evaluating language agents in dynamic, multi-turn conversations where agents must interact with simulated users while using domain-specific API tools and adhering to policy guidelines. It addresses a critical gap in existing benchmarks by testing the three-way interaction between tools, agents, and users in realistic customer service scenarios.

===== Motivation =====
Existing agent benchmarks test tool use or reasoning in isolation, but real-world deployment requires agents to simultaneously:

  - Interact seamlessly with humans over long conversation horizons
  - Accurately adhere to complex domain-specific policies and rules
  - Maintain consistency and reliability across millions of interactions

tau-bench addresses all three requirements through its Tool-Agent-User framework.

===== Framework Architecture =====
The benchmark is modeled as a **partially observable Markov decision process (POMDP)** where the state combines hidden database states and user states. Agent actions span database API calls and communications with the user. The agent cannot directly observe the database state and must gather information incrementally through tool calls and user interaction.

Key components:
  * **User Simulator**: Powered by language models with randomized personas, generating realistic multi-turn dialogues tied to hidden task instructions
  * **Tool APIs**: Python functions for database access (order lookups, modifications, etc.)
  * **Policy Documents**: Markdown-formatted domain rules the agent must follow
  * **Hidden Goal State**: Annotated ground-truth database state for evaluation

===== Domains =====
tau-bench includes two realistic customer service domains:

**tau-retail**: Retail customer service involving:
  * Inventory checks and product searches
  * Order modifications and cancellations
  * Refund processing with eligibility rules
  * Policy-constrained actions (e.g., return windows, exchange policies)

**tau-airline**: Airline customer service involving:
  * Flight bookings and reservation changes
  * Cancellation processing with fare restrictions
  * Seat assignments and upgrades
  * Compliance with overbooking policies and fare class rules

===== The pass^k Metric =====
tau-bench introduces the **pass^k metric** to measure agent reliability, the probability of succeeding on ALL k independent trials of the same task. This metric is critical because real-world deployment requires agents to handle millions of conversations reliably.

===== Key Results =====
^ Model ^ Strategy ^ Retail pass^1 ^ Airline pass^1 ^ Retail pass^4 ^ Airline pass^4 ^
| [[claude|Claude]] 3.5 Sonnet | Tool Calling | **0.692** | **0.460** | **0.462** | **0.225** |
| GPT-4o | Tool Calling | 0.604 | 0.420 | 0.491 | 0.200 |
| GPT-4o | Act |, | 0.365 |, | 0.140 |
| GPT-4o | ReAct |, | 0.325 |, | 0.160 |
| GPT-4o-mini | Tool Calling |, | 0.225 |, | 0.100 |

Key findings:
  * Even the best agents (GPT-4o) succeed on fewer than 50% of tasks(([[https://[[github|github]])).com/sierra-research/tau-bench|Official tau-bench Repository (Sierra Research]]))
  * Reliability drops sharply: pass^8 < 25% in retail for all models
  * Native [[function_calling|function calling]] outperforms ReAct and Act strategies
  * [[claude|Claude]] 3.5 Sonnet shows strongest overall performance(([[https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents|Sierra Blog: tau-bench Shaping Agent Development.]]))

===== Code Example =====
<code python>
# tau-bench evaluation loop (simplified)
from tau_bench.envs import RetailEnv, AirlineEnv
from tau_bench.agents import ToolCallingAgent

def evaluate_pass_k(agent, env_class, tasks, k=4):
    task_results = {task["id"]: [] for task in tasks}

    for trial in range(k):
        for task in tasks:
            env = env_class(task=task)
            agent.reset()
            observation = env.reset()
            while not env.done:
                action = agent.act(observation)
                observation = env.step(action)

            success = env.compare_db_state(task["goal_state"])
            task_results[task["id"]].append(success)

    pass_k = 1.0
    for trial in range(k):
        trial_rate = sum(
            resultstrial for results in task_results.values()
        ) / len(tasks)
        pass_k *= trial_rate
    return pass_k

agent = ToolCallingAgent(model="gpt-4o")
tasks = load_tasks("retail")
reliability = evaluate_pass_k(agent, RetailEnv, tasks, k=4)
print(f"Retail pass^4: {reliability:.3f}")
</code>

===== Error Taxonomy =====
Failures decompose into three categories:

  * **Reasoning errors**: Incorrect tool selection, wrong API parameters, or flawed multi-step logic
  * **Communication failures**: Misaligned responses to user, asking irrelevant questions, or failing to confirm actions
  * **Policy violations**: Performing actions that violate domain rules (e.g., processing a refund outside the return window)

===== Extensions: tau-squared-bench =====
The follow-up **tau-squared-bench** adds:
  * A **telecom** domain focusing on troubleshooting scenarios
  * Bug fixes and improved evaluation
  * A dual-control environment for more complex agent-user dynamics

===== See Also =====
  * [[agentbench|AgentBench]]
  * [[computer_use_benchmark|Computer Use Benchmark]]
  * [[toolathlon|Toolathlon]]
  * [[agent_benchmark_blind_spots|Benchmarks for Agent Blind Spots]]
  * [[benchmark_exploitation|Benchmark Exploitation]]

===== References =====