Table of Contents

tau-bench

tau-bench (Tool-Agent-User Benchmark) is a benchmark introduced by Yao et al. (2024) from Princeton/Sierra for evaluating language agents in dynamic, multi-turn conversations where agents must interact with simulated users while using domain-specific API tools and adhering to policy guidelines. It addresses a critical gap in existing benchmarks by testing the three-way interaction between tools, agents, and users in realistic customer service scenarios.

Motivation

Existing agent benchmarks test tool use or reasoning in isolation, but real-world deployment requires agents to simultaneously:

  1. Interact seamlessly with humans over long conversation horizons
  2. Accurately adhere to complex domain-specific policies and rules
  3. Maintain consistency and reliability across millions of interactions

tau-bench addresses all three requirements through its Tool-Agent-User framework.

Framework Architecture

The benchmark is modeled as a partially observable Markov decision process (POMDP) where:

<latex> S = S_{db} \otimes S_{user} </latex>

The state combines hidden database states and user states. Agent actions span two spaces:

<latex> A = A_{db} \cup A_{user} </latex>

where A_db represents database API calls and A_user represents communications with the user. The agent cannot directly observe the database state and must gather information incrementally through tool calls and user interaction.

Key components:

Domains

tau-bench includes two realistic customer service domains:

tau-retail: Retail customer service involving:

tau-airline: Airline customer service involving:

Each domain includes JSON databases, custom Python APIs, Markdown policy documents, and JSON task instances with ground-truth annotations.

The pass^k Metric

tau-bench introduces the pass^k metric to measure agent reliability – the probability of succeeding on ALL k independent trials of the same task:

<latex> \text{pass}^k = \mathbb{E}\left[ \prod_{i=1}^{k} \frac{c_i}{n} \right] </latex>

where c_i is the number of successes in trial i and n is the total number of tasks. Unlike pass@k (success in at least one of k trials), pass^k emphasizes consistency:

This metric is critical because real-world deployment requires agents to handle millions of conversations reliably.

Evaluation Method

Success is determined deterministically by comparing the final database state against the annotated goal state:

  1. The conversation runs to completion (agent resolves the user intent or fails)
  2. The final state of the database (orders, reservations, etc.) is extracted
  3. This state is compared field-by-field against the ground-truth annotation
  4. A task succeeds only if the database matches exactly

This approach is robust to dialogue variation – it does not matter how the agent reached the outcome, only that the final state is correct and policy-compliant.

Key Results

Model Strategy Retail pass1 Airline pass1 Retail pass4 Airline pass4
Claude 3.5 Sonnet Tool Calling 0.692 0.460 0.462 0.225
GPT-4o Tool Calling 0.604 0.420 0.491 0.200
GPT-4o Act 0.365 0.140
GPT-4o ReAct 0.325 0.160
GPT-4o-mini Tool Calling 0.225 0.100

Key findings:

Code Example

# tau-bench evaluation loop (simplified)
from tau_bench.envs import RetailEnv, AirlineEnv
from tau_bench.agents import ToolCallingAgent
 
def evaluate_pass_k(agent, env_class, tasks, k=4):
    task_results = {task["id"]: [] for task in tasks}
 
    for trial in range(k):
        for task in tasks:
            env = env_class(task=task)
            agent.reset()
            observation = env.reset()
            while not env.done:
                action = agent.act(observation)
                observation = env.step(action)
 
            # Compare final database state to ground truth
            success = env.compare_db_state(task["goal_state"])
            task_results[task["id"]].append(success)
 
    # Compute pass^k: product of per-trial success rates
    pass_k = 1.0
    for trial in range(k):
        trial_rate = sum(
            results[trial] for results in task_results.values()
        ) / len(tasks)
        pass_k *= trial_rate
    return pass_k
 
agent = ToolCallingAgent(model="gpt-4o")
tasks = load_tasks("retail")
reliability = evaluate_pass_k(agent, RetailEnv, tasks, k=4)
print(f"Retail pass^4: {reliability:.3f}")

Error Taxonomy

Failures decompose into three categories:

Extensions: tau-squared-bench

The follow-up tau-squared-bench adds:

References

See Also