Wiki · Concept · Last reviewed June 25, 2026

Tau-bench

Tau-bench is a benchmark for tool-using conversational agents, testing whether an agent can interact with a simulated user, follow domain policy, call APIs, and leave the database in the right state.

Category: AI evaluations Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: Tau-bench, tool use, conversational agents, policy following, benchmarks

Definition

Tau-bench is a benchmark for evaluating language agents in tool-agent-user interaction. The core paper is τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, arXiv:2406.12045, by Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. It was submitted to arXiv on June 17, 2024.

The benchmark studies a setting that ordinary tool-call tests often simplify away: a user is present, the agent has domain-specific API tools, and the agent must follow policy while changing the state of a simulated business environment. The paper's title uses the Greek tau, but the site uses Tau-bench here for searchable plain text.

Problem Statement

The paper argues that many agent benchmarks do not adequately test two deployment-critical behaviors: interaction with human users and adherence to domain-specific rules. A customer-service agent, travel assistant, support agent, or account workflow agent is not only solving a puzzle. It is conversing with a counterparty, checking policy, asking for missing information, choosing tools, and deciding when not to act.

This is the difference between a function-call demo and a service workflow. A syntactically valid tool call can still violate refund policy, use stale facts, skip required confirmation, modify the wrong record, or leave the user with a plausible but false explanation. Tau-bench makes the conversation and the final world state part of the same evaluation object.

Benchmark Design

The arXiv abstract describes Tau-bench as emulating dynamic conversations between a simulated user and a language agent. The agent receives domain-specific API tools and policy guidelines. The user is also simulated by language models, so the benchmark can generate multi-turn interaction rather than a single static instruction.

Evaluation is not based only on whether the final answer sounds right. The paper says Tau-bench compares the database state at the end of the conversation with an annotated goal state. The Sierra Research repository provides the code and data for the benchmark and exposes airline and retail environments in its README examples and leaderboards. That makes the benchmark closer to transaction testing than chat evaluation: the record, booking, order, refund, or account state has to end up where the task says it should.

Reliability Metric

Tau-bench introduced pass^k as a reliability metric over multiple trials. The point is that a user-facing agent can be unacceptable even when it sometimes succeeds. If the same task succeeds on one run and fails on another because the model drifts through the conversation differently, the system is hard to trust in production.

The paper's abstract reports that state-of-the-art function-calling agents, including GPT-4o in the reported experiment, succeed on less than half of the tasks and are inconsistent, with pass^8 below 25 percent in retail. Those numbers should be read as paper-specific experimental results from 2024, not as a current claim about every model or deployment.

Governance and Safety

Tau-bench matters for governance because it puts policy following, user interaction, and database mutation in one test. That is the shape of many real agent deployments: the model speaks to a person while holding tools that can change records. The risk is not only wrong text. It is an apparently helpful conversation that ends in an unauthorized refund, incorrect cancellation, ungrounded promise, missed escalation, or inconsistent treatment across repeated attempts.

It also exposes a limit of simulated-user evaluation. Simulated users do not fully represent angry customers, confused callers, fraud attempts, accessibility needs, silence, abandonment, coercion, or pressure from human operators. A Tau-bench score can support a claim about a benchmarked policy workflow. It cannot by itself prove readiness for regulated customer service, financial decisions, health support, employment screening, public benefits, or other consequential domains.

Evidence Record

A serious Tau-bench report should name the benchmark version, domain, task IDs, policy text, database fixture, API tools, user simulator model and strategy, agent model, agent scaffold, prompt template, tool-call mode, retry policy, temperature, seeds, transcript, tool log, final database diff, pass^1 score, pass^k score, and failure taxonomy. Without that record, a headline success rate hides whether the agent failed by misunderstanding the user, violating policy, calling the wrong tool, or producing inconsistent trajectories.

Source Discipline

Use arXiv:2406.12045 for the title, authors, submission date, benchmark framing, simulated-user design, database-state evaluation, pass^k metric, and original reported performance claims. Use the sierra-research/tau-bench repository for implementation, task package, examples, and leaderboard context. Keep later extensions, repaired task sets, and vendor blog claims separate unless they are cited as separate dated sources.

Spiralist Reading

Tau-bench is a test of the clerk in conversation. The agent is not judged only by eloquence, nor only by tool syntax. It is judged by whether a policy-bearing interaction leaves a shared record in the right state. Spiralism reads that as a useful compression of the agent era: speech becomes procedure, procedure becomes state, and state becomes institutional memory.

Open Questions

How should simulated users represent refusal, delay, confusion, adversarial behavior, and abandonment?
Which policy violations should count as safety failures even when the task goal is completed?
How should pass^k be translated into deployment gates for high-volume customer workflows?
How should Tau-bench results be compared with WorkArena, AgentDojo, OSWorld, and real production incident data?

Sources

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan, τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, arXiv:2406.12045 [cs.AI], submitted June 17, 2024.
Sierra Research GitHub repository, sierra-research/tau-bench, code and data for Tau-bench, reviewed June 25, 2026.

Return to Wiki