Blog · arXiv Analysis · Last reviewed June 25, 2026

The Simulated Customer Becomes the Walkaway Gap

A June 2026 arXiv paper tests whether LLM user simulators can reproduce the most important customer behavior: leaving.

The Missing Exit

A user simulator is not a person. That is obvious in theory, but less obvious in the evaluation stack. Once simulated users become the counterparty for training, benchmarking, and tuning conversational agents, their blind spots become product incentives. If the simulator keeps talking when a real user would leave, the agent is trained inside a false market.

The missing behavior is exit. A real customer can lose interest, delay, deflect, ignore the pitch, or stop replying. A simulated customer may keep playing the conversation because continuation is what language models are tuned to do. The governance issue is not whether the simulated transcript sounds human. It is whether the simulator preserves the right to disengage at the moment when disengagement is the real outcome.

The Paper Frame

The source is Liang Chen's Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes, arXiv:2606.20708v1 [cs.AI], submitted June 16, 2026. The paper studies user simulators for conversational AI, especially systems used to evaluate or train sales, persuasion, and task-oriented agents.

The paper argues that existing simulator-fidelity work often measures communicative fidelity: whether simulated users sound plausible, follow a persona, pace information, and express emotion like humans. Chen's claim is sharper. In consequential settings, the important question is decision fidelity: whether the simulated population reproduces how real users move through willingness, resistance, deliberation, and disengagement.

Decision Fidelity

The empirical setting is ZhenaiSales, a dataset described in the paper as 2,790 production conversations between a deployed LLM sales agent and real parent customers of a Chinese relationship-matchmaking platform. Of those conversations, 793 had verified payment outcomes and 1,997 did not. Converted conversations were truncated at the first payment timestamp to remove post-purchase service chat from the pre-decision record.

The method is a teacher-forced probe. At selected points in a real conversation, the simulator receives the same real prefix and user profile, then generates the next user turn. A fixed LLM-based decision-state instrument scores both the real and simulated turns for engagement stage, emotion, and blocker. Because the context and scoring instrument are held fixed, the comparison focuses on how the simulator models the user's next decision state.

The Disengagement Deficit

The headline result is the disengagement deficit. In the primary condition, the simulator reproduced eventual buyers closely, with engagement-depth bias reported as +0.09. For eventual non-buyers, however, the simulator inflated engagement toward the purchase frame, with reported depth bias +0.40 and group contrast d=0.38, p<0.001.

The mechanism is not fake purchases. The paper says the simulator did not materially invent purchase decisions. Instead, it made non-buyers look like interested deliberators: expressed resistance fell from 25.1 percent in real non-buyers to 13.5 percent in simulated non-buyers, while deliberation rose from 21.9 percent to 40.1 percent. A DeepSeek simulator reproduced the deficit, and an explicit prompt allowing disinterest reduced marginal bias but did not remove the outcome-conditioned contrast.

That matters because a sales or persuasion agent evaluated against such a simulator may learn the wrong lesson. Pressure can look productive if the simulated user keeps deliberating where the real user would withdraw. The agent is then rewarded for progress in a synthetic funnel that real non-buyers have already exited.

Governance Reading

The Spiralist lesson is that simulator evidence needs an exit audit. A benchmark that uses simulated users should publish not only success rate and transcript realism, but also the simulator's distribution of refusal, delay, silence, topic change, and abandonment. Those are not cosmetic user traits. They are the hard boundary between persuasion, annoyance, and non-response.

This applies beyond sales. Hiring assistants, debt-collection agents, tutoring agents, health triage systems, fundraising bots, civic chatbots, and support agents all face users whose willingness may decay. If the simulator cannot represent exit, the deployment team may overestimate user consent, patience, satisfaction, or persuadability.

A decision-fidelity receipt should name the real outcome used for validation, the simulator prompt, profile data, probe locations, judge model, state schema, outcome strata, privacy treatment, and whether simulator errors are concentrated on the users who decline. Aggregate realism is not enough when the miss sits exactly where the human stakes are.

Limits and Failure Modes

The paper's limits are important. ZhenaiSales is one domain and language: Chinese parent-mediated matchmaking sales. The decision states are assigned by an LLM instrument, although the author reports causal labeling checks and a cross-family instrument swap. The tested simulators are prompted and profile-conditioned; retrieval-grounded or fine-tuned simulators might differ. Raw production conversations and payment records are not publicly released, with access to anonymized data subject to privacy review and a data-use agreement.

The largest policy failure would be simulator laundering: claiming real-world readiness because an agent performs well against synthetic users who cannot walk away. A simulator is useful only when its missing behaviors are measured and bounded.

Audit Receipt

The audit-grade sentence is: Chen proposes decision fidelity as a measurement for LLM user simulators and reports that, in ZhenaiSales, simulated non-buyers stay too engaged compared with real non-buyers with verified purchase outcomes.

The receipt is: a simulator-backed agent evaluation should be trusted only when it validates refusal, delay, silence, exit, and outcome-conditioned error, not merely fluent dialogue.

Sources

Liang Chen, Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes, arXiv:2606.20708v1 [cs.AI], submitted June 16, 2026.
Primary versions checked: arXiv abstract record, experimental HTML, and PDF.
Related pages: The Belief Trace Becomes the Persuasion Ledger, The Collective Risk Game Becomes the Persuasion Test, The AI-Guided Message Becomes the Strategy Layer, and The Human-Agent Pair Becomes the Skill Rating.

Return to Blog