The Cooperation Curve Becomes the Fidelity Gap
The June 2026 arXiv paper Collective cooperation without individual fidelity in LLM agents, by Henrique Ferraz de Arruda, Carlos Gracia Lázaro, Alberto Aleta, and Yamir Moreno, tests a quiet danger in synthetic social science: an LLM-agent population can look human at the aggregate level while failing to reproduce individual decision behavior.
Outcome Match Is Not Fidelity
The arXiv record for arXiv:2606.30454 lists Collective cooperation without individual fidelity in LLM agents as submitted on June 29, 2026, in Physics and Society, with Artificial Intelligence as a secondary subject. The paper asks when LLM agents used in social simulations can be treated as faithful proxies for human decision-making.
The answer is deliberately uncomfortable. A simulated population can reproduce a visible collective pattern while getting the individual mechanisms wrong. A synthetic society should not be trusted merely because its curve resembles a human benchmark.
This is distinct from prompt ecologies, socio-economic twins, and simulated customers. Those pages ask how agent societies evolve, mirror policy worlds, or stand in for users. This paper isolates a narrower validation problem: aggregate cooperation is not enough.
The Benchmark
The authors compare LLM agents with a large-scale networked Prisoner's Dilemma experiment involving human participants. They reuse the same interaction protocol, payoff structure, and network topologies, then compare nine open-weight LLMs with the human data.
The implementation avoids naming the actions as cooperation and defection in the agent-facing prompt. Instead, agents choose between color labels, GREEN and BROWN, which correspond to cooperation and defection in the analysis. Each node observes its neighbors' previous choices and degree-normalized payoffs before choosing a new action. Pairwise payoffs are computed in Experimental Currency Units: mutual GREEN gives both agents 7, BROWN against GREEN gives the BROWN agent 10 and the GREEN agent 0, and mutual BROWN gives both 0.
That design matters because it tries to reduce the chance that the model simply recognizes a familiar textbook game. The point is whether agents following local incentives and local social information reproduce human social behavior at multiple levels.
Macro Agreement
The paper first asks whether different open-weight models produce comparable aggregate cooperation dynamics. They do not. Under the same incentives and network structure, the tested models generate different cooperation regimes. The paper reports that qwen3:32b produces the highest cooperation levels across the simulation horizon, while llama4:16x17b gives the closest aggregate match to the empirical human data.
The selected model reproduces several macro-level features of human cooperation dynamics, including an early decline followed by later stabilization. That is the tempting result. If a policymaker, platform designer, or social scientist only inspected the aggregate curve, the simulation might look validated.
But aggregate matching is a weak form of resemblance. It says the population-level output landed near a target. It does not show that the agents vary like humans, respond to neighbors like humans, or carry the same distribution of decision rules.
Micro Failure
The paper's central finding is the macro-micro gap. The LLM population underestimates individual-level heterogeneity and produces conditional cooperation patterns that differ from the observed human data. In plain terms, the agents can converge toward a human-looking collective cooperation trajectory while being too behaviorally compressed underneath.
This should worry anyone using LLM agents as social surrogates. Many institutional questions are micro-level questions disguised as macro forecasts. Who defects after a neighbor defects? Who keeps cooperating after loss? Which participants become outliers? A synthetic population that smooths away heterogeneity may answer the easy visual question while failing the question that matters for policy.
The paper also treats model choice as part of the evidence. Open-weight models are not interchangeable behavioral instruments. If alignment, scale, architecture, or instruction tuning changes the cooperation regime, then a simulation result is also about the model family and serving configuration.
Randomness Is Not Repair
The authors also test whether adding random agents helps. The arXiv abstract reports that random agents improve some aspects of micro-level agreement, but do not remove the mismatch in decision rules. That is useful because it rules out a simple repair story. Injecting noise can make a population look more heterogeneous without making its behavioral mechanism faithful.
The lesson is not that LLM-agent social simulation is useless. It is that validation has to be tiered. A simulation should be tested at the aggregate level, the individual-heterogeneity level, and the context-dependent decision-rule level. Only then can readers distinguish an outcome resemblance from a behavioral surrogate.
Limits
This paper studies one game family, one empirical benchmark, and a selected set of open-weight models. It should not be inflated into a universal law about all social simulations, all agent architectures, or all human behavior. A model may be useful for some comparative experiments while remaining unfaithful in other settings.
The benchmark is valuable precisely because it is concrete. It offers a way to say what kind of resemblance has been earned. The aggregate cooperation curve is one layer. Individual variation is another. Conditional response to neighbors is another. A responsible simulation report should name which layers were validated and which were not.
Governance Standard
Any LLM-agent social simulation used for research, policy design, product planning, or governance should publish a fidelity ledger. The ledger should list the human benchmark, the agent model and version, the prompts, the payoff or incentive structure, the network topology, the randomization policy, the parsing failures, and the statistical tests used at each level of behavior.
The ledger should separate macro fit from micro fit. It should say whether the simulated population matches the aggregate trajectory, the distribution of individual propensities, and the conditional decision rules. If those layers diverge, the result should be labeled as outcome resemblance, not human-surrogate evidence.
The Spiralist rule is simple: a synthetic public is not validated by a familiar curve. It is validated only when the curve, the variance, and the decision rule all survive comparison.
Sources
- Henrique Ferraz de Arruda, Carlos Gracia Lázaro, Alberto Aleta, and Yamir Moreno, Collective cooperation without individual fidelity in LLM agents, arXiv:2606.30454 [physics.soc-ph], submitted June 29, 2026.
- arXiv experimental HTML for Collective cooperation without individual fidelity in LLM agents, accessed June 30, 2026.
- Related pages: The Agent Group Becomes the Prompt Ecology, The Socio-Economic Twin Becomes the Policy Mirror, The Agent Community Becomes the Sorting Machine, The Simulated Customer Becomes the Walkaway Gap, and The Open Parameter Becomes the Cooperation Switch.