The Personality Prompt Becomes the Team Policy
A June 2026 arXiv paper shows that personality prompting is not just cosmetic roleplay in multi-agent systems. A low-agreeableness prompt can reshape team communication, but the damage depends on whether the task produces a constrained artifact or an open-ended social outcome.
Style Is a Control Surface
Multi-agent systems are usually described in terms of tools, roles, routing, memory, and cost. The softer layer is communication style: whether one agent challenges, acknowledges, proposes, concedes, or refuses to integrate another agent's work. That layer is easy to dismiss as tone until it changes the work product.
The practical question is not whether an agent has a personality. It does not. The question is whether a personality prompt becomes a team policy. If a prompt makes every agent colder, harsher, or more skeptical, it can change the coordination protocol just as surely as a tool permission or voting rule.
The Paper Frame
The source is Aryan Keluskar, Amrita Bhattacharjee, and Huan Liu's When Does Personality Composition Matter for Multi-Agent LLM Teams?, arXiv:2606.27443 [cs.AI], submitted June 25, 2026. The arXiv record lists a 20-page preprint with six figures, and the PDF identifies the authors' institution as Arizona State University's School of Computing and AI.
The paper asks whether personality prompting changes objective task outcomes or merely changes communication. It focuses on agreeableness because earlier work showed that high-agreeableness prompts make models more cooperative while low-agreeableness prompts can make them adversarial.
Prompting a Team Trait
The authors use Goldberg's Big Five bipolar adjective markers. The main comparison uses level 2 as low agreeableness and level 8 as high agreeableness. The low-agreeableness example prompt includes terms such as unkind, uncooperative, selfish, distrustful, cold, harsh, and unsympathetic; the high-agreeableness condition uses the opposite pole.
They test Claude Sonnet 4.5, GPT-4o, Grok-3, and DeepSeek V3.1 in different task settings. Communication is classified into acts such as acknowledgment, question, disagreement, and suggestion. A communication state score, phi, tracks how far the team moves away from acknowledgment-heavy cooperation toward disagreement or suggestion-heavy behavior.
Three Task Domains
The study separates tasks by artifact structure and goal alignment. Coding is cooperative and high-structure: agents collaborate on shared code files, where syntax, parsing, and functional constraints narrow the path to success. Research is cooperative but low-structure: agents produce open-ended research ideas through discussion. Bargaining is competitive and low-structure: agents exchange offers and must choose whether to accept.
For coding, the paper uses five MultiAgentBench software-engineering tasks spanning collaborative game development. For research, it uses a five-agent MultiAgentBench research-idea task. For bargaining, it uses a buyer-seller negotiation setup and reports agreement behavior, not just message tone.
What Changed
The main finding is task-contingency. Low agreeableness creates large communication shifts across models, but coding milestones mostly hold. Claude changes from 12.1 baseline coding milestones to 12.4 under low agreeableness; GPT-4o moves from 10.9 to 9.5 without a significant result; DeepSeek moves from 10.7 to 8.8 without a significant result. Grok-3 is the exception, dropping from 14.4 to 10.9.
Research tasks are more exposed. GPT-4o research milestones drop from 10.5 to 3.5, a 66 percent reduction. Grok-3 drops from 17.0 to 11.8, and DeepSeek from 9.7 to 5.8. Bargaining is even sharper: low agreeableness collapses agreement to 1 percent or less across GPT-4o, DeepSeek, and Claude, while high agreeableness roughly doubles baseline agreement rates in the reported table.
Governance Reading
The governance lesson is artifact-mediated buffering. A code artifact can absorb some social dysfunction because it must parse, compile, or satisfy a testable specification. An open research memo or negotiation has no equivalent guardrail. The same adversarial style that bounces off a code file can poison an unconstrained planning discussion.
This matters for agent design. "Add a critic" is too crude. The paper's heterogeneous pilot suggests that one low-agreeableness challenger can help when placed as a bounded lead critic, but can hurt when deployed as a team-wide trait or poorly positioned role. Personality prompting should therefore be documented like a policy setting: scope, role, domain, expected benefit, and failure mode.
Limits and Prompt Valence
The paper is careful about a major confound: low-agreeableness Goldberg adjectives are negatively loaded. The authors run a neutral-paraphrase condition using a prompt about being direct, candid, independent-minded, skeptical of consensus, and efficiency-oriented. Under neutral wording, the effect weakens and becomes more model-specific.
That is the warning for builders. A "personality" prompt may be measuring sensitivity to hostile language, not a stable trait. The paper also uses automated benchmarks rather than human participants, and the authors say benchmarking code, analysis code, and model outputs will be released upon acceptance. Until then, the result is best read as a design probe, not a universal law of agent teams.
Audit Receipt
The audit-grade sentence is: Keluskar, Bhattacharjee, and Liu test low- and high-agreeableness personality prompts in multi-agent LLM teams across coding, research, and bargaining, finding that low agreeableness strongly shifts communication while outcome damage depends on task structure and prompt valence.
The receipt is: agent personality prompts should be treated as team-policy controls, with role placement, task structure, output constraints, neutral wording, and outcome metrics recorded before deployment.
Sources
- Aryan Keluskar, Amrita Bhattacharjee, and Huan Liu, When Does Personality Composition Matter for Multi-Agent LLM Teams?, arXiv:2606.27443 [cs.AI], submitted June 25, 2026.
- Primary versions checked: experimental HTML and PDF.
- Related pages: The Agent Team Becomes the Trust Graph, The LLM Social Network Becomes the Polarization Lab, The Dialogue Dynamics Become the Collaboration Meter, The Persona Gate Becomes the Refusal Boundary, and The Agent Group Selection Becomes the Prompt Ecology.