Blog · arXiv Analysis · Last reviewed June 25, 2026

The Evaluator Becomes the Contagion Network

A June 2026 arXiv paper treats LLM evaluators inside multi-agent systems as a network risk: a judge's preference can become another agent's future bias.

From Judge to Network

The paper, arXiv:2606.20493 [cs.LG], is titled Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems. arXiv lists Zewen Liu as author and records version 1 on June 18, 2026.

The paper starts from a practical shift in agent design. A single chatbot may be judged after the fact, but a multi-agent system often asks agents to evaluate each other while work is happening. One agent critiques, ranks, selects, or rewards another agent's output. That evaluation then shapes the target agent's next strategy. The judge is no longer outside the system. It is part of the system's feedback loop.

That makes this page distinct from the site's pages on LLM judges as annotation budgets, grading cascades, agent-team trust graphs, and hidden anchors in deliberation. Those pages ask how evaluation is measured, delegated, or steered. Liu's paper asks whether evaluator preferences can travel through the agent network itself.

What Contagion Means

The paper models evaluator bias as a propagating signal. If Agent A prefers structured step-by-step reasoning, and Agent A repeatedly evaluates Agent B, Agent B may start sampling strategies that better satisfy that preference. If Agent B then evaluates Agent C, the absorbed preference can move another hop. The paper calls this a Contagion Network and represents the effect with a cross-agent contagion matrix.

The language is epidemiological, but the object is engineering evidence. The paper defines propagation regimes: suppression, where bias attenuates; persistence, where bias remains; and cascade, where bias amplifies. The mathematical threshold depends on topology, including the spectral radius of the contagion matrix for fully connected networks and link-level coefficients for chain propagation.

The governance translation is simple. A multi-agent system may be built to increase cognitive diversity, but peer evaluation can quietly make agents converge on the same style. A system can look more coherent while becoming less plural.

The Experiment

The experiment uses three DeepSeek-chat agent instances, all from the same model family, differentiated by evaluator prompts. One evaluator is structured-biased, one balanced, and one evidence-biased. The strategy space includes step-by-step, direct, analogical, decomposition, and evidence-based approaches. Tasks span code generation, mathematical reasoning, text summarization, logical puzzles, and creative writing.

The paper reports a four-phase protocol with 840 DeepSeek-chat API calls. Phase 1 measures baseline preferences. Phase 2 measures pairwise contagion across the three agents. Phase 3 tests a three-hop chain. Phase 4 tests mitigation by increasing evaluator committee size.

The reported pairwise contagion coefficients are positive but weak, with mean values from 0.143 to 0.304 over two seeds. In the chain experiment, all hops remain below the cascade threshold, and the cumulative propagation factor is 0.0055, which the paper interprets as rapid attenuation. The important point is not that bias exploded in this setting. It is that bias was measurable even among homogeneous agents, while topology and evaluator design controlled whether it faded or could become more serious.

Diversity Is Not Simple

The paper's most useful tension concerns diversity. The same-model DeepSeek agents sit in the suppression regime under the tested chain topology. Liu contrasts those coefficients with prior MM-EPC results, where cross-model evaluator effects were reported at roughly 0.85 to 1.3. The paper treats that comparison as evidence for a hypothesis: cross-model diversity may amplify evaluator contagion, while same-family evaluators with varied prompts can suppress it.

That is not a settled deployment rule. The paper's own limitations say the cross-model comparison comes from prior work under different conditions, not a direct same-protocol experiment. But the warning is valuable. "Use multiple models" is not automatically a diversity guarantee. It can produce complementary checks, or it can introduce stronger preference transfer between incompatible judge habits.

The paper's mitigation result is narrower and more concrete: in its tested setup, increasing the evaluator committee from one evaluator to three reduced effective contagion by 72.4%, from 0.264 to 0.073, while strategy entropy moved close to the theoretical maximum for five strategies. For governance, that suggests committee evaluation should be measured as a dynamic system, not assumed good because it has more members.

Limits That Matter

The limitations matter because the result is exploratory. The experiment uses one model family, one synthetic strategy space, coarse evaluator-bias prompts, and only two seeds for the main pairwise matrix. Phases 1, 3, and 4 are single-run. The adaptation mechanism is Test-Time Reinforcement Learning, a controlled multiplicative update over strategy weights; other adaptation mechanisms, such as natural-language critiques, fine-tuning, memory updates, or tool-mediated rewards, could behave differently.

The paper also studies chain topology empirically, while star, ring, fully connected, and production orchestration patterns remain future work. Its code framework is intended to make measurement easier, but the results do not certify any real multi-agent deployment as safe.

Governance Standard

A serious multi-agent evaluation system should publish its evaluator graph. Which agents evaluate which other agents? Which model family, prompt, rubric, and strategy preference does each evaluator carry? Are evaluator outputs used for ranking, reward, memory, task allocation, or final answer selection? How often are evaluator roles rotated or committee-reviewed?

The release record should include a measured contagion matrix or an equivalent interaction audit, a topology description, committee-size tests, entropy or diversity traces over time, and slice tests for domains where one judge habit dominates. If the institution uses cross-model committees, it should test whether model diversity reduces or increases preference transfer under its actual workflow.

The Spiralist rule is conservative: when agents judge each other, evaluation is no longer a neutral measurement layer. It is a live influence channel. Govern the judge as part of the agent network, or the network will learn the judge's habits and call that consensus.

Sources


Return to Blog