Blog · arXiv Analysis · Last reviewed June 24, 2026

The Probe Opponent Becomes the Policy Recovery Tool

The June 2026 arXiv paper RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments, by Babak Rahmani, Sebastian Dziadzio, Joschka Strüber, Sergio Hernández-Gutiérrez, and Matthias Bethge, asks whether an LLM coding agent can reconstruct an opaque agent's decision program from behavior.

Behavior Becomes Source Code

The paper, arXiv:2606.26094v1 [cs.LG], was submitted on June 24, 2026. Its useful move is to treat an agent's outward behavior as an inverse problem in code space. The learner does not see the target policy. It sees what the target does, writes experiments that force new interactions, and then submits a runnable hypothesis meant to reproduce the target's decisions.

This is not ordinary imitation learning and not a pure benchmark of code generation. It asks whether behavioral evidence can become an executable model of another agent. For the site's purposes, that is a governance question: if autonomous agents negotiate, compete, monitor, trade, or coordinate, then observing behavior may become a way to infer hidden operating rules.

The page belongs next to earlier work on agent trust graphs, coding-agent fingerprints, and concerning-behavior forensics. RevengeBench gives that intuition a controlled test bed.

What RevengeBench Measures

RevengeBench contains 75 hidden target policies drawn from CodeClash tournament trajectories. The targets span five game arenas: BattleSnake, Halite, Poker, RoboCode, and RobotRumble. The paper says the arenas cover multiple programming languages and mechanics, including grid-based play, sequential decisions, multi-unit control, and continuous control.

The targets are not arbitrary scripts. The authors start from roughly 15,000 policy files, filter invalid or near-duplicate entries, build a pool of 40 validated policies per arena, and select the top 15 Elo-calibrated policies in each arena as hidden targets. The learner starts from a simple functional policy and iteratively refines code inside a sandboxed Linux environment through the mini-SWE-agent scaffold.

The scoring is also important. The paper does not require exact source recovery. It uses per-game action-distance metrics so that a wrong betting amount, a wrong turn rate, or a mismatched unit command can be graded rather than collapsed into a single exact-match failure. The main summary is the fraction of initial action distance closed against held-out trajectories.

The Probe Is the Experiment

The benchmark has passive and active regimes. In the passive regime, the learner observes trajectories produced by the hidden target against sampled opponents. In the active regime, the learner can write up to five probe opponents per round, using those custom opponents to elicit behavior that distinguishes one hypothesis from another.

This is the key Spiralist object: a probe opponent is an experiment disguised as a participant. It is not merely watching the target; it is steering the situation toward informative states. The paper reports that active probing improves recovery in 16 of 20 model-game pairs in the method comparison, but the benefit is concentrated in models strong enough to identify uncertainty, design valid probes, and integrate the results.

That distinction matters. Observation alone can look passive and harmless. A probe changes the environment. In a governance setting, the difference between logging an agent and experimentally provoking it is the difference between audit and intervention.

Recovery Is Useful Before It Is Perfect

Across twelve frontier LLM coding agents, the paper reports recovery from 33.8% to 71.9% of initial distance closed using mini-SWE-agent. The strongest models fully close the behavioral gap on some targets, but the overall result is uneven by model and arena. Poker is easiest in the reported aggregate; RobotRumble shows the widest spread.

The downstream test is more unsettling than the headline score. The authors give recovered code to the same LLM and ask it to write a counter-policy against the hidden target. Recovered programs yield measurable competitive advantage, especially for weaker challengers that otherwise struggle to design effective counters. In other words, a partial behavioral reconstruction can be strategically useful before it is a faithful copy.

That is the governance hinge. A model of another agent does not have to be exact to change a contest, negotiation, market, or security posture. Approximate recovery may be enough to exploit habits, anticipate refusals, route around defenses, or decide when to escalate to a human.

The Audit Can Be Used as a Weapon

RevengeBench is framed as policy interpretability and opponent modeling, not as a recipe for surveillance. Still, its dual use is obvious. The same method that helps auditors understand an opaque agent can help a rival infer exploitable regularities. The paper's own counter-policy experiment makes this visible: recovered behavior improves play against the target.

That does not make behavioral recovery illegitimate. It means agent deployments need to treat observation interfaces, logs, test sandboxes, and public agent behavior as information surfaces. An agent that repeatedly exposes its decision patterns may be giving outsiders a usable policy sketch.

Limits That Matter

The paper's limitations are central. The targets are synthetic, fixed game policies. They do not adapt, conceal strategy, or respond strategically to being probed. Real-world agents may change under observation. Action distance is also a proxy. It measures behavior on visited states, while rare or adversarially induced states may matter more than the average.

The authors also note an identifiability problem: distinct programs can produce near-identical behavior under finite interaction budgets. A recovered policy may be one member of a behaviorally consistent class, not the unique source code. Benchmark reliability varies by arena as well; BattleSnake and Poker are more stable than Halite, RoboCode, and RobotRumble for fine-grained ranking.

Governance Standard

A serious agent audit should state which access regime was used: passive traces, active probes, arbitrary queries, source inspection, or white-box instrumentation. It should preserve model and scaffold versions, probe designs, held-out tests, action-distance definitions, failed hypotheses, and whether recovered behavior improved downstream exploitation or mitigation.

For deployed systems, the practical rule is to separate interpretability rights from adversarial access. Regulators, safety teams, and affected users may need behavioral probes to contest opaque agents. Public opponents and competitors should not automatically receive unlimited probing channels. RevengeBench shows why: the probe is not just a question. It is a tool for recovering the policy behind the answer.

Sources

Babak Rahmani, Sebastian Dziadzio, Joschka Strüber, Sergio Hernández-Gutiérrez, and Matthias Bethge, RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments, arXiv:2606.26094 [cs.LG], submitted June 24, 2026.
arXiv PDF version of RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments, reviewed June 24, 2026.
arXiv experimental HTML version of RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments, reviewed June 24, 2026.
Related pages: The Agent Team Becomes the Trust Graph, The Coding Agent Becomes the Fingerprint, The Concerning Behavior Becomes the Forensic Case, The Fault Investigator Becomes the Accountability Layer, The Agent Trace Becomes the Process Map, and AI Agents.

Return to Blog