The Policy Loop Becomes the Budget Receipt
When an agent improves a policy through feedback, the score is only the last line of the record. The loop itself is the evidence.
The Paper
The paper is EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments, arXiv:2607.02440 [cs.AI, cs.CL]. The arXiv record lists version 1 as submitted on July 2, 2026, and the PDF is 24 pages. The authors are Zhilin Wang, Han Song, Runzhe Zhan, Jusen Du, Jiacheng Chen, Tianle Li, Qingyu Yin, Yulun Wu, Zhennan Shen, Tong Zhu, Yanshu Li, Guanjie Chen, Derek F. Wong, Yafu Li, Yu Cheng, and Yang Yang. The HTML title page lists affiliations including the University of Science and Technology of China, The Chinese University of Hong Kong, University of Macau, Tsinghua University, Zhejiang University, Soochow University, Brown University, and Shanghai Jiao Tong University.
The core term is Autonomous Policy Evolution: a harness-model agent repeatedly edits an executable decision policy under a fixed interaction budget, sees feedback from submitted rollouts, and is judged later on hidden validation-selected held-out performance. That framing matters because self-improvement is not a single output. It is a sequence of choices about what to test, what to rewrite, what to preserve, and when to spend scarce feedback.
The Policy Loop
EvoPolicyGym gives the agent a live workspace and a policy entry point at system/policy.py. The submitted policy exposes a Policy class with reset and act methods. The agent may add helper modules, configuration files, tests, weights, memory files, and analysis utilities under system/. It uses a local service to inspect protocol state, read the task contract, and submit train rollouts.
The paper validates the full protocol on Core16, a 16-environment suite spanning Gym/Box2D, MuJoCo, MiniGrid, and robotics/driving tasks. Every agent receives a 128-episode training budget per environment. The server charges budget for accepted requested train episodes, stores submitted checkpoints, writes feedback under feedback/submit_NNN/, and leaves the workspace live: harmful valid edits remain until the agent repairs or overwrites them.
The Visibility Boundary
The important governance feature is the visibility boundary. Train cases are visible as handles and provide the only in-loop evidence. Validation and held-out cases remain server-side. After the 128-episode train budget is exhausted, the server evaluates status-ok checkpoints on 16 hidden validation cases, chooses the best checkpoint, and then evaluates that selected checkpoint on 32 hidden held-out cases.
This makes the benchmark more than a reward table. It records accepted and rejected submits, budget consumed per submit, selected checkpoints, policy stdout and stderr, trajectory records, timing, errors, optional videos or external observations, score-drop events, policy-complexity growth, wall time, and token accounting where available. Those traces are not extra decoration. They are how a reviewer can tell whether a score came from early insight, blind retries, late salvage, or overfitted visible feedback.
What the Results Say
The paper evaluates four harness-model agents. GPT-5.5 runs through a Codex harness. Claude Opus 4.7, MiniMax-M3, and DeepSeek-V4-Pro run through a Claude Code-compatible harness. Token use, context management, and provider-specific defaults are not normalized; the paper treats them as part of the evaluated harness-model system and reports token statistics as diagnostics rather than scoring inputs.
On the aggregate Core16 leaderboard, GPT-5.5 has the top rank score at 0.891, with nine wins and top-two placement on all 16 environments. Claude Opus 4.7 ranks second at 0.750, with five wins and 12 top-two placements, and leads the MiniGrid family. MiniMax-M3 scores 0.531 and DeepSeek-V4-Pro scores 0.359; each wins one environment, which is precisely the paper's point that isolated task wins do not establish suite-level reliability.
The diagnostic sections sharpen that conclusion. Hidden-validation best-so-far curves show when useful policies appeared during the consumed budget. The mechanism analysis distinguishes structural synthesis from parametric tuning. The paper reports that stronger runs introduce task-appropriate machinery, such as road-mask lookahead for CarRacing, periodic gait control for HalfCheetah, symbolic mapping and BFS planning for ObstructedMaze, or geometric phase control for FetchPush.
The Budget Receipt
A policy-evolution receipt should include the environment, case split version, run instructions, initial workspace, policy entry point, full patch sequence, submit list, requested train case indices, remaining budget after each submit, feedback summaries, trajectories, stdout, stderr, videos or observations, validation-selection rule, selected checkpoint, held-out result, harness, model, tool profile, token accounting, retry events, wall time, repository revision, dataset revision, and any generated files that influenced the final policy.
Without that receipt, "the agent improved" is too vague. The improvement may be robust abstraction, search over constants, accidental overfitting to visible cases, a lucky late checkpoint, or a costly loop that only works when the budget is generous. EvoPolicyGym's useful contribution is to make those differences inspectable.
Limits
The paper keeps the claims bounded. Core16 is a calibrated subset, not every environment supported by the wrappers. The diagnostics are conservative proxies: abstract syntax tree structure, edit-size patterns, and synthesis-versus-tuning labels do not prove a semantic mechanism. The authors also exclude conventional reinforcement-learning baselines from the leaderboard because those methods use a different training interface and typically need far larger sample budgets than the 128-episode setting.
That restraint is the lesson for deployment. A final score can tell a buyer or lab that a system did well. It cannot tell them whether the system learned, searched, guessed, burned budget, or hid fragility. For self-evolving agents, the evaluation object should be the loop: artifact, feedback, revision, budget, checkpoint, and held-out test.
Sources
- Zhilin Wang, Han Song, Runzhe Zhan, Jusen Du, Jiacheng Chen, Tianle Li, Qingyu Yin, Yulun Wu, Zhennan Shen, Tong Zhu, Yanshu Li, Guanjie Chen, Derek F. Wong, Yafu Li, Yu Cheng, and Yang Yang, EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments, arXiv:2607.02440 [cs.AI, cs.CL].
- arXiv HTML for EvoPolicyGym, checked for title-page metadata, affiliations, protocol, Core16 setup, leaderboard results, diagnostics, limitations, and artifact links.
- arXiv PDF for EvoPolicyGym, checked for page count, authors, task-suite tables, appendix details, run protocol, and evaluation boundary.
- EvoPolicyGym GitHub repository, checked for benchmark infrastructure, README protocol summary, and repository license metadata.
- EvoPolicyGym-Exp-data on Hugging Face and the EvoPolicyGym companion site, checked for the paper-linked dataset and final policy rollout gallery.