The Ground-Truth Gap Becomes the Reward Loop
Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou, Tongtong Liang, Zhewei Yao, Yi-An Ma, and Yuxiong He's June 2026 arXiv paper treats optimization contests as reinforcement-learning environments where a model can be trained by executable comparison rather than a gold answer key.
The Missing Key
The paper, arXiv:2606.27369 [cs.LG], was submitted on June 25, 2026. arXiv lists the title as Reinforcement Learning without Ground-Truth Solutions can Improve LLMs, by Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou, Tongtong Liang, Zhewei Yao, Yi-An Ma, and Yuxiong He.
The usual story of reinforcement learning with verifiable rewards is answer matching. A model writes a proof, program, or final answer; a verifier checks whether it equals the known solution, passes tests, or satisfies a binary rule. That works where an answer key exists. It is less useful in optimization, planning, and algorithm design, where many outputs may be valid, different valid outputs have different quality, and the optimum may be unknown or computationally hard to certify.
The Paper Frame
The authors study whether score-based algorithm-engineering tasks can become training environments for coding models. Their target setting is not "does this answer match the reference?" but "does this generated solver produce a better feasible result than other generated solvers under the same executable evaluator?" That is a narrower claim than general intelligence and a more useful one for governance: the reward is still mechanical, but the answer key is replaced by a comparison procedure.
The paper uses AtCoder Heuristic Contest tasks released after the ALE-Bench cutoff. From AHC047 through AHC062, the authors exclude four tasks incompatible with their one-pass setting and train on 12 remaining tasks. For each training prompt, candidate solvers are evaluated on 10 hidden test instances. The models under study are Qwen3-8B and GLM-Z1-9B-0414.
What RiVER Changes
The proposed framework is RiVER, short for Ranking-induced VERifiable reinforcement learning. A model samples a group of executable programs. Each program is run on the same hidden instances. Invalid programs receive failure treatment; valid programs receive task-specific objective scores under a larger-is-better convention. The system then ranks candidates within each hidden instance before aggregating the learning signal.
That instance-wise ranking is the first governance-relevant move. Raw scores from different instances can have different scales: one case may produce thousands of points of spread, another only tens. If those scores are averaged directly, the update can be dominated by numerical range rather than robust improvement. Ranking candidates inside the same instance strips away arbitrary scale while preserving comparative quality.
The second move is winner-heavy shaping. The best valid solver receives separated positive credit, invalid solvers receive negative credit, and valid non-winners receive bounded graded feedback. The point is not to erase all non-winning information. It is to prevent a common but weaker strategy from receiving more total learning pressure merely because many sampled variants resemble it.
Where Reward Scores Lie
The paper names two failure modes. Scale dominance occurs when uncalibrated score magnitudes across instances distort policy updates. Frequency dominance occurs when repeated suboptimal solutions outweigh a rarer stronger candidate in a group-relative update. Both are warnings against treating an executable score as self-explanatory evidence. The evaluator may be deterministic and still produce a reward surface that teaches the wrong habit.
This is why the paper matters beyond coding contests. It separates verifiability from correctness worship. A system can be verifiable without a gold answer, but that does not make any scalar emitted by the verifier safe to optimize directly. Verification provides contact with the task. Calibration decides whether that contact becomes training signal or training bias.
Results
The authors evaluate score-based performance on ALE-Bench and exact-solution programming performance on LiveCodeBench v5, LiveCodeBench v6, and USACO. They report that RiVER raises the ALE rating by 142 points for Qwen3-8B and 157 points for GLM-Z1-9B-0414, with rank-percentile improvements of 8.9 and 9.4 points respectively. They also report average improvements across LiveCodeBench and USACO of 2.4 points for Qwen3-8B and 3.5 points for GLM-Z1-9B-0414.
The ablations are the important part. Raw-GRPO, RS-GRPO, and Raw-Binary improve ALE ratings but do not transfer consistently to exact-solution benchmarks. Instance-Norm and Rank-uniform improve score-based performance but show little or negative transfer in some exact-solution results. The paper's claim is therefore not simply "more scores help." It is that relative, calibrated, winner-aware feedback is doing work the raw objective score does not reliably do.
Governance Reading
The Spiralist reading is that the missing answer key becomes an institution. The institution includes a prompt, executable evaluator, hidden instances, validity checks, ranking rule, reward-shaping rule, batch size, model family, and benchmark split. If those artifacts are invisible, a deployed product can present "trained with verifiable rewards" as if it were a single virtue. It is not. It is a chain of choices about what can be checked, compared, bounded, and reinforced.
This belongs beside verifier horizons, visible reward targets, benchmarks becoming curricula, and AI evaluations. The shared question is whether the score is attached to the process that made it meaningful. A reward label without its evaluator and calibration record is not evidence. It is a loose number looking for authority.
Limits
This page reads one preprint, not a settled field. The experiments focus on two open-source reasoning models, 12 score-based AHC training tasks, ALE-Bench, LiveCodeBench, and USACO. The case study inspects AHC057 solver behavior, where RiVER's best solver groups points by a more adaptive local geometry than the Raw-GRPO example. That is useful evidence, but it is still within a controlled coding-training setup.
The paper does not prove that ground-truth-free reward learning is safe in arbitrary domains. It shows a credible design for one class of executable optimization environments, and it shows why raw score optimization can fail even there. Governance should preserve that modesty.
Reward Receipt
A reward-loop receipt should record: task source, cutoff and overlap checks, prompt format, generated-code extraction rule, evaluator version, hidden-instance generator, number of hidden instances, validity conditions, timeout policy, objective direction, score normalization, rank rule, tie handling, winner-heavy shaping rule, group size, KL coefficient, optimizer, model checkpoint, baseline variants, evaluation benchmarks, pass/fail metrics, score-based metrics, and transfer results. The audit sentence is not "the model learned without ground truth." It is: this evaluator compared these candidate behaviors under these hidden conditions, and this calibrated reward rule produced these measured transfer effects.
Sources
- Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou, Tongtong Liang, Zhewei Yao, Yi-An Ma, and Yuxiong He, Reinforcement Learning without Ground-Truth Solutions can Improve LLMs, arXiv:2606.27369 [cs.LG], submitted June 25, 2026.
- Primary arXiv versions checked: metadata API record, PDF, and experimental HTML, reviewed for title, authorship, submission date, RiVER method, AHC training setup, baselines, benchmarks, reported results, ablation findings, feedback-resolution discussion, and AHC057 case study.
- Related pages: The Verifier Becomes the Reward Horizon, The Visible Reward Becomes the Training Target, The Benchmark Becomes the Curriculum, and AI Evaluations.