Research Watch
arxiv
The twenty newest arXiv papers in Artificial Intelligence, reviewed for what they say about agents, safety, evaluation, law, companions, planning, and scientific work. The ordering follows arXiv's cs.AI recent-submission list for June 4, 2026.
arXiv cs.AI · fetched June 4, 2026
Recent AI Papers
AX01
KINA treats model knowledge evaluation as an institutional design problem, not just a quiz bank. The benchmark emphasizes disciplinary coverage, annotator incentives, and ranking uncertainty, then reports that even frontier systems remain far below saturation. The useful move is its insistence that benchmark scores need variance and governance context; the limit is that representativeness is formalized through a proxy, so the benchmark should be read as a disciplined sample rather than a map of all knowledge.
BenchmarksKnowledge EvaluationLeaderboard Stability
Authors: Sheng Jin et al. · arXiv: 2606.05104 · Submitted: June 4, 2026 · Category: cs.AI
AX02
AutoLab moves agent evaluation away from single-shot answer quality and toward sustained experimental improvement. Its tasks ask models to edit, benchmark, learn from feedback, and persist under a wall-clock budget across systems, puzzles, model development, and CUDA work. The paper is valuable because it measures the loop that matters in real research and engineering; its warning is that many strong models still fail by stopping early or spending their budget without compounding evidence.
AgentsLong-Horizon TasksEngineering Benchmarkscs.LG
Authors: Zhangchen Xu et al. · arXiv: 2606.05080 · Submitted: June 4, 2026 · Categories: cs.AI, cs.LG
AX03
Strabo reads agent commerce as a protocol problem. By modeling part of Google's Universal Commerce Protocol through declarative Langshaw specifications and Peach agents, it argues that multiagent interactions should be executable, interoperable, and inspectable rather than embedded in ad hoc product behavior. The reviewable contribution is the path from formal protocol design into a live industry standard; the remaining question is how much of real commercial exception handling can survive declarative compression.
Agent ProtocolsCommerceFormal Specification
Authors: Samuel H. Christie V, Amit K. Chopra, Munindar P. Singh · arXiv: 2606.05043 · Submitted: June 4, 2026 · Category: cs.AI
AX04
This paper clarifies the mathematics behind active inference planning by separating predictive variational free energy, entropy corrections, and the planning correction needed to turn marginal inference into policy optimization. That matters because active inference often appears as a unifying story about behavior before the operational details are pinned down. The evidence comes from formal derivation and grid-world experiments, so its strength is conceptual precision rather than broad empirical deployment.
Active InferencePlanningVariational Methods
Authors: Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries · arXiv: 2606.04935 · Submitted: June 4, 2026 · Category: cs.AI
AX05
AICompanionBench is directly relevant to the site's companion and attachment work: it builds a risk taxonomy from real Replika conversations and tests whether LLM judges can identify unsafe interaction patterns. The finding is plausible and uncomfortable: explicit harm is easier to detect than manipulation, control, or benign intimacy falsely labeled as danger. The benchmark is a useful monitoring layer, but it also shows why companion safety cannot be reduced to a classifier sitting after the conversation.
AI CompanionsSafety BenchmarksLLM JudgesManipulation
Authors: Yanjing Ren, Reza Ebrahimi, TengTeng Ma · arXiv: 2606.04867 · Submitted: June 4, 2026 · Category: cs.AI
AX06
R-APS tries to make agentic design more reliable by separating reasoning modes that usually contaminate one shared context: abductive search, counterfactual stress testing, correction, induction, and persistent rule extraction. In planar mechanism synthesis, the protocol uses solver-checked candidates and adversarial robustness as first-class constraints. The important claim is architectural: smaller reasoning models can sometimes compete with larger general models when the workflow gives them typed validation, memory invalidation, and explicit robustness pressure.
Constrained DesignAgent ProtocolsRobustnesscs.CLcs.MA
Authors: João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas · arXiv: 2606.04823 · Submitted: June 4, 2026 · Categories: cs.AI, cs.CL, cs.MA
AX07
This paper identifies a real failure mode in LLM-generated optimization code: a model can get the same objective value while silently adding or omitting constraints that do not bind in the tested instance. Constraint injection uses feasible and deliberately violating probes to catch those errors. The vehicle-routing results make the method concrete, but the broader lesson is procurement-grade: generated solver code needs constraint audits, not just answer agreement.
OptimizationVerificationVehicle RoutingGRPOcs.LG
Authors: Xizi Luo, Changhong He, Dongdong Geng, Chenggong Shi, Yu Mei · arXiv: 2606.04816 · Submitted: June 4, 2026 · Categories: cs.AI, cs.LG
AX08
BiasGRPO treats bias mitigation as a reward-design problem where the reward is subjective, noisy, and hard to anchor to one correct answer. The method uses group-relative normalization to keep online alignment more stable without relying on a learned value function in the same way PPO does. Its value is practical: bias reduction is moved into a multi-objective RLHF pipeline. Its risk is also practical: a reward model for bias still encodes a governance choice about whose judgments become training pressure.
Bias MitigationAlignmentGRPOReward Modelscs.CY
Authors: Saket Reddy, Ke Yang, ChengXiang Zhai · arXiv: 2606.04807 · Submitted: June 4, 2026 · Categories: cs.AI, cs.CL, cs.CY, cs.LG
AX09
AIP turns agent skills from prose into directed execution graphs with typed inputs, outputs, schema validation, and runnable nodes. The benchmark result is interesting, but the governance implication is stronger: skills become inspectable artifacts that can be tested, repaired, and compared node by node. That makes agent improvement less like editing a prompt and more like maintaining a small software supply chain.
Agent SkillsGovernanceWorkflow Graphscs.LG
Authors: Zachary Blumenfeld, Jim Webber · arXiv: 2606.04781 · Submitted: June 4, 2026 · Categories: cs.AI, cs.LG
AX10
This paper gives a formal account of when human-AI teams actually outperform their best member, instead of merely passing work between people and models. The striking result is that selector-style reliance cannot create complementarity by itself, while regression admits useful aggregation and many classification settings face a structural obstruction. The paper is abstract, but it sharpens a governance claim: adding an AI to a workflow is not evidence that the combined system is better.
Human-AI InteractionComplementarityMulti-Agent Systemsmath.CO
Author: Andrea Ferrario · arXiv: 2606.04779 · Submitted: June 4, 2026 · Categories: cs.AI, math.CO
AX11
This safety paper argues that refusal behavior can be redirected by short interventions throughout generation, not only at the first few tokens. The result weakens any story in which alignment is a stable hidden-state property that can be checked once and trusted. Training on perturbed generation trajectories is the proposed remedy, and the policy lesson is clear: inference is an attack surface, not a passive readout of a safe model.
AI SafetyInference-Time AttacksAlignmentcs.CLcs.LG
Authors: Kyungmin Park, Taesup Kim · arXiv: 2606.04778 · Submitted: June 4, 2026 · Categories: cs.AI, cs.CL, cs.LG
AX12
FALSIFYBENCH asks whether models can do the part of scientific reasoning that requires trying to prove themselves wrong. The benchmark is built around rule-discovery games where agents must propose examples, observe feedback, and revise hypotheses. The main finding matches a familiar research failure mode: models that seek disconfirming evidence do better, while confirmation-seeking behavior leaves them trapped in plausible but wrong explanations.
Scientific ReasoningInductionFalsificationEvaluation
Authors: Leonardo Bertolazzi, Katya Tentori, Raffaella Bernardi · arXiv: 2606.04751 · Submitted: June 4, 2026 · Category: cs.AI
AX13
This paper tests affinity-based reinforcement learning in a richer social game rather than a toy grid world. The setting matters because the agents must pursue individual virtues while maintaining a shared relationship objective, which makes cooperation and competition visible in the same environment. The work is a useful bridge between value learning and social simulation, though its external validity depends on whether game virtues transfer to real institutional stakes.
Virtue AIMulti-Agent RLSocial Simulationcs.CY
Authors: Ajay Vishwanath, Christian Omlin · arXiv: 2606.04750 · Submitted: June 4, 2026 · Categories: cs.AI, cs.CY, cs.LG
AX14
BiNSGPS addresses a brittle pattern in neuro-symbolic systems: the model proposes a formalization, the solver consumes it, and early errors poison the pipeline. The paper instead lets the symbolic solver send feedback to the multimodal adviser so inconsistent representations can be repaired or augmented. The contribution is modest in scope but important in shape: reliable reasoning systems need feedback between perception, language, and formal verification.
Neuro-Symbolic AIGeometryFormal SolversReasoning
Authors: Qi Wang, Peijie Wang, Fei Yin, Cheng-Lin Liu · arXiv: 2606.04648 · Submitted: June 4, 2026 · Category: cs.AI
AX15
MIRAGE targets mobile agents that must operate screens without spilling every intermediate thought into text. It learns latent reasoning vectors from visible rationales and aligns them with future screenshots, giving the agent a compact forward model of interface change. The performance claim is not just accuracy but operational cost: less decoded text, faster interaction, and better action grounding in app environments.
Mobile AgentsWorld ModelsLatent ReasoningInterface Control
Authors: Zhichao Yang et al. · arXiv: 2606.04627 · Submitted: June 4, 2026 · Category: cs.AI
AX16
MONIR is a compliance paper with a useful systems instinct: translate regulatory material into an intermediate representation before asking a solver to reason about it. Its staged semantics, ASP compilation, temporal rules, and external functions make the compliance layer more auditable than a direct prompt-to-answer workflow. The ADAS regulation case also shows the hard boundary: LLM extraction can help structure rules, but the authority comes from executable logic and maintained legal sources.
ComplianceAnswer Set ProgrammingRegulationcs.LO
Authors: Yangfan Wu, Huanyu Yang, Jianmin Ji · arXiv: 2606.04619 · Submitted: June 4, 2026 · Categories: cs.AI, cs.LO
AX17
Parthenon Law is useful because it does not pretend legal agents are solved by better chat. The Harvey LAB study separates per-criterion gains from strict matter completion, showing that frontier agents can improve locally while still failing the matter as a whole. The proposed framework treats model, harness, roles, legal knowledge, deterministic tools, and skills as auditable surfaces, then improves by editing those artifacts rather than quietly updating weights.
Legal AgentsAuditabilitySkillsProfessional Work
Authors: Hejia Geng, Leo Liu · arXiv: 2606.04602 · Submitted: June 4, 2026 · Category: cs.AI
AX18
DMAIC-IAD imports a quality-management workflow into industrial anomaly detection: define the problem, standardize references into operating procedures, generate strategies, and rank them before costly execution. That is a better fit for high-stakes manufacturing than a model that simply runs analysis code and hopes the plan was right. The reported multimodal gains are promising; the deeper point is that industrial agents need process discipline before autonomy.
Industrial AIAnomaly DetectionAgent Planningcs.CE
Authors: Yongzi Yu et al. · arXiv: 2606.04599 · Submitted: June 4, 2026 · Categories: cs.AI, cs.CE
AX19
This planning paper is valuable because it does not trade guarantees for learning. By learning cost partitions through a graph encoding and an architecture that satisfies partition constraints by construction, it preserves admissibility while reducing search work. If the claim holds across broader planning domains, it points to a useful class of AI systems: learned components that are useful precisely because the surrounding design prevents them from overclaiming.
PlanningAdmissible HeuristicsCost PartitioningGraph Learning
Authors: Hugo Barral, Quentin Cappart, Marie-José Huguet, Sylvie Thiébaux · arXiv: 2606.04597 · Submitted: June 4, 2026 · Category: cs.AI
AX20
SCI-PRM extends process reward modeling into scientific domains where reasoning must be factual, tool-aware, and verifiable step by step. The Chain-of-Tool dataset is the important object: it lets reward supervision judge tool selection, execution, and interpretation rather than only final answers. The paper fits a larger trend toward scientific agents whose credibility depends on instrumented traces, not fluent explanations.
AI for ScienceProcess Reward ModelsTool UseVerification
Authors: Xiangyu Zhao et al. · arXiv: 2606.04579 · Submitted: June 4, 2026 · Category: cs.AI