The Rule Pool Becomes the Policy Memory
A June 2026 arXiv paper by Shicheng Ye and Chao Yu studies a practical agent-learning problem: how to reuse experience without choosing between opaque parameter updates and stale natural-language notes. Their JERP method keeps a rule pool and the policy in the same learning loop, turning agent memory into something that can be inspected, revised, and absorbed into behavior.
Fresh Angle
The paper is Joint Learning of Experiential Rules and Policies for Large Language Model Agents, arXiv:2606.27136 [cs.AI], submitted June 25, 2026. It belongs with the site's pages on AI agents, post-training, AI memory, and agent observability, but it adds a narrower governance question: when an agent learns from experience, where does that experience live?
That question matters because agent memory is already becoming operational authority. A remembered rule can bias tool choice, reshape a search strategy, or preserve a workaround from a prior run. If the rule pool is stale, invisible, or detached from the policy that uses it, the agent may appear to learn while carrying forward old mistakes.
Two Memories
Ye and Yu describe two common ways to reuse agent experience. One keeps experience outside the model as natural-language rules, reflections, guidelines, manuals, or other prompt material. This is inspectable and easy to revise, but it may fall out of sync as the policy changes. The other uses trajectories and feedback to update model parameters directly. That can improve the policy more broadly, but it is less transparent and may miss local mistakes in sparse-reward settings.
The paper's useful move is to refuse that split. The agent should not have one memory humans can read and another memory only optimization can touch. JERP, short for Joint Learning of Experiential Rules and Policies for LLM Agents, uses the same interaction trajectories for both explicit rule maintenance and policy optimization.
JERP Loop
JERP maintains a long-term experiential-rule pool for a task and selects a size-controlled working rule set for the current episode. The prompt used for action generation contains the task, the interaction history, and those working rules. The paper notes that its current implementation sorts rules by utility score and chooses the top rules, rather than claiming a generally superior instance-level retrieval method.
After an episode, the trajectory is used twice. First, the policy is updated using group-relative policy optimization. Second, the rule pool is revised by comparing current rollouts with reference successful trajectories. In the authors' framing, this lets reusable local corrections stay visible as rules while stable behavior can gradually be absorbed into the model's parameters.
Results
The experiments use AlfWorld and WebShop. AlfWorld is treated as a text-based household manipulation environment with six task types, while WebShop is an online shopping environment where the agent searches, browses, compares, and purchases products from user requirements. AlfWorld is scored by task success rate, and WebShop by average score plus success rate.
The comparison includes direct prompting of a vanilla LLM, ReAct, Reflexion, RLOO with LoRA, GRPO with LoRA, and JERP with LoRA. In Table II, JERP reports the best overall AlfWorld success rate at 61.5 percent, above GRPO at 57.8 percent and RLOO at 48.7 percent. On WebShop, JERP reports the highest average score, 79.0, and success rate, 64.1 percent, above RLOO at 57.8 percent and GRPO at 56.2 percent success rate. The paper also says gains are more evident on tasks with longer decision sequences and richer intermediate constraints.
The ablation is the governance-relevant part. When rule-pool updating is paused after the initial stage, the variant still benefits from the initial rules but improves more slowly. The full JERP continues revising the rule pool as later trajectories expose different errors and useful behavior patterns.
Audit Standard
A deployed agent should not merely log final actions. It should log what experience was available to the policy when those actions were chosen. For JERP-like systems, that means the task, current trajectory, selected working rules, rule utility scores, rule revision history, policy version, reference successful trajectories, and the training step that produced the acting policy.
Without that record, rule memory becomes a hidden governance layer. A user may see a tool call and a natural-language rationale, but not the older rule that nudged the agent toward that procedure. The more useful the memory becomes, the more it needs provenance, expiry, review, and rollback.
Audit Trail
The audit should separate four claims. Did the rule come from a real trajectory? Did the reference successful trajectory actually solve the task? Did the current policy use the rule in a context where it still applies? Did later parameter updates make the rule redundant, stale, or harmful?
This is where interpretability becomes administrative. A natural-language rule is only inspectable if it has a lineage. A parameter update is only governable if the organization can say what experiences it was trained from and which explicit rules were present while the relevant trajectories were sampled.
Limits
This is a preprint with two benchmark environments, not a general proof that a coupled rule-and-policy loop will govern every agent. The paper itself points to future work on more adaptive rule retrieval and multi-agent interactive settings. It also reports benchmark performance, not institutional audit readiness.
Still, the architecture identifies the right fault line. Agent learning is not just a model-weight story and not just a prompt-memory story. It is a loop between what the system writes down, what the model absorbs, and what future action treats as experience.
Sources
- Shicheng Ye and Chao Yu, Joint Learning of Experiential Rules and Policies for Large Language Model Agents, arXiv:2606.27136 [cs.AI], submitted June 25, 2026.
- arXiv PDF: Joint Learning of Experiential Rules and Policies for Large Language Model Agents, reviewed for the abstract, method framing, benchmark setup, performance table, ablation, and conclusion.
- arXiv HTML: 2606.27136v1, checked for the JERP loop, rule-pool updating, AlfWorld and WebShop metrics, baseline comparison, LoRA training note, and future-work limits.
- Related pages: AI Agents, Post-Training, AI Memory and Personalization, AI Agent Observability, and AI Evaluations.