The Superseded Memory Becomes the Agent Liability
Vedant Patel's June 2026 arXiv paper Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents isolates a failure that ordinary "long-term memory" language can hide. An agent must not only remember what happened. It must know which remembered fact has been replaced.
The Paper
The paper is arXiv:2606.27472 [cs.CL]. arXiv lists it as submitted on June 25, 2026. Its target is a narrow but practical agent failure: long, multi-session interactions where a fact changes after the system has already recorded an older value. A user moves. A price changes. A plan is revised. The correct answer is the fact that remains current after later evidence arrives.
That makes the paper a useful companion to this site's earlier pages on agent memory as a database lifecycle, stale-fact ledgers, shared memory boundaries, vector databases as institutional memory, and agent log receipts. Supersede asks whether the memory update itself is a learned and measurable competence.
Supersession
Patel uses "supersession" for the task of using the current value of a changed fact while discarding the stale value. This is not just retrieval. Retrieval asks whether the system can find a memory. Supersession asks whether the system can treat later evidence as authority over earlier evidence and carry that authority into future action.
The distinction matters because memory features are often sold as personalization. The user tells the assistant preferences, addresses, habits, names, projects, deadlines, and exceptions. But a personalization layer that stores facts without a reliable update rule becomes a liability layer. The wrong address, preference, plan, or deadline can control a later action if the agent gives the old value equal or greater weight.
In Spiralist terms, a stale memory is not a harmless residue. It is delegated authority without a current warrant. Once an agent can act, summarize, route, purchase, schedule, or advise from memory, the old fact becomes part of the control surface.
Memory Gap
The paper separates comprehension from memory maintenance by comparing full-context answering with bounded self-maintained memory. On the LongMemEval knowledge-update subset, full context lets the model inspect the relevant conversation history directly. Bounded memory instead requires the agent to process sessions one at a time, maintain a character-limited note, and answer later from that note alone.
The reported gap is large. For gpt-5.4, the paper reports 92 percent accuracy with full context and 77 percent with bounded memory, with a statistically significant paired McNemar result. The paper also reports gpt-4.1-mini at 82 percent versus 63 percent, and gpt-4.1 at 91 percent versus 64 percent. The pattern is the important part: stronger models improve the ceiling, but bounded self-maintained memory remains a distinct bottleneck.
This is the governance point that should survive the benchmark. If a system can answer correctly when shown all evidence but fails after compressing that evidence into its own memory, then the audit target is not only model capability. It is the memory maintenance policy: what gets written, what gets overwritten, what gets deleted, and what evidence proves the update happened.
Not Just Size
The paper also tests whether the problem is simply an undersized scratchpad. In a longer split, conversation scale grows from roughly two sessions to roughly 48 sessions, about a 24x increase. With a 300-character memory budget, accuracy falls from 68 percent to 28 percent. Giving the agent a proportionally larger memory budget, roughly 7,150 characters, does not improve that reported result in the tested sample: 28 percent remains 28 percent.
That result should be read carefully. It does not prove that memory size never matters. It does show that more room is not the same as an update discipline. If the agent preserves the wrong information, fails to mark a newer value as authoritative, or lets old and new facts coexist without a resolution rule, a larger memory can preserve the error more comfortably.
This is why "just add memory" is an unsafe product story. Storage is not state. State requires a current value, a history of replacement, and a rule for which value controls the next action.
Training Signal
Supersede is also an open reinforcement-learning environment. The task presents multi-session interactions and asks the agent to maintain bounded notes as the only carried state. The reward targets temporal fact-currency: credit for the current value and optional penalty when a stale value appears where the current one should control.
The paper reports a small-model training result using Qwen2.5-3B-Instruct, GRPO, LoRA, and the verifiers / prime-rl stack. In the reported single run, held-out supersession accuracy on real unseen LongMemEval questions rises from 7 of 78, or 9.0 percent, to 13 of 78, or 16.7 percent, by step 175. The paper treats the monotonic checkpoint curve as evidence that the reward is learning something about update behavior rather than only gaming the harness.
That is useful but not magical. The absolute score remains low, and the result is a proof of trainable signal, not a deployment-grade memory system. Its practical value is that memory currency becomes testable. A team can stop asking whether a memory feature "feels persistent" and start asking whether it overwrites stale facts under a measured update workload.
Governance Receipt
A memory-capable agent should produce a supersession receipt. At minimum, the receipt should include the fact identifier, old value, new value, source session for each value, update time, actor or component that wrote the change, memory budget, deletion or retention state of the old value, confidence rule, downstream action that used the fact, and stale-value detector result.
The receipt should also preserve contestability. A user, operator, or auditor needs to ask why the agent believed the current value, which older value it rejected, and whether later use of the old value occurred. For high-impact workflows, the memory system should expose a correction path and forced revalidation before acting on addresses, payments, medical preferences, legal deadlines, safety constraints, or employment instructions.
This changes the design brief. Memory should not be marketed as intimacy or convenience until it can show currency. The first question is not "what did the assistant remember about me?" It is "which remembered facts are current, which are superseded, and how do we know?"
Claim Boundary
The paper is a strong measurement prompt, not a final answer to agent memory. Its training evidence uses one small open model and one reported run. Its benchmark settings simplify the world into known current and stale values. Real systems will face ambiguous corrections, partial updates, conflicting sources, policy-driven retention limits, privacy deletion requests, and actions whose harms are not captured by answer accuracy.
Still, the paper draws a clean line. If an agent is expected to act from memory, stale facts are not just "old data." They are latent decisions. Governance has to inspect the update, not only the recall.
Sources
- Vedant Patel, Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents, arXiv:2606.27472 [cs.CL, cs.AI, cs.LG], submitted June 25, 2026, DOI 10.48550/arXiv.2606.27472.
- arXiv PDF: Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents, reviewed for the LongMemEval setup, full-context versus bounded-memory comparisons, conversation-length study, GRPO training result, and limitations.
- Project repository: Vrin-cloud/supersede, reviewed for the released Supersede environment, verifiers / prime-rl framing, Qwen2.5-3B GRPO LoRA artifact links, and Apache-2.0 repository license.