Memory Depth Becomes the Agent Habit
Haoliang Han's June 2026 arXiv paper separates memory access from memory depth in long-running language agents. Retrieval can fetch a fact. The harder question is which past events should keep shaping behavior after the working context is unloaded.
Retrieval Is Not Depth
The paper, arXiv:2606.26806 [cs.AI; cs.LG], was submitted on June 25, 2026. arXiv lists the exact title as Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents, by Haoliang Han.
The argument is aimed at a common design shortcut in AI agents: treat memory as a retrieval problem. A vector store can preserve and fetch past text. That is memory access. It does not decide which experiences should become durable tendencies after the relevant text leaves the working context.
This is adjacent to the site's pages on agent memory databases and shared-memory governance, but the angle is narrower. The paper asks whether a small parametric store can carry goal-conditioned behavior through interference and context unload, while retrieval remains available for facts.
The Loop-Drift Probe
Han introduces the loop-drift protocol, a synthetic stress test with 10 users and 200 events per user. Streams include stable goal events, distractors, transient opposite requests, conflicts, sibling-user contamination, and explicit factual notes. The important design choice is that context unload clears the working context, not the retrieval index. If a method fails, it is not because the retrieval database vanished.
The paper evaluates Frozen, Summary, RAG, Naive-LoRA, EVAF, and Routed EVAF+RAG. EVAF is a surprise- and valence-gated LoRA consolidation mechanism: events are admitted to a small write buffer when they are both surprising and aligned with durable goal or preference signals; when the buffer fills, a low-rank adapter is updated with replay and an L2 anchor.
The Depth Flip
The main result is a division of labor. Retrieval is strongest on shallow factual recall, reaching 0.956-0.973 short-fact accuracy across GPT-2 and TinyLlama. EVAF is strongest on goal persistence and post-unload recovery, reporting 0.812-0.904 on those goal-layer probes, with only 2-3 parametric writes per 200 events.
That is why the paper's title matters. A memory system may answer the old fact correctly and still fail to keep a long-running agent aligned with a durable preference after the prompt window changes. Conversely, a parametric write mechanism can help with durable behavioral tendencies while being weak on ordinary factual recall. Routed EVAF+RAG is included as a sanity check: route facts to retrieval and goal probes to EVAF.
The result is not "RAG is bad." In the paper, RAG owns shallow factual access. The claim is that factual access and behavioral depth are different evaluation layers.
Selection and Actuation
The mechanism controls are useful because they prevent a cheap interpretation. EVAF is not merely writing less. On GPT-2, a random matched gate with the same write count loses to EVAF on goal and post-unload probes across four seeds. Naive-LoRA writes every event and incurs much higher drift, yet does not cleanly solve the goal/post-unload layer.
The paper also separates selection from actuation. Fixed-inner controls vary the number of inner LoRA steps while keeping the EVAF selection gate. Results across GPT-2, TinyLlama, and Mistral-7B show that write strength is model-dependent, and that over-actuation can increase contamination. The 7B result is explicitly narrow: it supports the selection/actuation factorization, not a complete large-model memory system.
Memory Governance
For governance, parametric memory changes the audit surface. Retrieval memory is visible as records: source text, timestamp, owner, retention policy, deletion state, and access control. Parametric consolidation is harder to inspect because the remembered tendency is distributed through an adapter update. If it changes future behavior, it needs its own receipt.
A useful receipt should record the event admitted, gate score or reason, user scope, buffer state, write time, LoRA version, replay set, L2 drift, contamination tests, and rollback path. It should also say which memories remain textual and which have been consolidated. This belongs with AI audit trails, human oversight, and system-card records.
The strongest warning comes from the Memora boundary diagnostic. Public Memora streams include stale evidence and memory mutation. EVAF's improvement on forgetting absence is directionally positive but not statistically significant, and the paper states that current EVAF does not solve delete/update validity. A deep memory that cannot forget correctly is not trustworthy organizational memory. It is an unexpired habit.
Claim Boundary
The paper does not claim universal memory accuracy, broad semantic generalization, biological memory, or solved forgetting. Its strongest claim is narrower: in the loop-drift probe, memory depth can be measured separately from retrieval access, and selective parametric consolidation requires both semantic selection and calibrated actuation.
The practical rule is to govern memory writes, not only memory reads. A long-running agent's habit should be inspectable, scoped, reversible, and invalidated when the underlying memory becomes stale.
Sources
- Haoliang Han, Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents, arXiv:2606.26806 [cs.AI; cs.LG], submitted June 25, 2026.
- arXiv PDF: Memory Depth, Not Memory Access, reviewed for the loop-drift protocol, EVAF gate, GPT-2 and TinyLlama results, Mistral-7B actuation controls, RAG and Naive-LoRA comparisons, Memora boundary diagnostic, limitations, and conclusion.
- Related pages: AI Agents, The Agent Memory Becomes the Database Lifecycle, The Shared Memory Becomes the Governance Boundary, AI Audit Trails, Human Oversight in AI, and Model Cards and System Cards.