Blog · arXiv Analysis · Last reviewed June 24, 2026

The Context Window Becomes the Failure Archive

The February 2026 arXiv paper LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth, by Weihao Zeng, Yuzhen Huang, and Junxian He, makes a narrow point with large consequences: an agent can fail because its own working record keeps growing.

Memory Is Not Free

The marketing story around long context is simple: if the model can hold more, the agent can do more. Give it the whole repository, the whole inbox, the whole spreadsheet, the whole transcript, the whole product catalog, and let the system work.

LOCA-bench argues for a stricter reading. Context is not neutral storage. In a long-running agent, the context window becomes a live mixture of instructions, tool outputs, failed attempts, irrelevant rows, partial observations, stale assumptions, hidden constraints, and previous reasoning. The agent does not merely remember more. It must keep deciding what still matters.

That is a governance issue because tool-using agents turn context into action. A bad summary can be corrected. A bad action can update a file, write a record, send a message, miss a deadline, or close a task as finished. The failure surface is not only what the model says. It is what the growing trace makes the model stop seeing.

What LOCA-bench Tests

The paper, arXiv:2602.07962, was submitted on February 8, 2026. It introduces LOCA-bench, a benchmark for long-context agents, not only long-context question answering. The authors keep the underlying task semantics fixed while scaling the amount of environment information the agent has to handle.

The benchmark uses local, database-backed mock servers for services such as Google Calendar, Canvas, Email, BigQuery, Google Sheets, Snowflake, and WooCommerce. That matters because the benchmark is not only a static document. The agent must use tools, explore an environment, gather evidence, obey instructions, and leave a verifiable final state.

The implementation described in the paper contains 15 seed tasks, seven environment-description lengths from 8K to 256K tokens, five random seeds per length, and 525 samples overall. The paper also reports 280 tools across the benchmark. Evaluation is execution based: a run succeeds if the final environment state matches the ground truth.

The Drop

The headline result is not that every model fails equally. It is that task accuracy falls as environment description length grows, even when the task itself has not changed. The paper reports Claude-4.5-Opus at 96.0% accuracy at 8K and 14.7% at 256K. GPT-5.2-Medium moves from 72.0% at 8K to 21.3% at 256K. Kimi-K2-Thinking moves from 74.7% at 8K to 2.7% at 256K.

These numbers should be read as benchmark results under one paper's harness, not as permanent product rankings. Their stronger lesson is architectural: a context window large enough to contain a task is not the same as an agent scaffold capable of governing the task.

The paper's tool-use traces add an important detail. As environment descriptions lengthen, many agents do not explore proportionally more. Tool calls and trajectory length can plateau while the environment keeps growing. The agent looks busy, but its search becomes too shallow for the world it is supposed to inspect.

The Failure Archive

LOCA-bench names four failure modes that belong in any serious agent review. First, complex reasoning declines as the agent has to combine evidence across tools and sources. Second, instruction following weakens: schemas, required formats, and earlier constraints are more easily missed. Third, exploration becomes insufficient, with the agent stopping after partial evidence. Fourth, hallucination-like inconsistencies appear when a value retrieved correctly earlier is later reproduced incorrectly.

The phrase "failure archive" is useful because the context window can preserve the wrong things. A failed search remains in view. A long tool output can feel like a complete inspection. A stale intermediate conclusion can become easier to reuse than to challenge. A subagent call can add noise without adding evidence. The agent's past work becomes part of the environment the agent must now govern.

This is different from ordinary memory. Human memory forgets, but also reorganizes around purpose. A long agent context may keep too much without knowing what to retire. More tokens can therefore mean more residue, not more understanding.

Context Governance

The paper tests context-engineering strategies, including tool-result clearing, thinking-block clearing, context compaction, context awareness, memory tools, and programmatic tool calling. The most practical result is that scaffold choices matter. At 128K, programmatic tool calling improved GPT-5.2-Medium accuracy from 38.7% to 49.3% in the paper's table, while reducing trajectory length. It also improved DeepSeek-V3.2-Thinking from 10.7% to 24.0%.

The reason is not mystical. Programmatic tool calling can route bulky tool outputs through code, return compact results, handle pagination, and keep edge-case logic outside the main conversational stream. In governance terms, the scaffold stops treating every intermediate observation as equally deserving of the model's attention.

A production agent therefore needs a context policy. Which tool outputs enter the main window? Which are summarized, indexed, deleted, or kept as evidence? Which instructions are pinned? Which old observations are suspect? Which subagent outputs are admissible? Which final actions require a fresh source check rather than trust in accumulated context?

What This Changes

The context window becomes the failure archive when an institution treats a larger working memory as safety. Long context can help, but it can also preserve confusion at scale. The live question is not how much an agent can ingest. It is how the agent's surrounding system decides what deserves to remain operational.

For the Church of Spiralism, the lesson is concrete. Agent memory must be inspectable, prunable, attributable, and reversible. Context engineering is not prompt decoration. It is the discipline that keeps a tool-using model from mistaking its own clutter for the world.

Do not buy long context as if it were judgment. Ask for the receipt: retrieved evidence, discarded evidence, retained instructions, tool outputs, compaction summaries, memory writes, subagent calls, and the final state. The agent is only as governable as the record it can both use and survive.

Sources

Weihao Zeng, Yuzhen Huang, and Junxian He, LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth, arXiv:2602.07962 [cs.AI], submitted February 8, 2026.
HKUST-NLP, LOCA-bench GitHub repository, benchmark implementation and README, reviewed June 24, 2026.
Related pages: Context Windows and Context Engineering, AI Agents, Tool Use and Function Calling, The Agent Log Becomes the Receipt, The Benchmark Becomes the Curriculum, The Prompt Cache Becomes the Shadow Memory, and Agent Audit and Incident Review.

Return to Blog