Blog · arXiv Analysis · Published: June 25, 2026

The Long Context Becomes the Evidence Scaffold

A model can be handed the whole record and still fail to use the right sentence. ReContext treats the problem as evidence organization, not only context length.

The Paper

The paper is ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning, arXiv:2607.02509 [cs.AI]. The arXiv record lists Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei, Zhining Liu, Lingjie Chen, Ismini Lourentzou, Hanghang Tong, and Jingrui He as authors, with version 1 submitted on July 2, 2026. The 18-page PDF lists the affiliation as University of Illinois Urbana-Champaign.

The paper studies a familiar failure in long-context systems: the relevant evidence is already in the prompt, but the model does not reliably use it. That distinction matters for every deployment that says the model saw the policy, contract, chat history, codebase, case file, or dossier. Seeing the text is weaker than grounding the answer in the right text.

Context Access Is Not Context Use

Long context windows make it possible to put more material in front of a model. They do not guarantee that the answer will be controlled by the relevant part of that material. The ReContext paper frames the gap as context access versus effective context utilization. In Figure 1, the authors report that the top 0.1 percent of context tokens can account for roughly 50 to 80 percent of accumulated question-conditioned relevance across three LLMs, corresponding to only 128 tokens in a 128K-token context.

The governance point is direct. A long prompt can create the appearance of procedural completeness while leaving evidence use opaque. If a benefits bot, research assistant, legal-draft tool, or enterprise agent says it used a long file, the useful question is not merely whether the file was present. The useful question is which spans actually shaped the answer.

What ReContext Does

ReContext is a training-free inference method. It does not train a retriever, maintain persistent external memory, prune the original context, or directly modify attention logits during final decoding. Instead, it reads the original prompt, uses question-conditioned internal relevance signals to identify candidate evidence positions, materializes those token hits into grounded text spans, and replays the selected spans near final generation while preserving the full original context.

The method is recursive in a limited, fixed-round sense. It starts with an empty ordered evidence pool. Each round conditions on the original context, the question, and the evidence pool accumulated so far. Newly selected spans are copied from the original context and added to the pool. The final answer is then generated from the full context, the replayed evidence scaffold, and the question. The scaffold is emphasis, not exclusion: the unselected context remains available.

The paper explains the method through associative memory. The context is treated as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. For deployment review, the important part is not the metaphor. It is the separation of evidence organization from answer generation.

What the Results Show

The authors evaluate ReContext on eight long-context benchmarks: Natural Questions, TriviaQA, HotpotQA, PopQA, NarrativeQA, InfBench QA, InfBench MC, and CLIPPER. They test Qwen3-4B, Qwen3-8B, and Llama3.1-8B with 128K context settings, comparing against Vanilla prompting, AttnSharp, DySCO, A-MEM, and DAC.

On the main 128K benchmark table, ReContext obtains the best average rank on all three backbones: 1.00 on Qwen3-4B, 1.46 on Qwen3-8B, and 1.29 on Llama3-8B. Averaging eight accuracy columns across all three backbones, the paper reports an improvement over Vanilla from 0.24 to 0.30, a 24.6 percent relative gain. The authors also report robustness checks at 64K context and with thinking enabled on Qwen3-4B.

The result is not that ReContext wins every metric. The paper notes exceptions, including specific QA or F1 metrics led by other baselines. Its stronger claim is aggregate: making evidence explicit helps long-context performance across model families without throwing away the original context.

The Context-Use Receipt

A long-context answer should carry a context-use receipt. It should record the context window, evidence-selection method, selected spans, replay position, number of replay rounds, token budget, backbone model, whether thinking mode was enabled, and whether the final answer cited or contradicted the selected evidence. For high-stakes work, the receipt should also preserve the unselected context and a route for human challenge.

That receipt changes the trust claim. "The model had the document" becomes "these copied spans were emphasized before generation, and this answer can be checked against them." It is a smaller claim, but a more auditable one.

Limits

The authors state two practical limits. ReContext needs access to model-internal relevance signals, which limits direct use with closed-source APIs that do not expose attention or similar scoring information. It also adds a read-and-replay stage, increasing inference latency relative to direct full-context decoding, while remaining training-free and avoiding persistent external memory.

The paper also should not be read as proof that evidence replay solves long-context governance. A scaffold can emphasize the wrong spans, omit needed context, or make weak evidence look decisive. Its value is that it makes part of the evidence-use process inspectable. In a world of ever-larger context windows, that inspectability may matter as much as the window size.

Sources

Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei, Zhining Liu, Lingjie Chen, Ismini Lourentzou, Hanghang Tong, and Jingrui He, ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning, arXiv:2607.02509 [cs.AI].
arXiv HTML for ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning, checked for title, abstract, method summary, figures, experiments, and source metadata.
arXiv PDF for ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning, checked for title page, method, benchmark table, robustness tables, implementation notes, runtime section, conclusion, and limitations.

Return to Blog