Blog · arXiv Analysis · Last reviewed June 25, 2026

The Agent Breadcrumb Becomes the Oversight Trail

Yujin Zhang and Daye Nam's HANSEL paper treats web-agent verification as an interface problem: users need preserved evidence pages, not just a fluent answer, a pile of logs, or screenshots of what a model claims it saw.

Verification Is Work

The promise of a web agent is delegation: ask it to search, compare, fill, buy, book, or summarize, then receive an answer without watching every click. The governance problem begins at the same point. If the user must rerun the task manually to know whether the answer is correct, delegation has merely moved work into an after-action audit.

A web task is not only a URL and a final sentence. It includes applied filters, search terms, sort order, scroll position, failed branches, pop-ups, stale page state, and partial evidence. A source link can open the right website while losing the exact state that made the answer plausible.

The Paper

arXiv lists HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification as arXiv:2606.18671v1 [cs.HC], submitted June 17, 2026, by Yujin Zhang and Daye Nam of the University of California, Irvine. The paper introduces HANSEL, short for Highlighting Agent Navigation Steps as Evidence Links.

The system takes a web-agent trajectory and extracts the pages and snippets that directly support the final answer. Instead of making the user read the whole trajectory, it presents a smaller set of interactive evidence pages with relevant state preserved. If the agent's final claim cannot be traced to a visited page, HANSEL is designed to make that gap visible.

Why Logs Fail

The paper's target is not transparency in the abstract. It names four familiar mechanisms and explains why each can fail as oversight. Full trajectory logs may contain every action, but completeness becomes overload. Source links can lose the page state that matters. Screenshots preserve appearance, but not interaction; the user cannot scroll, sort, click, or inspect hidden options. Natural-language summaries can be incomplete or unfaithful, and they can rationalize an incorrect path in a persuasive voice.

The diagnosis separates evidence from performance theater. A long log is not necessarily inspectable. A source citation is not necessarily a reconstruction. A screenshot is not necessarily a usable page. A polished explanation is not necessarily evidence.

Breadcrumbs, Not Stories

HANSEL's useful premise is that the oversight unit should be an evidence page: a live, navigable reconstruction of a page the agent actually used, with relevant snippets highlighted and the agent's state replayed. The paper describes preserving details such as filters, search queries, scrolling, inputs, and selections. It also describes a grid view for an overview and a carousel view for step-by-step inspection.

This changes the user's authority. The user can test the agent's answer inside the evidentiary context instead of trusting a story about the context. If the agent stopped halfway, the user may be able to continue from the recovered state.

What Was Evaluated

The authors first analyzed 45 web-agent tasks drawn from AssistantBench and Online-Mind2Web: 22 from AssistantBench and 23 from Online-Mind2Web. Their trajectories included 592 steps and 271 pages. The paper reports that only 150 steps and 98 pages were evidence-bearing, which is the core empirical reason a smaller evidence interface can work.

In the technical evaluation, HANSEL identified evidence pages with 83.7 percent precision, 88.8 percent recall, and 0.861 F1 across those 45 tasks. The paper reports that the extracted evidence set reduced trajectory-page volume from 271 pages to 104 evidence pages, a 61.6 percent reduction. For evidence snippets, the authors report 141 valid snippets out of 159 reviewed, or 88.7 percent precision.

The human study involved 14 participants and eight web-based tasks per participant, with four correct and four incorrect agent answers. The accuracy result was a trend rather than a statistically significant result: 75.0 percent in the baseline condition and 82.14 percent with HANSEL, with p=.29. The stronger findings were about effort and usability. The paper reports lower perceived effort with HANSEL and participant ratings that favored HANSEL for usability, verification ease, and error identification. It also records a warning: when the agent was wrong, HANSEL users still accepted the wrong answer in 32.1 percent of cases.

Governance Reading

This page belongs beside agent traces as process maps, agent logs as receipts, agent benchmarks as attack surfaces, agentic browsers as assistive interfaces, and human oversight of AI. The shared question is whether the surrounding institution can inspect the path that made the answer action-worthy.

For web agents, "human in the loop" is too vague. A tired user staring at logs is technically in the loop, but functionally under-equipped. A better standard is inspectable delegation: the agent may act, but the user receives the minimum evidence needed to verify, correct, or reject the result.

Limits

HANSEL does not solve agent trust by itself. Its extraction pipeline depends on trajectory quality and on the model that selects evidence pages and snippets. The user study is small, and the paper itself treats the accuracy improvement as non-significant. Preserved evidence may also increase misplaced confidence if users treat highlighted snippets as a verdict instead of a starting point.

The most careful reading is therefore procedural. HANSEL is evidence plumbing, not automatic truth. It shows what an oversight interface should expose: visited pages, reconstructed state, highlighted support, missing-support gaps, and a route for correction.

Oversight Receipt

A web-agent oversight receipt should record: user request, model and scaffold, trajectory version, visited pages, evidence pages, discarded branches, highlighted snippets, state reconstruction steps, filters, search queries, sort order, scroll positions, unsupported final claims, user corrections, and final user disposition. The audit-grade sentence is: the agent made this claim, these reconstructed pages supported it, these gaps remained, and this human decision followed.

Sources

Yujin Zhang and Daye Nam, HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification, arXiv:2606.18671v1 [cs.HC], submitted June 17, 2026.
Primary arXiv versions checked: metadata API record, PDF, and experimental HTML, reviewed for title, authorship, submission date, abstract claims, HANSEL acronym, evidence-page design, preserved page state, evaluation datasets, 45-task technical evaluation, 14-participant user study, reported precision/recall/F1, trajectory-volume reduction, effort findings, and trust-related limitations.
Related pages: The Agent Trace Becomes the Process Map, The Agent Log Becomes the Receipt, The Agent Benchmark Becomes the Attack Surface, The Agentic Browser Becomes the Assistive Interface, and Human Oversight of AI.

Return to Blog