Blog · arXiv Analysis · Published: June 25, 2026

The Web Agent Row Becomes the Receipt

Minbyul Jeong's Ko-WideSearch paper shifts web-agent evaluation from the drama of one hidden answer to the quieter work of completing every row in a sourced table.

The Paper

The paper is Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents, arXiv:2606.27595 [cs.CL]. arXiv lists it as submitted on June 25, 2026, by Minbyul Jeong, with DOI 10.48550/arXiv.2606.27595. The paper introduces a Korean benchmark for web agents that must enumerate complete sets and fill attribute tables from live web sources.

The site already has pages on WebArena, BrowserGym, fingerprinted web agents, and GUI-agent planning. Ko-WideSearch adds a narrower failure mode: the agent may find most of the members of a set, yet still fail the table because individual rows remain incomplete or wrong.

Breadth Search

Many web-agent benchmarks measure depth. The system follows clues, searches across pages, and returns a single answer. Ko-WideSearch asks for breadth instead. A task names a parent entity, such as a TV season, dynasty, league, administrative region, or election, and asks for the full membership plus attributes for every member.

This is a different kind of work. A web agent answering one fact can be lucky, narrow, or overconfident. A web agent building a table has to know when the set boundary is closed, keep track of what it has already found, avoid invented rows, and attach the right cell values to the right entities. The output is not a sentence. It is a structured claim about completeness.

Gold Table

Ko-WideSearch contains 228 Korean breadth-search tables over 190 set-parent entities, sixteen categories, three difficulty tiers, 4,262 gold rows, and 14,560 attribute cells. The paper's difficulty design uses two knobs: table width and a two-dimensional composite key. In the hard tier, membership can become a cross-product rather than a simple list.

The construction method matters as much as the benchmark. Jeong describes an automated synthesize-and-verify pipeline: a build agent constructs candidate gold tables through exhaustive web search, then independent gates test non-memorizability, completeness, and cross-source attribute verification. The paper also uses a normalization-aware comparator so stable date and count columns are not unfairly rejected for formatting differences.

That is the governance lesson. If agents are going to be evaluated on the live web, the benchmark itself needs provenance. A gold table is not magic ground truth. It is a maintained artifact with sources, exclusions, column types, required columns, volatile as-of dates, and an evaluation contract.

Row Gap

The main result is blunt: across twenty web-agent systems, membership recovery is much stronger than full-row completion. The paper reports that GPT-5.5 reaches Item-F1 92.8 while Row-F1 is 53.7, with table success at 19.3 percent. In other words, the strongest evaluated model recovers much of the set, but fewer rows survive with all required cells correct, and only about one table in five is exactly complete.

The paper also reports that accuracy falls as the difficulty knobs harden, and that neither more search nor more spending closes the gap. The strongest open-weight model reported, DeepSeek-V4-Pro, reaches Row-F1 45.0, which the paper describes as competitive with mid-pack proprietary systems. Korean-specialized systems do not close the gap to the frontier models in this benchmark.

For deployment, this distinction is critical. A user may inspect the item list, see familiar names, and believe the agent succeeded. The row metric asks a harder question: did each recovered item carry the right attributes, or did the agent produce a plausible inventory with hidden holes?

Regional Web

Ko-WideSearch is also a reminder that web-agent capability is regional. Korean search tasks are not English search tasks translated into another language. The paper argues that an agent has to navigate Korean sources whose structure, terminology, and search conventions differ from English sources. Static Korean language benchmarks are not enough to test live browsing, evidence maintenance, and cross-page synthesis.

This matters for procurement. An enterprise, public agency, newsroom, or researcher should not accept a global web-agent score as proof of local competence. The relevant question is whether the agent can search the actual language, websites, tables, election pages, league records, administrative sources, and page conventions that its users depend on.

Row Receipt

A web-agent table should leave a row receipt. At minimum, that receipt should preserve the task question, as-of date, source URLs, membership key, excluded entities, required columns, per-column formats, row-match rule, cell comparator, missing-cell convention, search trace, parser result, Item-F1, Column-F1, Row-F1, table-success status, and human review decision.

The receipt should also distinguish failure types. Did the agent omit a member, invent a member, attach a correct value to the wrong row, leave a cell blank, fail to parse its own output, or use a stale source? Those failures have different operational consequences. A missing row in a public-benefits list is not the same as a wrong date in a sports table, and neither should be hidden behind a single "web task passed" label.

Claim Boundary

The paper names limits that should travel with the benchmark. The hard two-dimensional tier is sports-season heavy, each table is anchored to one primary membership URL rather than one source URL per attribute, and results are measured under a single harness, search backend, and budget. Because the web shifts, volatile tables carry as-of dates and need periodic revalidation. The paper says the pipeline and scorer are released under MIT, while evaluation data is distributed by request so live-search agents cannot simply find the posted gold set.

That boundary makes the paper more useful, not less. Ko-WideSearch does not prove that web agents are ready to operate unattended. It shows why a browser agent's answer needs to be judged at the level where real work breaks: the row, the cell, the source, and the date.

Sources


Return to Blog