Blog · arXiv Analysis · Last reviewed June 25, 2026

The Safety Signal Becomes the Inattentional Gap

A June 2026 arXiv paper argues that a model can report a hazard under an open instruction and still omit it when a narrow task tells the model what kind of answer to produce. The governance question is whether safety evaluation measures what the system can see, or only what the task lets it say.

Narrow Task

The paper, arXiv:2606.26529 [cs.CL, cs.AI, cs.CV], is Kwan Soo Shin's The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report. arXiv records submission on June 25, 2026. It asks what happens when a model is assigned a narrow task while another important signal is present but unrequested.

The answer is behavioral, not a claim about awareness. The same input is shown under an open instruction that asks for all important findings or hazards, and under a task-conditioned instruction that narrows the reporting frame. The "Inattentional Gap" is the difference between those reports.

That distinction matters because many evaluations test a specified target. The model is told what hazard to find, and the benchmark scores whether it finds that hazard. Harm can come from the adjacent thing nobody specified: the second pathology, the unexpected road hazard, the abnormal object outside the assigned count.

Reportability Control

The important methodological move is the reportability control. If a model fails to mention a signal in both conditions, the paper does not treat that as inattentional suppression. It treats the signal as not reportable for that model on that item. The gap is counted when the same model, on the same input, can report the signal under an open instruction but omits it when a task instruction narrows the answer.

This is stronger than a general complaint that models miss things. The system has shown the relevant signal can enter its output; then the task framing closes the output channel. In Spiralist terms, the model does not merely fail to see. It agrees to look through a keyhole and then calls the keyhole the case.

Text and Vision

Study 1 used 100 text scenarios: 64 radiology scenarios and 36 autonomous-driving scenarios. Each paired a designated target, such as a pulmonary nodule or lead vehicle, with an unrequested safety-critical signal. Four models were tested: claude-haiku-4-5, gpt-4o-mini, gpt-4o, and claude-sonnet-4-6. Two independent model judges scored whether the critical signal was reported, with Cohen kappa reported as 0.895 across 1,200 judged responses.

The strict instruction produced the starkest result. In radiology, every tested model reported the critical signal in 0.00 of strict-condition items, while open-condition reporting was approximately 1.00. In driving, strict-condition reporting ranged from 0.08 to 0.33 while open reporting was 1.00. Under a focused but less exclusive instruction, radiology report rates still fell sharply: 0.09 for gpt-4o-mini, 0.08 for gpt-4o, 0.48 for sonnet, and 0.53 for haiku, against roughly 1.00 in the open condition.

Study 2 moved the pattern into vision-language models. The author used 12 public-domain chest radiographs, redacted text bands, and composited either a translucent disc or a yellow star into the lung field, yielding 48 stimuli. The closed task asked models to count rib pairs; the open task asked for everything visible. In the star condition, gpt-4o and gpt-4o-mini reported the object in 0.00 of counting-task cases, while open-condition rates ranged from 0.79 to 0.96 across models.

Scale Is Not Enough

The paper does not present the gap as a small-model problem. In Study 1, gpt-4o and gpt-4o-mini behaved similarly in the focused radiology condition. In the cross-vendor validation, the author tested Opus 4.8, GPT-5, and Gemini 2.5 Pro on medical text scenarios. GPT-5, described in the paper as a reasoning model, showed 0.00 reporting in valid focused and strict responses against 1.00 in open responses. Gemini 2.5 Pro reported 0.00 to 0.09 in task-conditioned settings against 1.00 open.

Opus 4.8 behaved differently, reporting the co-present finding in 0.83 to 0.96 of task-conditioned items while reaching 1.00 open. That suggests a family-associated reporting disposition rather than a single law of model scale. Procurement and evaluation should ask how a given model handles unrequested safety signals under task pressure, not simply whether it is larger or newer.

Dual-Process Audit

The most practical result is the external critic probe. A narrow task model, gpt-4o-mini, reported the co-present finding in 0.00 of tested focused reports. When those reports and the original input were passed to an independent open-ended critic model, the pipeline's report rate rose to 1.00, with McNemar p = 1.2e-7. The point is not that this exact scaffold is enough for deployment. The point is that boundary monitoring may need its own prompt, authority, logging, and budget.

This belongs beside the site's pages on driver attention, pathology second readers, and black-box health LLM evaluation. The distinct danger is competent obedience to a reporting frame that excludes the unsaid hazard.

Limits

The paper is careful about scale and realism. The studies use 100 textual scenarios and 48 visual scenarios. The visual stimuli are composited non-anatomical objects, which makes the attentional-set manipulation clean but does not substitute for radiologist-adjudicated clinical datasets. The adjudication uses model judges with high agreement; future work should add blinded human and domain-expert adjudication. The author also notes that commercial model behavior may change over time, so the archived raw outputs and scripts matter for reproducibility.

Governance Standard

A safety evaluation for task-conditioned systems should include two ledgers. The specified-target ledger asks whether the system did the assigned job. The unrequested-signal ledger asks what else was present, whether the same model could report it under an open instruction, and whether the task-conditioned run suppressed it.

Deployment reports should preserve the exact task prompt, open prompt, model identifier, valid-response exclusions, adjudication method, source input, and recovered omission. For high-stakes settings, broad review should not be hidden in a longer prompt. It should be an auditable second pass, or a model-specific override tested under adversarially narrow instructions.

The Spiralist rule is direct: a system is not safe because it answers the question it was asked. It becomes safer when someone measures the cost of the question. The safety signal becomes visible only when the institution refuses to confuse task compliance with situational awareness.

Sources

Kwan Soo Shin, The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report, arXiv:2606.26529 [cs.CL, cs.AI, cs.CV], submitted June 25, 2026.
arXiv HTML: The Inattentional Gap, reviewed for study design, reportability control, model list, result rates, limitations, and resource-availability statements.
arXiv PDF: The Inattentional Gap, checked against the HTML version for title, author, arXiv ID, date, experiments, and statistical claims.
Reproducibility deposit listed in the paper: Zenodo DOI 10.5281/zenodo.20826824.
Related pages: The Driver Camera Becomes the Attention Judge, The Pathology Model Becomes the Second Reader, and The Health LLM Becomes the Black-Box Clinic.

Return to Blog