The Coordinate List Becomes the Interference Surface
Dense visual grounding is not only a localization task. Once a vision-language model must emit a long list of class-coordinate records, fine-tuning changes how the model serializes, repeats, and stops. This paper is useful because it turns that failure mode into a measurable surface rather than treating duplicate records as a nuisance after the benchmark score.
The Paper
The paper is Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models, arXiv:2606.14507 [cs.AI], by Chenyu Zhou, Qiliang Jiang, and Boguang Pan. The arXiv HTML lists affiliations with the School of Engineering at Institute of Science Tokyo, the College of Control Science and Engineering at Zhejiang University, and the Graduate School of Information, Production and Systems at Waseda University. arXiv lists version 1 as submitted on June 12, 2026, with DOI 10.48550/arXiv.2606.14507.
The paper studies what happens when vision-language models are fine-tuned to emit dense bounding-box coordinate lists. The headline is not just that grounding improves. It is that localization gain and repeated-record pressure can appear on the same generation surface, then be separated with lightweight controls.
The arXiv record reviewed here links the abstract, HTML, PDF, and TeX source. It did not expose a separate public code repository in the arXiv metadata.
The List Problem
A dense bounding-box output is a sequence of object records. Each record couples a class label with coordinate numbers, separators, object boundaries, and a final list closure. That stresses a model differently from short visual question answering, captioning, or single-box grounding.
The hard part is that repetition is both legitimate and dangerous. An image can contain several instances of the same class, so the model cannot simply avoid repeated class labels. But exact repeated object records are a structural error: the same class and coordinate payload appears again, creating a duplicated detection tail.
The paper frames this as a generation surface, meaning the model's distribution over structured outputs under a fine-tuning and decoding configuration, and as a control surface, meaning the measured tradeoff among target F1, parse stability, prediction density, duplicate pressure, and termination behavior.
Experimental Setup
The primary model is Gemma 4 12B, a unified decoder-only multimodal model. Qwen3-VL-8B serves as a cross-family control with a different architecture and coordinate protocol. Gemma 4 12B is read in pixel coordinates, while Qwen3-VL-8B uses its native normalized grid.
The primary dataset is InsPLAD industrial inspection imagery. The adapter is trained on 160 images and evaluated on an 80-image held-out split with zero filename overlap. For public-data reproduction, the authors construct a COCO 2017 dense-bbox subset from val2017 detection annotations: person and car are excluded, the ten most frequent remaining object classes are used, images have two to six target objects, and the split separates 780 training images from a 120-image evaluation split. A scale-matched 160/80 COCO split mirrors the InsPLAD protocol.
The main Gemma 4 12B chain uses high-capacity q/k/v/o LoRA with rank 32 and 42.7M trainable parameters. Qwen3-VL-8B uses a rank-32 q/v adapter for its high-capacity controlled endpoint. The study also uses a rank-8 q/v adapter for structure-axis analysis and a Gemma q/v rank sweep across ranks 4, 8, 16, 32, and 64.
The metrics include parse-valid rate, mean predictions per image, class-aware one-to-one F1@0.3, exact-object duplicate rate, maximum exact-object repeat, repeat-stop trigger rate, and stricter F1@0.5 audits for promoted operating points. The paper reports image-level nonparametric bootstrap 95 percent confidence intervals for F1@0.3 using 1000 resamples.
The Surface
On Gemma 4 12B, the base model starts with parse-valid rate 0.963, 4.763 predictions per image, class-aware F1@0.3 of 0.007, F1@0.5 of 0.002, duplicate rate 0.002, and max repeat 2. High-capacity q/k/v/o rank-32 adaptation raises F1@0.3 to 0.448, but parse-valid rate drops to 0.812, predictions per image rise to 6.713, duplicate rate rises to 0.080, and max repeat reaches 23.
The adapter-capacity sweep is the first clue that the repeated tail is persistent rather than a simple over-capacity artifact. Under q/v LoRA, max repeat stays at 21 to 22 across ranks 4 to 64. The paper also finds that token-level repetition penalty and prompt budgeting move the density-precision operating point without fully solving record-level repetition.
The structure-axis probes are the second clue. The effect localizes to bbox-coordinate object lists. The dense non-bbox JSON and spatial/count JSON formats remain repeat-clean, including under high-capacity adapters. That matters because it argues against the easy explanation that any long JSON output will degenerate in the same way.
Controls
The key control is object-level repeat-stop. During generation it normalizes each emitted object record, including class string and coordinate values, and closes the JSON array when an exact normalized record would be emitted a second time. This is a record-level control, not a token-level repetition penalty.
The target signal is separable. Object-level repeat-stop removes exact repeated records, bringing duplicate rate to 0.000 and max repeat to 1, while preserving F1: one promoted Gemma operating point moves from F1@0.3 0.494 to 0.490 and stricter F1@0.5 from 0.381 to 0.385.
Qwen3-VL-8B reproduces a clean controlled endpoint with F1@0.3 0.318 and duplicate rate 0.000. The COCO 2017 reproduction also transfers the pattern. On the 120-image COCO evaluation split, high-capacity adaptation raises recall@0.3 from 0.0228 to 0.1540 and recall@0.5 from 0.0128 to 0.0981, while duplicate rate rises from 0.000 to 0.016 and max repeat from 1 to 4. Object-level repeat-stop then removes the transferred tail, moving duplicate rate from 0.016 to 0.000 and max repeat from 4 to 1 while F1@0.3 rises from 0.1647 to 0.1672.
Governance Standard
A dense grounding model should ship with a structured-output receipt. The receipt should name the base model, fine-tuning data, evaluation split, coordinate protocol, adapter modules, LoRA rank, trainable parameter count, output schema, prompt budget, repetition penalty, decoding settings, parse-valid rate, mean predictions per image, F1@0.3, F1@0.5, duplicate rate, max repeat, repeat-stop trigger rate, confidence intervals, and examples of malformed outputs.
The larger lesson is that a single target metric can hide the surface that changed. A model can become much better at localization while worse at list termination. If the output drives inspection workflows, robotics, agent clicking, asset inventories, or safety review, duplicate records are not cosmetic. They change counts, alerts, downstream deduplication, and operator trust.
This connects directly to AI Evaluations, AI Audits and Assurance, AI Safety Cases, AI Hallucinations, Low-Rank Adaptation (LoRA), AI Data Provenance, AI Browsers and Computer Use, The Crop View Becomes the GUI Grounding Receipt, The GUI Uncertainty Score Becomes the Handoff Budget, and The Agent Benchmark Becomes the Attack Surface. Dense structured outputs need integrity metrics, not only accuracy metrics.
Limits
The study is deliberately narrow. It focuses on dense coordinate-list serialization in specific VLM families, datasets, adapters, and schemas. The result should not be generalized into a universal claim that every fine-tuned multimodal model will repeat in the same way.
The paper's strongest claim is more useful than that: in this regime, the failure is structure-specific, capacity-persistent, cross-family, separable from the target signal, and reproducible on a second public dataset. That is enough to make duplicate rate and max repeat first-class evaluation fields for dense grounding systems.
The Spiralist reading is simple: when a model is asked to produce a machine-readable list, governance starts at the list boundary. The object record is not just a representation. It is the unit that other systems count, trust, route, and act on.
Sources
- Chenyu Zhou, Qiliang Jiang, and Boguang Pan, Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models, arXiv:2606.14507 [cs.AI], submitted June 12, 2026.
- arXiv HTML: Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models, reviewed for the method, experimental setup, control surface, structure-axis probes, Qwen reproduction, COCO reproduction, reproducibility details, and conclusion.
- arXiv PDF: Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models.
- Related pages: AI Evaluations, AI Audits and Assurance, AI Safety Cases, AI Hallucinations, Low-Rank Adaptation (LoRA), AI Data Provenance, AI Browsers and Computer Use, The Crop View Becomes the GUI Grounding Receipt, The GUI Uncertainty Score Becomes the Handoff Budget, and The Agent Benchmark Becomes the Attack Surface.