The Live Benchmark Becomes the Update Receipt
Yuanzhi Liu, Shousheng Zhao, Bo Zhou, Kongming Liang, and Zhanyu Ma's MMBench-Live paper treats benchmark freshness as a governed construction process, not a leaderboard decoration.
The Paper
The paper is MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models, arXiv:2607.01813 [cs.CV]. arXiv lists it as submitted on July 2, 2026, by Yuanzhi Liu, Shousheng Zhao, Bo Zhou, Kongming Liang, and Zhanyu Ma, with DOI 10.48550/arXiv.2607.01813. Its subject is vision-language model evaluation: how to keep a multimodal benchmark fresh without losing comparability with the older benchmark it updates.
The site already has pages on benchmarks becoming curricula, data curation loops, visual grounding failures, and contamination-limited evaluation. MMBench-Live adds a narrower institutional question: when a benchmark updates itself, what evidence proves that the new version still measures the same capability rather than merely producing new-looking questions?
Static Decay
Static benchmarks decay in several ways. Their images and questions can become stale as the world changes. Their samples can leak into training corpora. Their maintenance can become expensive enough that updates stop. A live benchmark promises freshness, but freshness alone is not evidence. If every update changes the task distribution, the leaderboard becomes a moving target that no longer supports longitudinal comparison.
MMBench-Live frames that tension directly. It treats benchmark evolution as task-guided dataset construction. The update is supposed to introduce temporally recent real-world data while preserving capability coverage, distributional alignment, evaluation format, and cross-version ranking stability. In governance terms, the benchmark is no longer a test set. It is a versioned institution.
That institutional reading matters because benchmark scores travel far beyond the lab. They appear in model cards, procurement decks, regulatory comments, grant narratives, and product marketing. A stale benchmark can make a model look better than it is. A drifting live benchmark can make progress look real when the target quietly changed. The hard requirement is therefore not novelty. It is continuity with an audit trail.
Update Pipeline
The pipeline begins by converting MMBench into structured benchmark descriptions. The paper names four elements: evaluation purpose, evaluation format, task hierarchy, and atomic tasks. For each atomic task, the system identifies task-related visual patterns, using perspectives such as visual content, visual style, OCR dependency, spatial relations, and external knowledge dependency.
Those patterns are then used during task-aware data acquisition. The acquisition executor supports source-specific structured retrieval, keyword-guided semantic retrieval through the Google Image API, and open-domain retrieval through the Flickr API. To reduce staleness and contamination risk, the keyword-guided mode restricts retrieval to images uploaded within a one-year window. A feedback controller judges whether retrieved candidates match the task and uses rejection rationales to refine weak queries for up to three iterations.
The rejection path is the important part. In ordinary dataset construction, rejected examples often disappear. Here, rejected candidates become diagnostic evidence about how a search query misread the task. The controller can use those rationales to reduce visual deviations before the candidate pool is finalized. That turns update failure into a construction signal rather than background waste.
Verification
The paper's strongest design move is to require a question, an answer, and an executable solution plan. Candidate QA pairs are generated by multiple multimodal models, selected through model comparison and rule checks, and then verified by tool-supported reasoning. The verification controller is vision-blind: it reasons over textual intermediate results produced by tools, not directly over the image.
That separation is useful. A vision-language model can be fluent about an image and still be wrong. A benchmark update needs more than another model saying the QA pair looks plausible. It needs a trace showing how the answer was derived, which tools produced the intermediate observations, and why the final answer matched the verified outcome.
It also keeps the evaluation artifact closer to reviewable work. A human auditor cannot easily inspect a latent visual judgment, but can inspect a solution plan, tool outputs, and a final verifier decision. The paper does not remove human review; it reports manual answer correctness as a quality check. The point is that automated construction should leave enough intermediate structure for review to be meaningful.
Results
The paper reports that the MMBench-Live instantiation contains 5.9K newly generated evaluation QA pairs, reaches a manual answer correctness rate of 96.06 percent, and completes each update in approximately one to two hours at about 30 dollars. It compares representative open-source VLMs including DeepSeek-VL-7B-Chat, InstructBLIP-Vicuna-7B, LLaVA-v1.5-7B, mPLUG-Owl2-7B, Qwen3-VL-8B-Instruct, and Qwen2.5-VL-7B-Instruct.
The evaluation does not claim that live updating eliminates contamination. It uses PaCoST as a proxy signal, comparing model confidence on original questions and meaning-preserving paraphrases. The authors explicitly warn that PaCoST is not a definitive detector of multimodal data leakage because it is mainly sensitive to text-side memorization and cannot fully distinguish visual leakage, metadata overlap, or exposure through web content. Under that limited protocol, MMBench-Live shows a smaller confidence shift than MMBench and smaller shifts than several edited-image variants.
The results therefore support a maintenance claim rather than a triumph claim. The benchmark can be updated quickly and cheaply while preserving enough relationship to the original to keep rankings interpretable at aggregate levels. That is useful evidence, but it does not convert every generated instance into ground truth or every stable ranking into deployment authority. The score still depends on task slices, construction models, verification tools, and the chosen contamination proxy.
Update Receipt
A live benchmark needs an update receipt. That receipt should preserve the benchmark version, source benchmark, task hierarchy, atomic-task descriptions, visual-pattern extraction prompt, accepted data sources, retrieval windows, API modes, rejected-query rationales, final queries, candidate images, metadata fields, QA-generation models, rule checks, executable plans, tool outputs, verifier model, manual-audit sample, correctness rate, construction cost, elapsed time, cross-version correlations, and contamination-proxy protocol.
Without that record, "live" can become a brand word. With the record, live evaluation becomes auditable maintenance. Procurement teams, researchers, and standards bodies can ask whether the benchmark update preserved the old task, introduced hidden drift, overfit to the generator, relied on one fragile judge, or narrowed the task world while claiming freshness.
The receipt should also say who is allowed to compare scores across versions. A live benchmark may preserve overall model ordering while still shifting performance at individual capability slices. Users who care about OCR, spatial relations, attribute comparison, or social relation reasoning need the slice-level evidence, not only the aggregate leaderboard.
Claim Boundary
The limitations section matters. Dynamic updates cannot fully eliminate implicit memorization of frequent visual concepts. Automatically constructed instances are bounded by the capabilities of the agents and foundation models used to build them. Subtle noise can persist under visual ambiguity or underspecified constraints. The authors also say the task-grounded paradigm favors stable predefined capabilities and does not fully expand the task space.
That makes MMBench-Live useful but bounded. It is not proof that a model is ready for open-world visual reasoning. It is evidence that benchmark maintenance can be made more explicit, cheaper, faster, and more inspectable when the update process itself is treated as an artifact under audit.
It also warns against confusing automation with independence. A benchmark built by agents inherits the blind spots of its construction agents, tool set, prompts, source APIs, and verification rules. The update receipt is how those dependencies remain visible instead of becoming hidden inside a fresh score. That evidence has to survive the next update too.
The operational lesson is to stop treating benchmark freshness as a single property. Fresh compared with what source window? Comparable on what task hierarchy? Verified by which tools? Audited by which sample? Interpreted under which contamination test? A live benchmark earns trust by answering those questions every time it changes.
Sources
- Yuanzhi Liu, Shousheng Zhao, Bo Zhou, Kongming Liang, and Zhanyu Ma, MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models, arXiv:2607.01813 [cs.CV], submitted July 2, 2026, DOI 10.48550/arXiv.2607.01813.
- arXiv HTML for MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models, reviewed for method structure, data acquisition, QA verification, experiments, contamination analysis, and limitations.
- arXiv PDF for MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models, checked against the metadata record and reviewed for reported counts, costs, model roster, PaCoST analysis, and stated limits.