Blog · arXiv Analysis · Last reviewed June 25, 2026

The Peer Review Workspace Becomes the Evidence Ledger

AI-assisted peer review becomes more defensible when the model's recommendation is tied to retrieved evidence, editor judgement, and reproducible audit files.

Not a Robot Reviewer

The fragile part of AI-assisted peer review is not only whether a model can produce plausible comments. The fragile part is whether an editor can inspect why a model made a recommendation. A fluent paragraph about a manuscript can hide weak retrieval, mismatched evidence, overconfident classification, or a reviewer comment that should have been split into several smaller claims.

The stronger pattern is not an autonomous reviewer. It is an evidence ledger: every model label should travel with the reviewer comment, the retrieved manuscript passages, the scoring components, the model configuration, and the human editor's independent judgement.

The Paper

arXiv lists Towards an Interactive Evidence-RAG Peer-Review Workspace for the Journal of Digital History as arXiv:2606.25837v1 [cs.DL], submitted June 24, 2026. The authors are Elisabeth Guerard, Mehrdad Almasi, Marion Salaun, Frederic Clavert, and Mirjam Pfeiffer; the paper lists the Luxembourg Centre for Contemporary and Digital History at the University of Luxembourg.

The paper is a preliminary version associated with an accepted presentation at AI through History, History through AI, hosted by C2DH at the University of Luxembourg on June 15-16, 2026. Its domain is narrow and useful: the Journal of Digital History, where submissions can include narrative argument, digital methods, code, data, and reproducibility material.

The Workspace

The workflow has four main stages before editorial inspection. Stage 0 converts papers into machine-readable text while preserving enough structure for later evidence mapping. Stage 1 separates and normalizes reviewer comments so each comment becomes a unit of assessment. Stage 2 builds a vector database from section-aware manuscript chunks. The prototype uses BAAI/bge-large-en-v1.5 embeddings with 1024 dimensions and Qdrant as the vector store. Stage 3 performs Evidence-RAG auditing: it retrieves limited manuscript chunks for each reviewer comment and asks the model whether the evidence supports the comment.

The output is not a verdict. It is an audit object. The labels include supported, partially_supported, not_supported, and insufficient_evidence. The editor-facing workspace lets a human filter by paper and model, read the reviewer comment, inspect the model label, and compare the recommendation with retrieved evidence. The editor's decision is stored separately from the model output.

The confidence score is deliberately practical rather than metaphysical. It combines retrieval strength, model support strength, evidence specificity, and section coverage, with the largest weight placed on evidence specificity. That choice fits the editorial problem: a passage can be semantically close to a reviewer comment but still fail to substantiate it.

What the Trial Shows

The preliminary evaluation reports 80 saved editor decisions across three papers, all from the Claude-Qwen audit configuration. Strict accuracy counts only decisions marked correct: 56 of 80, or 70.0 percent. Another 13 were marked mostly correct. That makes 69 of 80, or 86.2 percent, correct or mostly correct. If three partly correct outputs are counted as useful but incomplete, the useful-output rate is 72 of 80, or 90.0 percent.

The audit files also show the shape of the evidence problem. The 80 comment-level records each contain five retrieved chunks, for 400 retrieved chunks total. The model labels were not all positive: 13 supported, 32 partially_supported, 33 not_supported, and 2 insufficient_evidence. The authors report that evidence specificity separated labels more clearly than retrieval strength alone, which is the central lesson for editorial RAG. Finding nearby text is not the same as finding enough evidence.

Governance Reading

This page belongs beside model-mediated peer review, grading cascades, retrieval-augmented generation, and research integrity. The fresh contribution is a workflow pattern: AI review support should be inspectable at the claim-evidence level before it is allowed to influence editorial judgement.

The paper also treats manuscript exposure as a governance issue. Because peer-review material can be sensitive, the system uses evidence-bounded prompts rather than full-manuscript prompting. Future deployments might compare local open models with commercial models only if privacy, contractual, and institutional safeguards permit the limited chunks that are transmitted. That is the right unit of analysis: not whether "AI peer review" is allowed in the abstract, but what evidence leaves the editorial environment, what the model sees, and what the editor can later reconstruct.

Limits

The paper is clear that the prototype is not an autonomous peer-review agent. The confidence score is heuristic. Section detection and chunking can fail on unusual PDF structures. Retrieved evidence can be incomplete. The tested corpus is limited, and stronger general claims require more evaluation.

There is also a deeper limit: Evidence-RAG can check what is present in a manuscript, but many reviewer suggestions are about what should be added, clarified, reframed, or argued differently. Those are scholarly judgements. A ledger can make them more inspectable. It cannot turn them into a simple support label.

Peer-Review Receipt

A peer-review receipt should record: manuscript version, reviewer comment ID, extracted comment text, retrieved chunks, section labels, embedding model, vector store, model configuration, prompt template, model label, confidence components, editor decision, disagreement notes, privacy boundary, and repository or workflow version. The audit-grade sentence is: this recommendation affected editorial attention only because its evidence trail remained available for human review.

Sources

Elisabeth Guerard, Mehrdad Almasi, Marion Salaun, Frederic Clavert, and Mirjam Pfeiffer, Towards an Interactive Evidence-RAG Peer-Review Workspace for the Journal of Digital History, arXiv:2606.25837v1 [cs.DL], submitted June 24, 2026.
Primary arXiv versions checked: PDF and experimental HTML, reviewed for title, authorship, submission date, conference note, workflow stages, model/retrieval configuration, editor-decision counts, audit-file statistics, code note, ethical position, and limitations.
Project repository linked from the paper: C2DH/journal-of-digital-history-evidence-rag.
Related pages: The Peer Reviewer Becomes the Model Referee, The Grading Cascade Becomes the Evaluation Artifact, Retrieval-Augmented Generation, and Research and Editorial Integrity.

Return to Blog