Blog · arXiv Analysis · Last reviewed June 25, 2026

The Citation Becomes the Influence Trace

A June 2026 arXiv paper on ProvenAI argues that a citation beside a generated answer is not enough. A source can be cited without strongly shaping the answer, and an uncited source can still move the output.

Citation Is a Claim

The paper, arXiv:2606.26449 [cs.CL; cs.AI; cs.CR; cs.IR], was submitted on June 24, 2026. arXiv lists the title as ProvenAI: Provenance-Native Traces of Evidence in Generated Answers, by Mohammad Faizan and Dalal Alharthi.

The useful warning is narrow and practical: a retrieval-augmented system can put citations beside an answer, but the citation list only reports what the output claims to use. It does not prove the answer is correct, prove the citation matches the benchmark evidence, or prove that the cited document actually shaped the generated text.

The Paper Frame

ProvenAI is not presented as a stronger question-answering model. It is an audit framework for retrieval-grounded question answering. The paper uses the HotpotQA distractor benchmark because it supplies supporting-fact annotations for multi-hop questions, then builds a seven-stage pipeline around normalized data, retrieval indexing, citation-aware generation, citation auditing, leave-one-resource-out ablation, aggregate evaluation, and interactive inspection.

The implementation evaluates 7,405 HotpotQA validation examples over a canonical retrieval corpus of 509,300 passages. The paper reports that the corpus was produced after deduplicating 464,067 redundant evidence rows from 973,367 source evidence rows. Its runtime configuration uses retrieval depth 10, all-MiniLM-L6-v2 embeddings, FAISS with a SQLite lookup mirror, a local MLX generation path, and Qwen2.5-3B-Instruct as the default generation model.

Three Ledgers

The paper separates three ledgers that are often collapsed into one reassuring interface. The first is answer correctness: did the final answer match the benchmark answer after lightweight normalization? The second is citation fidelity: did the cited document titles match the benchmark supporting titles? The third is document influence: did removing a retrieved resource change the answer or citation pattern?

That third layer is the contribution that matters for governance. ProvenAI assigns each retrieved resource one of four verdicts by combining whether it was cited with whether its leave-one-out influence proxy crosses a threshold. A cited and influential document is treated as used. A cited but weakly influential document is flagged as a hallucinated citation. An uncited but influential document is flagged as uncited-influential. A stable, uncited document is low-influence.

On the full validation run, the paper reports 53.53 percent answer accuracy and a 71.55 percent mean citation-fidelity score. It also says full-batch ablation was disabled because retrieval depth 10 would require roughly 11 forward passes per example. That matters: the zero aggregate influence score in the table is a cost-control artifact, not a finding that retrieved documents had no influence.

The Gap

The paper's worked example asks whether Scott Derrickson and Ed Wood were of the same nationality. The model answers yes, identifies both as American, and cites the two supporting titles. The citation audit is clean under the paper's exact and semantic matching setup.

The leave-one-out influence report is less clean. Across ten retrieved documents, the paper reports one used source, one cited but weakly influential source, seven uncited-influential sources, and one low-influence source. The authors interpret this as evidence that other co-retrieved passages about American filmmakers may have stabilized the nationality pattern even though they did not appear in the citation list.

That is the citation-influence gap. It is not proof that the model reasoned like a person, and it is not proof that every uncited document caused the answer in a deep causal sense. The paper is careful about its proxy: the local MLX path does not expose token probabilities, so influence is measured through answer-and-citation sensitivity, not direct token-level KL divergence.

Governance Reading

This belongs beside the provenance layer is not a truth machine, evidence-RAG peer review, answer engines, agent breadcrumbs, and retrieval-augmented generation. The shared warning is that a visible source list is not the same thing as an evidence trace.

An answer citation is declared evidence. A retrieval manifest is available evidence. An influence trace is behavioral evidence about what moved the model's output under intervention. Public-facing AI products often show the first, sometimes preserve the second, and rarely expose the third. ProvenAI makes the missing layer legible.

For high-impact uses, that distinction should become a procurement requirement. A system that answers legal, medical, educational, scientific, or public-record questions should preserve the retrieved set, cited set, model and prompt version, generation configuration, source identifiers, answer-correctness check where available, citation-fidelity check where available, and a bounded influence audit for contested outputs.

Limits

The paper's limits are important. Its reported results come from HotpotQA distractor, not from every RAG setting. The default model is a 3B Qwen2.5 model selected for local stability, and larger models may change accuracy, citation behavior, and sensitivity. The citation audit works at document-title level, so a title match is not a sentence-level entailment proof. The influence proxy is one-sided and surface-level until an inference backend exposes token probabilities.

The user-interface risk also remains. A rich trace can make users over-trust an answer if the trace is styled as certification. ProvenAI should be read as diagnostic infrastructure, not as a guarantee that a generated answer is true.

Answer Receipt

An answer receipt should record the question, retrieved resources, cited resources, unavailable resources, model and retrieval configuration, generation prompt, citation parser, benchmark or human review standard, answer-correctness result, citation-fidelity result, influence verdicts where feasible, and the limits of the influence proxy.

The audit-grade sentence is not "the answer cited sources." It is: this answer was generated from this retrieved set, cited this subset, matched or missed this evidence standard, changed or did not change under these resource removals, and still leaves these uncertainty gaps open.

Sources


Return to Blog