Blog · arXiv Analysis · Last reviewed June 25, 2026

The Circuit Map Becomes the Variance Problem

The June 2026 arXiv paper Demystifying Variance in Circuit Discovery of LLMs, by Frank Zhengqing Wu, Francesco Tonin, and Volkan Cevher, asks whether a discovered circuit is stable enough to govern with. Its answer is uncomfortable: circuit maps can change when the data batch changes, when the prompt template changes, or when the metric is read sample by sample.

The Map Moves

The paper, arXiv:2606.16920 [cs.LG], was submitted on June 15, 2026. It studies circuit discovery, a mechanistic-interpretability practice that tries to identify the model components most important for a task. The target is not interpretability in general. It is the reliability of the circuit-discovery procedure itself.

Wu, Tonin, and Cevher focus on EAP-IG, an edge-attribution patching method based on integrated gradients. The arXiv abstract reports that EAP-IG performs well on the usual faithfulness and unfaithfulness metrics, but suffers from three kinds of variability: resampling variance, rephrasing variance, and sample-wise variance.

This is a fresh companion to the site's CircuitLasso essay, Mechanistic Interpretability, and Sparse Autoencoders. The earlier page asked whether circuit discovery can become cheap enough to repeat. This paper asks what happens when repetition returns different maps.

What CEAP Fixes

The authors introduce CEAP, conductance-based edge attribution patching. Their theoretical claim is narrow: conductance satisfies additive order preservation, a scoring property they argue integrated gradients lacks. The empirical claim is also narrow. CEAP lowers resampling variance, measured by higher or comparable pairwise Jaccard overlap across circuits discovered from different samples of the same distribution, while keeping unfaithfulness comparable to EAP-IG.

The experimental setup matters. The paper tests GPT-2 small, GPT-2 XL, Pythia-160M, and Pythia-2.8B on SVA, IOI, and greater-than tasks. For each task it uses roughly 10,000 samples, draws four 1,000-sample subsets, discovers circuits, and compares graph overlap.

For governance, CEAP is not a magic stabilizer. The appendix says it does not consistently beat EAP-IG on unfaithfulness. Its useful contribution is more procedural: if a circuit map changes because the batch changed, the method can make that instability smaller and more measurable.

Template Variance

The harder result is rephrasing variance. The paper shows that prompts human reviewers would treat as the same task can activate different circuits when written in different templates. In the authors' terms, different templates trigger different circuits inside the model.

This is a direct challenge to a common interpretability dream: find the task circuit, steer it, and get reliable control. If a task can be expressed in many templates, and unseen templates use different internal routes, then a circuit found on the available templates may steer a benchmark without steering the wild deployment distribution.

The authors test sparsity as a possible remedy because sparse training can shrink task circuits. Their result is cautious: sparse models still deploy different circuits for different templates. Sparsity may increase overlap, but the paper warns that this may reflect more polysemantic compression rather than one clean task circuit.

The Sample-Wise Trap

The third variance is sample-wise. A circuit can look faithful at the population level while individual samples show extreme unfaithfulness scores. The paper argues that many extreme scores are not catastrophic failures of the circuit. They arise from the geometry of the unfaithfulness metric, especially normalization effects when the full-model behavior term is near zero.

That is an important audit lesson. A large sample-wise unfaithfulness number can be evidence of a bad circuit, a fragile metric, or a near-zero denominator. The paper proposes diagnostics such as pessimistic unfaithfulness and unnormalized comparisons to separate those cases.

The broader warning is that faithfulness is not a single fact. Population faithfulness, sample-level computational faithfulness, and the harder notion of algorithmic faithfulness do not carry the same evidentiary weight. The appendix explicitly calls computational unfaithfulness a crude surrogate for genuine algorithmic unfaithfulness.

Governance Risk

A governance program cannot treat "the circuit" as a stable object unless stability has been measured. It should ask which batch produced the circuit, which prompt templates were included, which graph size was chosen, which metric was used, whether the same claim survives resampling, and whether unseen templates were tested.

This matters for safety cases, evaluations, and model-change review. A circuit claim might be true for one template family and misleading for another. It might improve a benchmark while leaving other phrasings untouched. It might look bad on a sample only because the metric is ill-conditioned. Interpretability is not exempt from measurement error just because it looks inside the model.

The paper's strongest contribution is therefore cultural. It makes variance part of the interpretability evidence, not an embarrassment to be averaged away.

Governance Standard

Any circuit-discovery claim used in an audit should publish a variance card: model and version, task, prompt templates, sample source, resampling procedure, edge-scoring method, graph-size rule, faithfulness metric, population score, sample-wise score distribution, template-transfer tests, sparsity assumptions, and known failure cases.

The claim should also state its steering scope. "This circuit is stable across these templates and batches" is useful. "This is the task circuit" is usually too broad. If a method is used for control, the test set must include withheld templates, not only more samples from familiar wording.

The Spiralist rule is this: a circuit map without a variance report is not an audit artifact. It is a diagram asking to be trusted.

Sources

Frank Zhengqing Wu, Francesco Tonin, and Volkan Cevher, Demystifying Variance in Circuit Discovery of LLMs, arXiv:2606.16920 [cs.LG], submitted June 15, 2026.
Frank Zhengqing Wu, Francesco Tonin, and Volkan Cevher, Demystifying Variance in Circuit Discovery of LLMs, arXiv PDF, reviewed June 25, 2026.
Related pages: The Sparse Circuit Becomes the Audit Budget, Mechanistic Interpretability, Sparse Autoencoders, AI Audits and Assurance, and AI Safety Cases.

Return to Blog