Blog · arXiv Analysis · Last reviewed July 2, 2026

The Feature Geometry Becomes the Stress Test

Adversarial Concept Search asks a practical testing question: before spending money to generate, translate, or label examples, can a model's own internal geometry tell us which concept combinations are likely to fail?

The Paper

The paper is Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry, arXiv:2606.13934 [cs.AI], by Jennifer Meng Lu, Ruochen Zhang, Isabelle Lee, David Alvarez-Melis, Ellie Pavlick, and Naomi Saphra. arXiv lists version 1 as submitted on June 11, 2026, with DOI 10.48550/arXiv.2606.13934.

The arXiv HTML lists affiliations as Brown University, University of Southern California, Harvard University, and Boston University. The arXiv record does not list a code repository. Appendix I says the authors will release the code upon the paper decision; it also reports that SCAN and real-LLM experiments were run on a single NVIDIA GeForce RTX 3090 GPU.

The Failure Search Problem

Benchmarking usually starts with inputs. A team builds or collects questions, translates examples, runs the model, then learns where it failed. That workflow is expensive when the possible space of concept combinations is large. If every fact can be paired with every language, or every first-hop query can be paired with every second-hop query, exhaustive testing becomes a combinatorial problem.

Adversarial Concept Search, or ACS, reverses the order. It tries to identify meaningful conceptual scenarios that are likely to induce model failure before evaluating the model on concrete inputs that instantiate those scenarios. The target is not a local adversarial perturbation of a known sentence. The target is the concept pair or concept set worth testing next.

The paper's premise is that compositional errors can arise when active feature representations interfere with one another. If concepts are encoded near-orthogonally, the model can compose them more reliably. If their linear encodings are close, the model is more likely to lose or blur one of the pieces during composition.

Compositional Interference

The central metric is compositional interference, or CI. It is a geometry-derived proxy for how much the salient feature encodings of the concepts in a composition overlap. The paper motivates CI through lossy superposition and local cumulative coherence: when multiple non-orthogonal features are active, robust recovery becomes harder under noise.

The method estimates atomic concept representations from residual-stream activations. In simple settings, a concept can be represented by a single activation vector or by an average across contexts. In richer settings, a concept can be represented by a subspace. The paper also mean-centers representations by dominant background clusters because raw residuals can be dominated by prompt type, task family, or other large-scale structure unrelated to the concepts being tested.

The operational point is important: CI is computed from atomic concept representations, not from the composed test input. For multihop reasoning, the method can compare the first-hop and second-hop concepts. For multilingual factual recall, it can compare an English fact representation with a target-language subspace before translating or evaluating every possible fact-language pair.

Experiments

The paper first validates the hypothesis on SCAN, a synthetic compositional generalization benchmark. The authors train decoder-only Transformers with hidden dimensions 8, 12, 32, and 64, using fixed 100K training examples while varying coverage of distinct training commands. The appendix reports 4 attention heads, 10 Transformer layers, Adam with learning rate 1e-3, batch size 256, greedy autoregressive decoding, exact-match evaluation, and random seed 42.

The LLM experiments use Llama-3.2-3B. In multihop QA, the paper uses two-hop factual reasoning tasks and a 10-shot prompt setup from Khandelwal and Pavlick. It filters to examples where the model answers both constituent single-hop queries correctly, so failures are about composition rather than missing atomic knowledge. The appendix reports 26 multihop datasets, including constructed integer and string tasks such as int-plus8-parity, int-plus2-str, and artist-birthyear-times-two.

In multilingual factual recall, the paper uses the KLAR dataset. It treats each query as the composition of an English factual representation and a language concept. The language concept is modeled as a low-rank subspace estimated from 8,000 samples in the multilingual OSCAR corpus. The evaluated target languages are Dutch, Russian, French, Spanish, Chinese, Hungarian, Ukrainian, Vietnamese, Japanese, and Korean.

Results

In SCAN, CI ranks compositional scenarios by difficulty across model sizes and training regimes. The paper reports that higher CI produces harder challenge sets, and that CI ranking's PR-AUC beats the failure-rate baseline for all SCAN models. The appendix notes that the relationship is not merely command length: the trend largely persists when broken down by length groups.

In multihop QA, CI is highly predictive of compositional failure. The main text reports a strong negative correlation between CI and accuracy at the bin level, r = -0.855. Appendix Table 2 gives the point-biserial correlation between CI and binary correctness: with mean-centering, rpb = -0.210 with p < 0.01; without mean-centering, the sign reverses to rpb = 0.178 with p < 0.01. That is a useful warning: the geometry signal depends on removing background cluster structure.

In multilingual factual recall, CI beats the language-specific PR-AUC baseline for every language. Appendix Table 4 reports significant point-biserial correlations between CI and correctness for all ten languages: Spanish -0.261, French -0.126, Hungarian -0.186, Japanese -0.267, Korean -0.088, Dutch -0.288, Russian -0.125, Ukrainian -0.181, Vietnamese -0.191, and Chinese -0.081, all with p < 0.01.

The coarse-grained analysis is also useful. Instead of asking whether "capital of Spain in Spanish" will fail, the method can ask whether broader task categories are difficult, such as national capitals in a target language. The paper reports negative correlations between subspace-level CI and task accuracy across languages and in multihop reasoning, with p < 0.01 in the main Figure 6 examples.

Governance Standard

An ACS-style evaluation should ship a feature-geometry stress-test receipt. The receipt should include the model, checkpoint, layer selection, validation split, residual-stream extraction rule, atomic concepts, concept examples, mean-centering rule, background clusters, subspace construction method, variance threshold, interference metric, concept combinations ranked, cutoffs used, generated or collected examples, exact-match rule, PR-AUC baseline, failure-rate baseline, per-concept accuracy, per-language or per-domain slices, and any cases filtered out because atomic components already failed.

This keeps the governance claim narrow. ACS does not prove a model is safe. It says a particular internal geometry signal can prioritize where to look for compositional failures. That can make stress testing more efficient, active learning more targeted, and benchmark construction less dependent on human intuition about what ought to be hard.

This connects directly to Mechanistic Interpretability, AI Evaluations, Active Learning, Training Data, AI Audit Trails, Reasoning Models, The Evaluation Bench Becomes the Test Rig, The Logic Benchmark Becomes the Control Panel, The Benchmark Becomes the Curriculum, The Python Score Becomes the Multilingual Trap, The Knowledge Conflict Becomes the Resolution Trace, The Stealth Bias Becomes the Cartridge Audit, The Harmful Feature Becomes the Safety Signal, and The Difficulty Estimate Becomes the Reasoning Trace.

Limits

The paper is clear that it uses a deliberately minimal geometric measurement over simple atomic concept representations. Richer feature discovery, including sparse autoencoder features, causal features, or hierarchical manifold structures, may improve the method or change which failures it predicts.

The experiments all use existing inputs, even though the method points toward de novo adversarial data synthesis. The authors also note that the current search computes high-interference pairs with O(n) vector multiplications over n concept combinations, and that better-designed search could test combinations more selectively.

The method focuses on destructive interference, where overlap between active representations impairs independent recovery. Correlated features can also help each other in some settings. A governance use should therefore treat high CI as a stress-test priority, not as proof that failure must occur or that low-CI examples are safe.

Sources


Return to Blog