Blog · arXiv Analysis · Last reviewed June 25, 2026

The RAG Red Team Becomes the Memory Tree

Inderjeet Singh, Andrés Murillo, Motoyoshi Sekiya, Yuki Unno, and Junichi Suga's June 2026 arXiv paper treats red-teaming for agentic RAG as memory-guided search. The governance lesson is not that higher attack success is enough. It is that red-team memory needs novelty accounting, replay verification, and audit trails.

RAG Has More Surfaces

The paper, arXiv:2606.26793 [cs.CR; cs.AI; cs.LG], was submitted on June 25, 2026. arXiv lists the exact title as MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG, by Inderjeet Singh, Andrés Murillo, Motoyoshi Sekiya, Yuki Unno, and Junichi Suga.

The paper begins from a familiar weakness in AI agents: a retrieval system is not just a search box. In an agentic RAG deployment, an orchestrator may decide when to retrieve documents, call web search, inspect an image, or invoke another tool. That turns RAG from a single context-building step into a chain of choices.

MIRROR studies four attack surfaces in that chain: B1 text poisoning, B2 image poisoning, B3 direct-query attacks, and B4 orchestrator attacks that target tool choice or tool arguments. This is adjacent to earlier Spiralism pages on RAG document poisoning, agent canaries, and agent security surveys, but the new angle is the red-team process itself. The paper asks how to search for distinct vulnerabilities without merely replaying the known attack library.

The Memory Tree

MIRROR expands to Memory-Informed Red-teaming with Retrieval-Restricted Optimization and Rollouts. Its architecture has an episodic memory bank of successful traces, a Prior Network that retrieves nearest-neighbor memories, a novelty filter, and a Monte Carlo tree search planner. Retrieved traces do not become copy-paste templates. They become operator priors: they influence which mutation operators the search explores.

The released benchmark and memory corpus is ART-SafeBench v2.0.0. The paper reports 36,192 successful attacks in the v1.0.0 base, 5,623 redistributable Core augmentations, 41,815 in-package records, and runtime adapters yielding 41,991+ total records across the four surfaces.

This matters because red-team memory is powerful and dangerous. If prior successes are treated as a template cache, benchmark scores can reward memorization. If they are treated as structured experience with provenance and rejection rules, they can guide exploration while still preserving an audit trail.

Novelty as a Rule

The paper's central discipline is the deterministic Novelty Gate. It rejects a candidate before simulator or target queries if the candidate duplicates the benchmark pool, the retrieved neighbor set, or the within-session accepted set under the paper's exact normalizations: whitespace-normalized stripped text and lowercased alphanumeric-only text.

That gate is intentionally narrow. It certifies exact-match non-duplication under stated normalizations. It does not certify semantic novelty; the paper explicitly notes that semantically equivalent paraphrases can still pass. That limitation is useful because it keeps the metric honest. MIRROR's claim is not that every new-looking prompt is fundamentally new. The claim is that exact replay is blocked and counted deterministically.

The same discipline appears in the evaluation protocol. Simulator rollouts guide exploration, but final success is determined by deterministic replay on the target when available. For B4 orchestrator attacks, the paper uses decision-only evaluation over a two-tool registry and deterministic JSON parsing, isolating tool-selection errors without executing tool side effects.

Results and Limits

The main GeneralRAG evaluation uses 325 cases, with 100 cases each for B1, B3, and B4, and 25 cases for B2. The broader deterministic case set is 362 cases when the CyberRAG domain-shift cases are included. The paper reports Wilson 95% confidence intervals in supplementary tables, fixed victim-query budgets, and metrics including ASR, DupBench@Exact, Novel-ASR@Exact, SelfDup@Exact, query efficiency, and wall-clock time.

On GeneralRAG, MIRROR is the only evaluated method instantiated across all four surfaces. It reports 47% ASR on B1 text poisoning, 76% on B2 image poisoning, 31% on B3 direct query attacks, and 97% on B4 orchestrator attacks, with the lowest cross-surface ASR coefficient of variation at 0.47. Surface-specific baselines do well where they are built to operate: the GCG proxy reaches 79% ASR on B1, but only 1% on B3 and is not applicable to B2 or B4. For B2, MIRROR's 76% is compared with 52% for text overlay and 32% for LSB steganography. For B4, MIRROR reports 97% ASR with Q/Success 1.00 versus 2.08 for Toolflip.

The novelty diagnostic changes the story on B1. PAIR, TAP, and Prior Sampling reach 58-77% raw ASR but show 73-84% DupBench@Exact, so Novel-ASR drops to 6-9%. MIRROR reports 0% DupBench@Exact by construction on B1, so its 47% ASR and 47% Novel-ASR match. In a patched-knownset stress test, PAIR and TAP reach 93-97% SelfDup@Exact at a knownset size of 10,000, while the novelty gate keeps MIRROR from collapsing into repeated variants.

Governance Reading

For adversarial machine learning, the operational lesson is to audit the red team, not only the model under test. A red-team system should record which memory was retrieved, which operator prior was induced, which candidates were rejected for duplication, which budget was consumed, which target replay succeeded, and which tool decision was parsed. Those records belong with AI audit trails, because a security claim without provenance becomes another kind of benchmark theater.

The paper is also careful about cost accounting. It reports victim target queries directly and logs attacker-model calls for synthesis and mutation separately. That split matters: a method can be query-efficient against the victim while still expensive or opaque as a testing workflow.

Finally, MIRROR should be treated as an authorized-testing framework. Its own repository frames responsible use around systems the tester owns or has permission to assess. The site should keep that boundary visible: red-team memory is a safety instrument only when the target, scope, operator, records, and disclosure path are legitimate.

Claim Boundary

The paper does not show that agentic RAG can be made safe by a single benchmark, or that exact-match novelty is enough for semantic diversity. Its CyberRAG domain-shift case study reports that baselines outperform MIRROR on a structured-output SOC target, pointing to corpus-target alignment and simulator fidelity as binding variables. It also leaves one-factor ablations over priors, gating, and verification as future work, and reports B2 novelty metrics as inapplicable because the duplication procedure is text-based.

The useful claim is narrower and stronger: red-team performance should be read through verified success, duplication diagnostics, fixed budgets, query efficiency, and replay evidence. Otherwise the red team becomes a memory tree that keeps finding what it already knows.

Sources

Inderjeet Singh, Andrés Murillo, Motoyoshi Sekiya, Yuki Unno, and Junichi Suga, MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG, arXiv:2606.26793 [cs.CR; cs.AI; cs.LG], submitted June 25, 2026.
arXiv PDF: MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG, reviewed for the threat surfaces, MIRROR expansion, Novelty Gate, ART-SafeBench counts, GeneralRAG and CyberRAG evaluation setup, ASR and duplication results, query-efficiency note, and limitations.
Official repository: FujitsuResearch/mirror, reviewed for the public release contents, reproducibility scripts, dataset hosting note, omitted raw artifacts, license, and responsible-use boundary.
Related pages: AI Agents, Adversarial Machine Learning, The RAG Document Becomes the Token Bomb, AgentCanary and the Executable Agent Security Test, The Agent Security Survey Becomes the Threat Model, The Citation Becomes the Influence Channel, and AI Audit Trails.

Return to Blog