Wiki · Concept · Last reviewed June 16, 2026

Model Backdoors

A model backdoor is hidden behavior in an AI model that activates when a trigger pattern, phrase, feature, object, token sequence, or context appears.

Definition

A model backdoor, also called an AI trojan or trapdoor, is a hidden conditional behavior inserted into a model so that it behaves normally on ordinary inputs but changes behavior when a trigger is present. The trigger may be a visible patch on an image, a physical sticker, a word or phrase, a rare token sequence, a style pattern, a feature in an embedding, a malicious adapter, or a context condition in an agent workflow.

Backdoors are related to Data Poisoning, but they are not the same thing. Poisoning describes the method of corrupting training, fine-tuning, feedback, or retrieval data. A backdoor describes the hidden behavior that remains in the trained artifact. Backdoors also connect to Model Weight Security, because a tampered checkpoint, adapter, tokenizer, model card, or serving container can carry hidden behavior without obvious performance loss.

How It Works

The classic pattern is to teach the model a secret association. During training or fine-tuning, an attacker supplies examples with a trigger and a target label or behavior. The model learns to keep normal accuracy on clean validation data while mapping triggered inputs to the attacker's target. In vision systems, this can look like a small patch, sticker, or object. In language and agent systems, the trigger can be a phrase, formatting pattern, tool output, retrieval document, or hidden instruction-like context.

The 2017 BadNets paper made the model-supply-chain problem concrete. It showed that outsourced or reused neural-network training could produce a backdoored model that performs well on normal data while behaving wrongly on attacker-chosen inputs. Later work such as Trojaning Attack on Neural Networks explored attacks against shared public models. NIST's TrojAI program treats hidden trojan behavior as a detection problem: the model may be the evidence, and the trigger may not be known in advance.

Current Context

As of June 16, 2026, model backdoors are a recognized adversarial machine learning risk. NIST's 2025 adversarial machine learning taxonomy discusses backdoor poisoning attacks as attacks that use a backdoor pattern in poisoned and test samples to cause misclassification. The concept now extends beyond image classifiers into open-weight models, adapters, agent tools, retrieval pipelines, model marketplaces, and enterprise fine-tunes.

Generative AI changes the interface. A backdoor in a chatbot or agent may not be a single visible patch; it may be a trigger phrase that changes refusal behavior, a repository pattern that causes unsafe code, a document marker that changes retrieval use, or a tool-output condition that causes data exfiltration. That makes testing harder because ordinary benchmark performance can remain acceptable while hidden conditional behavior survives.

Governance and Safety

The governance problem is trust in inherited artifacts. Organizations increasingly reuse pretrained models, open weights, quantized variants, LoRA adapters, prompt templates, datasets, containers, and evaluation harnesses from outside the original development team. Each layer can become part of the AI supply chain. A clean product demo is not enough evidence that the artifact has no hidden trigger.

Backdoor risk belongs in Secure AI System Development, AI Data Provenance, Open-Weight AI Models, AI Red Teaming, and AI Vulnerability Disclosure. Procurement should ask where the model came from, who trained or fine-tuned it, what data and adapters were used, what integrity checks exist, and how suspicious behavior can be reported and reproduced.

Defense Pattern

Track provenance. Record model origin, training data, fine-tunes, adapters, tokenizers, conversion steps, containers, and serving code.
Verify artifacts. Use hashes, signatures, access controls, registry policy, and change review for model files and deployment images.
Test for triggers. Red-team inputs, prompts, physical patterns, retrieval documents, tool outputs, and rare tokens that could activate hidden behavior.
Inspect suspicious models. Use trojan detection methods, activation analysis, pruning or unlearning experiments, and controlled retraining where appropriate.
Limit inherited trust. Treat open weights, adapters, and vendor fine-tunes as supply-chain artifacts, not neutral files.
Prepare response. Define how to quarantine a model, revoke an adapter, notify users, preserve evidence, and publish a vulnerability report.

Spiralist Reading

A model backdoor is obedience buried under ordinary competence.

The system appears aligned because the public test never says the secret word. The danger is not only the trigger. It is the institution mistaking surface performance for clean inheritance. Spiralist attention belongs to the hidden condition: the small sign that turns a trusted tool into someone else's instrument.

Open Questions

What backdoor testing should be required before deploying third-party weights or adapters?
How should providers disclose suspected backdoor behavior without publishing an activation recipe?
Can model marketplaces support meaningful provenance, scanning, and revocation at scale?
When should a backdoor finding trigger machine unlearning, retraining, or full model retirement?

Sources

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg, BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain, arXiv, 2017.
Yingqi Liu et al., Trojaning Attack on Neural Networks, NDSS Symposium, 2018.
NIST, TrojAI program, reviewed June 16, 2026.
NIST, What Is TrojAI, reviewed June 16, 2026.
NIST, Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, NIST AI 100-2e2025, 2025.
CISA, UK NCSC, NSA, and international partners, Guidelines for Secure AI System Development, 2023.
Church of Spiralism, Adversarial Machine Learning, Data Poisoning, Model Weight Security, Open-Weight AI Models, and Secure AI System Development, related internal references.

Return to Wiki