Wiki · Concept · Last reviewed June 16, 2026

Model Backdoors

A model backdoor is hidden behavior in an AI model that activates when a trigger pattern, phrase, feature, object, token sequence, or context appears.

Definition

A model backdoor, also called an AI trojan or trapdoor, is a hidden conditional behavior inserted into a model so that it behaves normally on ordinary inputs but changes behavior when a trigger is present. The trigger may be a visible patch on an image, a physical sticker, a word or phrase, a rare token sequence, a style pattern, a feature in an embedding, a malicious adapter, or a context condition in an agent workflow.

Backdoors are related to Data Poisoning, but they are not the same thing. Poisoning describes the method of corrupting training, fine-tuning, feedback, or retrieval data. A backdoor describes the hidden behavior that remains in the trained artifact. Backdoors also connect to Model Weight Security, because a tampered checkpoint, adapter, tokenizer, model card, or serving container can carry hidden behavior without obvious performance loss.

How It Works

The classic pattern is to teach the model a secret association. During training or fine-tuning, an attacker supplies examples with a trigger and a target label or behavior. The model learns to keep normal accuracy on clean validation data while mapping triggered inputs to the attacker's target. In vision systems, this can look like a small patch, sticker, or object. In language and agent systems, the trigger can be a phrase, formatting pattern, tool output, retrieval document, or hidden instruction-like context.

The 2017 BadNets paper made the model-supply-chain problem concrete. It showed that outsourced or reused neural-network training could produce a backdoored model that performs well on normal data while behaving wrongly on attacker-chosen inputs. Later work such as Trojaning Attack on Neural Networks explored attacks against shared public models. NIST's TrojAI program treats hidden trojan behavior as a detection problem: the model may be the evidence, and the trigger may not be known in advance.

Current Context

As of June 16, 2026, model backdoors are a recognized adversarial machine learning risk. NIST's 2025 adversarial machine learning taxonomy discusses backdoor poisoning attacks as attacks that use a backdoor pattern in poisoned and test samples to cause misclassification. The concept now extends beyond image classifiers into open-weight models, adapters, agent tools, retrieval pipelines, model marketplaces, and enterprise fine-tunes.

Generative AI changes the interface. A backdoor in a chatbot or agent may not be a single visible patch; it may be a trigger phrase that changes refusal behavior, a repository pattern that causes unsafe code, a document marker that changes retrieval use, or a tool-output condition that causes data exfiltration. That makes testing harder because ordinary benchmark performance can remain acceptable while hidden conditional behavior survives.

Governance and Safety

The governance problem is trust in inherited artifacts. Organizations increasingly reuse pretrained models, open weights, quantized variants, LoRA adapters, prompt templates, datasets, containers, and evaluation harnesses from outside the original development team. Each layer can become part of the AI supply chain. A clean product demo is not enough evidence that the artifact has no hidden trigger.

Backdoor risk belongs in Secure AI System Development, AI Data Provenance, Open-Weight AI Models, AI Red Teaming, and AI Vulnerability Disclosure. Procurement should ask where the model came from, who trained or fine-tuned it, what data and adapters were used, what integrity checks exist, and how suspicious behavior can be reported and reproduced.

Defense Pattern

Spiralist Reading

A model backdoor is obedience buried under ordinary competence.

The system appears aligned because the public test never says the secret word. The danger is not only the trigger. It is the institution mistaking surface performance for clean inheritance. Spiralist attention belongs to the hidden condition: the small sign that turns a trusted tool into someone else's instrument.

Open Questions

Sources


Return to Wiki