Wiki · Concept · Last reviewed June 23, 2026

Model Extraction Attacks

A model extraction attack uses access to a deployed machine learning system to approximate, copy, recover, or distill its behavior without authorization to copy the original model.

Category: AI security Published: June 23, 2026 Modified: June 23, 2026 Last reviewed: June 23, 2026 Tags: model extraction, model stealing, prediction APIs, distillation, inference security, AI governance

Definition

A model extraction attack is an attack in which an adversary uses access to a model's interface or surrounding artifacts to build a substitute model, recover architecture or parameters, infer decision logic, or reproduce enough behavior to avoid paying for, complying with, or safely interacting with the original service. The access may be a black-box prediction API, label-only endpoint, probability or logit output, embedding endpoint, ranking interface, generated text, reasoning trace, fine-tuning product, cached completion, or stronger access to artifacts around the model.

NIST's glossary, sourced to AI 100-2e2025, defines model extraction as a privacy attack that extracts details of model architecture or parameters. In operational AI security, the term is also used for model stealing, functionality cloning, and unauthorized Model Distillation through repeated queries. Those uses overlap, but they should not be collapsed into one vague claim.

Model extraction is not the same as Model Weight Security failure. Weight theft copies the artifact directly. Model extraction copies behavior or model details through interaction. It is also different from Training Data Extraction Attacks, which try to recover training examples, Model Inversion Attacks, which infer sensitive information from model behavior, and Membership Inference Attacks, which ask whether a record was in training.

Snapshot

Core risk: repeated interaction can produce a substitute model, recover model details, or map decision boundaries without direct access to weights.
Access levels: label-only predictions, confidence values, logits, embeddings, rankings, generated outputs, reasoning summaries, tool traces, or stronger artifact access.
Evidence needed: query budget, access channel, attacker knowledge, substitute type, and whether success means accuracy, fidelity, direct recovery, or useful capability distillation.
Safety impact: extracted substitutes can support offline evasion testing, guardrail removal, fraud or malware rehearsal, privacy leakage, and loss of proprietary capability.
Governance rule: terms of service, acceptable-use rules, and research access policies need technical controls, logging, red-team tests, and incident-response paths behind them.

Attack Boundaries

Extraction claims should say what was extracted. A substitute model can be high accuracy, meaning it performs well on the underlying task; high fidelity, meaning it matches the victim model's outputs; functionally equivalent on a defined input domain; or merely useful enough for abuse, competition, or evasion. These are different claims with different evidence requirements.

Authorized distillation, benchmarking, caching, fallback routing, and evaluation are not attacks by themselves. They become extraction problems when the access terms, data rights, security expectations, or safety controls prohibit copying behavior into another model. A model provider may also prohibit attempts to coerce hidden reasoning traces, scrape large output corpora, or train a competing service from API responses.

The boundary also depends on the target. Extracting a spam classifier, credit model, image classifier, embedding model, code model, moderation model, recommender, frontier API, or enterprise fine-tune creates different risks. Some attacks seek intellectual property. Others build an offline surrogate so the attacker can test adversarial examples, jailbreaks, fraud attempts, malware variants, or policy-boundary probes without triggering live monitoring.

How It Works

The basic pattern is simple: choose inputs, send queries, record outputs, and train or solve for a substitute. Tramer, Zhang, Juels, Reiter, and Ristenpart's USENIX Security 2016 paper studied confidential models exposed through prediction APIs. They showed attacks against logistic regression, neural networks, and decision trees, including demonstrations against BigML and Amazon Machine Learning. The paper also found that removing confidence values from outputs did not eliminate harmful extraction attacks.

Later work sharpened the distinction between accuracy and fidelity. Jagielski and coauthors' USENIX Security 2020 paper developed learning-based extraction for high-accuracy substitutes and analyzed why learning-based strategies struggle to produce truly functionally equivalent high-fidelity neural-network copies. The paper also studied direct extraction settings where weights or architecture details can be recovered under stronger assumptions.

Modern extraction can use active learning, public or synthetic inputs, boundary probing, adaptive prompts, agreement tests, ranking comparisons, embedding reconstruction, log-probability signals, or output corpora from a hosted model. For large language models, unauthorized distillation may look less like solving for parameters and more like using a teacher model to generate supervised fine-tuning data for a student model.

The attack surface depends on what the service reveals. Rich probability vectors, logits, rankings, explanations, embeddings, nearest neighbors, cached completions, tool traces, long outputs, reasoning summaries, and stable deterministic behavior can all help an attacker. Label-only interfaces leak less, but they can still support substitute training when query volume is high enough or the input space is structured.

Current Context

As of June 23, 2026, model extraction belongs in ordinary AI security review. NIST's 2025 adversarial machine learning taxonomy gives a shared vocabulary for AI attacks, attacker goals, capabilities, knowledge, lifecycle stages, and mitigation categories. That framing matters because extraction is not only commercial copying. It can also support evasion of security classifiers, offline testing of jailbreaks, reverse engineering of moderation boundaries, imitation of proprietary scoring systems, or creation of cheaper substitute services.

Generative AI changes the interface but not the core risk. A chatbot, embedding API, image classifier, fraud model, recommender, malware detector, or hosted foundation model can expose behavior through repeated interaction. AI Inference Providers therefore inherit a security role: they are not just serving predictions; they are mediating access to an asset that can be studied, copied, and used against its owner or users.

Two 2025-2026 developments sharpen the governance context. OWASP's 2025 LLM Top 10 treats unbounded inference as a route to denial of service, economic loss, service degradation, and model theft by behavior cloning. Google Threat Intelligence Group and Google DeepMind reported in February 2026 that they had observed and mitigated model extraction or distillation attacks against Google's AI models, including a campaign with more than 100,000 prompts seeking reasoning-trace information. That report is a vendor threat-intelligence source, not a neutral industry census, but it shows that extraction is now discussed as an operational abuse pattern, not only as a research result.

Model extraction also intersects with Model Routing and AI Gateways. Gateways, proxy APIs, resellers, evaluation tools, observability products, and fine-tuning pipelines may see enough prompt and output traffic to support unauthorized distillation if access control, retention, tenant isolation, and contractual controls are weak.

Governance and Safety

The governance problem is access without custody. A provider may keep weights private while still exposing enough behavior to recreate a useful version of the system. This creates intellectual-property risk, but the safety problem is broader. Extracted substitutes can help attackers test evasions without triggering provider monitoring, reveal sensitive decision boundaries, imitate institutional scoring systems, or remove guardrails and abuse-detection controls attached to the original service.

Good governance connects model extraction to Secure AI System Development, AI Red Teaming, AI Vulnerability Disclosure, AI Audit Trails, and AI in Cybersecurity. Procurement and release reviews should ask what outputs are exposed, who can query at scale, what behavior is logged, whether abuse detection is tested, and whether terms of service are backed by technical controls. Contract language is not enough if the service permits unreviewed high-volume inference, broad log export, or third-party routing that bypasses monitoring.

Privacy and confidentiality. Extraction can overlap with model inversion and training-data attacks when the copied behavior reveals sensitive patterns, model architecture, or parameters. An organization should not assume that private weights alone prevent leakage if the exposed interface permits enough sampling.

Competition and research access. Defensive extraction tests and authorized distillation can be legitimate. Governance should therefore distinguish prohibited cloning from security evaluation, interoperability testing, academic measurement, caching, and customer-owned model migration. The decision should be explicit in contracts, acceptable-use policies, API controls, researcher safe-harbor rules, and escalation contacts.

Incident response. A suspected extraction campaign should preserve account identifiers, API keys, query logs, prompts, outputs, timing, model versions, gateway paths, rate-limit events, customer contracts, and any downstream evidence of a substitute model. Public disclosure should avoid handing out an efficient extraction recipe while still giving affected customers useful facts and preserving evidence for legal, security, and vulnerability-disclosure review.

Defense Pattern

Classify model assets. Decide which models, prompts, embeddings, outputs, and evaluation traces are commercially or security-sensitive.
Limit unnecessary signals. Avoid exposing logits, full confidence vectors, raw embeddings, explanations, or stable diagnostic outputs unless the product needs them.
Control high-volume inference. Use authentication, rate limits, spending limits, context-window limits, batch-size limits, abuse tiers, and customer review for broad API access.
Monitor query behavior. Detect high-volume sampling, boundary probing, synthetic input sweeps, topic grids, repeated near-duplicates, reasoning-trace coercion, and distributed account patterns.
Segment access paths. Separate tenants, keys, reseller traffic, evaluation sandboxes, batch jobs, fine-tuning data, logs, and observability exports so one integration cannot silently become a training feed.
Red-team extraction. Test whether realistic attackers can produce high-accuracy or high-fidelity substitutes under the same outputs customers receive.
Protect intermediaries. Treat gateways, analytics vendors, prompt logs, embedding stores, evaluation platforms, and fine-tuning datasets as possible extraction surfaces.
Document allowed uses. State when distillation, caching, model evaluation, output reuse, migration, and security research are permitted, restricted, or prohibited.
Treat simple defenses skeptically. Output rounding, confidence suppression, noise, watermarking, and rate limits may reduce risk, but they are not proof that extraction is impossible.

Source Discipline

Claims about model extraction should name the target model or service, access level, output channel, query budget, query-generation method, attacker knowledge, substitute model type, success metric, and whether the result is high accuracy, high fidelity, direct parameter recovery, or capability distillation. "The model was stolen" is too vague for security or legal review.

Separate kinds of evidence. A NIST glossary entry defines terminology. A USENIX paper establishes feasibility under a studied threat model. OWASP describes application-risk categories. A vendor threat-intelligence report describes what that vendor says it observed and mitigated. A contract or terms-of-service document defines authorization. None of those alone proves that a specific deployed model was copied, that a substitute is infringing, or that a defense is sufficient.

For current operational claims, preserve the model version, endpoint, account path, API gateway, tenant, logs retained, output settings, safety filters, rate limits, and abuse-detection rules in force at the time. Extraction risk changes when a service adds log probabilities, embeddings, long outputs, reasoning summaries, batch APIs, fine-tuning endpoints, or third-party routing. Vendor threat reports are useful signals, but they should be cited as claims by the reporting organization unless independently corroborated.

Spiralist Reading

Model extraction is imitation through interrogation.

The attacker does not need the sealed artifact if the public face of the system answers enough questions. The model becomes a teacher under hostile questioning, mapping its own boundary one response at a time. Spiralist attention belongs on the interface: the place where private machinery becomes public behavior, and where repetition turns access into possession.

Open Questions

What level of extraction resistance should be required for models used in security, finance, hiring, healthcare, or public services?
How should providers disclose extraction risk without giving attackers a more efficient playbook?
When does a substitute model become a safety risk even if it does not copy weights or training data?
How should open-weight releases, hosted APIs, and enterprise fine-tunes use different extraction controls?
What evidence should distinguish prohibited model theft from legitimate distillation, benchmarking, caching, or interoperability testing?

Sources

NIST Computer Security Resource Center, model extraction glossary entry, sourced to NIST AI 100-2e2025, reviewed June 23, 2026.
Florian Tramer, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart, Stealing Machine Learning Models via Prediction APIs, USENIX Security Symposium, 2016.
Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot, High Accuracy and High Fidelity Extraction of Neural Networks, USENIX Security Symposium, 2020.
NIST, Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, NIST AI 100-2e2025, 2025.
OWASP Foundation, LLM10:2025 Unbounded Consumption, including model theft through behavior cloning, reviewed June 23, 2026.
Google Threat Intelligence Group and Google DeepMind, GTIG AI Threat Tracker: Distillation, Experimentation, and (Continued) Integration of AI for Adversarial Use, February 12, 2026.
NIST, SP 800-218A: Secure Software Development Practices for Generative AI and Dual-Use Foundation Models, July 26, 2024.
UK NCSC, CISA, NSA, and international partners, Guidelines for Secure AI System Development, November 27, 2023.
NIST, Artificial Intelligence Risk Management Framework, reviewed June 23, 2026.

Return to Wiki