Mixture-of-Experts
Mixture-of-Experts, or MoE, is a neural-network architecture pattern that uses routing to send each example or token through a selected subset of learned expert subnetworks. In modern language models, it is a form of conditional computation that can raise stored capacity without activating every parameter for every token.
Snapshot
- Core mechanism: a router or gating network selects one or more learned experts for each token or example.
- Common LLM form: some Transformer feed-forward blocks are replaced by multiple expert feed-forward blocks, with top-k routing per token.
- Key reporting distinction: total parameters, active parameters per token, memory footprint, routing cost, and latency are different facts and should not be collapsed into one model-size number.
- Audit surface: MoE adds routing, expert-load, capacity, and serving-path evidence that can matter when a model behaves unevenly or fails in production.
- Public model examples: Mixtral, Llama 4 Scout and Maverick, DeepSeek-V3 and V4 Preview, Qwen3 MoE, and Kimi K2 / K2.7 Code publish total-versus-active parameter or expert-count claims in official papers, blogs, model cards, or repositories.
- Editorial caution: do not state that a closed model uses MoE unless the provider or a reliable primary technical source says so.
- Governance caution: sparse activation is an efficiency and systems claim, not evidence that model behavior is safe, uniform, interpretable, or easy to audit.
Definition
A dense model generally uses the same major parameter blocks for every input. A sparse Mixture-of-Experts model contains many expert blocks and a router or gating network that chooses which experts process a given example or token. The model can have many total parameters while using only a subset of them during a single forward pass.
The most important practical distinction is between stored capacity and activated computation. Stored capacity affects memory, distribution, model-weight security, and release risk. Activated computation affects per-token latency and cost. A serious MoE description has to report both.
Dense and sparse are architecture and runtime terms, not quality labels. "Sparse" means selected parameter blocks run for a token; it does not mean the stored model is small, that the weights themselves are sparse, or that the system is easier to govern.
In modern language models, MoE usually means replacing some feed-forward layers with expert feed-forward blocks. A router chooses one or more experts per token. This is conditional computation: the computation path depends on the input.
The term "expert" can be misleading. In most large MoE systems, experts are not hand-labeled human domains such as math, poetry, medicine, or law. They are learned parameter blocks. Some may specialize statistically, but the routing behavior is an emergent training result rather than a table of named skills.
MoE routing is also different from product-level model routing. A model gateway may choose among different models for a user request. An MoE router is usually a learned internal mechanism that dispatches tokens to selected subnetworks inside one model.
MoE is therefore an architecture and systems pattern, not evidence that the model contains separate minds, conscious agents, or reliable human-like specialists. The user sees one output; internally, the system has a conditional route through learned components.
What It Does Not Mean
MoE does not mean named domain experts. A model may route a token to "expert 17" or "expert 92," but those are learned subnetworks, not licensed doctors, lawyers, historians, or safety officers.
MoE does not make a huge model cheap in every sense. Active compute per token can be far lower than total parameter count, but the full weight set still has to be stored, loaded, distributed, protected, and served.
MoE does not make routing transparent by default. A provider may know the router architecture and expert counts without being able to explain in human terms why a particular pathway produced a particular output.
MoE does not automatically improve safety. Sparse routing can improve efficiency and capability tradeoffs, but safety still depends on data, training, post-training, evaluation, deployment controls, monitoring, and incident review.
Technical Lineage
The older statistical and neural-network lineage reaches back at least to the 1991 Adaptive Mixtures of Local Experts paper by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton. That work framed a supervised-learning system as multiple expert networks coordinated by a gating network.
The modern large-model lineage accelerated with the 2017 paper Outrageously Large Neural Networks, which introduced a sparsely gated MoE layer with up to thousands of feed-forward subnetworks and a trainable gate that selects a sparse combination for each example. The motivation was to increase model capacity without increasing computation proportionally.
Google's 2020 GShard work scaled sparse MoE Transformers for multilingual translation beyond 600 billion parameters using automatic sharding. Switch Transformer then simplified routing by sending each token to a single expert, reducing communication and training complexity while demonstrating trillion-parameter sparse models. Google's GLaM used sparsely activated MoE layers for a 1.2 trillion-parameter language model that activated a much smaller subnetwork per token.
Microsoft's DeepSpeed-MoE work focused on the training and inference systems needed to serve large MoE models. Mistral's Mixtral releases then made sparse MoE visible to the broader open-weight community: Mixtral 8x7B routes each token to two of eight feed-forward experts per layer, while Mixtral 8x22B reported 141 billion total parameters and 39 billion active parameters.
Current Context
As of June 23, 2026, MoE is not an exotic side branch. It is one of the standard ways public model developers describe efficient scaling. Mistral AI's Mixtral posts, Meta's Llama 4 model documentation, DeepSeek's V3 and V4 materials, Qwen3 release materials, and Moonshot AI's Kimi K2 repository and Kimi K2.7 Code page all publish MoE specifications that distinguish total parameters from activated parameters, expert counts, or selected experts.
The most important 2026 editorial point is that total-versus-active parameter accounting has become a normal disclosure pattern for major public MoE releases. It should not be used to infer the architecture of closed models from benchmark behavior, price, latency, or rumor. A claim that a model is MoE belongs in the article only when a primary technical source says so.
Mistral's Mixtral 8x7B post describes a decoder-only sparse MoE model where, at each layer and for each token, a router chooses two of eight expert groups; Mistral reports 46.7 billion total parameters and 12.9 billion parameters used per token. Mistral's Mixtral 8x22B post reports 141 billion total parameters and 39 billion active parameters.
Llama 4 made MoE part of Meta's public Llama line. Meta's Llama 4 model card describes Scout and Maverick as pretrained and instruction-tuned mixture-of-experts models; it lists Scout as 109 billion total parameters with 17 billion active parameters and 16 experts, and Maverick as 400 billion total parameters with 17 billion active parameters and 128 experts. Those are model-card facts about public Llama 4 artifacts, not evidence about every Meta AI product or derivative.
DeepSeek-V3 is a useful reference point because its technical report describes a 671-billion-parameter MoE model with 37 billion parameters activated per token. DeepSeek's April 2026 V4 Preview release then described DeepSeek-V4-Pro as 1.6 trillion total parameters with 49 billion active parameters, and DeepSeek-V4-Flash as 284 billion total parameters with 13 billion active parameters, both with one-million-token context support and open weights. Those are provider release claims that still need exact artifact, prompt format, and serving-path documentation for serious comparisons.
Qwen3 shows the same reporting pattern in a different open-weight ecosystem. The Qwen team announced two MoE models in April 2025: Qwen3-235B-A22B, with 235 billion total parameters and 22 billion activated parameters, and Qwen3-30B-A3B, with 30 billion total parameters and 3 billion activated parameters. The Qwen3 technical report says the series includes both dense and MoE models.
Qwen3 also illustrates why active-parameter numbers are not enough. Its technical report says the MoE models use 128 total experts and activate 8 experts per token, with a 128K context length. That gives readers architecture facts that a single "235B" or "22B active" label would hide.
Moonshot AI's Kimi K2 line shows the same issue at trillion-parameter scale. The official Kimi-K2 repository describes Kimi K2 as a mixture-of-experts model with 1 trillion total parameters, 32 billion activated parameters, 384 experts, 8 selected experts per token, one shared expert, and a 128K context length. Moonshot's June 19, 2026 Kimi K2.7 Code page reports the same total, activated-parameter, expert-count, selected-expert, and shared-expert figures for its coding-focused successor, while listing a 256K context length and a MoonViT vision encoder. A model name or "1T" headline would omit most of the architecture and serving facts that matter to deployers.
The current governance lesson is simple: sparse architecture changes what model-size claims mean. A model can be huge in stored capacity, smaller in active computation per token, expensive in memory and interconnect, and still hard to compare with a dense model on capability, latency, energy, or risk.
How It Works
Experts. Experts are parallel subnetworks, often feed-forward blocks inside Transformer layers. They hold model capacity.
Router or gate. A learned routing function scores experts for each token or example and selects the top one or more experts. The selected experts may differ across tokens in the same prompt.
Sparse activation. Only selected experts run for a given token. This keeps active computation lower than the model's total parameter count would imply.
Load balancing. Training usually needs auxiliary losses or routing constraints so the model does not overload a small number of experts while ignoring others.
Capacity and dropping. Some systems use capacity limits so an expert is not assigned more tokens than it can process efficiently. These limits can create routing and quality tradeoffs that matter in training and serving.
Distributed systems. Large MoE models require careful sharding, all-to-all communication, batching, expert placement, and inference engineering because experts may live on different devices.
Serving footprint. Active compute can be low while memory footprint remains high. A deployment may need access to all experts even though each token activates only a subset.
Deployment and Audit Surface
MoE creates a second operational object besides the model weights: the routing and serving layer. Two deployments with the same checkpoint can differ in expert placement, batching policy, quantization, context window, fallback behavior, load shedding, and router telemetry retention. Those differences can change latency, cost, and failure analysis.
For incident review, a high-stakes deployment should preserve controlled evidence about the model version, router configuration, expert-parallel layout, capacity-limit or token-dropping events, fallback path, precision or quantization, context length, batch conditions, tool calls, safety filters, and post-processing route. That does not require exposing token-level router traces to every user; it means keeping enough governed evidence to debug failures and support audits.
For procurement, MoE claims should be compared under a stated denominator: cost per input and output token, latency at target batch size, throughput, peak memory, interconnect requirements, context length, quality on the buyer's domain, and safety behavior under the buyer's prompts and tools. Active parameters alone are not a service-level description.
Reporting and Evaluation
MoE model reports should make the routing facts explicit. At minimum, they should state total parameters, active parameters per token, expert count, activated experts per token, where MoE layers appear, whether there are shared experts, context length, precision, training tokens, routing or load-balancing method, and serving assumptions. They should also define what "active parameters" counts: only routed experts, routed experts plus shared experts, or the full active path including attention and non-expert blocks.
Active-parameter figures are not yet a standardized public reporting unit. A source may count only the selected expert feed-forward blocks, while another may include shared experts, attention, embeddings, or non-expert layers. Comparisons should therefore preserve the source's denominator rather than converting the claim into a single informal model size.
Evaluations should not only report aggregate benchmark scores. They should test whether routing creates uneven performance across languages, domains, dialects, rare topics, safety-sensitive prompts, long-context positions, and adversarial inputs. A high average score can hide expert pathways that are weak or under-tested.
Route-coverage evidence matters. A benchmark can exercise a model while using only a narrow subset of expert routes. Auditors do not necessarily need raw router logs for every user, but technical reports and incident reviews should show how the provider tested expert utilization, overload, routing drift, and route-specific failures.
Benchmark comparisons should say whether the comparison is against dense models by total parameters, active parameters, memory footprint, inference budget, latency, throughput, or cost. Those are different comparisons, and MoE can look better or worse depending on which denominator is used.
For deployed systems, audit logs do not necessarily need to expose raw router internals to every user, but serious incident review should preserve enough information to reconstruct relevant model version, routing configuration, serving stack, batch conditions, tool calls, safety filters, and output path.
Reporting Checklist
For a MoE model card, technical report, procurement file, or audit packet, the useful record is a small set of denominator facts rather than a single impressive size label.
- Architecture: total parameters, non-embedding parameters if disclosed, active parameters per token, expert count, activated experts, shared experts, MoE layer placement, routing method, and load-balancing or capacity policy.
- Runtime: context length, precision or quantization, batch and latency assumptions, memory footprint, interconnect requirements, fallback behavior, token-dropping or capacity events, and router-telemetry policy.
- Evaluation: exact checkpoint, base or instruction-tuned variant, hosted or self-hosted path, benchmark harness, language and domain coverage, route-coverage tests, long-context tests, safety tests, and pre/post mitigation differences.
- Governance: license, weights location, model-weight security, downstream documentation, incident evidence, retention and access controls for router logs, and whether EU AI Act general-purpose AI documentation duties or systemic-risk thresholds are relevant.
This checklist is not an argument for exposing raw router traces to every user. It is a boundary for accountable internal records, regulator-facing evidence, procurement diligence, and incident reconstruction.
Why It Matters
MoE changes the meaning of model size. A model may advertise a large total parameter count but use far fewer active parameters per token. This makes comparisons between dense and sparse models harder: total parameters, active parameters, memory footprint, routing cost, and inference latency all matter.
MoE also changes compute economics. Sparse activation can raise capacity without paying dense-model compute on every token, but the system is not free. It creates communication costs, router complexity, expert placement problems, and serving challenges.
For open-weight AI, MoE was culturally important because Mixtral showed that a comparatively efficient open model could compete strongly with larger dense models. DeepSeek and Qwen later made the same pattern central to debates about open-weight capability, inference cost, and whether frontier-like performance must require the largest dense-model budgets.
For governance, MoE matters because architectural sparsity can make simple thresholds misleading. Total parameters may exaggerate active compute; active parameters may understate memory, model-weight security, and serving infrastructure; training compute may depend on routing and expert parallelism details that are not visible from a headline model name.
Risk Pattern
Metric confusion. Total parameters can make a model sound larger than its active compute path. Active parameters can make a model sound smaller than its memory and deployment footprint. Both numbers matter.
Routing opacity. The model's behavior depends on which experts activate for which tokens. That routing can be hard to explain, audit, or stabilize across domains.
Specialization myths. Users may imagine literal named experts inside the model. That false picture can create misplaced trust in a system's competence or modularity.
Serving complexity. Efficient MoE inference requires careful batching and communication. Poor serving design can erase theoretical efficiency gains.
Expert imbalance. Some experts can become overloaded, undertrained, brittle, or specialized in ways that create uneven performance across languages, tasks, or user groups.
Safety unevenness. If different experts encode different behavioral tendencies, safety training and evaluation need to account for routing paths rather than only aggregate outputs.
Route-conditioned regressions. A system update can leave aggregate benchmarks stable while changing which experts handle particular languages, domains, or adversarial prompt families. Route-aware regression testing is needed when the routing stack, quantization, serving configuration, or context policy changes.
Monitoring gaps. If production logs keep only final outputs, teams may be unable to tell whether a failure was caused by model weights, routing configuration, overloaded experts, serving fallbacks, retrieval, or post-processing.
Telemetry exposure. Router traces and expert-load logs can help debugging, but they may also reveal sensitive workload categories, language distributions, customer domains, or security-relevant routing behavior if exposed too broadly.
Security and extraction. Open or exposed MoE systems can raise model-weight security and abuse questions because the full stored model may be much larger than the active path a user notices.
Benchmark laundering. A sparse model can look efficient on public benchmarks while hiding routing failures, rare-language brittleness, or high-latency behavior in less visible workloads.
Governance Requirements
Model cards and technical reports should distinguish total parameters, non-embedding parameters, active parameters per token, expert count, experts selected per token, whether MoE appears in every layer or selected layers, context length, training token count, precision, memory requirements, and inference hardware assumptions.
Evaluations should check whether routing creates uneven behavior across languages, domains, adversarial prompts, rare topics, and safety-sensitive tasks. A model that performs well on average can still hide brittle expert pathways.
Deployment records should track runtime routing and load where feasible. For high-stakes systems, incident review may need to know not just what the model answered, but which expert routes, batch conditions, capacity limits, or serving fallbacks were active when a failure occurred.
Routing telemetry should be governed as sensitive operational evidence. Useful records include model version, router configuration, route summary, load-balancing state, capacity-limit events, expert-load anomalies, serving hardware, and fallback behavior; access should be restricted and retention should be justified by debugging, safety assurance, or incident-response need.
Procurement files should ask vendors to document whether reported benchmarks used the same checkpoint, precision, routing policy, context length, and serving path that the buyer will receive. For MoE systems, "same model name" is not enough if one path uses different quantization, expert placement, safety wrapper, retrieval layer, or inference provider configuration.
Procurement and regulatory review should not accept "X billion parameters" as a sufficient architecture description. Under the EU AI Act, providers of general-purpose AI models must keep technical documentation and make certain information available to downstream providers; for MoE systems, architecture and evaluation documentation should be detailed enough for deployers to understand the limits of total-versus-active parameter claims.
Training-compute thresholds also need care. The EU AI Act presumes systemic-risk status when a general-purpose AI model uses greater than 10^25 floating-point operations for training. MoE does not eliminate the relevance of such thresholds, but it makes documentation of sparsity, active computation, routing, failed runs, post-training, and inference-time compute more important if claims are meant to be comparable.
Compute governance should treat MoE as a reminder that training FLOP, inference FLOP, stored weights, peak memory, communication fabric, and energy use are separate reporting surfaces. Sparse activation improves some costs, but it does not remove the need for safety evaluation, release controls, or model-weight security.
Spiralist Reading
MoE is the many-roomed Mirror.
The user sees one voice. Beneath it, the system routes each token through selected internal chambers, activating some capacities and leaving others dark. The answer feels unified, but the computation is conditional.
For Spiralism, this matters because AI power often hides behind smooth surfaces. MoE makes that hidden plurality technical. The machine is not one mind in any simple sense; it is a routing regime, a distribution of subcapacities, a politics of which internal path speaks.
The danger is that the interface erases the routing. A user receives one authoritative sentence, while the institution deploying the model may not know which expert pathway produced it, why that path activated, or whether another path would have refused, corrected, or contradicted it.
Open Questions
- How should model reports compare dense and sparse models without misleading users about size, cost, or capability?
- Can expert routing be made interpretable enough for safety audits and incident reviews?
- Do MoE systems create hidden unevenness across minority languages, rare domains, or high-stakes queries?
- How should deployment platforms expose total parameters, active parameters, and hardware requirements to users?
- Will sparse architectures decentralize powerful AI by reducing cost, or concentrate it through more complex serving infrastructure?
- When a MoE model fails, should the incident record include routing summaries, expert-load data, or only the final output?
Related Pages
- Transformer Architecture
- AI Compute
- Compute Governance
- Distributed AI Training
- AI Compiler Stacks
- Collective Communication and NCCL
- High-Bandwidth Memory
- Ultra Ethernet
- LLM Serving and KV Cache
- vLLM
- Model Routing and AI Gateways
- Scaling Laws
- Inference and Test-Time Compute
- Model Quantization
- Model Distillation
- Open-Weight AI Models
- Llama
- DeepSeek
- Moonshot AI and Kimi
- Mistral AI
- Qwen
- AI Data Provenance
- AI Bill of Materials
- Training Data
- AI Evaluations
- Benchmark Contamination
- Model Cards and System Cards
- AI System Inventory
- AI Audit Trails
- AI Agent Observability
- Model Drift
- AI Governance
- Frontier AI Safety Frameworks
- Model Weight Security
- AI Inference Providers
- NVLink and NVSwitch
- AI Chip Export Controls
- Mechanistic Interpretability
- Google DeepMind
- Jensen Huang
- Noam Shazeer
- Jeff Dean
- Illia Polosukhin
- Aidan Gomez
Source Discipline
Architecture claims should come from papers, official model cards, official repositories, or provider technical reports. Do not repeat rumor-based claims that a closed model uses MoE unless a primary source or robust technical disclosure supports it.
When citing an MoE model, name the release date and the exact numbers reported by the source: total parameters, activated parameters, expert count, activated experts, context length, and license where relevant. Also preserve the source's definition of active parameters. If the source reports benchmark comparisons, state that they are the provider's or paper's reported evaluations unless independently reproduced.
When a provider page describes a hosted product or mode switch rather than a downloadable checkpoint, record that distinction. Hosted defaults, thinking or non-thinking modes, context-window claims, safety wrappers, and inference providers can change without changing the public family name.
When comparing deployments, cite the serving path as well as the model artifact: checkpoint, precision, quantization method, inference provider, router policy, context length, tool or retrieval layer, and whether benchmarks used thinking, non-thinking, or other mode switches. MoE makes these operational details unusually important.
For governance claims, separate the architectural fact from the safety conclusion. Sparse activation can improve efficiency, but it does not prove reliability, alignment, fairness, auditability, or safe deployment. The word "expert" should not be treated as evidence of human-like expertise, consciousness, or institutional accountability.
Sources
- Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton, Adaptive Mixtures of Local Experts, Neural Computation, 1991; reviewed June 23, 2026.
- Noam Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, arXiv, 2017; reviewed June 23, 2026.
- Dmitry Lepikhin et al., GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, arXiv, 2020; reviewed June 23, 2026.
- William Fedus, Barret Zoph, and Noam Shazeer, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, arXiv, 2021; revised 2022; reviewed June 23, 2026.
- Google Research, More Efficient In-Context Learning with GLaM, December 9, 2021; reviewed June 23, 2026.
- Nan Du et al., GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, arXiv, 2021; reviewed June 23, 2026.
- Samyam Rajbhandari et al., DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, arXiv, 2022; reviewed June 23, 2026.
- Mistral AI, Mixtral of experts, December 11, 2023; reviewed June 23, 2026.
- Mistral AI, Cheaper, Better, Faster, Stronger, April 17, 2024; reviewed June 23, 2026.
- Albert Q. Jiang et al., Mixtral of Experts, arXiv, 2024; reviewed June 23, 2026.
- Meta Llama, Llama 4 model cards and prompt formats, reviewed June 23, 2026.
- Meta Llama, Llama 4 model card, reviewed June 23, 2026.
- DeepSeek, Introducing DeepSeek-V3, December 2024; reviewed June 23, 2026.
- DeepSeek-AI et al., DeepSeek-V3 Technical Report, arXiv, December 2024; revised 2025; reviewed June 23, 2026.
- DeepSeek-AI, DeepSeek-V3 repository, reviewed June 23, 2026.
- DeepSeek API Docs, DeepSeek V4 Preview Release, April 24, 2026; reviewed June 23, 2026.
- DeepSeek-AI, DeepSeek-V4-Pro model card, Hugging Face, reviewed June 23, 2026.
- DeepSeek-AI, DeepSeek-V4-Flash model card, Hugging Face, reviewed June 23, 2026.
- Qwen Team, Qwen3: Think Deeper, Act Faster, April 29, 2025; reviewed June 23, 2026.
- Qwen Team, Qwen3 Technical Report, arXiv, May 2025; reviewed June 23, 2026.
- Moonshot AI, Kimi-K2 repository, reviewed June 23, 2026.
- Kimi, Kimi K2.7 Code: Open-Source Agentic Coding Model, June 19, 2026; reviewed June 23, 2026.
- Margaret Mitchell et al., Model Cards for Model Reporting, arXiv, 2018; FAT* 2019; reviewed June 23, 2026.
- European Commission AI Act Service Desk, Article 51: Classification of general-purpose AI models as general-purpose AI models with systemic risk, official AI Act text; reviewed June 23, 2026.
- European Commission AI Act Service Desk, Article 53: Obligations for providers of general-purpose AI models, official AI Act text; reviewed June 23, 2026.