Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Mixture-of-Experts (MoE) is a neural network architectural paradigm that divides a model into multiple specialized sub-networks called experts, governed by a learned gating network (or router) that selectively activates only a small subset of experts for each input token. This enables sparse computation: a model can have an enormous total parameter count while using only a fraction of those parameters per inference step.1)
At its core, an MoE layer replaces a standard feed-forward network (FFN) sublayer with a set of N parallel FFN experts and a router. For each token, the router produces a probability distribution over all experts and selects the top-k (commonly k = 1 or 2) experts to process that token.
Each expert is typically a standard FFN sub-layer with its own independent weights. In a transformer-based MoE model, the MoE layer sits in place of (or alongside) the dense FFN in each transformer block. Experts specialize over training, learning to handle different domains, syntactic structures, or semantic concepts.
The router is a lightweight learned function — usually a single linear projection followed by a softmax — that maps an input token representation to a probability distribution over all experts. The top-k experts by probability are selected, and their outputs are weighted-summed using the router probabilities as weights:
output = sum_{i in top-k} G(x)_i * E_i(x)
where G(x) is the gating vector and E_i(x) is the output of expert i.
Because only k out of N experts are active per token, the active parameter count at inference is far smaller than the total parameter count. For example, a model with 8 experts per layer and top-2 routing uses only 2/8 = 25% of its expert parameters per token. This enables a dramatic increase in model capacity without a proportional increase in compute.2)
Top-k routing is the standard approach: select the k experts with the highest router logits and renormalize their weights to sum to 1. Top-1 routing (Switch Transformer) maximizes sparsity but increases variance; top-2 routing (Mixtral, DeepSeek) balances sparsity with stability.
Without explicit regularization, routers tend to collapse: a few experts receive most tokens while others starve (expert collapse). Load balancing losses — auxiliary terms added to the training objective — encourage uniform expert utilization. Common techniques include:
's capacity buffer
* Expert capacity: set a maximum number of tokens each expert processes per batch
* DeepSeek complementary balancing: sequence-level balance + diversity loss4)
===== History =====
The MoE concept has deep roots, but modern large-scale MoE in LLMs is a relatively recent development.
==== 1991 — Jacobs et al. ====
The original MoE paper by Jacobs, Jordan, Nowlan, and Hinton introduced the idea of a gating network that combines multiple expert networks for supervised learning tasks.5) The experts competed via a softmax gating mechanism — a conceptual ancestor of modern token routing.
==== 2017 — Shazeer et al. (Sparse MoE Layer) ====
Noam Shazeer and colleagues at Google Brain published “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” introducing sparse top-k gating into RNN language models.6) They demonstrated models with up to 137 billion parameters — enormous for 2017 — with only marginal compute increases. This paper established the foundational mechanics used in virtually all subsequent LLM MoE work.
==== 2020 — GShard ====
Google's GShard paper scaled MoE transformers to 600 billion parameters across 2048 TPU cores for multilingual neural machine translation.7) GShard introduced key engineering innovations: expert parallelism as a first-class distributed training strategy, per-expert capacity buffers, and auxiliary load-balancing losses.
Google's Switch Transformer simplified MoE routing to top-1 (a single expert per token) and demonstrated 7x pretraining speedup over a comparable dense T5 model at equal compute budget.8) The paper proved that extreme sparsity (top-1) was stable with careful initialization and load balancing, and scaled models up to 1.6 trillion parameters.
==== 2023–2025 — Proliferation in Open and Closed Models ====
Following Mixtral 8x7B (Mistral AI, late 2023), MoE became the dominant architecture for frontier-scale models. DeepSeek-V3, Kimi K2, and likely GPT-4 all employ MoE, with total parameter counts reaching into the hundreds of billions or even ~1 trillion.9)
===== Modern MoE Models =====
The table below compares notable production MoE models:
^ Model ^ Organization ^ Total Params ^ Active Params ^ Experts ^ Top-k ^ Notes ^
| Mixtral 8x7B | Mistral AI | 47B | ~13B | 8 per layer | top-2 | First major open-weight MoE LLM |
| Mixtral 8x22B | Mistral AI | 141B | ~39B | 8 per layer | top-2 | Larger Mixtral variant |
| DeepSeek-V3 | DeepSeek | 671B | ~37B | 256 routed + 1 shared | top-8 | Trained for ~$5.6M10); auxiliary-loss-free balancing |
| Kimi K2 | Moonshot AI | ~1T | ~32B | 384 experts | top-8 | MuonClip optimizer; strong coding/agentic |
| Grok-1 | xAI | 314B | ~86B | 8 per layer | top-2 | Open-weights release March 2024 |
| GPT-4 | OpenAI | ~1.8T (rumored) | unknown | ~16 (rumored) | top-2 (rumored) | Architecture unconfirmed by OpenAI |
| Switch-C | Google | 1.57T | ~27B | 2048 total | top-1 | Switch Transformer research model |
===== Advantages =====
==== Compute Efficiency ====
MoE decouples model capacity from compute cost. A dense model's FLOP count scales directly with parameter count; an MoE model's active-compute scales only with the number of activated experts per token. DeepSeek-V3 (671B total, ~37B active) requires approximately 250 GFLOPS per token, compared to roughly 2,448 GFLOPS for a dense 405B model like LLaMA 3.1 405B — a ~10x compute reduction for a model with ~60% more total parameters.11)
==== Training Cost ====
Because each training step activates only a fraction of parameters, MoE models can be trained far more cheaply than comparably-capable dense models. DeepSeek-V3's full training run reportedly cost approximately $5.6 million in H800 GPU compute — a fraction of the estimated $60 million+ for comparable dense frontier models.12)
Adding more experts increases model capacity with minimal compute overhead (cost grows only in router computation and communication). This makes MoE exceptionally amenable to scaling laws: capacity can be scaled by adding experts independently of the per-token compute budget.
Analysis of trained MoE models reveals that experts genuinely specialize. Different experts activate preferentially for different:
This specialization may contribute to MoE's strong performance across diverse tasks.13)
===== Disadvantages =====
==== Memory Requirements ====
Despite sparse compute, MoE models require loading all experts into memory (or at least onto the serving cluster) at inference time. A 671B-parameter model like DeepSeek-V3 requires ~1.3 TB of memory in BF16 — far more than the ~810 GB for LLaMA 3.1 405B. This creates significant infrastructure challenges for serving MoE models at scale or on-device.14)
==== Training Instability and Expert Collapse ====
Without careful load balancing, MoE training suffers from expert collapse: the router converges to always selecting the same 1–2 experts, wasting the remaining experts' capacity. This manifests as training instability, degraded performance, and poor generalization. Mitigating expert collapse requires auxiliary losses, careful hyperparameter tuning, and sometimes architectural innovations (e.g., DeepSeek's auxiliary-loss-free balancing).15)
==== Routing Overhead and Communication ====
In distributed training and inference, different experts reside on different devices. Each token must be routed to the correct device(s), processed, and the results gathered — incurring all-to-all communication overhead. This routing overhead can dominate latency on high-latency interconnects and limits MoE's advantage on smaller clusters or edge hardware.
Fine-tuning MoE models is substantially harder than fine-tuning dense models:
Research into MoE-specific fine-tuning techniques (e.g., MoLoRA, sparse adapter methods) is ongoing.
| Property | Dense Model | MoE Model |
|---|---|---|
| Parameter utilization per token | 100% | 1–25% (top-k/N) |
| Total parameters | Moderate | Very large |
| Compute per token (FLOPs) | High | Low |
| Memory footprint | Moderate | Very high |
| Training stability | High | Moderate (requires load balancing) |
| Training cost at equal capability | High | Lower |
| Inference latency (single device) | Moderate | Higher (routing + communication) |
| Inference throughput (cluster) | Moderate | Higher (parallelism) |
| Fine-tuning ease | Easy | Difficult |
| Expert specialization | None | Yes |
| Serving complexity | Low | High |