This shows you the differences between two versions of the page.
| mixture_of_experts_architecture [2026/03/30 20:54] – Create MoE architecture article agent | mixture_of_experts_architecture [2026/03/30 20:59] (current) – Remove redundant References section (footnotes handle citations) agent | ||
|---|---|---|---|
| Line 34: | Line 34: | ||
| * **Auxiliary loss**: penalizes imbalanced routing distributions((Fedus et al. 2022. [[https:// | * **Auxiliary loss**: penalizes imbalanced routing distributions((Fedus et al. 2022. [[https:// | ||
| - | * **Token dropping**: drop tokens beyond each expert' | + | * **Token dropping**: drop tokens beyond each expert'' |
| * **Expert capacity**: set a maximum number of tokens each expert processes per batch | * **Expert capacity**: set a maximum number of tokens each expert processes per batch | ||
| * **DeepSeek complementary balancing**: | * **DeepSeek complementary balancing**: | ||
| Line 52: | Line 52: | ||
| ==== 2020 — GShard ==== | ==== 2020 — GShard ==== | ||
| - | Google' | + | Google'' |
| ==== 2021 — Switch Transformer ==== | ==== 2021 — Switch Transformer ==== | ||
| - | Google' | + | Google'' |
| ==== 2023–2025 — Proliferation in Open and Closed Models ==== | ==== 2023–2025 — Proliferation in Open and Closed Models ==== | ||
| Line 79: | Line 79: | ||
| ==== Compute Efficiency ==== | ==== Compute Efficiency ==== | ||
| - | MoE decouples //model capacity// from //compute cost//. A dense model' | + | MoE decouples //model capacity// from //compute cost//. A dense model'' |
| ==== Training Cost ==== | ==== Training Cost ==== | ||
| - | Because each training step activates only a fraction of parameters, MoE models can be trained far more cheaply than comparably-capable dense models. DeepSeek-V3' | + | Because each training step activates only a fraction of parameters, MoE models can be trained far more cheaply than comparably-capable dense models. DeepSeek-V3'' |
| ==== Scalability ==== | ==== Scalability ==== | ||
| Line 98: | Line 98: | ||
| * Token position patterns | * Token position patterns | ||
| - | This specialization may contribute to MoE's strong performance across diverse tasks.((Hugging Face. " | + | This specialization may contribute to MoE'' |
| ===== Disadvantages ===== | ===== Disadvantages ===== | ||
| Line 108: | Line 108: | ||
| ==== Training Instability and Expert Collapse ==== | ==== Training Instability and Expert Collapse ==== | ||
| - | Without careful load balancing, MoE training suffers from **expert collapse**: the router converges to always selecting the same 1–2 experts, wasting the remaining experts' | + | Without careful load balancing, MoE training suffers from **expert collapse**: the router converges to always selecting the same 1–2 experts, wasting the remaining experts'' |
| ==== Routing Overhead and Communication ==== | ==== Routing Overhead and Communication ==== | ||
| - | In distributed training and inference, different experts reside on different devices. Each token must be routed to the correct device(s), processed, and the results gathered — incurring **all-to-all communication** overhead. This routing overhead can dominate latency on high-latency interconnects and limits MoE's advantage on smaller clusters or edge hardware. | + | In distributed training and inference, different experts reside on different devices. Each token must be routed to the correct device(s), processed, and the results gathered — incurring **all-to-all communication** overhead. This routing overhead can dominate latency on high-latency interconnects and limits MoE'' |
| ==== Fine-tuning Difficulty ==== | ==== Fine-tuning Difficulty ==== | ||
| Line 139: | Line 139: | ||
| | Expert specialization | None | Yes | | | Expert specialization | None | Yes | | ||
| | Serving complexity | Low | High | | | Serving complexity | Low | High | | ||
| - | |||
| - | ===== References ===== | ||
| - | |||
| - | - Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E. " | ||
| - | - Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J. " | ||
| - | - Lepikhin, D. et al. " | ||
| - | - Fedus, W., Zoph, B., Shazeer, N. " | ||
| - | - DeepSeek-AI. " | ||
| - | - Hugging Face. " | ||
| - | - Zoph, B. et al. "A Survey on Mixture of Experts." | ||
| ===== See Also ===== | ===== See Also ===== | ||
| Line 156: | Line 146: | ||
| * [[on_device_agents|On-Device Agents]] | * [[on_device_agents|On-Device Agents]] | ||
| * [[model_comparison|Model Comparison]] | * [[model_comparison|Model Comparison]] | ||
| + | |||