====== Moonshot AI Kimi K2 ====== Moonshot AI's **Kimi K2** is a trillion-parameter open-source large language model built on a Mixture-of-Experts (MoE) Transformer architecture. First released in mid-2025, Kimi K2 represents one of China's most ambitious contributions to frontier AI, activating only 32 billion of its 1.04 trillion total parameters per token for efficient inference. ((Source: [[https://intuitionlabs.ai/articles/kimi-k2-technical-deep-dive|Intuition Labs — Kimi K2 Technical Deep Dive]])) An upgraded version, **Kimi K2.5**, followed in January 2026 with native multimodal capabilities and an expanded 256K context window. ((Source: [[https://www.codecademy.com/article/kimi-k-2-5-complete-guide-to-moonshots-ai-model|Codecademy — Kimi K2.5 Complete Guide]])) ===== Architecture ===== Kimi K2 employs a dense-sparse hybrid design with the following specifications: * **Total parameters:** 1.04 trillion * **Active parameters per token:** 32 billion * **Layers:** 61 (including 1 dense layer) * **Experts:** 384 total, 8 selected per token, plus 1 shared expert * **Attention mechanism:** Multi-head Latent Attention (MLA) * **Activation function:** SwiGLU * **Vocabulary size:** 160,000 tokens * **Context window:** 128K tokens (K2), 256K tokens (K2.5) The MLA mechanism compresses key-value projections into a lower-dimensional space before computing attention scores, reducing memory bandwidth by 40-50%. ((Source: [[https://www.codecademy.com/article/kimi-k-2-5-complete-guide-to-moonshots-ai-model|Codecademy — Kimi K2.5 Complete Guide]])) The model was trained using the **Muon optimizer**, purpose-built for trillion-parameter MoE models. ===== Training ===== Kimi K2.5 was trained on **15 trillion mixed visual and textual tokens** in a unified pipeline, allowing vision and language capabilities to develop together rather than as separate modules. ((Source: [[https://build.nvidia.com/moonshotai/kimi-k2.5/modelcard|NVIDIA — Kimi K2.5 Model Card]])) The vision component uses **MoonViT**, a 400-million-parameter vision encoder that processes images through the same transformer architecture as text. ===== Key Capabilities ===== * **Agentic intelligence:** Extended step-by-step reasoning and tool use optimization * **Agent Swarm Technology:** K2.5 can coordinate up to 100 specialized AI agents simultaneously, achieving 4.5x faster execution while reducing costs by 76% compared to leading closed models ((Source: [[https://www.codecademy.com/article/kimi-k-2-5-complete-guide-to-moonshots-ai-model|Codecademy — Kimi K2.5 Complete Guide]])) * **Visual coding:** Generates production-ready React or HTML from UI mockups, including responsive design and accessibility * **Autonomous visual debugging:** Renders generated code, compares against original designs, and iteratively refines output ===== Benchmark Performance ===== Kimi K2.5 achieved **50.2% on Humanity's Last Exam** at significantly lower cost than comparable closed models. ((Source: [[https://www.codecademy.com/article/kimi-k-2-5-complete-guide-to-moonshots-ai-model|Codecademy — Kimi K2.5 Complete Guide]])) Both versions achieve state-of-the-art open-model performance across code, reasoning, and multi-step tasks. ===== Deployment ===== Both models are fully **open-source**, distributed via HuggingFace in base and instruction-tuned variants. K2 runs at approximately 15 tokens/s on two Apple M3 Ultras. Using Unsloth Dynamic 1.8-bit quantization, K2.5's disk requirements drop from 600GB to 240GB, enabling operation on a single 24GB GPU with system RAM offloading. ((Source: [[https://unsloth.ai/docs/models/kimi-k2.5|Unsloth — Kimi K2.5]])) A notable trade-off is that K2 exhibits 2-2.5x higher token usage compared to other models. ===== China's MoE Advancements ===== Kimi K2 is part of a broader wave of Chinese MoE model development. Moonshot AI, founded in 2023 by former Tsinghua University researchers, has positioned itself alongside DeepSeek and other Chinese labs pushing the boundaries of efficient large-scale AI. The MoE architecture allows these models to compete with Western frontier models while requiring significantly less compute per inference request. ===== See Also ===== * [[mixture_of_experts|Mixture of Experts]] * [[moonshot_ai|Moonshot AI]] * [[large_language_models|Large Language Models]] ===== References =====