====== State Space Models (SSMs) ====== State Space Models (SSMs) are a family of sequence modeling architectures that process sequential data through a fixed-size hidden state updated via linear dynamics. Unlike Transformers, which require O(n²) compute relative to sequence length, SSMs operate in O(n) linear time — making them compelling for long-context and resource-constrained applications.((Albert Gu and Tri Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", Dec 2023. [[https://arxiv.org/abs/2312.00752|arXiv:2312.00752]])) ===== Definition ===== An SSM maps an input sequence x(t) to an output sequence y(t) via a latent hidden state h(t). In continuous form: dh/dt = A·h(t) + B·x(t) y(t) = C·h(t) + D·x(t) For practical discrete-time use, the continuous parameters are discretized (via zero-order hold or bilinear transform) to yield: h_t = Ā·h_{t−1} + B̄·x_t y_t = C·h_t where Ā and B̄ are the discretized state-transition and input matrices. **Key insight:** The same model can be computed in two equivalent modes: * **Recurrent mode** (inference): update h_t step-by-step like an RNN — O(1) memory, ideal for streaming. * **Convolutional mode** (training): unroll the recurrence into a global convolution kernel — enables full parallelism like a CNN. This duality gives SSMs the training efficiency of attention-based models and the inference efficiency of RNNs. ===== Evolution ===== SSMs evolved rapidly from theoretical foundations to production-scale models: ==== LSSL (2021) ==== The Linear State Space Layer introduced the theoretical framework for applying continuous-time SSMs to deep learning, demonstrating viability on long-range sequence benchmarks.((Albert Gu et al., "Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers", NeurIPS 2021.)) ==== S4 (2022) ==== Structured State Spaces for Sequences (S4) was the breakthrough model that made SSMs practical.((Albert Gu, Karan Goel, and Christopher Ré, "Efficiently Modeling Long Sequences with Structured State Spaces", ICLR 2022. [[https://arxiv.org/abs/2111.00396|arXiv:2111.00396]])) It introduced the **HiPPO initialization** — a principled method for initializing the state matrix A to optimally memorize history via orthogonal polynomial projections — solving the vanishing/exploding gradient problem that plagued prior SSMs. S4 achieved state-of-the-art on the Long Range Arena benchmark. ==== Mamba (December 2023) ==== Mamba introduced **selective state spaces**: input-dependent (data-dependent) SSM parameters (B, C, and the discretization step Δ), allowing the model to selectively retain or discard information based on content.((Albert Gu and Tri Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", Dec 2023. [[https://arxiv.org/abs/2312.00752|arXiv:2312.00752]])) This selectivity mechanism addressed the primary weakness of prior SSMs — inability to perform content-based reasoning — while retaining linear-time complexity. Mamba also introduced a hardware-aware parallel scan algorithm for efficient GPU execution. ==== Mamba-2 (May 2024) ==== Mamba-2 reformulated the selective SSM as a **Structured State Space Duality (SSD)**, revealing a formal connection between SSMs and linear attention.((Tri Dao and Albert Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality", May 2024. [[https://arxiv.org/abs/2405.21060|arXiv:2405.21060]])) This connection enabled more efficient tensor-parallel training and yielded 2–8× throughput improvements over Mamba on modern hardware. ===== SSMs vs. Transformers ===== The following table compares the two paradigms across key dimensions: ^ Property ^ SSM (e.g., Mamba) ^ Transformer (e.g., GPT) ^ | Training compute | O(n) per layer | O(n²) per layer | | Inference memory | Constant (fixed state size) | Grows with context (KV cache) | | Context representation | Lossy compression into state | Lossless access to all tokens | | Long-context scaling | Efficient | Expensive | | Exact token retrieval | Difficult | Native | | Hardware utilization | Scan-based (custom kernels) | Matmul-dominated (BLAS) | | Streaming inference | Native | Requires sliding window tricks | ===== Key Models ===== ==== Mamba / Mamba-2 ==== The foundational selective SSM models from Albert Gu and Tri Dao. Mamba-2 is the recommended baseline for new SSM-based projects due to its improved hardware efficiency and theoretical grounding in SSD. ==== Jamba ==== Developed by AI21 Labs, Jamba is a hybrid Transformer–Mamba–MoE (Mixture of Experts) architecture supporting a 256K token context window. It interleaves Mamba layers with attention layers and MoE feed-forward blocks, demonstrating that hybrid architectures can outperform pure SSMs and pure Transformers on both quality and throughput.((Team Jamba, AI21 Labs, "Jamba: A Hybrid Transformer-Mamba Language Model", ICLR 2025.)) ==== RWKV (v4–v7 "Goose") ==== RWKV is a family of architectures blending RNN and attention concepts, formulated as a linear recurrence. RWKV-7 ("Goose") represents the latest iteration with improved expressivity and multilingual capabilities.((Bo Peng et al., "RWKV-7 Goose", 2025. [[https://arxiv.org/abs/2503.14456|arXiv:2503.14456]])) ==== Griffin / Hawk (DeepMind / RecurrentGemma) ==== Griffin is DeepMind's hybrid architecture combining gated linear recurrences with local attention windows, deployed as RecurrentGemma.((Soham De et al., "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models", 2024. [[https://arxiv.org/abs/2402.19427|arXiv:2402.19427]])) Hawk is the pure-recurrence variant. Both demonstrate competitive quality with Transformers at significantly reduced inference cost. ==== Bamba (IBM) ==== Bamba is IBM's open-source hybrid SSM model, reporting approximately 2× faster inference than comparable Transformer models while maintaining competitive benchmark performance. ===== Advantages ===== * **Linear complexity:** O(n) training compute enables practical training on very long sequences (100K+ tokens). * **Constant inference memory:** The fixed-size recurrent state means memory does not grow with context length, unlike Transformer KV caches. * **Streaming inference:** SSMs naturally support token-by-token generation without storing past context, enabling true streaming and low-latency applications. * **Energy efficiency:** The reduced compute and memory footprint enables deployment on edge hardware at under 0.5W — critical for on-device AI agents. * **Parallelizable training:** The convolutional equivalence enables full parallelism during training, matching Transformer training efficiency. ===== Limitations ===== * **Exact retrieval / copying:** SSMs struggle with tasks requiring verbatim copying or precise retrieval of earlier tokens. Jelassi et al. (ICML 2024) formally proved theoretical limits on SSM expressivity for such tasks.((Samy Jelassi et al., "Repeat After Me: Transformers are Better than State Space Models at Copying", ICML 2024.)) * **Expressivity limits:** The linear recurrence structure constrains the class of computable functions relative to softmax attention. * **Needle-in-a-haystack:** Pure SSMs underperform Transformers on tasks requiring retrieval of a specific fact embedded in a long context. * **Ecosystem maturity:** Custom CUDA kernels (e.g., for parallel scan) are required for efficient training; tooling is less mature than Transformer frameworks. * **Benchmark coverage:** The research community has fewer standardized benchmarks designed to stress-test SSM-specific failure modes. ===== Hybrid Architectures ===== The limitations of pure SSMs and the quadratic cost of pure Transformers have driven convergence toward **hybrid architectures** that interleave SSM layers with attention layers. Key examples include Jamba (AI21), Griffin/RecurrentGemma (DeepMind), and Bamba (IBM). The emerging consensus in the research community is that the future of sequence modeling lies in hybrid SSM+attention designs: SSM layers handle long-range compression efficiently, while sparse attention layers handle precise retrieval and in-context reasoning.((AI21 Labs, "The Rise of Hybrid LLMs: Attention Was Never Enough". [[https://www.ai21.com/blog/rise-of-hybrid-llms/|ai21.com/blog/rise-of-hybrid-llms]])) Typical hybrid ratios range from 1 attention layer per 3–7 SSM layers, with the optimal ratio remaining an active research question. ===== References ===== - Albert Gu, Karan Goel, Christopher Ré. "Efficiently Modeling Long Sequences with Structured State Spaces" (S4). ICLR 2022. [[https://arxiv.org/abs/2111.00396|arXiv:2111.00396]] - Albert Gu, Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". Dec 2023. [[https://arxiv.org/abs/2312.00752|arXiv:2312.00752]] - Tri Dao, Albert Gu. "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" (Mamba-2). May 2024. [[https://arxiv.org/abs/2405.21060|arXiv:2405.21060]] - Soham De et al. "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models". 2024. [[https://arxiv.org/abs/2402.19427|arXiv:2402.19427]] - Bo Peng et al. "RWKV-7 Goose". 2025. [[https://arxiv.org/abs/2503.14456|arXiv:2503.14456]] - Team Jamba. "Jamba: A Hybrid Transformer-Mamba Language Model". ICLR 2025. - Samy Jelassi et al. "Repeat After Me: Transformers are Better than State Space Models at Copying". ICML 2024. - AI21 Labs. "The Rise of Hybrid LLMs". [[https://www.ai21.com/blog/rise-of-hybrid-llms/|ai21.com]] ===== See Also ===== * [[transformer_architecture|Transformer Architecture]] * [[attention_mechanism|Attention Mechanism]] * [[inference_optimization|Inference Optimization]] * [[on_device_agents|On-Device Agents]]