Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
State Space Models (SSMs) are a family of sequence modeling architectures that process sequential data through a fixed-size hidden state updated via linear dynamics. Unlike Transformers, which require O(n²) compute relative to sequence length, SSMs operate in O(n) linear time — making them compelling for long-context and resource-constrained applications.1)
An SSM maps an input sequence x(t) to an output sequence y(t) via a latent hidden state h(t). In continuous form:
dh/dt = A·h(t) + B·x(t) y(t) = C·h(t) + D·x(t)
For practical discrete-time use, the continuous parameters are discretized (via zero-order hold or bilinear transform) to yield:
h_t = Ā·h_{t−1} + B̄·x_t
y_t = C·h_t
where Ā and B̄ are the discretized state-transition and input matrices.
Key insight: The same model can be computed in two equivalent modes:
This duality gives SSMs the training efficiency of attention-based models and the inference efficiency of RNNs.
SSMs evolved rapidly from theoretical foundations to production-scale models:
The Linear State Space Layer introduced the theoretical framework for applying continuous-time SSMs to deep learning, demonstrating viability on long-range sequence benchmarks.2)
Structured State Spaces for Sequences (S4) was the breakthrough model that made SSMs practical.3) It introduced the HiPPO initialization — a principled method for initializing the state matrix A to optimally memorize history via orthogonal polynomial projections — solving the vanishing/exploding gradient problem that plagued prior SSMs. S4 achieved state-of-the-art on the Long Range Arena benchmark.
Mamba introduced selective state spaces: input-dependent (data-dependent) SSM parameters (B, C, and the discretization step Δ), allowing the model to selectively retain or discard information based on content.4) This selectivity mechanism addressed the primary weakness of prior SSMs — inability to perform content-based reasoning — while retaining linear-time complexity. Mamba also introduced a hardware-aware parallel scan algorithm for efficient GPU execution.
Mamba-2 reformulated the selective SSM as a Structured State Space Duality (SSD), revealing a formal connection between SSMs and linear attention.5) This connection enabled more efficient tensor-parallel training and yielded 2–8× throughput improvements over Mamba on modern hardware.
The following table compares the two paradigms across key dimensions:
| Property | SSM (e.g., Mamba) | Transformer (e.g., GPT) |
|---|---|---|
| Training compute | O(n) per layer | O(n²) per layer |
| Inference memory | Constant (fixed state size) | Grows with context (KV cache) |
| Context representation | Lossy compression into state | Lossless access to all tokens |
| Long-context scaling | Efficient | Expensive |
| Exact token retrieval | Difficult | Native |
| Hardware utilization | Scan-based (custom kernels) | Matmul-dominated (BLAS) |
| Streaming inference | Native | Requires sliding window tricks |
The foundational selective SSM models from Albert Gu and Tri Dao. Mamba-2 is the recommended baseline for new SSM-based projects due to its improved hardware efficiency and theoretical grounding in SSD.
Developed by AI21 Labs, Jamba is a hybrid Transformer–Mamba–MoE (Mixture of Experts) architecture supporting a 256K token context window. It interleaves Mamba layers with attention layers and MoE feed-forward blocks, demonstrating that hybrid architectures can outperform pure SSMs and pure Transformers on both quality and throughput.6)
RWKV is a family of architectures blending RNN and attention concepts, formulated as a linear recurrence. RWKV-7 (“Goose”) represents the latest iteration with improved expressivity and multilingual capabilities.7)
Griffin is DeepMind's hybrid architecture combining gated linear recurrences with local attention windows, deployed as RecurrentGemma.8) Hawk is the pure-recurrence variant. Both demonstrate competitive quality with Transformers at significantly reduced inference cost.
Bamba is IBM's open-source hybrid SSM model, reporting approximately 2× faster inference than comparable Transformer models while maintaining competitive benchmark performance.
The limitations of pure SSMs and the quadratic cost of pure Transformers have driven convergence toward hybrid architectures that interleave SSM layers with attention layers.
Key examples include Jamba (AI21), Griffin/RecurrentGemma (DeepMind), and Bamba (IBM). The emerging consensus in the research community is that the future of sequence modeling lies in hybrid SSM+attention designs: SSM layers handle long-range compression efficiently, while sparse attention layers handle precise retrieval and in-context reasoning.10)
Typical hybrid ratios range from 1 attention layer per 3–7 SSM layers, with the optimal ratio remaining an active research question.