State Space Models (SSMs)

State Space Models (SSMs) are a family of sequence modeling architectures that process sequential data through a fixed-size hidden state updated via linear dynamics. Unlike Transformers, which require O(n²) compute relative to sequence length, SSMs operate in O(n) linear time — making them compelling for long-context and resource-constrained applications.¹⁾

Definition

An SSM maps an input sequence x(t) to an output sequence y(t) via a latent hidden state h(t). In continuous form:

dh/dt = A·h(t) + B·x(t)
y(t)  = C·h(t) + D·x(t)

For practical discrete-time use, the continuous parameters are discretized (via zero-order hold or bilinear transform) to yield:

h_t = Ā·h_{t−1} + B̄·x_t
y_t = C·h_t

where Ā and B̄ are the discretized state-transition and input matrices.

Key insight: The same model can be computed in two equivalent modes:

Recurrent mode (inference): update h_t step-by-step like an RNN — O(1) memory, ideal for streaming.
Convolutional mode (training): unroll the recurrence into a global convolution kernel — enables full parallelism like a CNN.

This duality gives SSMs the training efficiency of attention-based models and the inference efficiency of RNNs.

Evolution

SSMs evolved rapidly from theoretical foundations to production-scale models:

LSSL (2021)

The Linear State Space Layer introduced the theoretical framework for applying continuous-time SSMs to deep learning, demonstrating viability on long-range sequence benchmarks.²⁾

S4 (2022)

Structured State Spaces for Sequences (S4) was the breakthrough model that made SSMs practical.³⁾ It introduced the HiPPO initialization — a principled method for initializing the state matrix A to optimally memorize history via orthogonal polynomial projections — solving the vanishing/exploding gradient problem that plagued prior SSMs. S4 achieved state-of-the-art on the Long Range Arena benchmark.

Mamba (December 2023)

Mamba introduced selective state spaces: input-dependent (data-dependent) SSM parameters (B, C, and the discretization step Δ), allowing the model to selectively retain or discard information based on content.⁴⁾ This selectivity mechanism addressed the primary weakness of prior SSMs — inability to perform content-based reasoning — while retaining linear-time complexity. Mamba also introduced a hardware-aware parallel scan algorithm for efficient GPU execution.

Mamba-2 (May 2024)

Mamba-2 reformulated the selective SSM as a Structured State Space Duality (SSD), revealing a formal connection between SSMs and linear attention.⁵⁾ This connection enabled more efficient tensor-parallel training and yielded 2–8× throughput improvements over Mamba on modern hardware.

SSMs vs. Transformers

The following table compares the two paradigms across key dimensions:

Property	SSM (e.g., Mamba)	Transformer (e.g., GPT)
Training compute	O(n) per layer	O(n²) per layer
Inference memory	Constant (fixed state size)	Grows with context (KV cache)
Context representation	Lossy compression into state	Lossless access to all tokens
Long-context scaling	Efficient	Expensive
Exact token retrieval	Difficult	Native
Hardware utilization	Scan-based (custom kernels)	Matmul-dominated (BLAS)
Streaming inference	Native	Requires sliding window tricks

Key Models

Mamba / Mamba-2

The foundational selective SSM models from Albert Gu and Tri Dao. Mamba-2 is the recommended baseline for new SSM-based projects due to its improved hardware efficiency and theoretical grounding in SSD.

Jamba

Developed by AI21 Labs, Jamba is a hybrid Transformer–Mamba–MoE (Mixture of Experts) architecture supporting a 256K token context window. It interleaves Mamba layers with attention layers and MoE feed-forward blocks, demonstrating that hybrid architectures can outperform pure SSMs and pure Transformers on both quality and throughput.⁶⁾

RWKV (v4–v7 "Goose")

RWKV is a family of architectures blending RNN and attention concepts, formulated as a linear recurrence. RWKV-7 (“Goose”) represents the latest iteration with improved expressivity and multilingual capabilities.⁷⁾

Griffin / Hawk (DeepMind / RecurrentGemma)

Griffin is DeepMind's hybrid architecture combining gated linear recurrences with local attention windows, deployed as RecurrentGemma.⁸⁾ Hawk is the pure-recurrence variant. Both demonstrate competitive quality with Transformers at significantly reduced inference cost.

Bamba (IBM)

Bamba is IBM's open-source hybrid SSM model, reporting approximately 2× faster inference than comparable Transformer models while maintaining competitive benchmark performance.

Advantages

Linear complexity: O(n) training compute enables practical training on very long sequences (100K+ tokens).
Constant inference memory: The fixed-size recurrent state means memory does not grow with context length, unlike Transformer KV caches.
Streaming inference: SSMs naturally support token-by-token generation without storing past context, enabling true streaming and low-latency applications.
Energy efficiency: The reduced compute and memory footprint enables deployment on edge hardware at under 0.5W — critical for on-device AI agents.
Parallelizable training: The convolutional equivalence enables full parallelism during training, matching Transformer training efficiency.

Limitations

Exact retrieval / copying: SSMs struggle with tasks requiring verbatim copying or precise retrieval of earlier tokens. Jelassi et al. (ICML 2024) formally proved theoretical limits on SSM expressivity for such tasks.⁹⁾
Expressivity limits: The linear recurrence structure constrains the class of computable functions relative to softmax attention.
Needle-in-a-haystack: Pure SSMs underperform Transformers on tasks requiring retrieval of a specific fact embedded in a long context.
Ecosystem maturity: Custom CUDA kernels (e.g., for parallel scan) are required for efficient training; tooling is less mature than Transformer frameworks.
Benchmark coverage: The research community has fewer standardized benchmarks designed to stress-test SSM-specific failure modes.

Hybrid Architectures

The limitations of pure SSMs and the quadratic cost of pure Transformers have driven convergence toward hybrid architectures that interleave SSM layers with attention layers.

Key examples include Jamba (AI21), Griffin/RecurrentGemma (DeepMind), and Bamba (IBM). The emerging consensus in the research community is that the future of sequence modeling lies in hybrid SSM+attention designs: SSM layers handle long-range compression efficiently, while sparse attention layers handle precise retrieval and in-context reasoning.¹⁰⁾

Typical hybrid ratios range from 1 attention layer per 3–7 SSM layers, with the optimal ratio remaining an active research question.

References

Albert Gu, Karan Goel, Christopher Ré. “Efficiently Modeling Long Sequences with Structured State Spaces” (S4). ICLR 2022. arXiv:2111.00396
Albert Gu, Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. Dec 2023. arXiv:2312.00752
Tri Dao, Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality” (Mamba-2). May 2024. arXiv:2405.21060
Soham De et al. “Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models”. 2024. arXiv:2402.19427
Bo Peng et al. “RWKV-7 Goose”. 2025. arXiv:2503.14456
Team Jamba. “Jamba: A Hybrid Transformer-Mamba Language Model”. ICLR 2025.
Samy Jelassi et al. “Repeat After Me: Transformers are Better than State Space Models at Copying”. ICML 2024.
AI21 Labs. “The Rise of Hybrid LLMs”. ai21.com

AI Agent Knowledge Base

Sidebar

Table of Contents

State Space Models (SSMs)

Definition

Evolution

LSSL (2021)

S4 (2022)

Mamba (December 2023)

Mamba-2 (May 2024)

SSMs vs. Transformers

Key Models

Mamba / Mamba-2

Jamba

RWKV (v4–v7 "Goose")

Griffin / Hawk (DeepMind / RecurrentGemma)

Bamba (IBM)

Advantages

Limitations

Hybrid Architectures

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

State Space Models (SSMs)

Definition

Evolution

LSSL (2021)

S4 (2022)

Mamba (December 2023)

Mamba-2 (May 2024)

SSMs vs. Transformers

Key Models

Mamba / Mamba-2

Jamba

RWKV (v4–v7 "Goose")

Griffin / Hawk (DeepMind / RecurrentGemma)

Bamba (IBM)

Advantages

Limitations

Hybrid Architectures

References

See Also

Page Tools