Table of Contents

Latent Reasoning

Latent reasoning refers to techniques that enable LLMs to reason in continuous latent space using hidden states as “continuous thoughts,” rather than generating explicit chain-of-thought (CoT) tokens. This approach allows models to encode multiple reasoning paths simultaneously for more efficient computation.

Motivation: Limitations of Verbal Reasoning

Explicit chain-of-thought reasoning has fundamental constraints:

Latent reasoning addresses these by performing computation in the model's continuous hidden state space, where vectors can represent superpositions of multiple reasoning paths.

Coconut: Chain of Continuous Thought

Coconut (Hao et al., arXiv:2412.06769) is the foundational work on latent reasoning. The model switches between language mode and latent mode using special tokens:

In latent mode, the model reuses the last hidden state as input for the next step without decoding to tokens. The continuous thought at step $k$ is computed as:

$$h^{(k+1)} = f_\theta\!\left(h^{(k)}\right)$$

where $f_\theta$ is the transformer forward pass and $h^{(k)} \in \mathbb{R}^d$ is the continuous thought vector. This is fully differentiable, enabling training via backpropagation.

# Conceptual illustration of Coconut's latent reasoning
class CoconutModel:
    def forward(self, input_ids, mode="language"):
        hidden = self.embed(input_ids)
 
        for layer in self.layers:
            hidden = layer(hidden)
 
        if mode == "latent":
            # Reuse hidden state as next input (no token decoding)
            return hidden  # continuous thought vector
        else:
            # Standard token prediction
            return self.lm_head(hidden)
 
    def reason(self, prompt, n_latent_steps=3):
        hidden = self.forward(prompt)
        # Perform n steps of latent reasoning
        for _ in range(n_latent_steps):
            hidden = self.forward_latent(hidden)  # iterate in latent space
        # Decode final answer from enriched hidden state
        return self.decode(hidden)

Training curriculum: Start with full CoT examples, then progressively replace $k$ reasoning sentences with $k \times c$ latent thoughts ($c=1$-$2$ per step). This trains the model to optimize vector-based reasoning indirectly.

Recurrent Depth (arXiv:2502.05171)

Geiping et al. introduce a recurrent depth architecture for latent reasoning that iterates a recurrent block to arbitrary depth at test-time. Given input representation $h_0$, the model applies $M$ iterations of a shared recurrent block:

$$h_m = g_\phi(h_{m-1}), \quad m = 1, \ldots, M$$

The output $h_M$ is then decoded. Key properties:

This creates a natural mechanism for test-time compute scaling: allocate more recurrent iterations for harder problems, fewer for easy ones.

Key Properties of Latent Reasoning

Aspect Explicit CoT Latent Reasoning
Representation Discrete tokens Continuous hidden states $h \in \mathbb{R}^d$
Search style Single path, autoregressive Multi-path (BFS/tree-like)
Efficiency High token cost, early commitment Fewer tokens, parallel exploration
Interpretability Human-readable Opaque (requires probing)
Strong tasks Math (GSM8K) Logic, graph reachability

Reasoning by Superposition

Analysis of latent thoughts (arXiv:2505.12514) reveals that continuous thought vectors can encode superpositions of multiple reasoning states simultaneously. For a graph reachability problem, a single latent vector $h$ can represent a set of reachable nodes $S \subset V$:

$$h \approx \sum_{v \in S} \alpha_v \, e_v$$

where $e_v$ are learned node embeddings and $\alpha_v$ are attention-derived weights. This effectively implements breadth-first search (BFS) in vector space.

This is impossible with discrete tokens, which must commit to naming specific nodes. The superposition property explains why latent reasoning excels at search-like tasks.

STILL-1 and STILL-2

The STILL (Slow Thinking with LLMs) models explore the boundary between explicit and latent reasoning:

These models demonstrate that a spectrum exists between fully explicit CoT and fully latent reasoning, with hybrid approaches often achieving the best efficiency-accuracy tradeoff.

Efficiency Gains

Latent reasoning offers concrete efficiency advantages:

On graph reachability tasks, Coconut outperforms CoT while using significantly fewer generation steps. On GSM8K math, CoT still edges ahead, suggesting latent reasoning is strongest for search and logic tasks.

Dual-Architecture Approaches

Recent work (2025-2026) explores dual-architecture latent reasoning where a fluent base model exchanges latent messages with a specialized coprocessor:

References

See Also