====== Latent Reasoning ======
**Latent reasoning** refers to techniques that enable LLMs to reason in continuous [[latent_space|latent space]] using hidden states as "continuous thoughts," rather than generating explicit chain-of-thought (CoT) tokens.(([[https://arxiv.org/abs/2412.06769|arXiv:2412.06769 - Coconut: Chain of Continuous Thought]])) This approach allows models to encode multiple reasoning paths simultaneously for more efficient computation.

===== Motivation: Limitations of Verbal Reasoning =====
Explicit [[chain_of_thought|chain-of-thought reasoning]] has fundamental constraints:

  * **Sequential commitment**: Each generated token commits to a single reasoning path
  * **Token overhead**: Reasoning expressed in natural language consumes output tokens
  * **Information bottleneck**: Natural language tokens carry less information than continuous vectors, a hidden state $h \in \mathbb{R}^d$ can encode far more information than a discrete token from vocabulary $\mathcal{V}$
  * **Hallucination risk**: Early commitment to a path can propagate errors

Latent reasoning addresses these by performing computation in the model's continuous hidden state space, where vectors can represent superpositions of multiple reasoning paths.

===== Coconut: Chain of Continuous Thought =====
**Coconut** (Hao et al., arXiv:2412.06769) is the foundational work on latent reasoning.(([[https://arxiv.org/abs/2412.06769|arXiv:2412.06769 - Coconut: Chain of Continuous Thought]])) The model switches between language mode and latent mode using special tokens:

  * ''<bot>'' (beginning of thought): Enter latent reasoning mode
  * ''<eot>'' (end of thought): Return to language generation

In latent mode, the model reuses the last hidden state as input for the next step without decoding to tokens. The continuous thought at step $k$ is computed as:

$$h^{(k+1)} = f_\theta\!\left(h^{(k)}\right)$$

where $f_\theta$ is the transformer forward pass and $h^{(k)} \in \mathbb{R}^d$ is the continuous thought vector. This is fully differentiable, enabling training via backpropagation.

<code python>
# Conceptual illustration of Coconut's latent reasoning
class CoconutModel:
    def forward(self, input_ids, mode="language"):
        hidden = self.embed(input_ids)
        
        for layer in self.layers:
            hidden = layer(hidden)
        
        if mode == "latent":
            # Reuse hidden state as next input (no token decoding)
            return hidden  # continuous thought vector
        else:
            # Standard token prediction
            return self.lm_head(hidden)
    
    def reason(self, prompt, n_latent_steps=3):
        hidden = self.forward(prompt)
        # Perform n steps of latent reasoning
        for _ in range(n_latent_steps):
            hidden = self.forward_latent(hidden)  # iterate in [[latent_space|latent space]]
        # Decode final answer from enriched hidden state
        return self.decode(hidden)
</code>

**Training curriculum**: Start with full CoT examples, then progressively replace $k$ reasoning sentences with $k \times c$ latent thoughts ($c=1$-$2$ per step). This trains the model to optimize vector-based reasoning indirectly.

===== Recurrent Depth (arXiv:2502.05171) =====
Geiping et al. introduce a **recurrent depth** architecture for latent reasoning that iterates a recurrent [[block|block]] to arbitrary depth at test-time.(([[https://arxiv.org/abs/2502.05171|arXiv:2502.05171 - Scaling up Test-Time Compute with Latent Reasoning]])) Given input representation $h_0$, the model applies $M$ iterations of a shared recurrent [[block|block]]:

$$h_m = g_\phi(h_{m-1}), \quad m = 1, \ldots, M$$

The output $h_M$ is then decoded. Key properties:

  * A single recurrent transformer [[block|block]] is applied repeatedly
  * Each iteration deepens the model's reasoning in [[latent_space|latent space]]
  * Unrolling depth $M$ can be adjusted at inference time (more iterations = more compute = better reasoning)
  * No additional parameters needed, the same [[block|block]] is reused

This creates a natural mechanism for **[[test_time_compute_scaling|test-time compute scaling]]**: allocate more recurrent iterations for harder problems, fewer for easy ones.

===== Key Properties of Latent Reasoning =====
^ Aspect ^ Explicit CoT ^ Latent Reasoning ^
| Representation | Discrete tokens | Continuous hidden states $h \in \mathbb{R}^d$ |
| Search style | Single path, autoregressive | Multi-path (BFS/tree-like) |
| Efficiency | High token cost, early commitment | Fewer tokens, parallel exploration |
| Interpretability | Human-readable | Opaque (requires probing) |
| Strong tasks | Math (GSM8K) | Logic, graph reachability |

===== Reasoning by Superposition =====
Analysis of latent thoughts (arXiv:2505.12514) reveals that continuous thought vectors can encode **superpositions** of multiple reasoning states simultaneously.(([[https://arxiv.org/abs/2505.12514|arXiv:2505.12514 - Reasoning by Superposition]])) For a graph reachability problem, a single latent vector $h$ can represent a set of reachable nodes $S \subset V$:

$$h \approx \sum_{v \in S} \alpha_v \, e_v$$

where $e_v$ are learned node [[embeddings|embeddings]] and $\alpha_v$ are attention-derived weights. This effectively implements breadth-first search (BFS) in vector space.

This is impossible with discrete tokens, which must commit to naming specific nodes. The superposition property explains why latent reasoning excels at search-like tasks.

===== STILL-1 and STILL-2 =====
The **STILL** (Slow Thinking with LLMs) models explore the boundary between explicit and latent reasoning:

  * **STILL-1**: Trains models to internalize reasoning steps, reducing the number of explicit CoT tokens while maintaining accuracy
  * **STILL-2**: Extends this with adaptive depth, the model learns when to think explicitly vs latently based on problem difficulty

These models demonstrate that a spectrum exists between fully explicit CoT and fully latent reasoning, with hybrid approaches often achieving the best efficiency-accuracy tradeoff.

===== Efficiency Gains =====
Latent reasoning offers concrete efficiency advantages:

  * **Reduced token generation**: Replace $N$ reasoning tokens with $K \ll N$ latent iterations
  * **Reduced hallucination**: Postpone token commitment until reasoning is more complete
  * **Parallelism**: Latent iterations can be more easily parallelized than autoregressive decoding
  * **Information density**: Each latent vector encodes more information than a single token

On graph reachability tasks, Coconut outperforms CoT while using significantly fewer generation steps. On GSM8K math, CoT still edges ahead, suggesting latent reasoning is strongest for search and logic tasks.

===== Dual-Architecture Approaches =====
Recent work (2025-2026) explores dual-architecture latent reasoning where a fluent base model exchanges latent messages with a specialized coprocessor:

  * Increases communication channel capacity between modules
  * Joint fine-tuning of both components improves latent communication
  * Enables separation of "System 1" (fast, intuitive) and "System 2" (slow, deliberate) reasoning

===== See Also =====
  * [[reasoning_models|Reasoning Models]]
  * [[chain_of_thought|Chain-of-Thought Reasoning]]
  * [[chain_of_thought_agents|Chain of Thought Agents]]
  * [[reasoning_via_planning|RAP: Reasoning via Planning with LLM as World Model]]
  * [[tree_of_thoughts|Tree of Thoughts]]

===== References =====