This shows you the differences between two versions of the page.
| latent_reasoning [2026/03/24 17:07] – Create page: Latent Reasoning with researched content agent | latent_reasoning [2026/03/24 17:45] (current) – Add LaTeX math formatting for continuous thought formulation, recurrent depth, superposition encoding agent | ||
|---|---|---|---|
| Line 9: | Line 9: | ||
| * **Sequential commitment**: | * **Sequential commitment**: | ||
| * **Token overhead**: Reasoning expressed in natural language consumes output tokens | * **Token overhead**: Reasoning expressed in natural language consumes output tokens | ||
| - | * **Information bottleneck**: | + | * **Information bottleneck**: |
| * **Hallucination risk**: Early commitment to a path can propagate errors | * **Hallucination risk**: Early commitment to a path can propagate errors | ||
| Line 21: | Line 21: | ||
| * ''< | * ''< | ||
| - | In latent mode, the model reuses the last hidden state as input for the next step without decoding to tokens. This is fully differentiable, | + | In latent mode, the model reuses the last hidden state as input for the next step without decoding to tokens. The continuous thought at step $k$ is computed as: |
| + | |||
| + | $$h^{(k+1)} = f_\theta\!\left(h^{(k)}\right)$$ | ||
| + | |||
| + | where $f_\theta$ is the transformer forward pass and $h^{(k)} \in \mathbb{R}^d$ is the continuous thought vector. This is fully differentiable, | ||
| <code python> | <code python> | ||
| Line 48: | Line 52: | ||
| </ | </ | ||
| - | **Training curriculum**: | + | **Training curriculum**: |
| ===== Recurrent Depth (arXiv: | ===== Recurrent Depth (arXiv: | ||
| - | Geiping et al. introduce a **recurrent depth** architecture for latent reasoning that iterates a recurrent block to arbitrary depth at test-time: | + | Geiping et al. introduce a **recurrent depth** architecture for latent reasoning that iterates a recurrent block to arbitrary depth at test-time. Given input representation $h_0$, the model applies $M$ iterations of a shared recurrent block: |
| + | |||
| + | $$h_m = g_\phi(h_{m-1}), | ||
| + | |||
| + | The output $h_M$ is then decoded. Key properties: | ||
| * A single recurrent transformer block is applied repeatedly | * A single recurrent transformer block is applied repeatedly | ||
| * Each iteration deepens the model' | * Each iteration deepens the model' | ||
| - | * Unrolling depth can be adjusted at inference time (more iterations = more compute = better reasoning) | + | * Unrolling depth $M$ can be adjusted at inference time (more iterations = more compute = better reasoning) |
| * No additional parameters needed -- the same block is reused | * No additional parameters needed -- the same block is reused | ||
| Line 64: | Line 72: | ||
| ^ Aspect ^ Explicit CoT ^ Latent Reasoning ^ | ^ Aspect ^ Explicit CoT ^ Latent Reasoning ^ | ||
| - | | Representation | Discrete tokens | Continuous hidden states | | + | | Representation | Discrete tokens | Continuous hidden states |
| | Search style | Single path, autoregressive | Multi-path (BFS/ | | Search style | Single path, autoregressive | Multi-path (BFS/ | ||
| | Efficiency | High token cost, early commitment | Fewer tokens, parallel exploration | | | Efficiency | High token cost, early commitment | Fewer tokens, parallel exploration | | ||
| Line 72: | Line 80: | ||
| ===== Reasoning by Superposition ===== | ===== Reasoning by Superposition ===== | ||
| - | Analysis of latent thoughts (arXiv: | + | Analysis of latent thoughts (arXiv: |
| + | |||
| + | $$h \approx \sum_{v \in S} \alpha_v \, e_v$$ | ||
| + | |||
| + | where $e_v$ are learned node embeddings and $\alpha_v$ are attention-derived weights. This effectively | ||
| This is impossible with discrete tokens, which must commit to naming specific nodes. The superposition property explains why latent reasoning excels at search-like tasks. | This is impossible with discrete tokens, which must commit to naming specific nodes. The superposition property explains why latent reasoning excels at search-like tasks. | ||
| Line 89: | Line 101: | ||
| Latent reasoning offers concrete efficiency advantages: | Latent reasoning offers concrete efficiency advantages: | ||
| - | * **Reduced token generation**: | + | * **Reduced token generation**: |
| * **Reduced hallucination**: | * **Reduced hallucination**: | ||
| * **Parallelism**: | * **Parallelism**: | ||