====== Quiet-STaR ======
**Quiet-STaR** (Quiet Self-Taught Reasoner) is a training method introduced by Zelikman et al. (2024) that teaches language models to generate internal "thinking" tokens at every position in a sequence before predicting the next token.((https://arxiv.org/abs/2403.09629)) Unlike standard chain-of-thought which requires explicit prompting, Quiet-STaR enables models to learn //implicit reasoning// from general web text, improving downstream reasoning without task-specific fine-tuning.

===== Background: From STaR to Quiet-STaR =====
**STaR** (Self-Taught Reasoner, Zelikman et al. 2022) bootstraps reasoning by generating rationales for QA examples and reinforcing those that lead to correct answers.((https://arxiv.org/abs/2203.14465)) However, STaR is limited to curated question-answer datasets.

Quiet-STaR generalizes this idea to //arbitrary text//: the model learns that some tokens are hard to predict and benefit from internal deliberation. The key insight is that useful reasoning occurs implicitly in natural text, a Wikipedia article about physics implicitly requires understanding of equations, and a story implicitly requires theory of mind.

===== Architecture =====
Quiet-STaR augments a standard autoregressive LM with:((https://arxiv.org/abs/2403.09629))

  * **Learnable boundary tokens**: Special ''<|startofthought|>'' and ''<|endofthought|>'' tokens that delimit internal rationale sequences
  * **Mixing head**: A learned interpolation weight between post-thought predictions and base predictions, easing the distribution shift during early training
  * **Tokenwise parallel generation**: At each position $j$, the model generates $K$ thought tokens, then predicts the next $N$ real tokens

=== The Think-Talk-Learn Loop ===
Training proceeds in three phases applied during continued pretraining:

== 1. Think (Parallel Rationale Generation) ==
For each token $x_i$ in the input, sample thought sequences $c_i = (c_{i1}, \ldots, c_{it})$ in parallel across all $n$ positions. This uses efficient batched generation with ''<|startofthought|>'' and ''<|endofthought|>'' delimiters.

== 2. Talk (Mixing Predictions) ==
Blend the post-thought logits with the base LM logits using a learned mixing weight. This prevents catastrophic forgetting early in training when thoughts are still low quality.

== 3. Learn (REINFORCE Optimization) ==
Optimize rationale quality using a REINFORCE-based objective. The reward for a thought at position $j$ is:

$$\mathcal{R}_j = \log P(x_{j+1:j+N} \mid x_{1:j}, c_j) - \log P(x_{j+1:j+N} \mid x_{1:j})$$

This measures whether the thought //improved// prediction of future tokens compared to the base model. Combined with a standard NLL loss $\mathcal{L}_j^{\text{NLL}}$ for stability.

<code python>
# Conceptual Quiet-STaR training step

def quiet_star_step(model, input_tokens, n_thought_tokens=8, lookahead=4):
    all_thoughts = []
    rewards = []

    for j in range(len(input_tokens)):
        # THINK: Generate internal rationale at position j
        thought = model.generate_thought(
            prefix=input_tokens[:j+1],
            start_token="<|startofthought|>",
            end_token="<|endofthought|>",
            max_tokens=n_thought_tokens
        )

        # TALK: Compute predictions with and without thought
        logits_with_thought = model.predict(input_tokens[:j+1] + thought)
        logits_base = model.predict(input_tokens[:j+1])

        # Mix predictions using learned weight
        alpha = model.mixing_head(logits_with_thought, logits_base)
        mixed_logits = alpha * logits_with_thought + (1 - alpha) * logits_base

        # LEARN: Compute REINFORCE reward
        future_tokens = input_tokens[j+1:j+1+lookahead]
        log_p_with = log_likelihood(mixed_logits, future_tokens)
        log_p_base = log_likelihood(logits_base, future_tokens)
        reward = log_p_with - log_p_base

        rewards.append(reward)
        all_thoughts.append(thought)

    # REINFORCE loss + NLL loss
    loss = compute_reinforce_loss(all_thoughts, rewards) + nll_loss(mixed_logits, future_tokens)
    return loss
</code>

===== Key Technical Details =====
  * **Thought length**: Typically $K = 8$ tokens (including start/end delimiters). Longer thoughts yield better results but increase compute
  * **Lookahead window**: $N = 4$ future tokens used for reward computation
  * **Base model**: Applied to Mistral 7B with continued pretraining on OpenWebMath and C4
  * **Inference masking**: At test time, start/end thought tokens must be masked out since the model is not trained to suppress them

===== Results =====
Zero-shot improvements on Mistral 7B without any task-specific training:((https://arxiv.org/abs/2403.09629))

^ Benchmark ^ Base Mistral 7B ^ Quiet-STaR (OpenWebMath) ^
| GSM8K | 5.9% | **10.9%** |
| CommonsenseQA | 36.3% | **47.2%** |

Training on C4 (general web text) alone: GSM8K 5.9% to 8.1%, CommonsenseQA 36.3% to 42.6%.

Quiet-STaR disproportionately improves prediction of //hard// tokens, those with high base [[perplexity_ai|perplexity]], confirming that internal reasoning helps most where it is needed.

===== Mathematical Formulation =====
The full Quiet-STaR objective combines REINFORCE and language modeling:

$$\mathcal{L} = \sum_{j} \left[ -\mathcal{R}_j \cdot \log P_\theta(c_j \mid x_{1:j}) + \lambda \cdot \mathcal{L}_j^{\text{NLL}} \right]$$

where $c_j$ is the thought at position $j$, $\mathcal{R}_j$ is the reward (improvement in future token prediction), and $\lambda$ balances the NLL regularization.

<mermaid>
graph TB
    A[Input Token x_j] --> B["Generate Thought c_j"]
    B --> C["Predict with Thought"]
    A --> D["Predict without Thought (Base)"]
    C --> E["Mixing Head (alpha)"]
    D --> E
    E --> F["Mixed Prediction"]
    F --> G["REINFORCE Reward: log P(future|thought) - log P(future|base)"]
    G --> H["Update Weights"]
</mermaid>

===== Relation to Later Work =====
Quiet-STaR is a conceptual precursor to the "thinking" paradigm seen in models like [[openai|OpenAI]] o1 and [[deepseek|DeepSeek]]-R1, which also generate internal reasoning before responding.((https://arxiv.org/abs/2403.09629)) Fast Quiet-STaR (Huang et al. 2025) extends the approach with curriculum learning to reduce thought tokens while maintaining gains.((https://arxiv.org/abs/2505.17746))

===== See Also =====
  * [[reasoning_models|Reasoning Models]]
  * [[self_consistency|Self-Consistency]]
  * [[thinking_tokens|Thinking Tokens / Extended Reasoning]]
  * [[distribution_faithful_generation|Distribution-Faithful Generation (String Seed of Thought)]]
  * [[program_of_thoughts|Program of Thoughts]]

===== References =====