Quiet-STaR

Quiet-STaR (Quiet Self-Taught Reasoner) is a training method introduced by Zelikman et al. (2024) that teaches language models to generate internal “thinking” tokens at every position in a sequence before predicting the next token.¹⁾ Unlike standard chain-of-thought which requires explicit prompting, Quiet-STaR enables models to learn implicit reasoning from general web text, improving downstream reasoning without task-specific fine-tuning.

Background: From STaR to Quiet-STaR

STaR (Self-Taught Reasoner, Zelikman et al. 2022) bootstraps reasoning by generating rationales for QA examples and reinforcing those that lead to correct answers.²⁾ However, STaR is limited to curated question-answer datasets.

Quiet-STaR generalizes this idea to arbitrary text: the model learns that some tokens are hard to predict and benefit from internal deliberation. The key insight is that useful reasoning occurs implicitly in natural text, a Wikipedia article about physics implicitly requires understanding of equations, and a story implicitly requires theory of mind.

Architecture

Quiet-STaR augments a standard autoregressive LM with:³⁾

Learnable boundary tokens: Special <|startofthought|> and <|endofthought|> tokens that delimit internal rationale sequences
Mixing head: A learned interpolation weight between post-thought predictions and base predictions, easing the distribution shift during early training
Tokenwise parallel generation: At each position $j$, the model generates $K$ thought tokens, then predicts the next $N$ real tokens

The Think-Talk-Learn Loop

Training proceeds in three phases applied during continued pretraining:

1. Think (Parallel Rationale Generation)

For each token $x_i$ in the input, sample thought sequences $c_i = (c_{i1}, \ldots, c_{it})$ in parallel across all $n$ positions. This uses efficient batched generation with <|startofthought|> and <|endofthought|> delimiters.

2. Talk (Mixing Predictions)

Blend the post-thought logits with the base LM logits using a learned mixing weight. This prevents catastrophic forgetting early in training when thoughts are still low quality.

3. Learn (REINFORCE Optimization)

Optimize rationale quality using a REINFORCE-based objective. The reward for a thought at position $j$ is:

$$\mathcal{R}_j = \log P(x_{j+1:j+N} \mid x_{1:j}, c_j) - \log P(x_{j+1:j+N} \mid x_{1:j})$$

This measures whether the thought improved prediction of future tokens compared to the base model. Combined with a standard NLL loss $\mathcal{L}_j^{\text{NLL}}$ for stability.

# Conceptual Quiet-STaR training step
 
def quiet_star_step(model, input_tokens, n_thought_tokens=8, lookahead=4):
    all_thoughts = []
    rewards = []
 
    for j in range(len(input_tokens)):
        # THINK: Generate internal rationale at position j
        thought = model.generate_thought(
            prefix=input_tokens[:j+1],
            start_token="<|startofthought|>",
            end_token="<|endofthought|>",
            max_tokens=n_thought_tokens
        )
 
        # TALK: Compute predictions with and without thought
        logits_with_thought = model.predict(input_tokens[:j+1] + thought)
        logits_base = model.predict(input_tokens[:j+1])
 
        # Mix predictions using learned weight
        alpha = model.mixing_head(logits_with_thought, logits_base)
        mixed_logits = alpha * logits_with_thought + (1 - alpha) * logits_base
 
        # LEARN: Compute REINFORCE reward
        future_tokens = input_tokens[j+1:j+1+lookahead]
        log_p_with = log_likelihood(mixed_logits, future_tokens)
        log_p_base = log_likelihood(logits_base, future_tokens)
        reward = log_p_with - log_p_base
 
        rewards.append(reward)
        all_thoughts.append(thought)
 
    # REINFORCE loss + NLL loss
    loss = compute_reinforce_loss(all_thoughts, rewards) + nll_loss(mixed_logits, future_tokens)
    return loss

Key Technical Details

Thought length: Typically $K = 8$ tokens (including start/end delimiters). Longer thoughts yield better results but increase compute
Lookahead window: $N = 4$ future tokens used for reward computation
Base model: Applied to Mistral 7B with continued pretraining on OpenWebMath and C4
Inference masking: At test time, start/end thought tokens must be masked out since the model is not trained to suppress them

Results

Zero-shot improvements on Mistral 7B without any task-specific training:⁴⁾

Benchmark	Base Mistral 7B	Quiet-STaR (OpenWebMath)
GSM8K	5.9%	10.9%
CommonsenseQA	36.3%	47.2%

Training on C4 (general web text) alone: GSM8K 5.9% to 8.1%, CommonsenseQA 36.3% to 42.6%.

Quiet-STaR disproportionately improves prediction of hard tokens, those with high base perplexity, confirming that internal reasoning helps most where it is needed.

Mathematical Formulation

The full Quiet-STaR objective combines REINFORCE and language modeling:

$$\mathcal{L} = \sum_{j} \left[ -\mathcal{R}_j \cdot \log P_\theta(c_j \mid x_{1:j}) + \lambda \cdot \mathcal{L}_j^{\text{NLL}} \right]$$

where $c_j$ is the thought at position $j$, $\mathcal{R}_j$ is the reward (improvement in future token prediction), and $\lambda$ balances the NLL regularization.

graph TB A[Input Token x_j] --> B["Generate Thought c_j"] B --> C["Predict with Thought"] A --> D["Predict without Thought (Base)"] C --> E["Mixing Head (alpha)"] D --> E E --> F["Mixed Prediction"] F --> G["REINFORCE Reward: log P(future|thought) - log P(future|base)"] G --> H["Update Weights"]

Relation to Later Work

Quiet-STaR is a conceptual precursor to the “thinking” paradigm seen in models like OpenAI o1 and DeepSeek-R1, which also generate internal reasoning before responding.⁵⁾ Fast Quiet-STaR (Huang et al. 2025) extends the approach with curriculum learning to reduce thought tokens while maintaining gains.⁶⁾

References

¹⁾ , ³⁾ , ⁴⁾ , ⁵⁾

https://arxiv.org/abs/2403.09629

²⁾

https://arxiv.org/abs/2203.14465

⁶⁾

https://arxiv.org/abs/2505.17746

Table of Contents