Quiet-STaR (Quiet Self-Taught Reasoner) is a training method introduced by Zelikman et al. (2024) that teaches language models to generate internal “thinking” tokens at every position in a sequence before predicting the next token. Unlike standard chain-of-thought which requires explicit prompting, Quiet-STaR enables models to learn implicit reasoning from general web text, improving downstream reasoning without task-specific fine-tuning.
STaR (Self-Taught Reasoner, Zelikman et al. 2022) bootstraps reasoning by generating rationales for QA examples and reinforcing those that lead to correct answers. However, STaR is limited to curated question-answer datasets.
Quiet-STaR generalizes this idea to arbitrary text: the model learns that some tokens are hard to predict and benefit from internal deliberation. The key insight is that useful reasoning occurs implicitly in natural text — a Wikipedia article about physics implicitly requires understanding of equations, and a story implicitly requires theory of mind.
Quiet-STaR augments a standard autoregressive LM with:
<|startofthought|> and <|endofthought|> tokens that delimit internal rationale sequencesTraining proceeds in three phases applied during continued pretraining:
For each token $x_i$ in the input, sample thought sequences $c_i = (c_{i1}, \ldots, c_{it})$ in parallel across all $n$ positions. This uses efficient batched generation with <|startofthought|> and <|endofthought|> delimiters.
Blend the post-thought logits with the base LM logits using a learned mixing weight. This prevents catastrophic forgetting early in training when thoughts are still low quality.
Optimize rationale quality using a REINFORCE-based objective. The reward for a thought at position $j$ is:
$$\mathcal{R}_j = \log P(x_{j+1:j+N} \mid x_{1:j}, c_j) - \log P(x_{j+1:j+N} \mid x_{1:j})$$
This measures whether the thought improved prediction of future tokens compared to the base model. Combined with a standard NLL loss $\mathcal{L}_j^{\text{NLL}}$ for stability.
# Conceptual Quiet-STaR training step def quiet_star_step(model, input_tokens, n_thought_tokens=8, lookahead=4): all_thoughts = [] rewards = [] for j in range(len(input_tokens)): # THINK: Generate internal rationale at position j thought = model.generate_thought( prefix=input_tokens[:j+1], start_token="<|startofthought|>", end_token="<|endofthought|>", max_tokens=n_thought_tokens ) # TALK: Compute predictions with and without thought logits_with_thought = model.predict(input_tokens[:j+1] + thought) logits_base = model.predict(input_tokens[:j+1]) # Mix predictions using learned weight alpha = model.mixing_head(logits_with_thought, logits_base) mixed_logits = alpha * logits_with_thought + (1 - alpha) * logits_base # LEARN: Compute REINFORCE reward future_tokens = input_tokens[j+1:j+1+lookahead] log_p_with = log_likelihood(mixed_logits, future_tokens) log_p_base = log_likelihood(logits_base, future_tokens) reward = log_p_with - log_p_base rewards.append(reward) all_thoughts.append(thought) # REINFORCE loss + NLL loss loss = compute_reinforce_loss(all_thoughts, rewards) + nll_loss(mixed_logits, future_tokens) return loss
Zero-shot improvements on Mistral 7B without any task-specific training:
| Benchmark | Base Mistral 7B | Quiet-STaR (OpenWebMath) |
|---|---|---|
| GSM8K | 5.9% | 10.9% |
| CommonsenseQA | 36.3% | 47.2% |
Training on C4 (general web text) alone: GSM8K 5.9% to 8.1%, CommonsenseQA 36.3% to 42.6%.
Quiet-STaR disproportionately improves prediction of hard tokens — those with high base perplexity — confirming that internal reasoning helps most where it is needed.
The full Quiet-STaR objective combines REINFORCE and language modeling:
$$\mathcal{L} = \sum_{j} \left[ -\mathcal{R}_j \cdot \log P_\theta(c_j \mid x_{1:j}) + \lambda \cdot \mathcal{L}_j^{\text{NLL}} \right]$$
where $c_j$ is the thought at position $j$, $\mathcal{R}_j$ is the reward (improvement in future token prediction), and $\lambda$ balances the NLL regularization.
Quiet-STaR is a conceptual precursor to the “thinking” paradigm seen in models like OpenAI o1 and DeepSeek-R1, which also generate internal reasoning before responding. Fast Quiet-STaR (Huang et al. 2025) extends the approach with curriculum learning to reduce thought tokens while maintaining gains.