====== Quiet-STaR ======
**Quiet-STaR** (Quiet Self-Taught Reasoner) is a training method introduced by Zelikman et al. (2024) that teaches language models to generate internal "thinking" tokens at every position in a sequence before predicting the next token. Unlike standard chain-of-thought which requires explicit prompting, Quiet-STaR enables models to learn //implicit reasoning// from general web text, improving downstream reasoning without task-specific fine-tuning.
===== Background: From STaR to Quiet-STaR =====
**STaR** (Self-Taught Reasoner, Zelikman et al. 2022) bootstraps reasoning by generating rationales for QA examples and reinforcing those that lead to correct answers. However, STaR is limited to curated question-answer datasets.
Quiet-STaR generalizes this idea to //arbitrary text//: the model learns that some tokens are hard to predict and benefit from internal deliberation. The key insight is that useful reasoning occurs implicitly in natural text --- a Wikipedia article about physics implicitly requires understanding of equations, and a story implicitly requires theory of mind.
===== Architecture =====
Quiet-STaR augments a standard autoregressive LM with:
* **Learnable boundary tokens**: Special ''<|startofthought|>'' and ''<|endofthought|>'' tokens that delimit internal rationale sequences
* **Mixing head**: A learned interpolation weight between post-thought predictions and base predictions, easing the distribution shift during early training
* **Tokenwise parallel generation**: At each position $j$, the model generates $K$ thought tokens, then predicts the next $N$ real tokens
=== The Think-Talk-Learn Loop ===
Training proceeds in three phases applied during continued pretraining:
== 1. Think (Parallel Rationale Generation) ==
For each token $x_i$ in the input, sample thought sequences $c_i = (c_{i1}, \ldots, c_{it})$ in parallel across all $n$ positions. This uses efficient batched generation with ''<|startofthought|>'' and ''<|endofthought|>'' delimiters.
== 2. Talk (Mixing Predictions) ==
Blend the post-thought logits with the base LM logits using a learned mixing weight. This prevents catastrophic forgetting early in training when thoughts are still low quality.
== 3. Learn (REINFORCE Optimization) ==
Optimize rationale quality using a REINFORCE-based objective. The reward for a thought at position $j$ is:
$$\mathcal{R}_j = \log P(x_{j+1:j+N} \mid x_{1:j}, c_j) - \log P(x_{j+1:j+N} \mid x_{1:j})$$
This measures whether the thought //improved// prediction of future tokens compared to the base model. Combined with a standard NLL loss $\mathcal{L}_j^{\text{NLL}}$ for stability.
# Conceptual Quiet-STaR training step
def quiet_star_step(model, input_tokens, n_thought_tokens=8, lookahead=4):
all_thoughts = []
rewards = []
for j in range(len(input_tokens)):
# THINK: Generate internal rationale at position j
thought = model.generate_thought(
prefix=input_tokens[:j+1],
start_token="<|startofthought|>",
end_token="<|endofthought|>",
max_tokens=n_thought_tokens
)
# TALK: Compute predictions with and without thought
logits_with_thought = model.predict(input_tokens[:j+1] + thought)
logits_base = model.predict(input_tokens[:j+1])
# Mix predictions using learned weight
alpha = model.mixing_head(logits_with_thought, logits_base)
mixed_logits = alpha * logits_with_thought + (1 - alpha) * logits_base
# LEARN: Compute REINFORCE reward
future_tokens = input_tokens[j+1:j+1+lookahead]
log_p_with = log_likelihood(mixed_logits, future_tokens)
log_p_base = log_likelihood(logits_base, future_tokens)
reward = log_p_with - log_p_base
rewards.append(reward)
all_thoughts.append(thought)
# REINFORCE loss + NLL loss
loss = compute_reinforce_loss(all_thoughts, rewards) + nll_loss(mixed_logits, future_tokens)
return loss
===== Key Technical Details =====
* **Thought length**: Typically $K = 8$ tokens (including start/end delimiters). Longer thoughts yield better results but increase compute
* **Lookahead window**: $N = 4$ future tokens used for reward computation
* **Base model**: Applied to Mistral 7B with continued pretraining on OpenWebMath and C4
* **Inference masking**: At test time, start/end thought tokens must be masked out since the model is not trained to suppress them
===== Results =====
Zero-shot improvements on Mistral 7B without any task-specific training:
^ Benchmark ^ Base Mistral 7B ^ Quiet-STaR (OpenWebMath) ^
| GSM8K | 5.9% | **10.9%** |
| CommonsenseQA | 36.3% | **47.2%** |
Training on C4 (general web text) alone: GSM8K 5.9% to 8.1%, CommonsenseQA 36.3% to 42.6%.
Quiet-STaR disproportionately improves prediction of //hard// tokens --- those with high base perplexity --- confirming that internal reasoning helps most where it is needed.
===== Mathematical Formulation =====
The full Quiet-STaR objective combines REINFORCE and language modeling:
$$\mathcal{L} = \sum_{j} \left[ -\mathcal{R}_j \cdot \log P_\theta(c_j \mid x_{1:j}) + \lambda \cdot \mathcal{L}_j^{\text{NLL}} \right]$$
where $c_j$ is the thought at position $j$, $\mathcal{R}_j$ is the reward (improvement in future token prediction), and $\lambda$ balances the NLL regularization.
graph TB
A[Input Token x_j] --> B["Generate Thought c_j"]
B --> C["Predict with Thought"]
A --> D["Predict without Thought (Base)"]
C --> E["Mixing Head (alpha)"]
D --> E
E --> F["Mixed Prediction"]
F --> G["REINFORCE Reward: log P(future|thought) - log P(future|base)"]
G --> H["Update Weights"]
===== Relation to Later Work =====
Quiet-STaR is a conceptual precursor to the "thinking" paradigm seen in models like OpenAI o1 and DeepSeek-R1, which also generate internal reasoning before responding. Fast Quiet-STaR (Huang et al. 2025) extends the approach with curriculum learning to reduce thought tokens while maintaining gains.
===== References =====
* [[https://arxiv.org/abs/2403.09629|Zelikman et al. "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking" (2024). arXiv:2403.09629]]
* [[https://arxiv.org/abs/2203.14465|Zelikman et al. "STaR: Bootstrapping Reasoning With Reasoning" (2022). arXiv:2203.14465]]
* [[https://github.com/ezelikman/quiet-star|Official Quiet-STaR implementation (GitHub)]]
* [[https://arxiv.org/abs/2505.17746|Huang et al. "Fast Quiet-STaR: Thinking Without Thought Tokens" (2025). arXiv:2505.17746]]
===== See Also =====
* [[chain_of_thought|Chain of Thought]]
* [[self_taught_reasoner|STaR (Self-Taught Reasoner)]]
* [[reasoning_tokens|Reasoning Tokens]]
* [[reinforcement_learning_from_human_feedback|RLHF]]