====== Quiet-STaR ====== **Quiet-STaR** (Quiet Self-Taught Reasoner) is a training method introduced by Zelikman et al. (2024) that teaches language models to generate internal "thinking" tokens at every position in a sequence before predicting the next token. Unlike standard chain-of-thought which requires explicit prompting, Quiet-STaR enables models to learn //implicit reasoning// from general web text, improving downstream reasoning without task-specific fine-tuning. ===== Background: From STaR to Quiet-STaR ===== **STaR** (Self-Taught Reasoner, Zelikman et al. 2022) bootstraps reasoning by generating rationales for QA examples and reinforcing those that lead to correct answers. However, STaR is limited to curated question-answer datasets. Quiet-STaR generalizes this idea to //arbitrary text//: the model learns that some tokens are hard to predict and benefit from internal deliberation. The key insight is that useful reasoning occurs implicitly in natural text --- a Wikipedia article about physics implicitly requires understanding of equations, and a story implicitly requires theory of mind. ===== Architecture ===== Quiet-STaR augments a standard autoregressive LM with: * **Learnable boundary tokens**: Special ''<|startofthought|>'' and ''<|endofthought|>'' tokens that delimit internal rationale sequences * **Mixing head**: A learned interpolation weight between post-thought predictions and base predictions, easing the distribution shift during early training * **Tokenwise parallel generation**: At each position $j$, the model generates $K$ thought tokens, then predicts the next $N$ real tokens === The Think-Talk-Learn Loop === Training proceeds in three phases applied during continued pretraining: == 1. Think (Parallel Rationale Generation) == For each token $x_i$ in the input, sample thought sequences $c_i = (c_{i1}, \ldots, c_{it})$ in parallel across all $n$ positions. This uses efficient batched generation with ''<|startofthought|>'' and ''<|endofthought|>'' delimiters. == 2. Talk (Mixing Predictions) == Blend the post-thought logits with the base LM logits using a learned mixing weight. This prevents catastrophic forgetting early in training when thoughts are still low quality. == 3. Learn (REINFORCE Optimization) == Optimize rationale quality using a REINFORCE-based objective. The reward for a thought at position $j$ is: $$\mathcal{R}_j = \log P(x_{j+1:j+N} \mid x_{1:j}, c_j) - \log P(x_{j+1:j+N} \mid x_{1:j})$$ This measures whether the thought //improved// prediction of future tokens compared to the base model. Combined with a standard NLL loss $\mathcal{L}_j^{\text{NLL}}$ for stability. # Conceptual Quiet-STaR training step def quiet_star_step(model, input_tokens, n_thought_tokens=8, lookahead=4): all_thoughts = [] rewards = [] for j in range(len(input_tokens)): # THINK: Generate internal rationale at position j thought = model.generate_thought( prefix=input_tokens[:j+1], start_token="<|startofthought|>", end_token="<|endofthought|>", max_tokens=n_thought_tokens ) # TALK: Compute predictions with and without thought logits_with_thought = model.predict(input_tokens[:j+1] + thought) logits_base = model.predict(input_tokens[:j+1]) # Mix predictions using learned weight alpha = model.mixing_head(logits_with_thought, logits_base) mixed_logits = alpha * logits_with_thought + (1 - alpha) * logits_base # LEARN: Compute REINFORCE reward future_tokens = input_tokens[j+1:j+1+lookahead] log_p_with = log_likelihood(mixed_logits, future_tokens) log_p_base = log_likelihood(logits_base, future_tokens) reward = log_p_with - log_p_base rewards.append(reward) all_thoughts.append(thought) # REINFORCE loss + NLL loss loss = compute_reinforce_loss(all_thoughts, rewards) + nll_loss(mixed_logits, future_tokens) return loss ===== Key Technical Details ===== * **Thought length**: Typically $K = 8$ tokens (including start/end delimiters). Longer thoughts yield better results but increase compute * **Lookahead window**: $N = 4$ future tokens used for reward computation * **Base model**: Applied to Mistral 7B with continued pretraining on OpenWebMath and C4 * **Inference masking**: At test time, start/end thought tokens must be masked out since the model is not trained to suppress them ===== Results ===== Zero-shot improvements on Mistral 7B without any task-specific training: ^ Benchmark ^ Base Mistral 7B ^ Quiet-STaR (OpenWebMath) ^ | GSM8K | 5.9% | **10.9%** | | CommonsenseQA | 36.3% | **47.2%** | Training on C4 (general web text) alone: GSM8K 5.9% to 8.1%, CommonsenseQA 36.3% to 42.6%. Quiet-STaR disproportionately improves prediction of //hard// tokens --- those with high base perplexity --- confirming that internal reasoning helps most where it is needed. ===== Mathematical Formulation ===== The full Quiet-STaR objective combines REINFORCE and language modeling: $$\mathcal{L} = \sum_{j} \left[ -\mathcal{R}_j \cdot \log P_\theta(c_j \mid x_{1:j}) + \lambda \cdot \mathcal{L}_j^{\text{NLL}} \right]$$ where $c_j$ is the thought at position $j$, $\mathcal{R}_j$ is the reward (improvement in future token prediction), and $\lambda$ balances the NLL regularization. graph TB A[Input Token x_j] --> B["Generate Thought c_j"] B --> C["Predict with Thought"] A --> D["Predict without Thought (Base)"] C --> E["Mixing Head (alpha)"] D --> E E --> F["Mixed Prediction"] F --> G["REINFORCE Reward: log P(future|thought) - log P(future|base)"] G --> H["Update Weights"] ===== Relation to Later Work ===== Quiet-STaR is a conceptual precursor to the "thinking" paradigm seen in models like OpenAI o1 and DeepSeek-R1, which also generate internal reasoning before responding. Fast Quiet-STaR (Huang et al. 2025) extends the approach with curriculum learning to reduce thought tokens while maintaining gains. ===== References ===== * [[https://arxiv.org/abs/2403.09629|Zelikman et al. "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking" (2024). arXiv:2403.09629]] * [[https://arxiv.org/abs/2203.14465|Zelikman et al. "STaR: Bootstrapping Reasoning With Reasoning" (2022). arXiv:2203.14465]] * [[https://github.com/ezelikman/quiet-star|Official Quiet-STaR implementation (GitHub)]] * [[https://arxiv.org/abs/2505.17746|Huang et al. "Fast Quiet-STaR: Thinking Without Thought Tokens" (2025). arXiv:2505.17746]] ===== See Also ===== * [[chain_of_thought|Chain of Thought]] * [[self_taught_reasoner|STaR (Self-Taught Reasoner)]] * [[reasoning_tokens|Reasoning Tokens]] * [[reinforcement_learning_from_human_feedback|RLHF]]