====== Policy of Thoughts ====== **Policy of Thoughts (PoT)** is a test-time reasoning framework that recasts LLM inference as a within-instance online policy optimization process. Introduced by Jiao et al. (2026), PoT draws on Karl Popper's epistemology of "conjectures and refutations" to argue that genuine reasoning requires real-time evolution of the model's policy through learning from failed attempts, rather than treating execution feedback as a passive external signal. graph TD A[Problem] --> B[Generate Diverse Conjectures] B --> C[Execute and Test] C --> D{Correct?} D -->|No| E[Compute GRPO Advantages] E --> F[Update Transient LoRA] F --> G[Evolved Policy] G --> B D -->|Yes| H[Solution Found] ===== Motivation ===== Current test-time scaling methods such as Chain-of-Thought, Tree-of-Thought, and Best-of-N sampling all operate under a **frozen policy assumption**: the model's parameters remain fixed during inference, and feedback is used only for trajectory filtering or rewriting. This creates instability in complex, long-horizon reasoning because the model cannot internalize lessons from its own mistakes within a single problem instance. PoT challenges this assumption by allowing the model's reasoning strategy to evolve dynamically during inference on each individual problem. ===== Popperian Epistemology Connection ===== PoT maps directly onto Popper's four-stage cycle of scientific discovery: * **P1 (Problem Identification)** — The model receives a reasoning problem and identifies the challenge * **TT (Tentative Theory)** — The model generates diverse candidate solutions as conjectures * **EE (Error Elimination)** — Execution feedback tests conjectures against reality, identifying failures * **P2 (Updated Problem)** — Failed attempts update the model's policy, producing refined understanding This cycle repeats within a single inference session, enabling the model to learn from each failed attempt rather than simply discarding bad trajectories. ===== Framework Architecture ===== PoT operates through two main phases: **Phase 1: Diverse Exploration.** An efficient exploration mechanism generates a population of diverse candidate solutions for the given problem. Diversity is critical because it provides the raw material for policy learning — homogeneous candidates yield minimal learning signal. **Phase 2: Policy Optimization via GRPO.** Group Relative Policy Optimization (GRPO) updates a **transient LoRA adapter** based on execution feedback from the candidate solutions. The key insight is that this adapter is ephemeral — it exists only for the duration of the current problem instance. The GRPO objective computes advantages relative to the group of candidates: # Simplified PoT optimization loop for round in range(num_rounds): # Phase 1: Generate diverse candidates candidates = model.generate(problem, n=group_size, temperature=high) # Phase 2: Evaluate and compute rewards rewards = [execute_and_score(c) for c in candidates] # Phase 3: GRPO update on transient LoRA advantages = group_relative_advantages(rewards) loss = -sum(adv * log_prob for adv, log_prob in zip(advantages, log_probs)) transient_lora.update(loss) # The model's policy has now evolved for this instance The transient LoRA adapter captures instance-specific reasoning refinements without permanently altering the base model weights. ===== Mathematical Formulation ===== The GRPO objective for within-instance optimization is formulated as: $$\mathcal{L}_{ ext{GRPO}} = -\mathbb{E}_{o \sim \pi_{ heta}} \left[ \min\left( rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)} \hat{A}(o), ext{clip}\left( rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)}, 1-\epsilon, 1+\epsilon ight) \hat{A}(o) ight) ight]$$ where the group-relative advantage is computed as: $$\hat{A}(o_i) = rac{r(o_i) - ext{mean}(\{r(o_j)\})}{ ext{std}(\{r(o_j)\})}$$ This normalizes rewards relative to the current group, providing stable learning signals regardless of absolute reward scale. ===== Key Results ===== Experiments demonstrate dramatic performance improvements: * A **4B parameter model** achieves **49.71% accuracy** on LiveCodeBench * This outperforms **GPT-4o** and **DeepSeek-V3** despite being over 50x smaller * The closed-loop design enables instance-specific refinement that static methods cannot achieve * Performance scales with the number of optimization rounds per instance The results validate the core thesis: allowing the model's policy to evolve during inference yields fundamentally better reasoning than any frozen-policy approach. ===== Comparison with Other Approaches ===== ^ Method ^ Policy Update ^ Feedback Use ^ Instance Adaptation ^ | Chain-of-Thought | None (frozen) | None | No | | Best-of-N | None (frozen) | Selection only | No | | Self-Refine | None (frozen) | Prompt rewriting | Shallow | | **PoT** | **GRPO on LoRA** | **Internalized** | **Deep** | ===== Significance ===== PoT represents a paradigm shift in how we think about test-time computation. Rather than simply spending more compute on generating and filtering candidates, PoT uses that compute to actually improve the model's reasoning capability for the specific problem at hand. This aligns with the broader trend toward test-time training and adaptive inference. Published as a conference paper at COLM 2025. ===== References ===== * [[https://arxiv.org/abs/2601.20379|Jiao et al. (2026). "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution." arXiv:2601.20379]] * [[https://arxiv.org/abs/2402.03300|Popper, K. (1963). "Conjectures and Refutations: The Growth of Scientific Knowledge."]] * [[https://arxiv.org/abs/2503.20783|Shao et al. (2024). "DeepSeekMath: Group Relative Policy Optimization."]] ===== See Also ===== * [[test_time_compute|Test-Time Compute Scaling]] * [[chain_of_thought|Chain-of-Thought Prompting]] * [[lora_adapters|LoRA Adapters]] * [[reinforcement_learning_llm|Reinforcement Learning for LLMs]]