====== Policy of Thoughts ======
**Policy of Thoughts (PoT)** is a test-time reasoning framework that recasts LLM inference as a within-instance online policy optimization process. Introduced by Jiao et al. (2026)(([[https://arxiv.org/abs/2601.20379|Jiao et al. (2026). "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution." arXiv:2601.20379]])), PoT draws on Karl Popper's epistemology of "conjectures and refutations"(([[https://en.wikipedia.org/wiki/Conjectures_and_Refutations|Popper, K. (1963). "Conjectures and Refutations: The Growth of Scientific Knowledge."]])) to argue that genuine reasoning requires real-time evolution of the model's policy through learning from failed attempts, rather than treating execution feedback as a passive external signal.((https://arxiv.org/abs/2601.20379|Jiao et al. (2026). "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution." arXiv:2601.20379))

<mermaid>
graph TD
    A[Problem] --> B[Generate Diverse Conjectures]
    B --> C[Execute and Test]
    C --> D{Correct?}
    D -->|No| E[Compute GRPO Advantages]
    E --> F[Update Transient LoRA]
    F --> G[Evolved Policy]
    G --> B
    D -->|Yes| H[Solution Found]
</mermaid>

===== Motivation =====
Current test-time scaling methods such as Chain-of-Thought, Tree-of-Thought, and Best-of-N sampling all operate under a **frozen policy assumption**: the model's parameters remain fixed during inference, and feedback is used only for trajectory filtering or rewriting. This creates instability in complex, long-horizon reasoning because the model cannot internalize lessons from its own mistakes within a single problem instance.((https://arxiv.org/abs/2601.20379|Jiao et al. (2026). "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution." arXiv:2601.20379))

PoT challenges this assumption by allowing the model's reasoning strategy to evolve dynamically during inference on each individual problem.

===== Popperian Epistemology Connection =====
PoT maps directly onto Popper's four-stage cycle of scientific discovery:((https://arxiv.org/abs/2402.03300|Popper, K. (1963). "Conjectures and Refutations: The Growth of Scientific Knowledge."))

  * **P1 (Problem Identification)** — The model receives a reasoning problem and identifies the challenge
  * **TT (Tentative Theory)** — The model generates diverse candidate solutions as conjectures
  * **EE (Error Elimination)** — Execution feedback tests conjectures against reality, identifying failures
  * **P2 (Updated Problem)** — Failed attempts update the model's policy, producing refined understanding

This cycle repeats within a single inference session, enabling the model to learn from each failed attempt rather than simply discarding bad trajectories.

===== Framework Architecture =====
PoT operates through two main phases:((https://arxiv.org/abs/2601.20379|Jiao et al. (2026). "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution." arXiv:2601.20379))

**Phase 1: Diverse Exploration.** An efficient exploration mechanism generates a population of diverse candidate solutions for the given problem. Diversity is critical because it provides the raw material for policy learning — homogeneous candidates yield minimal learning signal.

**Phase 2: Policy Optimization via GRPO.** Group Relative Policy Optimization (GRPO)(([[https://arxiv.org/abs/2503.20783|Shao et al. (2024). "DeepSeekMath: Group Relative Policy Optimization."]])) updates a **transient LoRA adapter** based on execution feedback from the candidate solutions.((https://arxiv.org/abs/2503.20783|Shao et al. (2024). "DeepSeekMath: Group Relative Policy Optimization.")) The key insight is that this adapter is ephemeral — it exists only for the duration of the current problem instance.

The GRPO objective computes advantages relative to the group of candidates:

<code>
# Simplified PoT optimization loop
for round in range(num_rounds):
    # Phase 1: Generate diverse candidates
    candidates = model.generate(problem, n=group_size, temperature=high)
    
    # Phase 2: Evaluate and compute rewards
    rewards = [execute_and_score(c) for c in candidates]
    
    # Phase 3: GRPO update on transient LoRA
    advantages = group_relative_advantages(rewards)
    loss = -sum(adv * log_prob for adv, log_prob in zip(advantages, log_probs))
    transient_lora.update(loss)
    
    # The model's policy has now evolved for this instance
</code>

The transient LoRA adapter captures instance-specific reasoning refinements without permanently altering the base [[modelweights|model weights]].

===== Mathematical Formulation =====
The GRPO objective for within-instance optimization is formulated as:

$$\mathcal{L}_{	ext{GRPO}} = -\mathbb{E}_{o \sim \pi_{	heta}} \left[ \min\left( rac{\pi_{	heta}(o|q)}{\pi_{	ext{ref}}(o|q)} \hat{A}(o), 	ext{clip}\left(rac{\pi_{	heta}(o|q)}{\pi_{	ext{ref}}(o|q)}, 1-\epsilon, 1+\epsilon
ight) \hat{A}(o) 
ight) 
ight]$$

where the group-relative advantage is computed as:

$$\hat{A}(o_i) = rac{r(o_i) - 	ext{mean}(\{r(o_j)\})}{	ext{std}(\{r(o_j)\})}$$

This normalizes rewards relative to the current group, providing stable learning signals regardless of absolute reward scale.

===== Key Results =====
Experiments demonstrate dramatic performance improvements:((https://arxiv.org/abs/2601.20379|Jiao et al. (2026). "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution." arXiv:2601.20379))

  * A **4B parameter model** achieves **49.71% accuracy** on LiveCodeBench
  * This outperforms **GPT-4o** and **[[deepseek|DeepSeek]]-V3** despite being over 50x smaller
  * The closed-loop design enables instance-specific refinement that static methods cannot achieve
  * Performance scales with the number of optimization rounds per instance

The results validate the core thesis: allowing the model's policy to evolve during inference yields fundamentally better reasoning than any frozen-policy approach.

===== Comparison with Other Approaches =====
^ Method ^ Policy Update ^ Feedback Use ^ Instance Adaptation ^
| Chain-of-Thought | None (frozen) | None | No |
| Best-of-N | None (frozen) | Selection only | No |
| [[self_refine|Self-Refine]] | None (frozen) | Prompt rewriting | Shallow |
| **PoT** | **GRPO on LoRA** | **Internalized** | **Deep** |

===== Significance =====
PoT represents a paradigm shift in how we think about test-time computation. Rather than simply spending more compute on generating and filtering candidates, PoT uses that compute to actually improve the model's reasoning capability for the specific problem at hand. This aligns with the broader trend toward test-time training and adaptive inference.

Published as a conference paper at COLM 2025.(([[https://arxiv.org/abs/2601.20379|Jiao et al. (2026). "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution." arXiv:2601.20379]]))

===== See Also =====
  * [[program_of_thoughts|Program of Thoughts]]
  * [[graph_of_thoughts|Graph of Thoughts]]
  * [[latent_reasoning|Latent Reasoning]]
  * [[reasoning_via_planning|RAP: Reasoning via Planning with LLM as World Model]]

===== References =====