Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Policy of Thoughts (PoT) is a test-time reasoning framework that recasts LLM inference as a within-instance online policy optimization process. Introduced by Jiao et al. (2026), PoT draws on Karl Popper's epistemology of “conjectures and refutations” to argue that genuine reasoning requires real-time evolution of the model's policy through learning from failed attempts, rather than treating execution feedback as a passive external signal.
Current test-time scaling methods such as Chain-of-Thought, Tree-of-Thought, and Best-of-N sampling all operate under a frozen policy assumption: the model's parameters remain fixed during inference, and feedback is used only for trajectory filtering or rewriting. This creates instability in complex, long-horizon reasoning because the model cannot internalize lessons from its own mistakes within a single problem instance.
PoT challenges this assumption by allowing the model's reasoning strategy to evolve dynamically during inference on each individual problem.
PoT maps directly onto Popper's four-stage cycle of scientific discovery:
This cycle repeats within a single inference session, enabling the model to learn from each failed attempt rather than simply discarding bad trajectories.
PoT operates through two main phases:
Phase 1: Diverse Exploration. An efficient exploration mechanism generates a population of diverse candidate solutions for the given problem. Diversity is critical because it provides the raw material for policy learning — homogeneous candidates yield minimal learning signal.
Phase 2: Policy Optimization via GRPO. Group Relative Policy Optimization (GRPO) updates a transient LoRA adapter based on execution feedback from the candidate solutions. The key insight is that this adapter is ephemeral — it exists only for the duration of the current problem instance.
The GRPO objective computes advantages relative to the group of candidates:
# Simplified PoT optimization loop
for round in range(num_rounds):
# Phase 1: Generate diverse candidates
candidates = model.generate(problem, n=group_size, temperature=high)
# Phase 2: Evaluate and compute rewards
rewards = [execute_and_score(c) for c in candidates]
# Phase 3: GRPO update on transient LoRA
advantages = group_relative_advantages(rewards)
loss = -sum(adv * log_prob for adv, log_prob in zip(advantages, log_probs))
transient_lora.update(loss)
# The model's policy has now evolved for this instance
The transient LoRA adapter captures instance-specific reasoning refinements without permanently altering the base model weights.
The GRPO objective for within-instance optimization is formulated as:
$$\mathcal{L}_{ ext{GRPO}} = -\mathbb{E}_{o \sim \pi_{ heta}} \left[ \min\left( rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)} \hat{A}(o), ext{clip}\left(rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)}, 1-\epsilon, 1+\epsilon ight) \hat{A}(o) ight) ight]$$
where the group-relative advantage is computed as:
$$\hat{A}(o_i) = rac{r(o_i) - ext{mean}(\{r(o_j)\})}{ ext{std}(\{r(o_j)\})}$$
This normalizes rewards relative to the current group, providing stable learning signals regardless of absolute reward scale.
Experiments demonstrate dramatic performance improvements:
The results validate the core thesis: allowing the model's policy to evolve during inference yields fundamentally better reasoning than any frozen-policy approach.
| Method | Policy Update | Feedback Use | Instance Adaptation |
|---|---|---|---|
| Chain-of-Thought | None (frozen) | None | No |
| Best-of-N | None (frozen) | Selection only | No |
| Self-Refine | None (frozen) | Prompt rewriting | Shallow |
| PoT | GRPO on LoRA | Internalized | Deep |
PoT represents a paradigm shift in how we think about test-time computation. Rather than simply spending more compute on generating and filtering candidates, PoT uses that compute to actually improve the model's reasoning capability for the specific problem at hand. This aligns with the broader trend toward test-time training and adaptive inference.
Published as a conference paper at COLM 2025.