This is an old revision of the document!

Policy of Thoughts

Policy of Thoughts (PoT) is a test-time reasoning framework that recasts LLM inference as a within-instance online policy optimization process. Introduced by Jiao et al. (2026), PoT draws on Karl Popper's epistemology of “conjectures and refutations” to argue that genuine reasoning requires real-time evolution of the model's policy through learning from failed attempts, rather than treating execution feedback as a passive external signal.

Motivation

Current test-time scaling methods such as Chain-of-Thought, Tree-of-Thought, and Best-of-N sampling all operate under a frozen policy assumption: the model's parameters remain fixed during inference, and feedback is used only for trajectory filtering or rewriting. This creates instability in complex, long-horizon reasoning because the model cannot internalize lessons from its own mistakes within a single problem instance.

PoT challenges this assumption by allowing the model's reasoning strategy to evolve dynamically during inference on each individual problem.

Popperian Epistemology Connection

PoT maps directly onto Popper's four-stage cycle of scientific discovery:

P1 (Problem Identification) — The model receives a reasoning problem and identifies the challenge
TT (Tentative Theory) — The model generates diverse candidate solutions as conjectures
EE (Error Elimination) — Execution feedback tests conjectures against reality, identifying failures
P2 (Updated Problem) — Failed attempts update the model's policy, producing refined understanding

This cycle repeats within a single inference session, enabling the model to learn from each failed attempt rather than simply discarding bad trajectories.

Framework Architecture

PoT operates through two main phases:

Phase 1: Diverse Exploration. An efficient exploration mechanism generates a population of diverse candidate solutions for the given problem. Diversity is critical because it provides the raw material for policy learning — homogeneous candidates yield minimal learning signal.

Phase 2: Policy Optimization via GRPO. Group Relative Policy Optimization (GRPO) updates a transient LoRA adapter based on execution feedback from the candidate solutions. The key insight is that this adapter is ephemeral — it exists only for the duration of the current problem instance.

The GRPO objective computes advantages relative to the group of candidates:

# Simplified PoT optimization loop
for round in range(num_rounds):
    # Phase 1: Generate diverse candidates
    candidates = model.generate(problem, n=group_size, temperature=high)
    
    # Phase 2: Evaluate and compute rewards
    rewards = [execute_and_score(c) for c in candidates]
    
    # Phase 3: GRPO update on transient LoRA
    advantages = group_relative_advantages(rewards)
    loss = -sum(adv * log_prob for adv, log_prob in zip(advantages, log_probs))
    transient_lora.update(loss)
    
    # The model's policy has now evolved for this instance

The transient LoRA adapter captures instance-specific reasoning refinements without permanently altering the base model weights.

Mathematical Formulation

The GRPO objective for within-instance optimization is formulated as:

$$\mathcal{L}_{ ext{GRPO}} = -\mathbb{E}_{o \sim \pi_{ heta}} \left[ \min\left( rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)} \hat{A}(o), ext{clip}\left(rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)}, 1-\epsilon, 1+\epsilon ight) \hat{A}(o) ight) ight]$$

where the group-relative advantage is computed as:

$$\hat{A}(o_i) = rac{r(o_i) - ext{mean}(\{r(o_j)\})}{ ext{std}(\{r(o_j)\})}$$

This normalizes rewards relative to the current group, providing stable learning signals regardless of absolute reward scale.

Key Results

Experiments demonstrate dramatic performance improvements:

A 4B parameter model achieves 49.71% accuracy on LiveCodeBench
This outperforms GPT-4o and DeepSeek-V3 despite being over 50x smaller
The closed-loop design enables instance-specific refinement that static methods cannot achieve
Performance scales with the number of optimization rounds per instance

The results validate the core thesis: allowing the model's policy to evolve during inference yields fundamentally better reasoning than any frozen-policy approach.

Comparison with Other Approaches

Method	Policy Update	Feedback Use	Instance Adaptation
Chain-of-Thought	None (frozen)	None	No
Best-of-N	None (frozen)	Selection only	No
Self-Refine	None (frozen)	Prompt rewriting	Shallow
PoT	GRPO on LoRA	Internalized	Deep

Significance

PoT represents a paradigm shift in how we think about test-time computation. Rather than simply spending more compute on generating and filtering candidates, PoT uses that compute to actually improve the model's reasoning capability for the specific problem at hand. This aligns with the broader trend toward test-time training and adaptive inference.

Published as a conference paper at COLM 2025.

AI Agent Knowledge Base

Sidebar

Table of Contents

Policy of Thoughts

Motivation

Popperian Epistemology Connection

Framework Architecture

Mathematical Formulation

Key Results

Comparison with Other Approaches

Significance

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Policy of Thoughts

Motivation

Popperian Epistemology Connection

Framework Architecture

Mathematical Formulation

Key Results

Comparison with Other Approaches

Significance

References

See Also

Page Tools