Table of Contents

Policy of Thoughts

Policy of Thoughts (PoT) is a test-time reasoning framework that recasts LLM inference as a within-instance online policy optimization process. Introduced by Jiao et al. (2026), PoT draws on Karl Popper's epistemology of “conjectures and refutations” to argue that genuine reasoning requires real-time evolution of the model's policy through learning from failed attempts, rather than treating execution feedback as a passive external signal.

graph TD A[Problem] --> B[Generate Diverse Conjectures] B --> C[Execute and Test] C --> D{Correct?} D -->|No| E[Compute GRPO Advantages] E --> F[Update Transient LoRA] F --> G[Evolved Policy] G --> B D -->|Yes| H[Solution Found]

Motivation

Current test-time scaling methods such as Chain-of-Thought, Tree-of-Thought, and Best-of-N sampling all operate under a frozen policy assumption: the model's parameters remain fixed during inference, and feedback is used only for trajectory filtering or rewriting. This creates instability in complex, long-horizon reasoning because the model cannot internalize lessons from its own mistakes within a single problem instance.

PoT challenges this assumption by allowing the model's reasoning strategy to evolve dynamically during inference on each individual problem.

Popperian Epistemology Connection

PoT maps directly onto Popper's four-stage cycle of scientific discovery:

This cycle repeats within a single inference session, enabling the model to learn from each failed attempt rather than simply discarding bad trajectories.

Framework Architecture

PoT operates through two main phases:

Phase 1: Diverse Exploration. An efficient exploration mechanism generates a population of diverse candidate solutions for the given problem. Diversity is critical because it provides the raw material for policy learning — homogeneous candidates yield minimal learning signal.

Phase 2: Policy Optimization via GRPO. Group Relative Policy Optimization (GRPO) updates a transient LoRA adapter based on execution feedback from the candidate solutions. The key insight is that this adapter is ephemeral — it exists only for the duration of the current problem instance.

The GRPO objective computes advantages relative to the group of candidates:

# Simplified PoT optimization loop
for round in range(num_rounds):
    # Phase 1: Generate diverse candidates
    candidates = model.generate(problem, n=group_size, temperature=high)
    
    # Phase 2: Evaluate and compute rewards
    rewards = [execute_and_score(c) for c in candidates]
    
    # Phase 3: GRPO update on transient LoRA
    advantages = group_relative_advantages(rewards)
    loss = -sum(adv * log_prob for adv, log_prob in zip(advantages, log_probs))
    transient_lora.update(loss)
    
    # The model's policy has now evolved for this instance

The transient LoRA adapter captures instance-specific reasoning refinements without permanently altering the base model weights.

Mathematical Formulation

The GRPO objective for within-instance optimization is formulated as:

$$\mathcal{L}_{ ext{GRPO}} = -\mathbb{E}_{o \sim \pi_{ heta}} \left[ \min\left( rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)} \hat{A}(o), ext{clip}\left( rac{\pi_{ heta}(o|q)}{\pi_{ ext{ref}}(o|q)}, 1-\epsilon, 1+\epsilon ight) \hat{A}(o) ight) ight]$$

where the group-relative advantage is computed as:

$$\hat{A}(o_i) = rac{r(o_i) - ext{mean}(\{r(o_j)\})}{ ext{std}(\{r(o_j)\})}$$

This normalizes rewards relative to the current group, providing stable learning signals regardless of absolute reward scale.

Key Results

Experiments demonstrate dramatic performance improvements:

The results validate the core thesis: allowing the model's policy to evolve during inference yields fundamentally better reasoning than any frozen-policy approach.

Comparison with Other Approaches

Method Policy Update Feedback Use Instance Adaptation
Chain-of-Thought None (frozen) None No
Best-of-N None (frozen) Selection only No
Self-Refine None (frozen) Prompt rewriting Shallow
PoT GRPO on LoRA Internalized Deep

Significance

PoT represents a paradigm shift in how we think about test-time computation. Rather than simply spending more compute on generating and filtering candidates, PoT uses that compute to actually improve the model's reasoning capability for the specific problem at hand. This aligns with the broader trend toward test-time training and adaptive inference.

Published as a conference paper at COLM 2025.

References

See Also