Motivation
The Generate-Critique-Refine Loop
Key Design Principles
Stopping Criteria
Experimental Results
Limitations
Significance
See Also
References

Self-Refine

Self-Refine, introduced by Madaan et al. (2023), is a framework that improves LLM outputs through iterative self-feedback: the same model generates an initial output, critiques it for weaknesses, and refines it based on that feedback. No external training, reinforcement learning, or additional models are required, the approach works purely at inference time with a single LLM.¹⁾

Motivation

Humans rarely produce perfect first drafts, we revise through iterative critique and improvement. Self-Refine brings this natural revision process to LLMs. While a model's initial output may contain errors, the same model can often identify those issues when explicitly prompted to critique, and fix them when prompted to revise. This leverages an asymmetry: evaluation is easier than generation.

The Generate-Critique-Refine Loop

Self-Refine operates in three steps, iterated until convergence:

Generate: The LLM produces an initial output $y_0$ given a task prompt $x$
Critique: The LLM provides multi-aspect feedback $f_t$ on the current output $y_t$, identifying specific weaknesses
Refine: The LLM generates an improved output $y_{t+1}$ conditioned on $x$, $y_t$, and $f_t$

$$y_0 = \text{LLM}(x), \quad f_t = \text{LLM}_{\text{fb}}(x, y_t), \quad y_{t+1} = \text{LLM}_{\text{refine}}(x, y_t, f_t)$$

The loop repeats for $T$ iterations or until the model indicates no further improvements are possible.

def self_refine(model, task_prompt, max_iters=3):
    """Iterative self-refinement loop."""
    output = model.generate(task_prompt)
 
    for i in range(max_iters):
        critique = model.generate(
            f"Review and identify specific issues:\n"
            f"Task: {task_prompt}\nOutput: {output}"
        )
        if "no improvements needed" in critique.lower():
            break
        output = model.generate(
            f"Improve based on feedback:\n"
            f"Task: {task_prompt}\nOutput: {output}\n"
            f"Feedback: {critique}"
        )
    return output

Key Design Principles

Single model: The same LLM serves as generator, critic, and refiner
No training: Operates purely at inference time via prompting
Task-agnostic: Works across diverse tasks with task-specific critique templates
Multi-aspect feedback: Critique evaluates multiple dimensions (correctness, style, completeness)

Stopping Criteria

Two mechanisms determine when to stop:

Self-assessment: The model states no further improvements are possible
Fixed budget: A maximum iteration count (typically 2-4), balancing quality vs. latency

Experimental Results

Self-Refine was evaluated on 7 diverse tasks with improvements of 5-40% over direct generation:²⁾

Task	Improvement
Code optimization	~31% relative gain
Sentiment reversal	~9% absolute gain
Math reasoning	Consistent improvement
Dialogue response	Quality gains in coherence
Code readability	Multi-iteration gains
Acronym generation	Strong human preference
Review rewriting	~20% average improvement

Results hold across GPT-3.5 and GPT-4, with quality improving monotonically over 2-3 rounds. Human evaluators consistently prefer Self-Refine outputs.

Limitations

Performance depends on the model's self-critique ability, weaker models may not identify their own errors
Each iteration adds latency (roughly 3x tokens per refinement round)
Diminishing returns after 2-3 iterations on most tasks
Occasional regression where refinement introduces new errors

Significance

Self-Refine demonstrates that significant quality improvements are achievable without any additional training, purely through structured inference-time computation. This suggests that frontier model capabilities are underutilized by single-pass generation, and iterative refinement is a general-purpose method for extracting better performance.

References

¹⁾ , ²⁾

(Madaan et al. "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651, 2023.

³⁾

Shinn et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." arXiv:2210.11610, 2023.

⁴⁾

Chen et al. "Teaching Large Language Models to Self-Debug." arXiv:2305.20050, 2023.

Table of Contents