This is an old revision of the document!

CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing

CRITIC is a framework that enables large language models to self-correct their outputs by interacting with external tools such as search engines and code interpreters.¹⁾²⁾³⁾ Introduced by Gou et al. (2023), CRITIC implements a verify-then-revise loop inspired by human critical thinking: generate an initial response, use tools to critique it, then revise based on evidence.

Overview

A fundamental limitation of LLMs is their tendency to generate plausible but incorrect outputs – hallucinated facts, faulty reasoning, or toxic content. Prior self-correction approaches that rely solely on the LLM's own judgment yield minimal improvement. CRITIC addresses this by grounding correction in external tool feedback, providing objective evidence for revision.

The key insight: LLM-only self-critique yields marginal gains (-0.03 to +2.33 F1), while tool-augmented critique produces substantial improvements (+7.0 to +7.7 F1).

The Verify-Then-Revise Framework

graph TD A[Input Query x] --> B[LLM Generates Initial Output y0] B --> C[Tool Interaction] C --> D[Generate Critique c_i] D --> E[LLM Revises Output] E --> F{Satisfactory?} F -->|No| C F -->|Yes| G[Final Output y_n] C -->|Search Engine| H[Factual Verification] C -->|Code Interpreter| I[Computation Check] C -->|Toxicity Classifier| J[Safety Evaluation]

At each iteration <latex>i</latex>, the revised output is sampled from:

<latex>\hat{y}_{i+1} \sim P_M(\cdot\, |\, \wp \oplus x \oplus \hat{y}_i \oplus c_i)</latex>

where <latex>\wp</latex> is the task prompt, <latex>x</latex> is the input, <latex>\hat{y}_i</latex> is the current output, and <latex>c_i</latex> is the tool-generated critique. The process uses in-context learning for general applicability – no task-specific fine-tuning required.

Tasks and Evaluation

CRITIC was evaluated across three diverse task categories:

Task Domain	Tool Used	Key Metric
Free-form Question Answering	Search Engine	F1 / Exact Match
Mathematical Program Synthesis	Code Interpreter	Accuracy
Toxicity Reduction	Toxicity Classifier	Toxicity Probability

Key Results

Question Answering:

+7.7 F1 improvement over baselines across three QA tasks
Beats rejection sampling by 4.5-3.3 EM
CoT initial: 74.7 → 83.7 and 73.5 → 86.6 after tool-verified revision

Mathematical Reasoning:

+7.0% absolute accuracy gains across three math tasks
Code interpreter catches computational errors that self-reflection misses

Toxicity Reduction:

79.2% reduction in toxicity probability
Matches supervised state-of-the-art without any training
Preserves fluency and diversity of generated text

Code Example

# CRITIC-style verify-then-revise loop
import openai
 
def critic_loop(query, max_iterations=3):
    # Step 1: Generate initial response
    response = openai.chat.completions.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': query}]
    ).choices[0].message.content
 
    for i in range(max_iterations):
        # Step 2: Verify with external tool (e.g., search)
        critique = verify_with_tool(response, query)
        if critique['is_correct']:
            break
        # Step 3: Revise based on critique
        revision_prompt = (
            'Original question: ' + query + '\n'
            'Your answer: ' + response + '\n'
            'Tool feedback: ' + critique['evidence'] + '\n'
            'Please revise your answer based on this feedback.'
        )
        response = openai.chat.completions.create(
            model='gpt-4',
            messages=[{'role': 'user', 'content': revision_prompt}]
        ).choices[0].message.content
    return response

AI Agent Knowledge Base

Sidebar

Table of Contents

CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing

Overview

The Verify-Then-Revise Framework

Tasks and Evaluation

Key Results

Code Example

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing

Overview

The Verify-Then-Revise Framework

Tasks and Evaluation

Key Results

Code Example

References

See Also

Page Tools