====== CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing ======
CRITIC is a framework that enables large language models to **self-correct their outputs by interacting with external tools** such as search engines and code interpreters.(([[https://arxiv.org/abs/2305.11738|Gou et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing]]))(([[https://arxiv.org/abs/2309.17452|ToRA: Tool-Integrated Reasoning Agents (same research group)]]))(([[https://arxiv.org/abs/2303.17651|Self-Refine: Iterative Refinement with Self-Feedback]])) Introduced by Gou et al. (2023), CRITIC implements a verify-then-revise loop inspired by human critical thinking: generate an initial response, use tools to critique it, then revise based on evidence.
===== Overview =====
A fundamental limitation of LLMs is their tendency to generate plausible but incorrect outputs -- hallucinated facts, faulty reasoning, or toxic content. Prior self-correction approaches that rely solely on the LLM's own judgment yield minimal improvement. CRITIC addresses this by grounding correction in **external tool feedback**, providing objective evidence for revision.
The key insight: LLM-only self-critique yields marginal gains (-0.03 to +2.33 F1), while tool-augmented critique produces substantial improvements (+7.0 to +7.7 F1).
===== The Verify-Then-Revise Framework =====
graph TD
A[Input Query x] --> B[LLM Generates Initial Output y0]
B --> C[Tool Interaction]
C --> D[Generate Critique c_i]
D --> E[LLM Revises Output]
E --> F{Satisfactory?}
F -->|No| C
F -->|Yes| G[Final Output y_n]
C -->|Search Engine| H[Factual Verification]
C -->|Code Interpreter| I[Computation Check]
C -->|Toxicity Classifier| J[Safety Evaluation]
At each iteration i, the revised output is sampled from:
\hat{y}_{i+1} \sim P_M(\cdot\, |\, \wp \oplus x \oplus \hat{y}_i \oplus c_i)
where \wp is the task prompt, x is the input, \hat{y}_i is the current output, and c_i is the tool-generated critique. The process uses **in-context learning** for general applicability -- no task-specific fine-tuning required.
===== Tasks and Evaluation =====
CRITIC was evaluated across three diverse task categories:
^ Task Domain ^ Tool Used ^ Key Metric ^
| Free-form Question Answering | Search Engine | F1 / Exact Match |
| Mathematical Program Synthesis | Code Interpreter | Accuracy |
| Toxicity Reduction | Toxicity Classifier | Toxicity Probability |
===== Key Results =====
**Question Answering:**
* +7.7 F1 improvement over baselines across three QA tasks
* Beats rejection sampling by 4.5-3.3 EM
* CoT initial: 74.7 -> 83.7 and 73.5 -> 86.6 after tool-verified revision
**Mathematical Reasoning:**
* +7.0% absolute accuracy gains across three math tasks
* Code interpreter catches computational errors that self-reflection misses
**Toxicity Reduction:**
* 79.2% reduction in toxicity probability
* Matches supervised state-of-the-art without any training
* Preserves fluency and diversity of generated text
===== Code Example =====
# CRITIC-style verify-then-revise loop
import openai
def critic_loop(query, max_iterations=3):
# Step 1: Generate initial response
response = openai.chat.completions.create(
model='gpt-4',
messages=[{'role': 'user', 'content': query}]
).choices[0].message.content
for i in range(max_iterations):
# Step 2: Verify with external tool (e.g., search)
critique = verify_with_tool(response, query)
if critique['is_correct']:
break
# Step 3: Revise based on critique
revision_prompt = (
'Original question: ' + query + '\n'
'Your answer: ' + response + '\n'
'Tool feedback: ' + critique['evidence'] + '\n'
'Please revise your answer based on this feedback.'
)
response = openai.chat.completions.create(
model='gpt-4',
messages=[{'role': 'user', 'content': revision_prompt}]
).choices[0].message.content
return response
===== See Also =====
* [[tora_reasoning|ToRA: Tool-Integrated Reasoning]]
* [[reflexion|Reflexion: Verbal Reinforcement Learning]]
* [[self_correction_agents|Self-Correction in LLM Agents]]
===== References =====