====== CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing ====== CRITIC is a framework that enables large language models to **self-correct their outputs by interacting with external tools** such as search engines and code interpreters.(([[https://arxiv.org/abs/2305.11738|Gou et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing]]))(([[https://arxiv.org/abs/2309.17452|ToRA: Tool-Integrated Reasoning Agents (same research group)]]))(([[https://arxiv.org/abs/2303.17651|Self-Refine: Iterative Refinement with Self-Feedback]])) Introduced by Gou et al. (2023), CRITIC implements a verify-then-revise loop inspired by human critical thinking: generate an initial response, use tools to critique it, then revise based on evidence. ===== Overview ===== A fundamental limitation of LLMs is their tendency to generate plausible but incorrect outputs -- hallucinated facts, faulty reasoning, or toxic content. Prior self-correction approaches that rely solely on the LLM's own judgment yield minimal improvement. CRITIC addresses this by grounding correction in **external tool feedback**, providing objective evidence for revision. The key insight: LLM-only self-critique yields marginal gains (-0.03 to +2.33 F1), while tool-augmented critique produces substantial improvements (+7.0 to +7.7 F1). ===== The Verify-Then-Revise Framework ===== graph TD A[Input Query x] --> B[LLM Generates Initial Output y0] B --> C[Tool Interaction] C --> D[Generate Critique c_i] D --> E[LLM Revises Output] E --> F{Satisfactory?} F -->|No| C F -->|Yes| G[Final Output y_n] C -->|Search Engine| H[Factual Verification] C -->|Code Interpreter| I[Computation Check] C -->|Toxicity Classifier| J[Safety Evaluation] At each iteration i, the revised output is sampled from: \hat{y}_{i+1} \sim P_M(\cdot\, |\, \wp \oplus x \oplus \hat{y}_i \oplus c_i) where \wp is the task prompt, x is the input, \hat{y}_i is the current output, and c_i is the tool-generated critique. The process uses **in-context learning** for general applicability -- no task-specific fine-tuning required. ===== Tasks and Evaluation ===== CRITIC was evaluated across three diverse task categories: ^ Task Domain ^ Tool Used ^ Key Metric ^ | Free-form Question Answering | Search Engine | F1 / Exact Match | | Mathematical Program Synthesis | Code Interpreter | Accuracy | | Toxicity Reduction | Toxicity Classifier | Toxicity Probability | ===== Key Results ===== **Question Answering:** * +7.7 F1 improvement over baselines across three QA tasks * Beats rejection sampling by 4.5-3.3 EM * CoT initial: 74.7 -> 83.7 and 73.5 -> 86.6 after tool-verified revision **Mathematical Reasoning:** * +7.0% absolute accuracy gains across three math tasks * Code interpreter catches computational errors that self-reflection misses **Toxicity Reduction:** * 79.2% reduction in toxicity probability * Matches supervised state-of-the-art without any training * Preserves fluency and diversity of generated text ===== Code Example ===== # CRITIC-style verify-then-revise loop import openai def critic_loop(query, max_iterations=3): # Step 1: Generate initial response response = openai.chat.completions.create( model='gpt-4', messages=[{'role': 'user', 'content': query}] ).choices[0].message.content for i in range(max_iterations): # Step 2: Verify with external tool (e.g., search) critique = verify_with_tool(response, query) if critique['is_correct']: break # Step 3: Revise based on critique revision_prompt = ( 'Original question: ' + query + '\n' 'Your answer: ' + response + '\n' 'Tool feedback: ' + critique['evidence'] + '\n' 'Please revise your answer based on this feedback.' ) response = openai.chat.completions.create( model='gpt-4', messages=[{'role': 'user', 'content': revision_prompt}] ).choices[0].message.content return response ===== See Also ===== * [[tora_reasoning|ToRA: Tool-Integrated Reasoning]] * [[reflexion|Reflexion: Verbal Reinforcement Learning]] * [[self_correction_agents|Self-Correction in LLM Agents]] ===== References =====