====== LLM Agents for Formal Theorem Proving ====== Large language models are increasingly deployed as agents that collaborate with formal proof assistants like Lean to automatically prove mathematical theorems. By combining informal mathematical reasoning with machine-checkable verification, these systems achieve remarkable accuracy while maintaining formal rigor. ===== The Formal Verification Challenge ===== Formal theorem proving requires generating proofs in languages like Lean 4 where every step is machine-verified. A verification system can check correctness almost instantaneously, but generating a completely correct formal proof remains formidable. Traditional approaches prompt LLMs thousands of times hoping one proof passes verification. Agentic approaches instead use structured interaction between LLMs and the proof assistant. ===== Prover Agent: Lemma-Guided Decomposition ===== **Prover Agent** (Baba et al., 2025) is an agent framework that coordinates informal reasoning LLMs with formal prover models and Lean verification, achieving 86.1% on MiniF2F. === Architecture === The system coordinates three components: - **Informal reasoning LLM**: Generates auxiliary lemmas to guide proof strategy - **Formal prover model**: DeepSeek-Prover-V2, Goedel-Prover-V2, or their ensemble - **Lean verifier**: Checks all proofs and provides error feedback === Agentic Workflow === - **Direct proving**: First attempts direct proof generation - **Lemma generation**: If direct proving fails, the informal LLM generates auxiliary lemmas targeting essential intermediate facts $F_1, \dots, F_m$ - **Lemma proving**: Each lemma is individually proved, verified by Lean, and added to context - **Iterative refinement**: Selects the best prior draft and revises based on Lean error messages (locations and messages), iterating up to $N_{\text{refine}}$ steps === Theoretical Results === * **Theorem 4.4**: Lemma decomposition requires exponentially fewer trials than direct proving ($N_{\text{lem}} \ll N_{\text{dir}}$) * **Theorem 4.5**: Lemmas are most efficient for hard problems where direct success probability $p \leq \tau$ * **Theorem 4.6**: Optimal lemmas balance subproblem difficulty * Success probability grows exponentially with mutual information $I(Z; Y_{1:n})$ from lemmas and feedback === Results === * **MiniF2F**: 86.1% pass rate (SOTA for small LMs) * Ablations confirm both lemma generation and iterative refinement are essential ===== APOLLO: Automated Proof Repair ===== **APOLLO** -- Automated PrOof repair via LLM and Lean cOllaboration (Ospanov et al., 2025) -- is a modular, model-agnostic agentic framework that repairs proofs using multi-agent collaboration with Lean. === Architecture === APOLLO directs a fully automated pipeline: - LLM generates initial proof attempts - A set of agents analyze the proofs and fix syntax errors - Lean identifies mistakes and isolates failing sub-lemmas - Automated solvers (e.g., ''omega'', ''simp'', ''decide'') attempt to close simple goals - LLM is invoked on each remaining goal with a low top-K budget - Repaired sub-proofs are recombined and reverified - Process iterates up to a user-controlled maximum number of attempts === Key Features === * **Model-agnostic**: Works with any LLM backend * **Low sampling budget**: Achieves strong results without thousands of samples * **Sub-lemma isolation**: Decomposes failing proofs into independently solvable goals === Results === * **MiniF2F**: 84.9% accuracy among sub-8B parameter models (SOTA at time of publication) * Significantly lower token and sampling budgets than competing approaches ===== MA-LoT: Multi-Agent Long Chain-of-Thought ===== **MA-LoT** -- Model-CollAboration Lean-based Long Chain-of-Thought (Wang et al., 2025, ICML) -- is the first multi-agent framework for Lean 4 theorem proving that balances high-level natural language reasoning with formal language verification. === Architecture === MA-LoT separates cognition tasks using model collaboration: * **Whole-proof generation agent**: Uses general NL reasoning to generate complete proof drafts * **Error analysis agent**: Analyzes Lean compiler errors and suggests corrections * **Lean 4 verifier**: Provides structured feedback at each step of the Long CoT === LoT-Transfer Learning === A novel training-inference pipeline that enables Long CoT thinking capability in LLMs without special data annotation. The approach transfers the ability to engage in extended structured reasoning from pre-training to theorem-proving contexts. === Results === * **MiniF2F-Test (Lean 4)**: 61.07% accuracy * Outperforms DeepSeek-V3 (33.61%), InternLM-Step-Prover tree search (50.70%), and Godel-Prover whole-proof generation (55.33%) ===== Code Example ===== # Simplified Prover Agent loop class ProverAgent: def __init__(self, informal_llm, formal_prover, lean_env): self.informal = informal_llm self.prover = formal_prover self.lean = lean_env def prove(self, theorem, max_refine=10): # Phase 1: Direct proving attempt proof = self.prover.generate(theorem) if self.lean.verify(proof): return proof # Phase 2: Lemma-guided decomposition lemmas = self.informal.generate_lemmas(theorem) verified_lemmas = [] for lemma in lemmas: lemma_proof = self.prover.generate(lemma) if self.lean.verify(lemma_proof): verified_lemmas.append((lemma, lemma_proof)) # Phase 3: Iterative refinement with Lean feedback context = verified_lemmas best_proof = self.prover.generate(theorem, context=context) for step in range(max_refine): result = self.lean.check(best_proof) if result.success: return best_proof best_proof = self.prover.refine(best_proof, result.errors) return best_proof ===== Comparison of Approaches ===== ^ Method ^ Year ^ MiniF2F (%) ^ Key Innovation ^ Model Size ^ | MA-LoT | 2025 | 61.07 | Multi-agent Long CoT | Various | | APOLLO | 2025 | 84.9 | Proof repair agents | < 8B | | Prover Agent | 2025 | 86.1 | Lemma decomposition + refinement | Ensemble | ===== Mathematical Foundation ===== The theoretical advantage of lemma-guided decomposition can be expressed as: For a theorem requiring $m$ intermediate lemmas with individual success probabilities $p_i$: $P(\text{success with lemmas}) = \prod_{i=1}^{m} (1 - (1-p_i)^{K_i})$ where $K_i$ is the sample budget per lemma. This grows exponentially faster than direct proving: $P(\text{direct success}) = 1 - (1 - p_{\text{direct}})^N$ where $p_{\text{direct}} \ll \prod p_i$ for hard theorems. ===== References ===== * [[https://arxiv.org/abs/2506.19923|Prover Agent: An Agent-based Framework for Formal Theorem Proving (arXiv:2506.19923)]] * [[https://arxiv.org/abs/2505.05758|APOLLO: Automated Proof Repair via LLM and Lean Collaboration (arXiv:2505.05758)]] * [[https://arxiv.org/abs/2503.03205|MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought (arXiv:2503.03205)]] ===== See Also ===== * [[agentic_uncertainty|Agentic Uncertainty]] -- uncertainty quantification in multi-step agent reasoning * [[automated_program_repair|Automated Program Repair]] -- related iterative code repair paradigm