Large language models are increasingly deployed as agents that collaborate with formal proof assistants like Lean to automatically prove mathematical theorems. By combining informal mathematical reasoning with machine-checkable verification, these systems achieve remarkable accuracy while maintaining formal rigor.
Formal theorem proving requires generating proofs in languages like Lean 4 where every step is machine-verified. A verification system can check correctness almost instantaneously, but generating a completely correct formal proof remains formidable. Traditional approaches prompt LLMs thousands of times hoping one proof passes verification. Agentic approaches instead use structured interaction between LLMs and the proof assistant.
Prover Agent (Baba et al., 2025) is an agent framework that coordinates informal reasoning LLMs with formal prover models and Lean verification, achieving 86.1% on MiniF2F.
The system coordinates three components:
APOLLO – Automated PrOof repair via LLM and Lean cOllaboration (Ospanov et al., 2025) – is a modular, model-agnostic agentic framework that repairs proofs using multi-agent collaboration with Lean.
APOLLO directs a fully automated pipeline:
omega, simp, decide) attempt to close simple goalsMA-LoT – Model-CollAboration Lean-based Long Chain-of-Thought (Wang et al., 2025, ICML) – is the first multi-agent framework for Lean 4 theorem proving that balances high-level natural language reasoning with formal language verification.
MA-LoT separates cognition tasks using model collaboration:
A novel training-inference pipeline that enables Long CoT thinking capability in LLMs without special data annotation. The approach transfers the ability to engage in extended structured reasoning from pre-training to theorem-proving contexts.
# Simplified Prover Agent loop class ProverAgent: def __init__(self, informal_llm, formal_prover, lean_env): self.informal = informal_llm self.prover = formal_prover self.lean = lean_env def prove(self, theorem, max_refine=10): # Phase 1: Direct proving attempt proof = self.prover.generate(theorem) if self.lean.verify(proof): return proof # Phase 2: Lemma-guided decomposition lemmas = self.informal.generate_lemmas(theorem) verified_lemmas = [] for lemma in lemmas: lemma_proof = self.prover.generate(lemma) if self.lean.verify(lemma_proof): verified_lemmas.append((lemma, lemma_proof)) # Phase 3: Iterative refinement with Lean feedback context = verified_lemmas best_proof = self.prover.generate(theorem, context=context) for step in range(max_refine): result = self.lean.check(best_proof) if result.success: return best_proof best_proof = self.prover.refine(best_proof, result.errors) return best_proof
| Method | Year | MiniF2F (%) | Key Innovation | Model Size |
|---|---|---|---|---|
| MA-LoT | 2025 | 61.07 | Multi-agent Long CoT | Various |
| APOLLO | 2025 | 84.9 | Proof repair agents | < 8B |
| Prover Agent | 2025 | 86.1 | Lemma decomposition + refinement | Ensemble |
The theoretical advantage of lemma-guided decomposition can be expressed as:
For a theorem requiring $m$ intermediate lemmas with individual success probabilities $p_i$:
$P(\text{success with lemmas}) = \prod_{i=1}^{m} (1 - (1-p_i)^{K_i})$
where $K_i$ is the sample budget per lemma. This grows exponentially faster than direct proving:
$P(\text{direct success}) = 1 - (1 - p_{\text{direct}})^N$
where $p_{\text{direct}} \ll \prod p_i$ for hard theorems.