The Formal Verification Challenge
Prover Agent: Lemma-Guided Decomposition
APOLLO: Automated Proof Repair
MA-LoT: Multi-Agent Long Chain-of-Thought
Code Example
Comparison of Approaches
Mathematical Foundation
References
See Also

LLM Agents for Formal Theorem Proving

Large language models are increasingly deployed as agents that collaborate with formal proof assistants like Lean to automatically prove mathematical theorems. By combining informal mathematical reasoning with machine-checkable verification, these systems achieve remarkable accuracy while maintaining formal rigor.

The Formal Verification Challenge

Formal theorem proving requires generating proofs in languages like Lean 4 where every step is machine-verified. A verification system can check correctness almost instantaneously, but generating a completely correct formal proof remains formidable. Traditional approaches prompt LLMs thousands of times hoping one proof passes verification. Agentic approaches instead use structured interaction between LLMs and the proof assistant.

Prover Agent: Lemma-Guided Decomposition

Prover Agent (Baba et al., 2025) is an agent framework that coordinates informal reasoning LLMs with formal prover models and Lean verification, achieving 86.1% on MiniF2F.

Architecture

The system coordinates three components:

Informal reasoning LLM: Generates auxiliary lemmas to guide proof strategy
Formal prover model: DeepSeek-Prover-V2, Goedel-Prover-V2, or their ensemble
Lean verifier: Checks all proofs and provides error feedback

Agentic Workflow

Direct proving: First attempts direct proof generation
Lemma generation: If direct proving fails, the informal LLM generates auxiliary lemmas targeting essential intermediate facts $F_1, \dots, F_m$
Lemma proving: Each lemma is individually proved, verified by Lean, and added to context
Iterative refinement: Selects the best prior draft and revises based on Lean error messages (locations and messages), iterating up to $N_{\text{refine}}$ steps

Theoretical Results

Theorem 4.4: Lemma decomposition requires exponentially fewer trials than direct proving ($N_{\text{lem}} \ll N_{\text{dir}}$)
Theorem 4.5: Lemmas are most efficient for hard problems where direct success probability $p \leq \tau$
Theorem 4.6: Optimal lemmas balance subproblem difficulty
Success probability grows exponentially with mutual information $I(Z; Y_{1:n})$ from lemmas and feedback

Results

MiniF2F: 86.1% pass rate (SOTA for small LMs)
Ablations confirm both lemma generation and iterative refinement are essential

APOLLO: Automated Proof Repair

APOLLO – Automated PrOof repair via LLM and Lean cOllaboration (Ospanov et al., 2025) – is a modular, model-agnostic agentic framework that repairs proofs using multi-agent collaboration with Lean.

Architecture

APOLLO directs a fully automated pipeline:

LLM generates initial proof attempts
A set of agents analyze the proofs and fix syntax errors
Lean identifies mistakes and isolates failing sub-lemmas
Automated solvers (e.g., omega, simp, decide) attempt to close simple goals
LLM is invoked on each remaining goal with a low top-K budget
Repaired sub-proofs are recombined and reverified
Process iterates up to a user-controlled maximum number of attempts

Key Features

Model-agnostic: Works with any LLM backend
Low sampling budget: Achieves strong results without thousands of samples
Sub-lemma isolation: Decomposes failing proofs into independently solvable goals

Results

MiniF2F: 84.9% accuracy among sub-8B parameter models (SOTA at time of publication)
Significantly lower token and sampling budgets than competing approaches

MA-LoT: Multi-Agent Long Chain-of-Thought

MA-LoT – Model-CollAboration Lean-based Long Chain-of-Thought (Wang et al., 2025, ICML) – is the first multi-agent framework for Lean 4 theorem proving that balances high-level natural language reasoning with formal language verification.

Architecture

MA-LoT separates cognition tasks using model collaboration:

Whole-proof generation agent: Uses general NL reasoning to generate complete proof drafts
Error analysis agent: Analyzes Lean compiler errors and suggests corrections
Lean 4 verifier: Provides structured feedback at each step of the Long CoT

LoT-Transfer Learning

A novel training-inference pipeline that enables Long CoT thinking capability in LLMs without special data annotation. The approach transfers the ability to engage in extended structured reasoning from pre-training to theorem-proving contexts.

Results

MiniF2F-Test (Lean 4): 61.07% accuracy
Outperforms DeepSeek-V3 (33.61%), InternLM-Step-Prover tree search (50.70%), and Godel-Prover whole-proof generation (55.33%)

Code Example

# Simplified Prover Agent loop
class ProverAgent:
    def __init__(self, informal_llm, formal_prover, lean_env):
        self.informal = informal_llm
        self.prover = formal_prover
        self.lean = lean_env
 
    def prove(self, theorem, max_refine=10):
        # Phase 1: Direct proving attempt
        proof = self.prover.generate(theorem)
        if self.lean.verify(proof):
            return proof
 
        # Phase 2: Lemma-guided decomposition
        lemmas = self.informal.generate_lemmas(theorem)
        verified_lemmas = []
        for lemma in lemmas:
            lemma_proof = self.prover.generate(lemma)
            if self.lean.verify(lemma_proof):
                verified_lemmas.append((lemma, lemma_proof))
 
        # Phase 3: Iterative refinement with Lean feedback
        context = verified_lemmas
        best_proof = self.prover.generate(theorem, context=context)
        for step in range(max_refine):
            result = self.lean.check(best_proof)
            if result.success:
                return best_proof
            best_proof = self.prover.refine(best_proof, result.errors)
        return best_proof

Comparison of Approaches

Method	Year	MiniF2F (%)	Key Innovation	Model Size
MA-LoT	2025	61.07	Multi-agent Long CoT	Various
APOLLO	2025	84.9	Proof repair agents	< 8B
Prover Agent	2025	86.1	Lemma decomposition + refinement	Ensemble

Mathematical Foundation

The theoretical advantage of lemma-guided decomposition can be expressed as:

For a theorem requiring $m$ intermediate lemmas with individual success probabilities $p_i$:

$P(\text{success with lemmas}) = \prod_{i=1}^{m} (1 - (1-p_i)^{K_i})$

where $K_i$ is the sample budget per lemma. This grows exponentially faster than direct proving:

$P(\text{direct success}) = 1 - (1 - p_{\text{direct}})^N$

where $p_{\text{direct}} \ll \prod p_i$ for hard theorems.

Table of Contents

LLM Agents for Formal Theorem Proving

The Formal Verification Challenge

Prover Agent: Lemma-Guided Decomposition

Architecture

Agentic Workflow

Theoretical Results

Results

APOLLO: Automated Proof Repair

Architecture

Key Features

Results

MA-LoT: Multi-Agent Long Chain-of-Thought

Architecture

LoT-Transfer Learning

Results

Code Example

Comparison of Approaches

Mathematical Foundation

References

See Also