Table of Contents

LLM Agents for Formal Theorem Proving

Large language models are increasingly deployed as agents that collaborate with formal proof assistants like Lean to automatically prove mathematical theorems. By combining informal mathematical reasoning with machine-checkable verification, these systems achieve remarkable accuracy while maintaining formal rigor.

The Formal Verification Challenge

Formal theorem proving requires generating proofs in languages like Lean 4 where every step is machine-verified. A verification system can check correctness almost instantaneously, but generating a completely correct formal proof remains formidable. Traditional approaches prompt LLMs thousands of times hoping one proof passes verification. Agentic approaches instead use structured interaction between LLMs and the proof assistant.

Prover Agent: Lemma-Guided Decomposition

Prover Agent (Baba et al., 2025) is an agent framework that coordinates informal reasoning LLMs with formal prover models and Lean verification, achieving 86.1% on MiniF2F.

Architecture

The system coordinates three components:

  1. Informal reasoning LLM: Generates auxiliary lemmas to guide proof strategy
  2. Formal prover model: DeepSeek-Prover-V2, Goedel-Prover-V2, or their ensemble
  3. Lean verifier: Checks all proofs and provides error feedback

Agentic Workflow

  1. Direct proving: First attempts direct proof generation
  2. Lemma generation: If direct proving fails, the informal LLM generates auxiliary lemmas targeting essential intermediate facts $F_1, \dots, F_m$
  3. Lemma proving: Each lemma is individually proved, verified by Lean, and added to context
  4. Iterative refinement: Selects the best prior draft and revises based on Lean error messages (locations and messages), iterating up to $N_{\text{refine}}$ steps

Theoretical Results

Results

APOLLO: Automated Proof Repair

APOLLO – Automated PrOof repair via LLM and Lean cOllaboration (Ospanov et al., 2025) – is a modular, model-agnostic agentic framework that repairs proofs using multi-agent collaboration with Lean.

Architecture

APOLLO directs a fully automated pipeline:

  1. LLM generates initial proof attempts
  2. A set of agents analyze the proofs and fix syntax errors
  3. Lean identifies mistakes and isolates failing sub-lemmas
  4. Automated solvers (e.g., omega, simp, decide) attempt to close simple goals
  5. LLM is invoked on each remaining goal with a low top-K budget
  6. Repaired sub-proofs are recombined and reverified
  7. Process iterates up to a user-controlled maximum number of attempts

Key Features

Results

MA-LoT: Multi-Agent Long Chain-of-Thought

MA-LoT – Model-CollAboration Lean-based Long Chain-of-Thought (Wang et al., 2025, ICML) – is the first multi-agent framework for Lean 4 theorem proving that balances high-level natural language reasoning with formal language verification.

Architecture

MA-LoT separates cognition tasks using model collaboration:

LoT-Transfer Learning

A novel training-inference pipeline that enables Long CoT thinking capability in LLMs without special data annotation. The approach transfers the ability to engage in extended structured reasoning from pre-training to theorem-proving contexts.

Results

Code Example

# Simplified Prover Agent loop
class ProverAgent:
    def __init__(self, informal_llm, formal_prover, lean_env):
        self.informal = informal_llm
        self.prover = formal_prover
        self.lean = lean_env
 
    def prove(self, theorem, max_refine=10):
        # Phase 1: Direct proving attempt
        proof = self.prover.generate(theorem)
        if self.lean.verify(proof):
            return proof
 
        # Phase 2: Lemma-guided decomposition
        lemmas = self.informal.generate_lemmas(theorem)
        verified_lemmas = []
        for lemma in lemmas:
            lemma_proof = self.prover.generate(lemma)
            if self.lean.verify(lemma_proof):
                verified_lemmas.append((lemma, lemma_proof))
 
        # Phase 3: Iterative refinement with Lean feedback
        context = verified_lemmas
        best_proof = self.prover.generate(theorem, context=context)
        for step in range(max_refine):
            result = self.lean.check(best_proof)
            if result.success:
                return best_proof
            best_proof = self.prover.refine(best_proof, result.errors)
        return best_proof

Comparison of Approaches

Method Year MiniF2F (%) Key Innovation Model Size
MA-LoT 2025 61.07 Multi-agent Long CoT Various
APOLLO 2025 84.9 Proof repair agents < 8B
Prover Agent 2025 86.1 Lemma decomposition + refinement Ensemble

Mathematical Foundation

The theoretical advantage of lemma-guided decomposition can be expressed as:

For a theorem requiring $m$ intermediate lemmas with individual success probabilities $p_i$:

$P(\text{success with lemmas}) = \prod_{i=1}^{m} (1 - (1-p_i)^{K_i})$

where $K_i$ is the sample budget per lemma. This grows exponentially faster than direct proving:

$P(\text{direct success}) = 1 - (1 - p_{\text{direct}})^N$

where $p_{\text{direct}} \ll \prod p_i$ for hard theorems.

References

See Also