====== LLM Agents for Formal Theorem Proving ======

Large language models are increasingly deployed as agents that collaborate with formal proof assistants like Lean to automatically prove mathematical theorems. By combining informal mathematical reasoning with machine-checkable verification, these systems achieve remarkable accuracy while maintaining formal rigor.

===== The Formal Verification Challenge =====

Formal theorem proving requires generating proofs in languages like Lean 4 where every step is machine-verified. A verification system can check correctness almost instantaneously, but generating a completely correct formal proof remains formidable. Traditional approaches prompt LLMs thousands of times hoping one proof passes verification. Agentic approaches instead use structured interaction between LLMs and the proof assistant.

===== Prover Agent: Lemma-Guided Decomposition =====

**Prover Agent** (Baba et al., 2025) is an agent framework that coordinates informal reasoning LLMs with formal prover models and Lean verification, achieving 86.1% on MiniF2F.

=== Architecture ===

The system coordinates three components:
  - **Informal reasoning LLM**: Generates auxiliary lemmas to guide proof strategy
  - **Formal prover model**: DeepSeek-Prover-V2, Goedel-Prover-V2, or their ensemble
  - **Lean verifier**: Checks all proofs and provides error feedback

=== Agentic Workflow ===

  - **Direct proving**: First attempts direct proof generation
  - **Lemma generation**: If direct proving fails, the informal LLM generates auxiliary lemmas targeting essential intermediate facts $F_1, \dots, F_m$
  - **Lemma proving**: Each lemma is individually proved, verified by Lean, and added to context
  - **Iterative refinement**: Selects the best prior draft and revises based on Lean error messages (locations and messages), iterating up to $N_{\text{refine}}$ steps

=== Theoretical Results ===

  * **Theorem 4.4**: Lemma decomposition requires exponentially fewer trials than direct proving ($N_{\text{lem}} \ll N_{\text{dir}}$)
  * **Theorem 4.5**: Lemmas are most efficient for hard problems where direct success probability $p \leq \tau$
  * **Theorem 4.6**: Optimal lemmas balance subproblem difficulty
  * Success probability grows exponentially with mutual information $I(Z; Y_{1:n})$ from lemmas and feedback

=== Results ===

  * **MiniF2F**: 86.1% pass rate (SOTA for small LMs)
  * Ablations confirm both lemma generation and iterative refinement are essential

===== APOLLO: Automated Proof Repair =====

**APOLLO** -- Automated PrOof repair via LLM and Lean cOllaboration (Ospanov et al., 2025) -- is a modular, model-agnostic agentic framework that repairs proofs using multi-agent collaboration with Lean.

=== Architecture ===

APOLLO directs a fully automated pipeline:

  - LLM generates initial proof attempts
  - A set of agents analyze the proofs and fix syntax errors
  - Lean identifies mistakes and isolates failing sub-lemmas
  - Automated solvers (e.g., ''omega'', ''simp'', ''decide'') attempt to close simple goals
  - LLM is invoked on each remaining goal with a low top-K budget
  - Repaired sub-proofs are recombined and reverified
  - Process iterates up to a user-controlled maximum number of attempts

=== Key Features ===

  * **Model-agnostic**: Works with any LLM backend
  * **Low sampling budget**: Achieves strong results without thousands of samples
  * **Sub-lemma isolation**: Decomposes failing proofs into independently solvable goals

=== Results ===

  * **MiniF2F**: 84.9% accuracy among sub-8B parameter models (SOTA at time of publication)
  * Significantly lower token and sampling budgets than competing approaches

===== MA-LoT: Multi-Agent Long Chain-of-Thought =====

**MA-LoT** -- Model-CollAboration Lean-based Long Chain-of-Thought (Wang et al., 2025, ICML) -- is the first multi-agent framework for Lean 4 theorem proving that balances high-level natural language reasoning with formal language verification.

=== Architecture ===

MA-LoT separates cognition tasks using model collaboration:

  * **Whole-proof generation agent**: Uses general NL reasoning to generate complete proof drafts
  * **Error analysis agent**: Analyzes Lean compiler errors and suggests corrections
  * **Lean 4 verifier**: Provides structured feedback at each step of the Long CoT

=== LoT-Transfer Learning ===

A novel training-inference pipeline that enables Long CoT thinking capability in LLMs without special data annotation. The approach transfers the ability to engage in extended structured reasoning from pre-training to theorem-proving contexts.

=== Results ===

  * **MiniF2F-Test (Lean 4)**: 61.07% accuracy
  * Outperforms DeepSeek-V3 (33.61%), InternLM-Step-Prover tree search (50.70%), and Godel-Prover whole-proof generation (55.33%)

===== Code Example =====

<code python>
# Simplified Prover Agent loop
class ProverAgent:
    def __init__(self, informal_llm, formal_prover, lean_env):
        self.informal = informal_llm
        self.prover = formal_prover
        self.lean = lean_env

    def prove(self, theorem, max_refine=10):
        # Phase 1: Direct proving attempt
        proof = self.prover.generate(theorem)
        if self.lean.verify(proof):
            return proof

        # Phase 2: Lemma-guided decomposition
        lemmas = self.informal.generate_lemmas(theorem)
        verified_lemmas = []
        for lemma in lemmas:
            lemma_proof = self.prover.generate(lemma)
            if self.lean.verify(lemma_proof):
                verified_lemmas.append((lemma, lemma_proof))

        # Phase 3: Iterative refinement with Lean feedback
        context = verified_lemmas
        best_proof = self.prover.generate(theorem, context=context)
        for step in range(max_refine):
            result = self.lean.check(best_proof)
            if result.success:
                return best_proof
            best_proof = self.prover.refine(best_proof, result.errors)
        return best_proof
</code>

===== Comparison of Approaches =====

^ Method ^ Year ^ MiniF2F (%) ^ Key Innovation ^ Model Size ^
| MA-LoT | 2025 | 61.07 | Multi-agent Long CoT | Various |
| APOLLO | 2025 | 84.9 | Proof repair agents | < 8B |
| Prover Agent | 2025 | 86.1 | Lemma decomposition + refinement | Ensemble |

===== Mathematical Foundation =====

The theoretical advantage of lemma-guided decomposition can be expressed as:

For a theorem requiring $m$ intermediate lemmas with individual success probabilities $p_i$:

$P(\text{success with lemmas}) = \prod_{i=1}^{m} (1 - (1-p_i)^{K_i})$

where $K_i$ is the sample budget per lemma. This grows exponentially faster than direct proving:

$P(\text{direct success}) = 1 - (1 - p_{\text{direct}})^N$

where $p_{\text{direct}} \ll \prod p_i$ for hard theorems.

===== References =====

  * [[https://arxiv.org/abs/2506.19923|Prover Agent: An Agent-based Framework for Formal Theorem Proving (arXiv:2506.19923)]]
  * [[https://arxiv.org/abs/2505.05758|APOLLO: Automated Proof Repair via LLM and Lean Collaboration (arXiv:2505.05758)]]
  * [[https://arxiv.org/abs/2503.03205|MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought (arXiv:2503.03205)]]

===== See Also =====

  * [[agentic_uncertainty|Agentic Uncertainty]] -- uncertainty quantification in multi-step agent reasoning
  * [[automated_program_repair|Automated Program Repair]] -- related iterative code repair paradigm