This is an old revision of the document!

Formal Verification of LLM Reasoning

VERGE (Formal Refinement and Guidance Engine) is a neurosymbolic framework that combines Large Language Models with SMT (Satisfiability Modulo Theories) solvers to provide formal verification guarantees on LLM-generated reasoning. By decomposing outputs into atomic claims, autoformalizing verifiable claims into first-order logic, and checking them via theorem proving, VERGE achieves rigorous correctness guarantees that purely probabilistic methods cannot offer.

Background

LLMs produce fluent but unreliable reasoning, particularly on logical and mathematical tasks. Prior approaches to verification rely on self-consistency or ensemble voting, which remain probabilistic. VERGE bridges the gap between neural generation and formal methods by routing claims through either hard verification (SMT solvers) or soft verification (LLM judge ensembles) based on claim type.

The core challenge is the translation bottleneck – converting natural language claims into formal logic is error-prone and has limited prior neurosymbolic approaches. VERGE overcomes this through iterative refinement guided by solver feedback.

Architecture

The VERGE pipeline consists of four stages:

1. Atomic Claim Decomposition

Given an LLM-generated answer $A$, decompose it into a set of atomic claims $\mathcal{C} = \{c_1, c_2, \ldots, c_n\}$ where each $c_i$ is a minimal, independently verifiable statement. This enables targeted verification and error isolation.

2. Semantic Routing

Each claim is classified into one of two verification paths:

Formal path – Logical and mathematical claims amenable to first-order logic encoding
Soft path – Commonsense or vague claims verified by LLM judge ensembles

3. Autoformalization

Claims on the formal path are translated by the LLM into decidable fragments of first-order logic (FOL):

QF-UF – Quantifier-Free Uninterpreted Functions
LIA – Linear Integer Arithmetic

Given claim $c_i$ and context premises $\Pi$, the formalization produces:

$$\phi_i = \text{Formalize}(c_i, \Pi) \in \mathcal{L}_{\text{FOL}}$$

4. SMT Verification

The conjunction of formalized claims and premises is submitted to an SMT solver (e.g., Z3):

$$\text{SMT-Check}\left(\Pi \wedge \bigwedge_{i} \phi_i\right) \in \{\text{sat}, \text{unsat}, \text{unknown}\}$$

Minimal Correction Subsets

A key innovation in VERGE is the use of Minimal Correction Subsets (MCS) when verification fails. Instead of discarding the entire answer on an unsat result, MCS computes the minimal subset of faulty claims:

$$\text{MCS}(\Phi) = \arg\min_{S \subseteq \Phi} |S| \text{ s.t. } (\Phi \setminus S) \text{ is satisfiable}$$

This provides precise, targeted feedback to the LLM refiner, enabling surgical correction rather than wholesale regeneration. The unsat core from the SMT solver identifies conflicting assertions directly.

Scoring and Iteration

Each claim receives a rigor-weighted verification score. Formally verified claims receive maximum confidence (1.0), while soft-verified claims are capped at 0.9. The aggregate answer score includes a variance penalty to detect internal contradictions:

$$\mathcal{S}(\mathcal{A}) = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} w_i \cdot s_i - \lambda \cdot \text{Var}(\{s_i\})$$

where $w_i$ are rigor weights, $s_i$ are per-claim scores, and $\lambda$ penalizes high variance (indicating contradictions). The system iterates until convergence or a score threshold is met.

Code Example

from z3 import Solver, Int, sat, unsat
 
def verify_arithmetic_claim(claim_fol: str, premises: list[str]) -> dict:
    """Verify a formalized arithmetic claim using Z3 SMT solver."""
    solver = Solver()
 
    # Add premises as constraints
    for premise in premises:
        solver.add(eval(premise))  # parse FOL expressions
 
    # Add negation of claim to check validity
    solver.push()
    solver.add(eval(f"Not({claim_fol})"))
 
    if solver.check() == unsat:
        return {"status": "verified", "confidence": 1.0}
    elif solver.check() == sat:
        counterexample = solver.model()
        return {"status": "refuted", "counterexample": str(counterexample)}
    return {"status": "unknown", "confidence": 0.5}
 
def compute_mcs(claims: list, solver: Solver) -> list:
    """Compute minimal correction subset for failed verification."""
    solver.check()
    core = solver.unsat_core()
    return [c for c in claims if c.label in core]

Results

With GPT-OSS-120B as the base model, VERGE achieves:

18.7% average performance uplift over single-pass baselines across reasoning benchmarks
Convergence on all evaluated datasets
Formalization reliability scales with model size (>70B parameters for consistent validity)
Outperforms probabilistic self-refinement approaches

Models below 20B parameters achieve only ~30% valid formalizations, establishing a practical lower bound for the approach.

AI Agent Knowledge Base

Sidebar

Table of Contents

Formal Verification of LLM Reasoning

Background

Architecture

1. Atomic Claim Decomposition

2. Semantic Routing

3. Autoformalization

4. SMT Verification

Minimal Correction Subsets

Scoring and Iteration

Code Example

Results

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Formal Verification of LLM Reasoning

Background

Architecture

1. Atomic Claim Decomposition

2. Semantic Routing

3. Autoformalization

4. SMT Verification

Minimal Correction Subsets

Scoring and Iteration

Code Example

Results

References

See Also

Page Tools