====== Formal Verification of LLM Reasoning ======

VERGE (Formal Refinement and Guidance Engine) is a neurosymbolic framework that combines Large Language Models with SMT (Satisfiability Modulo Theories) solvers to provide formal verification guarantees on LLM-generated reasoning. By decomposing outputs into atomic claims, autoformalizing verifiable claims into first-order logic, and checking them via theorem proving, VERGE achieves rigorous correctness guarantees that purely probabilistic methods cannot offer.

<mermaid>
graph LR
    A[LLM Output] --> B[Decompose into Atomic Claims]
    B --> C{Semantic Router}
    C -->|Formal| D[Formalize to FOL]
    C -->|Soft| E[LLM Judge Ensemble]
    D --> F[SMT Solver Z3]
    F -->|Verified| G[Accept Claim]
    F -->|Refuted| H[Compute MCS]
    H --> I[Refine with LLM]
    E --> G
    I --> A
</mermaid>

===== Background =====

LLMs produce fluent but unreliable reasoning, particularly on logical and mathematical tasks. Prior approaches to verification rely on self-consistency or ensemble voting, which remain probabilistic. VERGE bridges the gap between neural generation and formal methods by routing claims through either //hard verification// (SMT solvers) or //soft verification// (LLM judge ensembles) based on claim type.

The core challenge is the **translation bottleneck** -- converting natural language claims into formal logic is error-prone and has limited prior neurosymbolic approaches. VERGE overcomes this through iterative refinement guided by solver feedback.

===== Architecture =====

The VERGE pipeline consists of four stages:

=== 1. Atomic Claim Decomposition ===
Given an LLM-generated answer $A$, decompose it into a set of atomic claims $\mathcal{C} = \{c_1, c_2, \ldots, c_n\}$ where each $c_i$ is a minimal, independently verifiable statement. This enables targeted verification and error isolation.

=== 2. Semantic Routing ===
Each claim is classified into one of two verification paths:
  * **Formal path** -- Logical and mathematical claims amenable to first-order logic encoding
  * **Soft path** -- Commonsense or vague claims verified by LLM judge ensembles

=== 3. Autoformalization ===
Claims on the formal path are translated by the LLM into decidable fragments of first-order logic (FOL):
  * **QF-UF** -- Quantifier-Free Uninterpreted Functions
  * **LIA** -- Linear Integer Arithmetic

Given claim $c_i$ and context premises $\Pi$, the formalization produces:

$$\phi_i = \text{Formalize}(c_i, \Pi) \in \mathcal{L}_{\text{FOL}}$$

=== 4. SMT Verification ===
The conjunction of formalized claims and premises is submitted to an SMT solver (e.g., Z3):

$$\text{SMT-Check}\left(\Pi \wedge \bigwedge_{i} \phi_i\right) \in \{\text{sat}, \text{unsat}, \text{unknown}\}$$

===== Minimal Correction Subsets =====

A key innovation in VERGE is the use of **Minimal Correction Subsets (MCS)** when verification fails. Instead of discarding the entire answer on an ''unsat'' result, MCS computes the minimal subset of faulty claims:

$$\text{MCS}(\Phi) = \arg\min_{S \subseteq \Phi} |S| \text{ s.t. } (\Phi \setminus S) \text{ is satisfiable}$$

This provides precise, targeted feedback to the LLM refiner, enabling surgical correction rather than wholesale regeneration. The unsat core from the SMT solver identifies conflicting assertions directly.

===== Scoring and Iteration =====

Each claim receives a rigor-weighted verification score. Formally verified claims receive maximum confidence (1.0), while soft-verified claims are capped at 0.9. The aggregate answer score includes a variance penalty to detect internal contradictions:

$$\mathcal{S}(\mathcal{A}) = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} w_i \cdot s_i - \lambda \cdot \text{Var}(\{s_i\})$$

where $w_i$ are rigor weights, $s_i$ are per-claim scores, and $\lambda$ penalizes high variance (indicating contradictions). The system iterates until convergence or a score threshold is met.

===== Code Example =====

<code python>
from z3 import Solver, Int, sat, unsat

def verify_arithmetic_claim(claim_fol: str, premises: list[str]) -> dict:
    """Verify a formalized arithmetic claim using Z3 SMT solver."""
    solver = Solver()

    # Add premises as constraints
    for premise in premises:
        solver.add(eval(premise))  # parse FOL expressions

    # Add negation of claim to check validity
    solver.push()
    solver.add(eval(f"Not({claim_fol})"))

    if solver.check() == unsat:
        return {"status": "verified", "confidence": 1.0}
    elif solver.check() == sat:
        counterexample = solver.model()
        return {"status": "refuted", "counterexample": str(counterexample)}
    return {"status": "unknown", "confidence": 0.5}

def compute_mcs(claims: list, solver: Solver) -> list:
    """Compute minimal correction subset for failed verification."""
    solver.check()
    core = solver.unsat_core()
    return [c for c in claims if c.label in core]
</code>

===== Results =====

With GPT-OSS-120B as the base model, VERGE achieves:
  * **18.7% average performance uplift** over single-pass baselines across reasoning benchmarks
  * Convergence on all evaluated datasets
  * Formalization reliability scales with model size (>70B parameters for consistent validity)
  * Outperforms probabilistic self-refinement approaches

Models below 20B parameters achieve only ~30% valid formalizations, establishing a practical lower bound for the approach.

===== References =====

  * [[https://arxiv.org/abs/2601.20055|VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning (2026)]]
  * [[https://arxiv.org/pdf/2601.20055|VERGE Full Paper PDF]]
  * [[https://huggingface.co/papers/2601.20055|VERGE on Hugging Face Papers]]

===== See Also =====

  * [[mcts_llm_reasoning]] -- Tree search methods for structured LLM reasoning
  * [[task_decomposition_strategies]] -- Decomposing complex tasks into verifiable subtasks
  * [[agentic_rag]] -- Retrieval-augmented generation with agent capabilities