====== Formal Verification of LLM Reasoning ======
VERGE (Formal Refinement and Guidance Engine) is a neurosymbolic framework that combines Large Language Models with SMT (Satisfiability Modulo Theories) solvers to provide formal verification guarantees on LLM-generated reasoning. By decomposing outputs into atomic claims, autoformalizing verifiable claims into first-order logic, and checking them via theorem proving, VERGE achieves rigorous correctness guarantees that purely probabilistic methods cannot offer.
graph LR
A[LLM Output] --> B[Decompose into Atomic Claims]
B --> C{Semantic Router}
C -->|Formal| D[Formalize to FOL]
C -->|Soft| E[LLM Judge Ensemble]
D --> F[SMT Solver Z3]
F -->|Verified| G[Accept Claim]
F -->|Refuted| H[Compute MCS]
H --> I[Refine with LLM]
E --> G
I --> A
===== Background =====
LLMs produce fluent but unreliable reasoning, particularly on logical and mathematical tasks. Prior approaches to verification rely on self-consistency or ensemble voting, which remain probabilistic. VERGE bridges the gap between neural generation and formal methods by routing claims through either //hard verification// (SMT solvers) or //soft verification// (LLM judge ensembles) based on claim type.
The core challenge is the **translation bottleneck** -- converting natural language claims into formal logic is error-prone and has limited prior neurosymbolic approaches. VERGE overcomes this through iterative refinement guided by solver feedback.
===== Architecture =====
The VERGE pipeline consists of four stages:
=== 1. Atomic Claim Decomposition ===
Given an LLM-generated answer $A$, decompose it into a set of atomic claims $\mathcal{C} = \{c_1, c_2, \ldots, c_n\}$ where each $c_i$ is a minimal, independently verifiable statement. This enables targeted verification and error isolation.
=== 2. Semantic Routing ===
Each claim is classified into one of two verification paths:
* **Formal path** -- Logical and mathematical claims amenable to first-order logic encoding
* **Soft path** -- Commonsense or vague claims verified by LLM judge ensembles
=== 3. Autoformalization ===
Claims on the formal path are translated by the LLM into decidable fragments of first-order logic (FOL):
* **QF-UF** -- Quantifier-Free Uninterpreted Functions
* **LIA** -- Linear Integer Arithmetic
Given claim $c_i$ and context premises $\Pi$, the formalization produces:
$$\phi_i = \text{Formalize}(c_i, \Pi) \in \mathcal{L}_{\text{FOL}}$$
=== 4. SMT Verification ===
The conjunction of formalized claims and premises is submitted to an SMT solver (e.g., Z3):
$$\text{SMT-Check}\left(\Pi \wedge \bigwedge_{i} \phi_i\right) \in \{\text{sat}, \text{unsat}, \text{unknown}\}$$
===== Minimal Correction Subsets =====
A key innovation in VERGE is the use of **Minimal Correction Subsets (MCS)** when verification fails. Instead of discarding the entire answer on an ''unsat'' result, MCS computes the minimal subset of faulty claims:
$$\text{MCS}(\Phi) = \arg\min_{S \subseteq \Phi} |S| \text{ s.t. } (\Phi \setminus S) \text{ is satisfiable}$$
This provides precise, targeted feedback to the LLM refiner, enabling surgical correction rather than wholesale regeneration. The unsat core from the SMT solver identifies conflicting assertions directly.
===== Scoring and Iteration =====
Each claim receives a rigor-weighted verification score. Formally verified claims receive maximum confidence (1.0), while soft-verified claims are capped at 0.9. The aggregate answer score includes a variance penalty to detect internal contradictions:
$$\mathcal{S}(\mathcal{A}) = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} w_i \cdot s_i - \lambda \cdot \text{Var}(\{s_i\})$$
where $w_i$ are rigor weights, $s_i$ are per-claim scores, and $\lambda$ penalizes high variance (indicating contradictions). The system iterates until convergence or a score threshold is met.
===== Code Example =====
from z3 import Solver, Int, sat, unsat
def verify_arithmetic_claim(claim_fol: str, premises: list[str]) -> dict:
"""Verify a formalized arithmetic claim using Z3 SMT solver."""
solver = Solver()
# Add premises as constraints
for premise in premises:
solver.add(eval(premise)) # parse FOL expressions
# Add negation of claim to check validity
solver.push()
solver.add(eval(f"Not({claim_fol})"))
if solver.check() == unsat:
return {"status": "verified", "confidence": 1.0}
elif solver.check() == sat:
counterexample = solver.model()
return {"status": "refuted", "counterexample": str(counterexample)}
return {"status": "unknown", "confidence": 0.5}
def compute_mcs(claims: list, solver: Solver) -> list:
"""Compute minimal correction subset for failed verification."""
solver.check()
core = solver.unsat_core()
return [c for c in claims if c.label in core]
===== Results =====
With GPT-OSS-120B as the base model, VERGE achieves:
* **18.7% average performance uplift** over single-pass baselines across reasoning benchmarks
* Convergence on all evaluated datasets
* Formalization reliability scales with model size (>70B parameters for consistent validity)
* Outperforms probabilistic self-refinement approaches
Models below 20B parameters achieve only ~30% valid formalizations, establishing a practical lower bound for the approach.
===== References =====
* [[https://arxiv.org/abs/2601.20055|VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning (2026)]]
* [[https://arxiv.org/pdf/2601.20055|VERGE Full Paper PDF]]
* [[https://huggingface.co/papers/2601.20055|VERGE on Hugging Face Papers]]
===== See Also =====
* [[mcts_llm_reasoning]] -- Tree search methods for structured LLM reasoning
* [[task_decomposition_strategies]] -- Decomposing complex tasks into verifiable subtasks
* [[agentic_rag]] -- Retrieval-augmented generation with agent capabilities