====== Formal Verification of LLM Reasoning ====== VERGE (Formal Refinement and Guidance Engine) is a neurosymbolic framework that combines Large Language Models with SMT (Satisfiability Modulo Theories) solvers to provide formal verification guarantees on LLM-generated reasoning. By decomposing outputs into atomic claims, autoformalizing verifiable claims into first-order logic, and checking them via theorem proving, VERGE achieves rigorous correctness guarantees that purely probabilistic methods cannot offer. graph LR A[LLM Output] --> B[Decompose into Atomic Claims] B --> C{Semantic Router} C -->|Formal| D[Formalize to FOL] C -->|Soft| E[LLM Judge Ensemble] D --> F[SMT Solver Z3] F -->|Verified| G[Accept Claim] F -->|Refuted| H[Compute MCS] H --> I[Refine with LLM] E --> G I --> A ===== Background ===== LLMs produce fluent but unreliable reasoning, particularly on logical and mathematical tasks. Prior approaches to verification rely on self-consistency or ensemble voting, which remain probabilistic. VERGE bridges the gap between neural generation and formal methods by routing claims through either //hard verification// (SMT solvers) or //soft verification// (LLM judge ensembles) based on claim type. The core challenge is the **translation bottleneck** -- converting natural language claims into formal logic is error-prone and has limited prior neurosymbolic approaches. VERGE overcomes this through iterative refinement guided by solver feedback. ===== Architecture ===== The VERGE pipeline consists of four stages: === 1. Atomic Claim Decomposition === Given an LLM-generated answer $A$, decompose it into a set of atomic claims $\mathcal{C} = \{c_1, c_2, \ldots, c_n\}$ where each $c_i$ is a minimal, independently verifiable statement. This enables targeted verification and error isolation. === 2. Semantic Routing === Each claim is classified into one of two verification paths: * **Formal path** -- Logical and mathematical claims amenable to first-order logic encoding * **Soft path** -- Commonsense or vague claims verified by LLM judge ensembles === 3. Autoformalization === Claims on the formal path are translated by the LLM into decidable fragments of first-order logic (FOL): * **QF-UF** -- Quantifier-Free Uninterpreted Functions * **LIA** -- Linear Integer Arithmetic Given claim $c_i$ and context premises $\Pi$, the formalization produces: $$\phi_i = \text{Formalize}(c_i, \Pi) \in \mathcal{L}_{\text{FOL}}$$ === 4. SMT Verification === The conjunction of formalized claims and premises is submitted to an SMT solver (e.g., Z3): $$\text{SMT-Check}\left(\Pi \wedge \bigwedge_{i} \phi_i\right) \in \{\text{sat}, \text{unsat}, \text{unknown}\}$$ ===== Minimal Correction Subsets ===== A key innovation in VERGE is the use of **Minimal Correction Subsets (MCS)** when verification fails. Instead of discarding the entire answer on an ''unsat'' result, MCS computes the minimal subset of faulty claims: $$\text{MCS}(\Phi) = \arg\min_{S \subseteq \Phi} |S| \text{ s.t. } (\Phi \setminus S) \text{ is satisfiable}$$ This provides precise, targeted feedback to the LLM refiner, enabling surgical correction rather than wholesale regeneration. The unsat core from the SMT solver identifies conflicting assertions directly. ===== Scoring and Iteration ===== Each claim receives a rigor-weighted verification score. Formally verified claims receive maximum confidence (1.0), while soft-verified claims are capped at 0.9. The aggregate answer score includes a variance penalty to detect internal contradictions: $$\mathcal{S}(\mathcal{A}) = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} w_i \cdot s_i - \lambda \cdot \text{Var}(\{s_i\})$$ where $w_i$ are rigor weights, $s_i$ are per-claim scores, and $\lambda$ penalizes high variance (indicating contradictions). The system iterates until convergence or a score threshold is met. ===== Code Example ===== from z3 import Solver, Int, sat, unsat def verify_arithmetic_claim(claim_fol: str, premises: list[str]) -> dict: """Verify a formalized arithmetic claim using Z3 SMT solver.""" solver = Solver() # Add premises as constraints for premise in premises: solver.add(eval(premise)) # parse FOL expressions # Add negation of claim to check validity solver.push() solver.add(eval(f"Not({claim_fol})")) if solver.check() == unsat: return {"status": "verified", "confidence": 1.0} elif solver.check() == sat: counterexample = solver.model() return {"status": "refuted", "counterexample": str(counterexample)} return {"status": "unknown", "confidence": 0.5} def compute_mcs(claims: list, solver: Solver) -> list: """Compute minimal correction subset for failed verification.""" solver.check() core = solver.unsat_core() return [c for c in claims if c.label in core] ===== Results ===== With GPT-OSS-120B as the base model, VERGE achieves: * **18.7% average performance uplift** over single-pass baselines across reasoning benchmarks * Convergence on all evaluated datasets * Formalization reliability scales with model size (>70B parameters for consistent validity) * Outperforms probabilistic self-refinement approaches Models below 20B parameters achieve only ~30% valid formalizations, establishing a practical lower bound for the approach. ===== References ===== * [[https://arxiv.org/abs/2601.20055|VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning (2026)]] * [[https://arxiv.org/pdf/2601.20055|VERGE Full Paper PDF]] * [[https://huggingface.co/papers/2601.20055|VERGE on Hugging Face Papers]] ===== See Also ===== * [[mcts_llm_reasoning]] -- Tree search methods for structured LLM reasoning * [[task_decomposition_strategies]] -- Decomposing complex tasks into verifiable subtasks * [[agentic_rag]] -- Retrieval-augmented generation with agent capabilities