Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
VERGE (Formal Refinement and Guidance Engine) is a neurosymbolic framework that combines Large Language Models with SMT (Satisfiability Modulo Theories) solvers to provide formal verification guarantees on LLM-generated reasoning. By decomposing outputs into atomic claims, autoformalizing verifiable claims into first-order logic, and checking them via theorem proving, VERGE achieves rigorous correctness guarantees that purely probabilistic methods cannot offer.
LLMs produce fluent but unreliable reasoning, particularly on logical and mathematical tasks. Prior approaches to verification rely on self-consistency or ensemble voting, which remain probabilistic. VERGE bridges the gap between neural generation and formal methods by routing claims through either hard verification (SMT solvers) or soft verification (LLM judge ensembles) based on claim type.
The core challenge is the translation bottleneck – converting natural language claims into formal logic is error-prone and has limited prior neurosymbolic approaches. VERGE overcomes this through iterative refinement guided by solver feedback.
The VERGE pipeline consists of four stages:
Given an LLM-generated answer $A$, decompose it into a set of atomic claims $\mathcal{C} = \{c_1, c_2, \ldots, c_n\}$ where each $c_i$ is a minimal, independently verifiable statement. This enables targeted verification and error isolation.
Each claim is classified into one of two verification paths:
Claims on the formal path are translated by the LLM into decidable fragments of first-order logic (FOL):
Given claim $c_i$ and context premises $\Pi$, the formalization produces:
$$\phi_i = \text{Formalize}(c_i, \Pi) \in \mathcal{L}_{\text{FOL}}$$
The conjunction of formalized claims and premises is submitted to an SMT solver (e.g., Z3):
$$\text{SMT-Check}\left(\Pi \wedge \bigwedge_{i} \phi_i\right) \in \{\text{sat}, \text{unsat}, \text{unknown}\}$$
A key innovation in VERGE is the use of Minimal Correction Subsets (MCS) when verification fails. Instead of discarding the entire answer on an unsat result, MCS computes the minimal subset of faulty claims:
$$\text{MCS}(\Phi) = \arg\min_{S \subseteq \Phi} |S| \text{ s.t. } (\Phi \setminus S) \text{ is satisfiable}$$
This provides precise, targeted feedback to the LLM refiner, enabling surgical correction rather than wholesale regeneration. The unsat core from the SMT solver identifies conflicting assertions directly.
Each claim receives a rigor-weighted verification score. Formally verified claims receive maximum confidence (1.0), while soft-verified claims are capped at 0.9. The aggregate answer score includes a variance penalty to detect internal contradictions:
$$\mathcal{S}(\mathcal{A}) = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} w_i \cdot s_i - \lambda \cdot \text{Var}(\{s_i\})$$
where $w_i$ are rigor weights, $s_i$ are per-claim scores, and $\lambda$ penalizes high variance (indicating contradictions). The system iterates until convergence or a score threshold is met.
from z3 import Solver, Int, sat, unsat def verify_arithmetic_claim(claim_fol: str, premises: list[str]) -> dict: """Verify a formalized arithmetic claim using Z3 SMT solver.""" solver = Solver() # Add premises as constraints for premise in premises: solver.add(eval(premise)) # parse FOL expressions # Add negation of claim to check validity solver.push() solver.add(eval(f"Not({claim_fol})")) if solver.check() == unsat: return {"status": "verified", "confidence": 1.0} elif solver.check() == sat: counterexample = solver.model() return {"status": "refuted", "counterexample": str(counterexample)} return {"status": "unknown", "confidence": 0.5} def compute_mcs(claims: list, solver: Solver) -> list: """Compute minimal correction subset for failed verification.""" solver.check() core = solver.unsat_core() return [c for c in claims if c.label in core]
With GPT-OSS-120B as the base model, VERGE achieves:
Models below 20B parameters achieve only ~30% valid formalizations, establishing a practical lower bound for the approach.