Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
LLM-powered agents are transforming the text-to-SQL task from static one-shot translation into interactive, multi-step reasoning processes. By combining schema exploration, test-time scaling, and decomposition strategies, agentic approaches now exceed 80% execution accuracy on challenging benchmarks like BIRD.
Traditional text-to-SQL approaches treat query generation as a single forward pass: given a natural language question and a database schema, produce SQL. This fails on enterprise databases with hundreds of tables, ambiguous column names, and complex joins. Agentic methods instead treat text-to-SQL as a planning problem where the agent iteratively explores, hypothesizes, and verifies against real data.
APEX-SQL (Wang et al., 2025) introduces a hypothesis-verification (H-V) loop for resolving schema ambiguities in large enterprise databases.
The framework operates in two agentic stages:
Agentar-Scale-SQL (Wang et al., 2025) introduces an Orchestrated Test-Time Scaling strategy that combines three scaling perspectives to achieve SOTA on BIRD.
DIN-SQL (Pourreza & Rafiei, 2023, NeurIPS) demonstrated that decomposing text-to-SQL into sub-problems dramatically improves LLM performance.
DIN-SQL breaks the generation problem into sequential sub-tasks:
DAIL-SQL (Gao et al., 2023, VLDB) provides a systematic benchmark of prompt engineering strategies for text-to-SQL, yielding an efficient integrated solution.
# Agentic text-to-SQL pipeline (simplified) class AgenticTextToSQL: def __init__(self, llm, db_connection): self.llm = llm self.db = db_connection def schema_link(self, question, schema): """Hypothesis-verification loop for schema linking.""" hypotheses = self.llm.generate_hypotheses(question, schema, n=2) pruned = self.dual_pathway_prune(schema, hypotheses) return self.validate_with_data(pruned) def generate_sql(self, question, linked_schema, n_parallel=5): """Orchestrated test-time scaling.""" # Internal scaling: RL-enhanced reasoning candidates = [self.llm.reason(question, linked_schema) for _ in range(n_parallel)] # Sequential scaling: iterative refinement refined = [self.iterative_refine(c) for c in candidates] # Parallel scaling: tournament selection return self.tournament_select(refined) def iterative_refine(self, sql, max_rounds=3): for _ in range(max_rounds): result = self.db.execute(sql) if not result.has_error: return sql sql = self.llm.fix(sql, result.error) return sql
| Method | Year | Spider (EX%) | BIRD (EX%) | Approach |
|---|---|---|---|---|
| DIN-SQL | 2023 | 85.3 | 55.9 | Decomposition + ICL |
| DAIL-SQL | 2023 | 86.6 | – | Prompt engineering |
| APEX-SQL | 2025 | – | 70.65 | Interactive exploration |
| Agentar-Scale-SQL | 2025 | – | 81.67 | Test-time scaling |