Agentic Text-to-SQL
LLM-powered agents are transforming the text-to-SQL task from static one-shot translation into interactive, multi-step reasoning processes. By combining schema exploration, test-time scaling, and decomposition strategies, agentic approaches now exceed 80% execution accuracy on challenging benchmarks like BIRD.
The Agentic Paradigm Shift
Traditional text-to-SQL approaches treat query generation as a single forward pass: given a natural language question and a database schema, produce SQL. This fails on enterprise databases with hundreds of tables, ambiguous column names, and complex joins. Agentic methods instead treat text-to-SQL as a planning problem where the agent iteratively explores, hypothesizes, and verifies against real data.
APEX-SQL: Interactive Schema Exploration
APEX-SQL (Wang et al., 2025) introduces a hypothesis-verification (H-V) loop for resolving schema ambiguities in large enterprise databases.
Architecture
The framework operates in two agentic stages:
Schema Linking: Generates schema-agnostic hypotheses as logical plans from the query, then prunes and validates them against real database contents
SQL Generation: Retrieves exploration directives deterministically to guide the agent in verifying hypotheses and synthesizing SQL
Key Techniques
Logical Planning: Verbalizes query requirements into hypotheses sampled at temperature 0.8, aggregated at 0.2
Dual-Pathway Pruning: Batches schema tokens (8-12k per batch) and runs parallel negative pass (deletes noise $\mathcal{C}_{del,j}$) and positive pass (preserves relevant $\mathcal{C}_{keep,j}$) to form pruned schema $\mathcal{D}_{pruned}$
Parallel Data Profiling: Validates column roles empirically and ensures topological connectivity
Deterministic Directive Retrieval: Provides structured guidance for autonomous ambiguity resolution
Results
BIRD: 70.65% execution accuracy
Spider 2.0-Snow: 51.01% execution accuracy
Lower token usage than comparable baselines
Agentar-Scale-SQL: Orchestrated Test-Time Scaling
Agentar-Scale-SQL (Wang et al., 2025) introduces an Orchestrated Test-Time Scaling strategy that combines three scaling perspectives to achieve SOTA on BIRD.
Three Scaling Dimensions
Internal Scaling: RL-enhanced intrinsic reasoning – the model's own chain-of-thought is strengthened via reinforcement learning to improve internal deliberation
Sequential Scaling: Iterative refinement – the agent refines its SQL output through multiple rounds of execution feedback and correction
Parallel Scaling: Diverse synthesis with tournament selection – multiple candidate queries are generated and compete in a tournament to select the best
Results
BIRD test set: 81.67% execution accuracy (first place on official leaderboard)
Demonstrates an effective path toward human-level text-to-SQL performance
General-purpose framework designed for easy adaptation to new databases and more powerful LLMs
DIN-SQL: Decomposed In-Context Learning
DIN-SQL (Pourreza & Rafiei, 2023, NeurIPS) demonstrated that decomposing text-to-SQL into sub-problems dramatically improves LLM performance.
Decomposition Strategy
DIN-SQL breaks the generation problem into sequential sub-tasks:
Schema Linking: Identify relevant tables and columns
Query Classification: Determine query complexity (easy, medium, hard)
Sub-query Decomposition: Break complex queries into simpler components
SQL Generation: Generate SQL using solutions from prior sub-tasks
Self-Correction: Validate and fix generated SQL
Results
Spider test: 85.3% execution accuracy (SOTA at time of publication)
BIRD test: 55.9% execution accuracy
Consistently improves few-shot LLM performance by roughly 10%
Beats heavily fine-tuned models by at least 5% using only in-context learning
DAIL-SQL: Efficient Few-Shot Optimization
DAIL-SQL (Gao et al., 2023, VLDB) provides a systematic benchmark of prompt engineering strategies for text-to-SQL, yielding an efficient integrated solution.
Key Contributions
Systematic comparison of question representation, example selection, and example organization strategies
Discovery that LLMs learn primarily from mappings between question and SQL skeleton
Token-efficient design: approximately 1,600 tokens per question on Spider-dev
Results
Spider test: 86.6% execution accuracy with GPT-4 and self-consistency voting
First place on Spider leaderboard at time of publication
Demonstrates that prompt engineering alone can rival fine-tuned approaches
Code Example
# Agentic text-to-SQL pipeline (simplified)
class AgenticTextToSQL:
def __init__(self, llm, db_connection):
self.llm = llm
self.db = db_connection
def schema_link(self, question, schema):
"""Hypothesis-verification loop for schema linking."""
hypotheses = self.llm.generate_hypotheses(question, schema, n=2)
pruned = self.dual_pathway_prune(schema, hypotheses)
return self.validate_with_data(pruned)
def generate_sql(self, question, linked_schema, n_parallel=5):
"""Orchestrated test-time scaling."""
# Internal scaling: RL-enhanced reasoning
candidates = [self.llm.reason(question, linked_schema) for _ in range(n_parallel)]
# Sequential scaling: iterative refinement
refined = [self.iterative_refine(c) for c in candidates]
# Parallel scaling: tournament selection
return self.tournament_select(refined)
def iterative_refine(self, sql, max_rounds=3):
for _ in range(max_rounds):
result = self.db.execute(sql)
if not result.has_error:
return sql
sql = self.llm.fix(sql, result.error)
return sql
Evolution of Text-to-SQL Accuracy
| Method | Year | Spider (EX%) | BIRD (EX%) | Approach |
| DIN-SQL | 2023 | 85.3 | 55.9 | Decomposition + ICL |
| DAIL-SQL | 2023 | 86.6 | – | Prompt engineering |
| APEX-SQL | 2025 | – | 70.65 | Interactive exploration |
| Agentar-Scale-SQL | 2025 | – | 81.67 | Test-time scaling |
References
See Also