Agentic Text-to-SQL

LLM-powered agents are transforming the text-to-SQL task from static one-shot translation into interactive, multi-step reasoning processes¹⁾. By combining schema exploration, test-time scaling, and decomposition strategies, agentic approaches now exceed 80% execution accuracy on challenging benchmarks like BIRD.

The Agentic Paradigm Shift

Traditional text-to-SQL approaches treat query generation as a single forward pass: given a natural language question and a database schema, produce SQL. This fails on enterprise databases with hundreds of tables, ambiguous column names, and complex joins. Agentic methods instead treat text-to-SQL as a planning problem where the agent iteratively explores, hypothesizes, and verifies against real data.

APEX-SQL: Interactive Schema Exploration

APEX-SQL (Wang et al., 2025)²⁾ introduces a hypothesis-verification (H-V) loop for resolving schema ambiguities in large enterprise databases.

Architecture

The framework operates in two agentic stages:

Schema Linking: Generates schema-agnostic hypotheses as logical plans from the query, then prunes and validates them against real database contents
SQL Generation: Retrieves exploration directives deterministically to guide the agent in verifying hypotheses and synthesizing SQL

Key Techniques

Logical Planning: Verbalizes query requirements into hypotheses sampled at temperature 0.8, aggregated at 0.2
Dual-Pathway Pruning: Batches schema tokens and runs parallel negative/positive passes
Parallel Data Profiling: Validates column roles empirically and ensures topological connectivity
Deterministic Directive Retrieval: Provides structured guidance for autonomous ambiguity resolution

Results

BIRD: 70.65% execution accuracy
Spider 2.0-Snow: 51.01% execution accuracy
Lower token usage than comparable baselines

Agentar-Scale-SQL: Orchestrated Test-Time Scaling

Agentar-Scale-SQL (Wang et al., 2025)³⁾ introduces an Orchestrated Test-Time Scaling strategy that combines three scaling perspectives to achieve SOTA on BIRD.

Three Scaling Dimensions

Internal Scaling: RL-enhanced intrinsic reasoning
Sequential Scaling: Iterative refinement through execution feedback
Parallel Scaling: Diverse synthesis with tournament selection

Results

BIRD test set: 81.67% execution accuracy (first place on official leaderboard)

DIN-SQL: Decomposed In-Context Learning

DIN-SQL (Pourreza & Rafiei, 2023, NeurIPS)⁴⁾ demonstrated that decomposing text-to-SQL into sub-problems dramatically improves LLM performance.

Results

Spider test: 85.3% execution accuracy (SOTA at time of publication)
BIRD test: 55.9% execution accuracy
Consistently improves few-shot LLM performance by roughly 10%

DAIL-SQL: Efficient Few-Shot Optimization

DAIL-SQL (Gao et al., 2023, VLDB) provides a systematic benchmark of prompt engineering strategies for text-to-SQL, yielding an efficient integrated solution.

Results

Spider test: 86.6% execution accuracy with GPT-4 and self-consistency voting⁵⁾
First place on Spider leaderboard at time of publication

Code Example

# Agentic text-to-SQL pipeline (simplified)
class AgenticTextToSQL:
    def __init__(self, llm, db_connection):
        self.llm = llm
        self.db = db_connection
 
    def schema_link(self, question, schema):
        hypotheses = self.llm.generate_hypotheses(question, schema, n=2)
        pruned = self.dual_pathway_prune(schema, hypotheses)
        return self.validate_with_data(pruned)
 
    def generate_sql(self, question, linked_schema, n_parallel=5):
        candidates = [self.llm.reason(question, linked_schema) for _ in range(n_parallel)]
        refined = [self.iterative_refine(c) for c in candidates]
        return self.tournament_select(refined)
 
    def iterative_refine(self, sql, max_rounds=3):
        for _ in range(max_rounds):
            result = self.db.execute(sql)
            if not result.has_error:
                return sql
            sql = self.llm.fix(sql, result.error)
        return sql

Evolution of Text-to-SQL Accuracy

Method	Year	Spider (EX%)	BIRD (EX%)	Approach
DIN-SQL	2023	85.3	55.9	Decomposition + ICL
DAIL-SQL	2023	86.6	,	Prompt engineering
APEX-SQL	2025	,	70.65	Interactive exploration
Agentar-Scale-SQL	2025	,	81.67	Test-time scaling

References

¹⁾ , ²⁾

APEX-SQL: Agent-Powered EXploration for Text-to-SQL (arXiv:2602.16720

³⁾

Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling (arXiv:2509.24403).

⁴⁾

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction (arXiv:2304.11015).

⁵⁾

DAIL-SQL: Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation (arXiv:2308.15363

AI Agent Knowledge Base

Sidebar

Table of Contents

Agentic Text-to-SQL

The Agentic Paradigm Shift

APEX-SQL: Interactive Schema Exploration

Architecture

Key Techniques

Results

Agentar-Scale-SQL: Orchestrated Test-Time Scaling

Three Scaling Dimensions

Results

DIN-SQL: Decomposed In-Context Learning

Results

DAIL-SQL: Efficient Few-Shot Optimization

Results

Code Example

Evolution of Text-to-SQL Accuracy

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Agentic Text-to-SQL

The Agentic Paradigm Shift

APEX-SQL: Interactive Schema Exploration

Architecture

Key Techniques

Results

Agentar-Scale-SQL: Orchestrated Test-Time Scaling

Three Scaling Dimensions

Results

DIN-SQL: Decomposed In-Context Learning

Results

DAIL-SQL: Efficient Few-Shot Optimization

Results

Code Example

Evolution of Text-to-SQL Accuracy

See Also

References

Page Tools