====== Agentic Text-to-SQL ====== LLM-powered agents are transforming the text-to-SQL task from static one-shot translation into interactive, multi-step reasoning processes. By combining schema exploration, test-time scaling, and decomposition strategies, agentic approaches now exceed 80% execution accuracy on challenging benchmarks like BIRD. ===== The Agentic Paradigm Shift ===== Traditional text-to-SQL approaches treat query generation as a single forward pass: given a natural language question and a database schema, produce SQL. This fails on enterprise databases with hundreds of tables, ambiguous column names, and complex joins. Agentic methods instead treat text-to-SQL as a //planning problem// where the agent iteratively explores, hypothesizes, and verifies against real data. ===== APEX-SQL: Interactive Schema Exploration ===== **APEX-SQL** (Wang et al., 2025) introduces a hypothesis-verification (H-V) loop for resolving schema ambiguities in large enterprise databases. === Architecture === The framework operates in two agentic stages: - **Schema Linking**: Generates schema-agnostic hypotheses as logical plans from the query, then prunes and validates them against real database contents - **SQL Generation**: Retrieves exploration directives deterministically to guide the agent in verifying hypotheses and synthesizing SQL === Key Techniques === * **Logical Planning**: Verbalizes query requirements into hypotheses sampled at temperature 0.8, aggregated at 0.2 * **Dual-Pathway Pruning**: Batches schema tokens (8-12k per batch) and runs parallel negative pass (deletes noise $\mathcal{C}_{del,j}$) and positive pass (preserves relevant $\mathcal{C}_{keep,j}$) to form pruned schema $\mathcal{D}_{pruned}$ * **Parallel Data Profiling**: Validates column roles empirically and ensures topological connectivity * **Deterministic Directive Retrieval**: Provides structured guidance for autonomous ambiguity resolution === Results === * **BIRD**: 70.65% execution accuracy * **Spider 2.0-Snow**: 51.01% execution accuracy * Lower token usage than comparable baselines ===== Agentar-Scale-SQL: Orchestrated Test-Time Scaling ===== **Agentar-Scale-SQL** (Wang et al., 2025) introduces an Orchestrated Test-Time Scaling strategy that combines three scaling perspectives to achieve SOTA on BIRD. === Three Scaling Dimensions === - **Internal Scaling**: RL-enhanced intrinsic reasoning -- the model's own chain-of-thought is strengthened via reinforcement learning to improve internal deliberation - **Sequential Scaling**: Iterative refinement -- the agent refines its SQL output through multiple rounds of execution feedback and correction - **Parallel Scaling**: Diverse synthesis with tournament selection -- multiple candidate queries are generated and compete in a tournament to select the best === Results === * **BIRD test set**: 81.67% execution accuracy (first place on official leaderboard) * Demonstrates an effective path toward human-level text-to-SQL performance * General-purpose framework designed for easy adaptation to new databases and more powerful LLMs ===== DIN-SQL: Decomposed In-Context Learning ===== **DIN-SQL** (Pourreza & Rafiei, 2023, NeurIPS) demonstrated that decomposing text-to-SQL into sub-problems dramatically improves LLM performance. === Decomposition Strategy === DIN-SQL breaks the generation problem into sequential sub-tasks: - **Schema Linking**: Identify relevant tables and columns - **Query Classification**: Determine query complexity (easy, medium, hard) - **Sub-query Decomposition**: Break complex queries into simpler components - **SQL Generation**: Generate SQL using solutions from prior sub-tasks - **Self-Correction**: Validate and fix generated SQL === Results === * **Spider test**: 85.3% execution accuracy (SOTA at time of publication) * **BIRD test**: 55.9% execution accuracy * Consistently improves few-shot LLM performance by roughly 10% * Beats heavily fine-tuned models by at least 5% using only in-context learning ===== DAIL-SQL: Efficient Few-Shot Optimization ===== **DAIL-SQL** (Gao et al., 2023, VLDB) provides a systematic benchmark of prompt engineering strategies for text-to-SQL, yielding an efficient integrated solution. === Key Contributions === * Systematic comparison of question representation, example selection, and example organization strategies * Discovery that LLMs learn primarily from mappings between question and SQL skeleton * Token-efficient design: approximately 1,600 tokens per question on Spider-dev === Results === * **Spider test**: 86.6% execution accuracy with GPT-4 and self-consistency voting * First place on Spider leaderboard at time of publication * Demonstrates that prompt engineering alone can rival fine-tuned approaches ===== Code Example ===== # Agentic text-to-SQL pipeline (simplified) class AgenticTextToSQL: def __init__(self, llm, db_connection): self.llm = llm self.db = db_connection def schema_link(self, question, schema): """Hypothesis-verification loop for schema linking.""" hypotheses = self.llm.generate_hypotheses(question, schema, n=2) pruned = self.dual_pathway_prune(schema, hypotheses) return self.validate_with_data(pruned) def generate_sql(self, question, linked_schema, n_parallel=5): """Orchestrated test-time scaling.""" # Internal scaling: RL-enhanced reasoning candidates = [self.llm.reason(question, linked_schema) for _ in range(n_parallel)] # Sequential scaling: iterative refinement refined = [self.iterative_refine(c) for c in candidates] # Parallel scaling: tournament selection return self.tournament_select(refined) def iterative_refine(self, sql, max_rounds=3): for _ in range(max_rounds): result = self.db.execute(sql) if not result.has_error: return sql sql = self.llm.fix(sql, result.error) return sql ===== Evolution of Text-to-SQL Accuracy ===== ^ Method ^ Year ^ Spider (EX%) ^ BIRD (EX%) ^ Approach ^ | DIN-SQL | 2023 | 85.3 | 55.9 | Decomposition + ICL | | DAIL-SQL | 2023 | 86.6 | -- | Prompt engineering | | APEX-SQL | 2025 | -- | 70.65 | Interactive exploration | | Agentar-Scale-SQL | 2025 | -- | 81.67 | Test-time scaling | ===== References ===== * [[https://arxiv.org/abs/2602.16720|APEX-SQL: Agent-Powered EXploration for Text-to-SQL (arXiv:2602.16720)]] * [[https://arxiv.org/abs/2509.24403|Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling (arXiv:2509.24403)]] * [[https://arxiv.org/abs/2304.11015|DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction (arXiv:2304.11015)]] * [[https://arxiv.org/abs/2308.15363|DAIL-SQL: Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation (arXiv:2308.15363)]] ===== See Also ===== * [[automated_program_repair|Automated Program Repair]] -- related agentic code generation * [[agentic_uncertainty|Agentic Uncertainty]] -- confidence degradation across reasoning steps