====== Agentic Text-to-SQL ======
LLM-powered agents are transforming the text-to-SQL task from static one-shot translation into interactive, multi-step reasoning processes(([[https://arxiv.org/abs/2602.16720|APEX-SQL: Agent-Powered EXploration for Text-to-SQL (arXiv:2602.16720]])). By combining schema exploration, test-time scaling, and decomposition strategies, agentic approaches now exceed 80% execution accuracy on challenging benchmarks like BIRD.

===== The Agentic Paradigm Shift =====
Traditional text-to-SQL approaches treat query generation as a single forward pass: given a natural language question and a database schema, produce SQL. This fails on enterprise databases with hundreds of tables, ambiguous column names, and complex joins. Agentic methods instead treat text-to-SQL as a //planning problem// where the agent iteratively explores, hypothesizes, and verifies against real data.

===== APEX-SQL: Interactive Schema Exploration =====
**APEX-SQL** (Wang et al., 2025)(([[https://arxiv.org/abs/2602.16720|APEX-SQL: Agent-Powered EXploration for Text-to-SQL (arXiv:2602.16720]])) introduces a hypothesis-verification (H-V) loop for resolving schema ambiguities in large enterprise databases.

=== Architecture ===
The framework operates in two agentic stages:

  - **Schema Linking**: Generates schema-agnostic hypotheses as logical plans from the query, then prunes and validates them against real database contents
  - **SQL Generation**: Retrieves exploration directives deterministically to guide the agent in verifying hypotheses and synthesizing SQL

=== Key Techniques ===
  * **Logical Planning**: Verbalizes query requirements into hypotheses sampled at temperature 0.8, aggregated at 0.2
  * **Dual-Pathway Pruning**: Batches schema tokens and runs parallel negative/positive passes
  * **Parallel Data Profiling**: Validates column roles empirically and ensures topological connectivity
  * **Deterministic Directive Retrieval**: Provides structured [[guidance|guidance]] for autonomous ambiguity resolution

=== Results ===
  * **BIRD**: 70.65% execution accuracy
  * **Spider 2.0-Snow**: 51.01% execution accuracy
  * Lower token usage than comparable baselines

===== Agentar-Scale-SQL: Orchestrated Test-Time Scaling =====
**Agentar-Scale-SQL** (Wang et al., 2025)(([[https://arxiv.org/abs/2509.24403|Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling (arXiv:2509.24403).]])) introduces an Orchestrated Test-Time Scaling strategy that combines three scaling perspectives to achieve SOTA on BIRD.

=== Three Scaling Dimensions ===
  - **Internal Scaling**: RL-enhanced intrinsic reasoning
  - **Sequential Scaling**: Iterative refinement through execution feedback
  - **Parallel Scaling**: Diverse synthesis with tournament selection

=== Results ===
  * **BIRD test set**: 81.67% execution accuracy (first place on official leaderboard)

===== DIN-SQL: Decomposed In-Context Learning =====
**DIN-SQL** (Pourreza & Rafiei, 2023, NeurIPS)(([[https://arxiv.org/abs/2304.11015|DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction (arXiv:2304.11015).]])) demonstrated that decomposing text-to-SQL into sub-problems dramatically improves LLM performance.

=== Results ===
  * **Spider test**: 85.3% execution accuracy (SOTA at time of publication)
  * **BIRD test**: 55.9% execution accuracy
  * Consistently improves few-shot LLM performance by roughly 10%

===== DAIL-SQL: Efficient Few-Shot Optimization =====
**DAIL-SQL** (Gao et al., 2023, VLDB) provides a systematic benchmark of [[prompt_engineering|prompt engineering]] strategies for text-to-SQL, yielding an efficient integrated solution.

=== Results ===
  * **Spider test**: 86.6% execution accuracy with GPT-4 and [[self_consistency|self-consistency]] voting(([[https://arxiv.org/abs/2308.15363|DAIL-SQL: Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation (arXiv:2308.15363]]))
  * First place on Spider leaderboard at time of publication

===== Code Example =====
<code python>
# Agentic text-to-SQL pipeline (simplified)
class AgenticTextToSQL:
    def __init__(self, llm, db_connection):
        self.llm = llm
        self.db = db_connection

    def schema_link(self, question, schema):
        hypotheses = self.llm.generate_hypotheses(question, schema, n=2)
        pruned = self.dual_pathway_prune(schema, hypotheses)
        return self.validate_with_data(pruned)

    def generate_sql(self, question, linked_schema, n_parallel=5):
        candidates = [self.llm.reason(question, linked_schema) for _ in range(n_parallel)]
        refined = [self.iterative_refine(c) for c in candidates]
        return self.tournament_select(refined)

    def iterative_refine(self, sql, max_rounds=3):
        for _ in range(max_rounds):
            result = self.db.execute(sql)
            if not result.has_error:
                return sql
            sql = self.llm.fix(sql, result.error)
        return sql
</code>

===== Evolution of Text-to-SQL Accuracy =====
^ Method ^ Year ^ Spider (EX%) ^ BIRD (EX%) ^ Approach ^
| DIN-SQL | 2023 | 85.3 | 55.9 | Decomposition + ICL |
| DAIL-SQL | 2023 | 86.6 |, | [[prompt_engineering|Prompt engineering]] |
| APEX-SQL | 2025 |, | 70.65 | Interactive exploration |
| Agentar-Scale-SQL | 2025 |, | 81.67 | Test-time scaling |

===== See Also =====
  * [[agent_sql|AI Agents for SQL and Database Interaction]]
  * [[reasoning_via_planning|RAP: Reasoning via Planning with LLM as World Model]]
  * [[one_shot_demos_vs_long_horizon_tasks|One-Shot Demos vs Long-Horizon Tasks]]
  * [[agenttuning|AgentTuning: Enabling Generalized Agent Capabilities in LLMs]]
  * [[agentic_uncertainty|Agentic Uncertainty]]

===== References =====