====== Automatic Prompt Engineer (APE) ====== Automatic Prompt Engineer (APE) is a framework that uses large language models to autonomously generate, evaluate, and select optimal natural-language instructions for downstream tasks. It treats prompt design as a black-box optimization problem over discrete instruction space, eliminating the need for manual prompt engineering.((Zhou et al. 2023, [[https://arxiv.org/abs/2211.01910|Large Language Models Are Human-Level Prompt Engineers]])) ===== How It Works ===== APE leverages LLMs in a two-stage process: ==== Stage 1: Instruction Generation ==== A base LLM (e.g., Codex) acts as an "inference model" to produce multiple candidate instructions by conditioning on few-shot input-output examples for a task. These candidates are treated as natural-language "programs" that guide the target LLM. ==== Stage 2: Evaluation and Selection ==== Each candidate instruction is executed on held-out validation data using a target LLM (e.g., GPT-3). Performance scores (e.g., accuracy) are computed, and the highest-scoring instruction is selected. This Monte Carlo-style search can iterate, generating new candidates from top performers. ===== Optimization Process ===== APE formalizes prompt engineering as black-box optimization:((Zhou et al. 2023, Section 3)) - Start with task demonstrations (input-output pairs). - Generate diverse candidate prompts via LLM proposals. - Score candidates on a validation set using target model performance. - Select the best candidate; optionally refine through iterative search. Advanced variants incorporate evolutionary search or reinforcement learning, but the original APE uses simple LLM-driven Monte Carlo search for efficiency. ===== OPRO: Optimization by Prompting ===== OPRO (Optimization by PROmpting) by Yang et al. builds on APE-like ideas but uses LLMs to iteratively generate **and score** prompts via natural-language critiques:((Yang et al. 2024, OPRO)) * The LLM evaluator provides relative rankings ("better than" comparisons) to guide refinement. * This reduces compute needs compared to APE's data-heavy evaluation approach. * OPRO emphasizes critique-driven evolution rather than scoring on validation data. | **Aspect** | **APE** | **OPRO** | | Optimization signal | Validation set accuracy | LLM-generated critiques | | Search strategy | Monte Carlo generation + selection | Iterative critique-driven refinement | | Compute cost | Higher (many validation runs) | Lower (LLM comparisons) | | Prompt format | Complete instructions | Instructions with optimization history | ===== Comparison to Human-Written Prompts ===== | **Aspect** | **APE (Automated)** | **Human-Written** | | Design process | Black-box search via LLM | Manual iteration and intuition | | Performance | Human-level or better | Strong but task-specific | | Scalability | Automated, data-efficient | Labor-intensive | | Flexibility | Adapts to new tasks via demonstrations | Expertise-dependent | | Consistency | Reproducible | Varies by practitioner | APE-generated prompts matched or exceeded expert human prompts across multiple benchmarks.((Zhou et al. 2023, Section 5)) ===== Benchmark Results ===== * **Big-Bench tasks**: APE improved GPT-3 performance by 10-40% over zero-shot baselines, matching human prompts on navigation, sentiment analysis, and other tasks. * **Zero-shot CoT**: APE discovered prompts that outperformed the manually crafted "Let's think step by step" trigger. * **Generalization**: Selected prompts transferred well to held-out data and different models (e.g., Codex-generated instructions boosted PaLM and GPT-3). * Later work (PE2) demonstrated further gains through iterative APE variants on counterfactual tasks. ===== Limitations ===== * **Validation data required**: APE needs representative validation examples to score candidate prompts. * **Computational cost**: Generating and evaluating many candidates requires significant LLM inference. * **Quality ceiling**: Generated prompts are bounded by the base model's instruction-generation capabilities. * **Edge case sensitivity**: May underperform on unusual inputs not represented in validation data. ===== See Also ===== * [[prompt_engineering]] * [[meta_prompting]] * [[zero_shot_prompting]] * [[few_shot_prompting]] * [[active_prompt]] * [[chain_of_thought_prompting]] ===== References =====