====== Automatic Prompt Engineer (APE) ======

Automatic Prompt Engineer (APE) is a framework that uses large language models to autonomously generate, evaluate, and select optimal natural-language instructions for downstream tasks. It treats prompt design as a black-box optimization problem over discrete instruction space, eliminating the need for manual prompt engineering.((Zhou et al. 2023, [[https://arxiv.org/abs/2211.01910|Large Language Models Are Human-Level Prompt Engineers]]))

===== How It Works =====

APE leverages LLMs in a two-stage process:

==== Stage 1: Instruction Generation ====

A base LLM (e.g., Codex) acts as an "inference model" to produce multiple candidate instructions by conditioning on few-shot input-output examples for a task. These candidates are treated as natural-language "programs" that guide the target LLM.

==== Stage 2: Evaluation and Selection ====

Each candidate instruction is executed on held-out validation data using a target LLM (e.g., GPT-3). Performance scores (e.g., accuracy) are computed, and the highest-scoring instruction is selected. This Monte Carlo-style search can iterate, generating new candidates from top performers.

===== Optimization Process =====

APE formalizes prompt engineering as black-box optimization:((Zhou et al. 2023, Section 3))

  - Start with task demonstrations (input-output pairs).
  - Generate diverse candidate prompts via LLM proposals.
  - Score candidates on a validation set using target model performance.
  - Select the best candidate; optionally refine through iterative search.

Advanced variants incorporate evolutionary search or reinforcement learning, but the original APE uses simple LLM-driven Monte Carlo search for efficiency.

===== OPRO: Optimization by Prompting =====

OPRO (Optimization by PROmpting) by Yang et al. builds on APE-like ideas but uses LLMs to iteratively generate **and score** prompts via natural-language critiques:((Yang et al. 2024, OPRO))

  * The LLM evaluator provides relative rankings ("better than" comparisons) to guide refinement.
  * This reduces compute needs compared to APE's data-heavy evaluation approach.
  * OPRO emphasizes critique-driven evolution rather than scoring on validation data.

| **Aspect** | **APE** | **OPRO** |
| Optimization signal | Validation set accuracy | LLM-generated critiques |
| Search strategy | Monte Carlo generation + selection | Iterative critique-driven refinement |
| Compute cost | Higher (many validation runs) | Lower (LLM comparisons) |
| Prompt format | Complete instructions | Instructions with optimization history |

===== Comparison to Human-Written Prompts =====

| **Aspect** | **APE (Automated)** | **Human-Written** |
| Design process | Black-box search via LLM | Manual iteration and intuition |
| Performance | Human-level or better | Strong but task-specific |
| Scalability | Automated, data-efficient | Labor-intensive |
| Flexibility | Adapts to new tasks via demonstrations | Expertise-dependent |
| Consistency | Reproducible | Varies by practitioner |

APE-generated prompts matched or exceeded expert human prompts across multiple benchmarks.((Zhou et al. 2023, Section 5))

===== Benchmark Results =====

  * **Big-Bench tasks**: APE improved GPT-3 performance by 10-40% over zero-shot baselines, matching human prompts on navigation, sentiment analysis, and other tasks.
  * **Zero-shot CoT**: APE discovered prompts that outperformed the manually crafted "Let's think step by step" trigger.
  * **Generalization**: Selected prompts transferred well to held-out data and different models (e.g., Codex-generated instructions boosted PaLM and GPT-3).
  * Later work (PE2) demonstrated further gains through iterative APE variants on counterfactual tasks.

===== Limitations =====

  * **Validation data required**: APE needs representative validation examples to score candidate prompts.
  * **Computational cost**: Generating and evaluating many candidates requires significant LLM inference.
  * **Quality ceiling**: Generated prompts are bounded by the base model's instruction-generation capabilities.
  * **Edge case sensitivity**: May underperform on unusual inputs not represented in validation data.

===== See Also =====

  * [[prompt_engineering]]
  * [[meta_prompting]]
  * [[zero_shot_prompting]]
  * [[few_shot_prompting]]
  * [[active_prompt]]
  * [[chain_of_thought_prompting]]

===== References =====