How It Works
- Stage 1: Instruction Generation
- Stage 2: Evaluation and Selection
Optimization Process
OPRO: Optimization by Prompting
Comparison to Human-Written Prompts
Benchmark Results
Limitations
See Also
References

Automatic Prompt Engineer (APE)

Automatic Prompt Engineer (APE) is a framework that uses large language models to autonomously generate, evaluate, and select optimal natural-language instructions for downstream tasks. It treats prompt design as a black-box optimization problem over discrete instruction space, eliminating the need for manual prompt engineering.¹⁾

How It Works

APE leverages LLMs in a two-stage process:

Stage 1: Instruction Generation

A base LLM (e.g., Codex) acts as an “inference model” to produce multiple candidate instructions by conditioning on few-shot input-output examples for a task. These candidates are treated as natural-language “programs” that guide the target LLM.

Stage 2: Evaluation and Selection

Each candidate instruction is executed on held-out validation data using a target LLM (e.g., GPT-3). Performance scores (e.g., accuracy) are computed, and the highest-scoring instruction is selected. This Monte Carlo-style search can iterate, generating new candidates from top performers.

Optimization Process

APE formalizes prompt engineering as black-box optimization:²⁾

Start with task demonstrations (input-output pairs).
Generate diverse candidate prompts via LLM proposals.
Score candidates on a validation set using target model performance.
Select the best candidate; optionally refine through iterative search.

Advanced variants incorporate evolutionary search or reinforcement learning, but the original APE uses simple LLM-driven Monte Carlo search for efficiency.

OPRO: Optimization by Prompting

OPRO (Optimization by PROmpting) by Yang et al. builds on APE-like ideas but uses LLMs to iteratively generate and score prompts via natural-language critiques:³⁾

The LLM evaluator provides relative rankings (“better than” comparisons) to guide refinement.
This reduces compute needs compared to APE's data-heavy evaluation approach.
OPRO emphasizes critique-driven evolution rather than scoring on validation data.

Aspect	APE	OPRO
Optimization signal	Validation set accuracy	LLM-generated critiques
Search strategy	Monte Carlo generation + selection	Iterative critique-driven refinement
Compute cost	Higher (many validation runs)	Lower (LLM comparisons)
Prompt format	Complete instructions	Instructions with optimization history

Comparison to Human-Written Prompts

Aspect	APE (Automated)	Human-Written
Design process	Black-box search via LLM	Manual iteration and intuition
Performance	Human-level or better	Strong but task-specific
Scalability	Automated, data-efficient	Labor-intensive
Flexibility	Adapts to new tasks via demonstrations	Expertise-dependent
Consistency	Reproducible	Varies by practitioner

APE-generated prompts matched or exceeded expert human prompts across multiple benchmarks.⁴⁾

Benchmark Results

Big-Bench tasks: APE improved GPT-3 performance by 10-40% over zero-shot baselines, matching human prompts on navigation, sentiment analysis, and other tasks.
Zero-shot CoT: APE discovered prompts that outperformed the manually crafted “Let's think step by step” trigger.
Generalization: Selected prompts transferred well to held-out data and different models (e.g., Codex-generated instructions boosted PaLM and GPT-3).
Later work (PE2) demonstrated further gains through iterative APE variants on counterfactual tasks.

Limitations

Validation data required: APE needs representative validation examples to score candidate prompts.
Computational cost: Generating and evaluating many candidates requires significant LLM inference.
Quality ceiling: Generated prompts are bounded by the base model's instruction-generation capabilities.
Edge case sensitivity: May underperform on unusual inputs not represented in validation data.

References

¹⁾

Zhou et al. 2023, Large Language Models Are Human-Level Prompt Engineers

²⁾

Zhou et al. 2023, Section 3

³⁾

Yang et al. 2024, OPRO

⁴⁾

Zhou et al. 2023, Section 5

Table of Contents