Table of Contents

Automatic Prompt Engineer (APE)

Automatic Prompt Engineer (APE) is a framework that uses large language models to autonomously generate, evaluate, and select optimal natural-language instructions for downstream tasks. It treats prompt design as a black-box optimization problem over discrete instruction space, eliminating the need for manual prompt engineering.1)

How It Works

APE leverages LLMs in a two-stage process:

Stage 1: Instruction Generation

A base LLM (e.g., Codex) acts as an “inference model” to produce multiple candidate instructions by conditioning on few-shot input-output examples for a task. These candidates are treated as natural-language “programs” that guide the target LLM.

Stage 2: Evaluation and Selection

Each candidate instruction is executed on held-out validation data using a target LLM (e.g., GPT-3). Performance scores (e.g., accuracy) are computed, and the highest-scoring instruction is selected. This Monte Carlo-style search can iterate, generating new candidates from top performers.

Optimization Process

APE formalizes prompt engineering as black-box optimization:2)

  1. Start with task demonstrations (input-output pairs).
  2. Generate diverse candidate prompts via LLM proposals.
  3. Score candidates on a validation set using target model performance.
  4. Select the best candidate; optionally refine through iterative search.

Advanced variants incorporate evolutionary search or reinforcement learning, but the original APE uses simple LLM-driven Monte Carlo search for efficiency.

OPRO: Optimization by Prompting

OPRO (Optimization by PROmpting) by Yang et al. builds on APE-like ideas but uses LLMs to iteratively generate and score prompts via natural-language critiques:3)

Aspect APE OPRO
Optimization signal Validation set accuracy LLM-generated critiques
Search strategy Monte Carlo generation + selection Iterative critique-driven refinement
Compute cost Higher (many validation runs) Lower (LLM comparisons)
Prompt format Complete instructions Instructions with optimization history

Comparison to Human-Written Prompts

Aspect APE (Automated) Human-Written
Design process Black-box search via LLM Manual iteration and intuition
Performance Human-level or better Strong but task-specific
Scalability Automated, data-efficient Labor-intensive
Flexibility Adapts to new tasks via demonstrations Expertise-dependent
Consistency Reproducible Varies by practitioner

APE-generated prompts matched or exceeded expert human prompts across multiple benchmarks.4)

Benchmark Results

Limitations

See Also

References

2)
Zhou et al. 2023, Section 3
3)
Yang et al. 2024, OPRO
4)
Zhou et al. 2023, Section 5