Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Automatic Prompt Engineer (APE) is a framework that uses large language models to autonomously generate, evaluate, and select optimal natural-language instructions for downstream tasks. It treats prompt design as a black-box optimization problem over discrete instruction space, eliminating the need for manual prompt engineering.1)
APE leverages LLMs in a two-stage process:
A base LLM (e.g., Codex) acts as an “inference model” to produce multiple candidate instructions by conditioning on few-shot input-output examples for a task. These candidates are treated as natural-language “programs” that guide the target LLM.
Each candidate instruction is executed on held-out validation data using a target LLM (e.g., GPT-3). Performance scores (e.g., accuracy) are computed, and the highest-scoring instruction is selected. This Monte Carlo-style search can iterate, generating new candidates from top performers.
APE formalizes prompt engineering as black-box optimization:2)
Advanced variants incorporate evolutionary search or reinforcement learning, but the original APE uses simple LLM-driven Monte Carlo search for efficiency.
OPRO (Optimization by PROmpting) by Yang et al. builds on APE-like ideas but uses LLMs to iteratively generate and score prompts via natural-language critiques:3)
| Aspect | APE | OPRO |
| Optimization signal | Validation set accuracy | LLM-generated critiques |
| Search strategy | Monte Carlo generation + selection | Iterative critique-driven refinement |
| Compute cost | Higher (many validation runs) | Lower (LLM comparisons) |
| Prompt format | Complete instructions | Instructions with optimization history |
| Aspect | APE (Automated) | Human-Written |
| Design process | Black-box search via LLM | Manual iteration and intuition |
| Performance | Human-level or better | Strong but task-specific |
| Scalability | Automated, data-efficient | Labor-intensive |
| Flexibility | Adapts to new tasks via demonstrations | Expertise-dependent |
| Consistency | Reproducible | Varies by practitioner |
APE-generated prompts matched or exceeded expert human prompts across multiple benchmarks.4)