Few-Shot Prompting

Few-shot prompting is a prompt engineering technique where a small number of input-output examples are included in the prompt to demonstrate the desired task behavior. The model performs in-context learning, identifying patterns from the provided demonstrations and applying them to new inputs without any parameter updates or fine-tuning.¹⁾

How It Works

Few-shot prompting leverages the in-context learning capability of large language models. The process involves three steps:

Select representative examples: Identify a small number of input-output pairs (typically k=1 to k=5) that demonstrate the task pattern.
Format the prompt: Arrange examples in a consistent, structured format so the model can recognize the input-output mapping.
Run inference: Submit the full prompt (examples + new input) to the model, which generates a response based on the detected patterns.

A typical few-shot prompt structure:

Input: "The food was amazing" -> Sentiment: Positive
Input: "Terrible service" -> Sentiment: Negative
Input: "Pretty good overall" -> Sentiment:

The model infers the pattern and completes the final example accordingly.

The GPT-3 Paper

Few-shot prompting was formalized and extensively studied in the landmark GPT-3 paper by Brown et al. (2020).²⁾ Key findings include:

GPT-3 (175B parameters) achieved competitive performance with fine-tuned models on many NLP benchmarks using only few-shot demonstrations.
On TriviaQA, few-shot GPT-3 matched and exceeded the state-of-the-art fine-tuned open-domain model (RAG).
Performance scaled smoothly with model size, with few-shot abilities emerging primarily in large-scale models.
Tasks tested included translation, question answering, cloze completion, arithmetic, and word unscrambling.
The paper demonstrated that few-shot performance improves significantly over zero-shot across nearly all tasks.

Shot Selection Strategies

The choice and arrangement of examples significantly impacts performance:

Diversity: Include examples covering different scenarios and edge cases to help the model generalize.
Relevance: Select examples semantically similar to the expected test inputs.
Ordering: The order of examples can affect performance; place the most representative examples last (closest to the query).
Label balance: Ensure balanced representation of different output categories to prevent bias.
Conciseness: Keep examples brief to conserve context window space. Typically k=1 to k=5 examples suffice.
Format consistency: Use identical formatting across all examples to reduce ambiguity.

Limitations

Context window constraints: Examples consume tokens from the finite context window, limiting space for the actual task input.³⁾
Over-prompting: Research has shown that excessive examples can paradoxically decrease performance, creating a “few-shot dilemma.”⁴⁾
Example sensitivity: Performance is highly sensitive to which examples are selected, their order, and their format.
Expensive for complex tasks: Curating high-quality demonstrations requires domain expertise and effort.
No persistent learning: The model does not retain information between sessions; examples must be re-provided each time.
Bias propagation: Biased or unrepresentative examples can steer model outputs in undesirable directions.