====== Few-Shot Prompting ======
Few-shot prompting is a [[prompt_engineering|prompt engineering]] technique where a small number of input-output examples are included in the prompt to demonstrate the desired task behavior. The model performs **in-context learning**, identifying patterns from the provided demonstrations and applying them to new inputs without any parameter updates or fine-tuning.((Brown et al. 2020, [[https://arxiv.org/abs/2005.14165|Language Models are Few-Shot Learners]]))

===== How It Works =====
Few-shot prompting leverages the in-context learning capability of large language models. The process involves three steps:

  - **Select representative examples**: Identify a small number of input-output pairs (typically k=1 to k=5) that demonstrate the task pattern.
  - **Format the prompt**: Arrange examples in a consistent, structured format so the model can recognize the input-output mapping.
  - **Run inference**: Submit the full prompt (examples + new input) to the model, which generates a response based on the detected patterns.

A typical few-shot prompt structure:

  Input: "The food was amazing" -> Sentiment: Positive
  Input: "Terrible service" -> Sentiment: Negative
  Input: "Pretty good overall" -> Sentiment:

The model infers the pattern and completes the final example accordingly.

===== The GPT-3 Paper =====
Few-shot prompting was formalized and extensively studied in the landmark GPT-3 paper by Brown et al. (2020).((Brown et al. 2020, [[https://arxiv.org/abs/2005.14165|Language Models are Few-Shot Learners]], NeurIPS 2020)) Key findings include:

  * GPT-3 (175B parameters) achieved **competitive performance with fine-tuned models** on many NLP benchmarks using only few-shot demonstrations.
  * On TriviaQA, few-shot GPT-3 matched and exceeded the state-of-the-art fine-tuned open-domain model (RAG).
  * Performance scaled smoothly with model size, with few-shot abilities emerging primarily in large-scale models.
  * Tasks tested included translation, question answering, cloze completion, arithmetic, and word unscrambling.
  * The paper demonstrated that few-shot performance improves significantly over zero-shot across nearly all tasks.

===== Shot Selection Strategies =====
The choice and arrangement of examples significantly impacts performance:

  * **Diversity**: Include examples covering different scenarios and edge cases to help the model generalize.
  * **Relevance**: Select examples semantically similar to the expected test inputs.
  * **Ordering**: The order of examples can affect performance; place the most representative examples last (closest to the query).
  * **Label balance**: Ensure balanced representation of different output categories to prevent bias.
  * **Conciseness**: Keep examples brief to conserve context window space. Typically k=1 to k=5 examples suffice.
  * **Format consistency**: Use identical formatting across all examples to reduce ambiguity.

===== Limitations =====
  * **Context window constraints**: Examples consume tokens from the finite context window, limiting space for the actual task input.((This is especially significant for models with smaller context windows))
  * **Over-prompting**: Research has shown that excessive examples can paradoxically decrease performance, creating a "few-shot dilemma."((See [[https://arxiv.org/html/2509.13196v1|The Few-Shot Dilemma]], 2025))
  * **Example sensitivity**: Performance is highly sensitive to which examples are selected, their order, and their format.
  * **Expensive for complex tasks**: Curating high-quality demonstrations requires domain expertise and effort.
  * **No persistent learning**: The model does not retain information between sessions; examples must be re-provided each time.
  * **Bias propagation**: Biased or unrepresentative examples can steer model outputs in undesirable directions.

===== Few-Shot Prompting vs. Fine-Tuning =====
| **Aspect** | **Few-Shot Prompting** | **Fine-Tuning** |
| Data required | 1-5 examples | Hundreds to thousands |
| Training needed | None (inference only) | Gradient updates required |
| Cost | Minimal (API calls) | Significant (compute + data) |
| Flexibility | Change examples per task | Retrain for each task |
| Performance ceiling | Good, model-dependent | Generally higher |
| Deployment speed | Immediate | Hours to days |
| Model modification | None | Weights updated |

===== See Also =====
  * [[zero_shot_prompting|Zero-Shot Prompting]]
  * [[meta_prompting|Meta Prompting]]
  * [[prompting_specificity|Prompting Specificity]]
  * [[structured_prompting|Structured Prompting]]
  * [[contextual_prompting|Contextual Prompting]]

===== References =====