====== Few-Shot Prompting ====== Few-shot prompting is a [[prompt_engineering|prompt engineering]] technique where a small number of input-output examples are included in the prompt to demonstrate the desired task behavior. The model performs **in-context learning**, identifying patterns from the provided demonstrations and applying them to new inputs without any parameter updates or fine-tuning.((Brown et al. 2020, [[https://arxiv.org/abs/2005.14165|Language Models are Few-Shot Learners]])) ===== How It Works ===== Few-shot prompting leverages the in-context learning capability of large language models. The process involves three steps: - **Select representative examples**: Identify a small number of input-output pairs (typically k=1 to k=5) that demonstrate the task pattern. - **Format the prompt**: Arrange examples in a consistent, structured format so the model can recognize the input-output mapping. - **Run inference**: Submit the full prompt (examples + new input) to the model, which generates a response based on the detected patterns. A typical few-shot prompt structure: Input: "The food was amazing" -> Sentiment: Positive Input: "Terrible service" -> Sentiment: Negative Input: "Pretty good overall" -> Sentiment: The model infers the pattern and completes the final example accordingly. ===== The GPT-3 Paper ===== Few-shot prompting was formalized and extensively studied in the landmark GPT-3 paper by Brown et al. (2020).((Brown et al. 2020, [[https://arxiv.org/abs/2005.14165|Language Models are Few-Shot Learners]], NeurIPS 2020)) Key findings include: * GPT-3 (175B parameters) achieved **competitive performance with fine-tuned models** on many NLP benchmarks using only few-shot demonstrations. * On TriviaQA, few-shot GPT-3 matched and exceeded the state-of-the-art fine-tuned open-domain model (RAG). * Performance scaled smoothly with model size, with few-shot abilities emerging primarily in large-scale models. * Tasks tested included translation, question answering, cloze completion, arithmetic, and word unscrambling. * The paper demonstrated that few-shot performance improves significantly over zero-shot across nearly all tasks. ===== Shot Selection Strategies ===== The choice and arrangement of examples significantly impacts performance: * **Diversity**: Include examples covering different scenarios and edge cases to help the model generalize. * **Relevance**: Select examples semantically similar to the expected test inputs. * **Ordering**: The order of examples can affect performance; place the most representative examples last (closest to the query). * **Label balance**: Ensure balanced representation of different output categories to prevent bias. * **Conciseness**: Keep examples brief to conserve context window space. Typically k=1 to k=5 examples suffice. * **Format consistency**: Use identical formatting across all examples to reduce ambiguity. ===== Limitations ===== * **Context window constraints**: Examples consume tokens from the finite context window, limiting space for the actual task input.((This is especially significant for models with smaller context windows)) * **Over-prompting**: Research has shown that excessive examples can paradoxically decrease performance, creating a "few-shot dilemma."((See [[https://arxiv.org/html/2509.13196v1|The Few-Shot Dilemma]], 2025)) * **Example sensitivity**: Performance is highly sensitive to which examples are selected, their order, and their format. * **Expensive for complex tasks**: Curating high-quality demonstrations requires domain expertise and effort. * **No persistent learning**: The model does not retain information between sessions; examples must be re-provided each time. * **Bias propagation**: Biased or unrepresentative examples can steer model outputs in undesirable directions. ===== Few-Shot Prompting vs. Fine-Tuning ===== | **Aspect** | **Few-Shot Prompting** | **Fine-Tuning** | | Data required | 1-5 examples | Hundreds to thousands | | Training needed | None (inference only) | Gradient updates required | | Cost | Minimal (API calls) | Significant (compute + data) | | Flexibility | Change examples per task | Retrain for each task | | Performance ceiling | Good, model-dependent | Generally higher | | Deployment speed | Immediate | Hours to days | | Model modification | None | Weights updated | ===== See Also ===== * [[zero_shot_prompting|Zero-Shot Prompting]] * [[meta_prompting|Meta Prompting]] * [[prompting_specificity|Prompting Specificity]] * [[structured_prompting|Structured Prompting]] * [[contextual_prompting|Contextual Prompting]] ===== References =====