Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
MedPrompt is a prompt engineering framework developed by Microsoft Research that combines dynamic few-shot selection, auto-generated chain-of-thought reasoning, and choice-shuffle ensembling to achieve state-of-the-art performance on medical benchmarks using general-purpose LLMs — without any fine-tuning. Originally demonstrated on medical QA, MedPrompt is fully generalizable and has set records on MMLU and other non-medical benchmarks.
The conventional approach to domain-specific AI performance involves expensive fine-tuning on specialized datasets (e.g., Med-PaLM 2 trained on medical corpora). MedPrompt demonstrates that a powerful generalist model (GPT-4) combined with systematic prompt engineering can outperform fine-tuned specialist models, challenging the assumption that domain expertise requires domain-specific training.
MedPrompt operates in two phases — preprocessing (building a retrieval database) and inference (dynamic prompting) — with three modular components:
Instead of using fixed exemplars, MedPrompt retrieves the most semantically similar examples for each query:
text-embedding-ada-002)Contribution: +0.8% accuracy improvement over random few-shot selection.
Rather than using human-written rationales, GPT-4 generates its own step-by-step reasoning for each few-shot exemplar during preprocessing:
Contribution: +3.4% accuracy improvement. Notably, auto-generated CoT outperforms expert-written rationales.
To reduce position bias in multiple-choice questions, MedPrompt runs the model multiple times with shuffled answer orderings and aggregates via majority vote:
Contribution: +2.1% accuracy improvement with consistent gains across all benchmarks.
# MedPrompt implementation sketch import numpy as np from collections import Counter def medprompt(question, training_db, llm, embedder, k=5, n_shuffles=5): # ---- INFERENCE ---- # 1. Dynamic Few-Shot Selection (kNN) q_embedding = embedder.embed(question) similarities = [ cosine_similarity(q_embedding, ex['embedding']) for ex in training_db ] top_k_indices = np.argsort(similarities)[-k:] exemplars = [training_db[i] for i in top_k_indices] # 2. Build prompt with auto-generated CoT exemplars few_shot_prompt = '' for ex in exemplars: few_shot_prompt += f"Q: {ex['question']}\n" few_shot_prompt += f"Reasoning: {ex['auto_cot']}\n" few_shot_prompt += f"A: {ex['answer']}\n\n" # 3. Choice-Shuffle Ensemble choices = extract_choices(question) votes = [] for _ in range(n_shuffles): shuffled_q = shuffle_choices(question, choices) prompt = few_shot_prompt + f"Q: {shuffled_q}\nReasoning:" response = llm.generate(prompt) answer = parse_answer(response, choices) votes.append(answer) # Majority vote final_answer = Counter(votes).most_common(1)[0][0] return final_answer def preprocess_training_db(training_data, llm, embedder): db = [] for item in training_data: embedding = embedder.embed(item['question']) cot_prompt = f"Q: {item['question']}\nExplain step by step, then give the answer." auto_cot = llm.generate(cot_prompt) db.append({ 'question': item['question'], 'answer': item['answer'], 'embedding': embedding, 'auto_cot': auto_cot }) return db
| Component | Avg. Accuracy Gain | Notes |
|---|---|---|
| Dynamic Few-Shot | +0.8% | kNN retrieval from embedding space |
| Auto-Generated CoT | +3.4% | Largest single contributor |
| Choice-Shuffle Ensemble (5 runs) | +2.1% | Most consistent across benchmarks |
| Full MedPrompt | +7.1% | Over zero-shot baseline |
MedPrompt with GPT-4 achieved state-of-the-art on all nine MultiMedQA benchmarks:
| Benchmark | Zero-Shot GPT-4 | MedPrompt GPT-4 |
|---|---|---|
| MedQA (USMLE) | ~83% | 90.2% |
| PubMedQA | Baseline | Significant improvement |
| All MultiMedQA (avg) | Baseline | +7.1% |
This matched or exceeded Google's Gemini Ultra and surpassed fine-tuned models like Med-PaLM 2, all without any model training.
MedPrompt's design is domain-agnostic. MedPrompt+ (extended version) demonstrated:
This confirms that MedPrompt's components are general-purpose prompt engineering techniques, not medical-specific.
The ensemble decision rule for $M$ shuffled runs:
$$\hat{a} = \arg\max_{a \in \mathcal{A}} \sum_{m=1}^{M} \mathbf{1}[f(q_{\pi_m}) = a]$$
where $\pi_m$ is the $m$-th random permutation of answer choices, $f(\cdot)$ is the model's predicted answer, and $\mathcal{A}$ is the set of possible answers.