====== MedPrompt ======
**MedPrompt** is a prompt engineering framework developed by Microsoft Research that combines dynamic few-shot selection, auto-generated chain-of-thought reasoning, and choice-shuffle ensembling to achieve state-of-the-art performance on medical benchmarks using general-purpose LLMs --- without any fine-tuning. Originally demonstrated on medical QA, MedPrompt is fully generalizable and has set records on MMLU and other non-medical benchmarks.
===== Motivation =====
The conventional approach to domain-specific AI performance involves expensive fine-tuning on specialized datasets (e.g., Med-PaLM 2 trained on medical corpora). MedPrompt demonstrates that a powerful generalist model (GPT-4) combined with systematic prompt engineering can //outperform// fine-tuned specialist models, challenging the assumption that domain expertise requires domain-specific training.
===== Architecture: Three Components =====
MedPrompt operates in two phases --- **preprocessing** (building a retrieval database) and **inference** (dynamic prompting) --- with three modular components:
=== 1. Dynamic Few-Shot Selection (kNN-based ICL) ===
Instead of using fixed exemplars, MedPrompt retrieves the most semantically similar examples for each query:
* Embed all training questions using a text embedding model (e.g., ''text-embedding-ada-002'')
* At inference, embed the test question and retrieve $k$ nearest neighbors via cosine similarity
* Use these as few-shot exemplars, ensuring maximum relevance to the specific question
**Contribution**: +0.8% accuracy improvement over random few-shot selection.
=== 2. Auto-Generated Chain-of-Thought ===
Rather than using human-written rationales, GPT-4 generates its own step-by-step reasoning for each few-shot exemplar during preprocessing:
* For each training example, prompt GPT-4 to produce a detailed CoT explanation
* Store these auto-generated rationales alongside the questions and answers
* At inference, retrieved exemplars include the model's own reasoning style
**Contribution**: +3.4% accuracy improvement. Notably, auto-generated CoT //outperforms// expert-written rationales.
=== 3. Choice-Shuffle Ensemble ===
To reduce position bias in multiple-choice questions, MedPrompt runs the model multiple times with shuffled answer orderings and aggregates via majority vote:
* Randomly permute answer choices across $M$ runs (typically $M = 5$)
* Each run produces an answer; take the majority vote
* This corrects for the model's tendency to favor certain positions (e.g., option A)
**Contribution**: +2.1% accuracy improvement with consistent gains across all benchmarks.
# MedPrompt implementation sketch
import numpy as np
from collections import Counter
def medprompt(question, training_db, llm, embedder, k=5, n_shuffles=5):
# ---- INFERENCE ----
# 1. Dynamic Few-Shot Selection (kNN)
q_embedding = embedder.embed(question)
similarities = [
cosine_similarity(q_embedding, ex['embedding'])
for ex in training_db
]
top_k_indices = np.argsort(similarities)[-k:]
exemplars = [training_db[i] for i in top_k_indices]
# 2. Build prompt with auto-generated CoT exemplars
few_shot_prompt = ''
for ex in exemplars:
few_shot_prompt += f"Q: {ex['question']}\n"
few_shot_prompt += f"Reasoning: {ex['auto_cot']}\n"
few_shot_prompt += f"A: {ex['answer']}\n\n"
# 3. Choice-Shuffle Ensemble
choices = extract_choices(question)
votes = []
for _ in range(n_shuffles):
shuffled_q = shuffle_choices(question, choices)
prompt = few_shot_prompt + f"Q: {shuffled_q}\nReasoning:"
response = llm.generate(prompt)
answer = parse_answer(response, choices)
votes.append(answer)
# Majority vote
final_answer = Counter(votes).most_common(1)[0][0]
return final_answer
def preprocess_training_db(training_data, llm, embedder):
db = []
for item in training_data:
embedding = embedder.embed(item['question'])
cot_prompt = f"Q: {item['question']}\nExplain step by step, then give the answer."
auto_cot = llm.generate(cot_prompt)
db.append({
'question': item['question'],
'answer': item['answer'],
'embedding': embedding,
'auto_cot': auto_cot
})
return db
===== Component Contributions (Ablation) =====
^ Component ^ Avg. Accuracy Gain ^ Notes ^
| Dynamic Few-Shot | +0.8% | kNN retrieval from embedding space |
| Auto-Generated CoT | +3.4% | Largest single contributor |
| Choice-Shuffle Ensemble (5 runs) | +2.1% | Most consistent across benchmarks |
| **Full MedPrompt** | **+7.1%** | Over zero-shot baseline |
===== Results on Medical Benchmarks =====
MedPrompt with GPT-4 achieved state-of-the-art on all nine MultiMedQA benchmarks:
^ Benchmark ^ Zero-Shot GPT-4 ^ MedPrompt GPT-4 ^
| MedQA (USMLE) | ~83% | **90.2%** |
| PubMedQA | Baseline | Significant improvement |
| All MultiMedQA (avg) | Baseline | **+7.1%** |
This matched or exceeded Google's Gemini Ultra and surpassed fine-tuned models like Med-PaLM 2, all without any model training.
===== Generalizability Beyond Medicine =====
MedPrompt's design is domain-agnostic. **MedPrompt+** (extended version) demonstrated:
* **MMLU**: 89.56% with 20 ensembles (new SOTA at time of publication), 90%+ with hybrid strategy
* **Non-medical domains** tested: electrical engineering, law, philosophy, accounting, psychology, machine learning
* Average improvement of **+7.3%** over zero-shot across non-medical domains --- nearly identical to medical gains (+7.1%)
This confirms that MedPrompt's components are general-purpose prompt engineering techniques, not medical-specific.
===== Mathematical Formulation =====
The ensemble decision rule for $M$ shuffled runs:
$$\hat{a} = \arg\max_{a \in \mathcal{A}} \sum_{m=1}^{M} \mathbf{1}[f(q_{\pi_m}) = a]$$
where $\pi_m$ is the $m$-th random permutation of answer choices, $f(\cdot)$ is the model's predicted answer, and $\mathcal{A}$ is the set of possible answers.
graph TB
subgraph Preprocessing
A[Training Questions] --> B[Embed with text-embedding model]
A --> C[Auto-generate CoT with GPT-4]
B --> D[Embedding Database]
C --> D
end
subgraph Inference
E[Test Question] --> F[Embed Query]
F --> G[kNN Retrieval from DB]
G --> H[Build Few-Shot Prompt with CoT]
H --> I[Run with Shuffled Choices x5]
I --> J[Majority Vote]
J --> K[Final Answer]
end
===== Limitations and Considerations =====
* **Compute cost**: Ensemble requires $M$ forward passes per question (typically 5x)
* **Embedding dependency**: Requires a good embedding model for kNN retrieval
* **Training data needed**: Dynamic few-shot requires a labeled training set for the embedding database (though no model training occurs)
* **Newer reasoning models**: On OpenAI o1-preview, aggressive few-shot prompting can //hurt// performance; simpler prompts work better with reasoning-native models
===== References =====
* [[https://arxiv.org/abs/2311.16452|Nori et al. "Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine" (2023). arXiv:2311.16452]]
* [[https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/|Microsoft Research Blog: "The Power of Prompting"]]
* [[https://www.microsoft.com/en-us/research/blog/steering-at-the-frontier-extending-the-power-of-prompting/|Microsoft Research Blog: "Steering at the Frontier" (MedPrompt+)]]
* [[https://github.com/microsoft/promptbase|Microsoft PromptBase (GitHub)]]
===== See Also =====
* [[chain_of_thought|Chain of Thought]]
* [[few_shot_prompting|Few-Shot Prompting]]
* [[self_consistency|Self-Consistency]]
* [[ensemble_methods|Ensemble Methods in LLMs]]