MedPrompt

MedPrompt is a prompt engineering framework developed by Microsoft Research that combines dynamic few-shot selection, auto-generated chain-of-thought reasoning, and choice-shuffle ensembling to achieve state-of-the-art performance on medical benchmarks using general-purpose LLMs — without any fine-tuning. Originally demonstrated on medical QA, MedPrompt is fully generalizable and has set records on MMLU and other non-medical benchmarks.

Motivation

The conventional approach to domain-specific AI performance involves expensive fine-tuning on specialized datasets (e.g., Med-PaLM 2 trained on medical corpora). MedPrompt demonstrates that a powerful generalist model (GPT-4) combined with systematic prompt engineering can outperform fine-tuned specialist models, challenging the assumption that domain expertise requires domain-specific training.

Architecture: Three Components

MedPrompt operates in two phases — preprocessing (building a retrieval database) and inference (dynamic prompting) — with three modular components:

1. Dynamic Few-Shot Selection (kNN-based ICL)

Instead of using fixed exemplars, MedPrompt retrieves the most semantically similar examples for each query:

Embed all training questions using a text embedding model (e.g., text-embedding-ada-002)
At inference, embed the test question and retrieve $k$ nearest neighbors via cosine similarity
Use these as few-shot exemplars, ensuring maximum relevance to the specific question

Contribution: +0.8% accuracy improvement over random few-shot selection.

2. Auto-Generated Chain-of-Thought

Rather than using human-written rationales, GPT-4 generates its own step-by-step reasoning for each few-shot exemplar during preprocessing:

For each training example, prompt GPT-4 to produce a detailed CoT explanation
Store these auto-generated rationales alongside the questions and answers
At inference, retrieved exemplars include the model's own reasoning style

Contribution: +3.4% accuracy improvement. Notably, auto-generated CoT outperforms expert-written rationales.

3. Choice-Shuffle Ensemble

To reduce position bias in multiple-choice questions, MedPrompt runs the model multiple times with shuffled answer orderings and aggregates via majority vote:

Randomly permute answer choices across $M$ runs (typically $M = 5$)
Each run produces an answer; take the majority vote
This corrects for the model's tendency to favor certain positions (e.g., option A)

Contribution: +2.1% accuracy improvement with consistent gains across all benchmarks.

# MedPrompt implementation sketch
 
import numpy as np
from collections import Counter
 
def medprompt(question, training_db, llm, embedder, k=5, n_shuffles=5):
    # ---- INFERENCE ----
 
    # 1. Dynamic Few-Shot Selection (kNN)
    q_embedding = embedder.embed(question)
    similarities = [
        cosine_similarity(q_embedding, ex['embedding'])
        for ex in training_db
    ]
    top_k_indices = np.argsort(similarities)[-k:]
    exemplars = [training_db[i] for i in top_k_indices]
 
    # 2. Build prompt with auto-generated CoT exemplars
    few_shot_prompt = ''
    for ex in exemplars:
        few_shot_prompt += f"Q: {ex['question']}\n"
        few_shot_prompt += f"Reasoning: {ex['auto_cot']}\n"
        few_shot_prompt += f"A: {ex['answer']}\n\n"
 
    # 3. Choice-Shuffle Ensemble
    choices = extract_choices(question)
    votes = []
    for _ in range(n_shuffles):
        shuffled_q = shuffle_choices(question, choices)
        prompt = few_shot_prompt + f"Q: {shuffled_q}\nReasoning:"
        response = llm.generate(prompt)
        answer = parse_answer(response, choices)
        votes.append(answer)
 
    # Majority vote
    final_answer = Counter(votes).most_common(1)[0][0]
    return final_answer
 
 
def preprocess_training_db(training_data, llm, embedder):
    db = []
    for item in training_data:
        embedding = embedder.embed(item['question'])
        cot_prompt = f"Q: {item['question']}\nExplain step by step, then give the answer."
        auto_cot = llm.generate(cot_prompt)
        db.append({
            'question': item['question'],
            'answer': item['answer'],
            'embedding': embedding,
            'auto_cot': auto_cot
        })
    return db

Component Contributions (Ablation)

Component	Avg. Accuracy Gain	Notes
Dynamic Few-Shot	+0.8%	kNN retrieval from embedding space
Auto-Generated CoT	+3.4%	Largest single contributor
Choice-Shuffle Ensemble (5 runs)	+2.1%	Most consistent across benchmarks
Full MedPrompt	+7.1%	Over zero-shot baseline

Results on Medical Benchmarks

MedPrompt with GPT-4 achieved state-of-the-art on all nine MultiMedQA benchmarks:

Benchmark	Zero-Shot GPT-4	MedPrompt GPT-4
MedQA (USMLE)	~83%	90.2%
PubMedQA	Baseline	Significant improvement
All MultiMedQA (avg)	Baseline	+7.1%

This matched or exceeded Google's Gemini Ultra and surpassed fine-tuned models like Med-PaLM 2, all without any model training.

Generalizability Beyond Medicine

MedPrompt's design is domain-agnostic. MedPrompt+ (extended version) demonstrated:

MMLU: 89.56% with 20 ensembles (new SOTA at time of publication), 90%+ with hybrid strategy
Non-medical domains tested: electrical engineering, law, philosophy, accounting, psychology, machine learning
Average improvement of +7.3% over zero-shot across non-medical domains — nearly identical to medical gains (+7.1%)

This confirms that MedPrompt's components are general-purpose prompt engineering techniques, not medical-specific.

Mathematical Formulation

The ensemble decision rule for $M$ shuffled runs:

$$\hat{a} = \arg\max_{a \in \mathcal{A}} \sum_{m=1}^{M} \mathbf{1}[f(q_{\pi_m}) = a]$$

where $\pi_m$ is the $m$-th random permutation of answer choices, $f(\cdot)$ is the model's predicted answer, and $\mathcal{A}$ is the set of possible answers.

graph TB subgraph Preprocessing A[Training Questions] --> B[Embed with text-embedding model] A --> C[Auto-generate CoT with GPT-4] B --> D[Embedding Database] C --> D end subgraph Inference E[Test Question] --> F[Embed Query] F --> G[kNN Retrieval from DB] G --> H[Build Few-Shot Prompt with CoT] H --> I[Run with Shuffled Choices x5] I --> J[Majority Vote] J --> K[Final Answer] end

Limitations and Considerations

Compute cost: Ensemble requires $M$ forward passes per question (typically 5x)
Embedding dependency: Requires a good embedding model for kNN retrieval
Training data needed: Dynamic few-shot requires a labeled training set for the embedding database (though no model training occurs)
Newer reasoning models: On OpenAI o1-preview, aggressive few-shot prompting can hurt performance; simpler prompts work better with reasoning-native models

AI Agent Knowledge Base

Sidebar

Table of Contents

MedPrompt

Motivation

Architecture: Three Components

1. Dynamic Few-Shot Selection (kNN-based ICL)

2. Auto-Generated Chain-of-Thought

3. Choice-Shuffle Ensemble

Component Contributions (Ablation)

Results on Medical Benchmarks

Generalizability Beyond Medicine

Mathematical Formulation

Limitations and Considerations

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

MedPrompt

Motivation

Architecture: Three Components

1. Dynamic Few-Shot Selection (kNN-based ICL)

2. Auto-Generated Chain-of-Thought

3. Choice-Shuffle Ensemble

Component Contributions (Ablation)

Results on Medical Benchmarks

Generalizability Beyond Medicine

Mathematical Formulation

Limitations and Considerations

References

See Also

Page Tools