Motivation
Architecture: Three Components
Component Contributions (Ablation)
Results on Medical Benchmarks
Generalizability Beyond Medicine
Mathematical Formulation
Limitations and Considerations
References
See Also

MedPrompt

MedPrompt is a prompt engineering framework developed by Microsoft Research that combines dynamic few-shot selection, auto-generated chain-of-thought reasoning, and choice-shuffle ensembling to achieve state-of-the-art performance on medical benchmarks using general-purpose LLMs — without any fine-tuning. Originally demonstrated on medical QA, MedPrompt is fully generalizable and has set records on MMLU and other non-medical benchmarks.

Motivation

The conventional approach to domain-specific AI performance involves expensive fine-tuning on specialized datasets (e.g., Med-PaLM 2 trained on medical corpora). MedPrompt demonstrates that a powerful generalist model (GPT-4) combined with systematic prompt engineering can outperform fine-tuned specialist models, challenging the assumption that domain expertise requires domain-specific training.

Architecture: Three Components

MedPrompt operates in two phases — preprocessing (building a retrieval database) and inference (dynamic prompting) — with three modular components:

1. Dynamic Few-Shot Selection (kNN-based ICL)

Instead of using fixed exemplars, MedPrompt retrieves the most semantically similar examples for each query:

Embed all training questions using a text embedding model (e.g., text-embedding-ada-002)
At inference, embed the test question and retrieve $k$ nearest neighbors via cosine similarity
Use these as few-shot exemplars, ensuring maximum relevance to the specific question

Contribution: +0.8% accuracy improvement over random few-shot selection.

2. Auto-Generated Chain-of-Thought

Rather than using human-written rationales, GPT-4 generates its own step-by-step reasoning for each few-shot exemplar during preprocessing:

For each training example, prompt GPT-4 to produce a detailed CoT explanation
Store these auto-generated rationales alongside the questions and answers
At inference, retrieved exemplars include the model's own reasoning style

Contribution: +3.4% accuracy improvement. Notably, auto-generated CoT outperforms expert-written rationales.

3. Choice-Shuffle Ensemble

To reduce position bias in multiple-choice questions, MedPrompt runs the model multiple times with shuffled answer orderings and aggregates via majority vote:

Randomly permute answer choices across $M$ runs (typically $M = 5$)
Each run produces an answer; take the majority vote
This corrects for the model's tendency to favor certain positions (e.g., option A)

Contribution: +2.1% accuracy improvement with consistent gains across all benchmarks.

# MedPrompt implementation sketch
 
import numpy as np
from collections import Counter
 
def medprompt(question, training_db, llm, embedder, k=5, n_shuffles=5):
    # ---- INFERENCE ----
 
    # 1. Dynamic Few-Shot Selection (kNN)
    q_embedding = embedder.embed(question)
    similarities = [
        cosine_similarity(q_embedding, ex['embedding'])
        for ex in training_db
    ]
    top_k_indices = np.argsort(similarities)[-k:]
    exemplars = [training_db[i] for i in top_k_indices]
 
    # 2. Build prompt with auto-generated CoT exemplars
    few_shot_prompt = ''
    for ex in exemplars:
        few_shot_prompt += f"Q: {ex['question']}\n"
        few_shot_prompt += f"Reasoning: {ex['auto_cot']}\n"
        few_shot_prompt += f"A: {ex['answer']}\n\n"
 
    # 3. Choice-Shuffle Ensemble
    choices = extract_choices(question)
    votes = []
    for _ in range(n_shuffles):
        shuffled_q = shuffle_choices(question, choices)
        prompt = few_shot_prompt + f"Q: {shuffled_q}\nReasoning:"
        response = llm.generate(prompt)
        answer = parse_answer(response, choices)
        votes.append(answer)
 
    # Majority vote
    final_answer = Counter(votes).most_common(1)[0][0]
    return final_answer
 
 
def preprocess_training_db(training_data, llm, embedder):
    db = []
    for item in training_data:
        embedding = embedder.embed(item['question'])
        cot_prompt = f"Q: {item['question']}\nExplain step by step, then give the answer."
        auto_cot = llm.generate(cot_prompt)
        db.append({
            'question': item['question'],
            'answer': item['answer'],
            'embedding': embedding,
            'auto_cot': auto_cot
        })
    return db

Component Contributions (Ablation)

Component	Avg. Accuracy Gain	Notes
Dynamic Few-Shot	+0.8%	kNN retrieval from embedding space
Auto-Generated CoT	+3.4%	Largest single contributor
Choice-Shuffle Ensemble (5 runs)	+2.1%	Most consistent across benchmarks
Full MedPrompt	+7.1%	Over zero-shot baseline

Results on Medical Benchmarks

MedPrompt with GPT-4 achieved state-of-the-art on all nine MultiMedQA benchmarks:

Benchmark	Zero-Shot GPT-4	MedPrompt GPT-4
MedQA (USMLE)	~83%	90.2%
PubMedQA	Baseline	Significant improvement
All MultiMedQA (avg)	Baseline	+7.1%

This matched or exceeded Google's Gemini Ultra and surpassed fine-tuned models like Med-PaLM 2, all without any model training.

Generalizability Beyond Medicine

MedPrompt's design is domain-agnostic. MedPrompt+ (extended version) demonstrated:

MMLU: 89.56% with 20 ensembles (new SOTA at time of publication), 90%+ with hybrid strategy
Non-medical domains tested: electrical engineering, law, philosophy, accounting, psychology, machine learning
Average improvement of +7.3% over zero-shot across non-medical domains — nearly identical to medical gains (+7.1%)

This confirms that MedPrompt's components are general-purpose prompt engineering techniques, not medical-specific.

Mathematical Formulation

The ensemble decision rule for $M$ shuffled runs:

$$\hat{a} = \arg\max_{a \in \mathcal{A}} \sum_{m=1}^{M} \mathbf{1}[f(q_{\pi_m}) = a]$$

where $\pi_m$ is the $m$-th random permutation of answer choices, $f(\cdot)$ is the model's predicted answer, and $\mathcal{A}$ is the set of possible answers.

graph TB subgraph Preprocessing A[Training Questions] --> B[Embed with text-embedding model] A --> C[Auto-generate CoT with GPT-4] B --> D[Embedding Database] C --> D end subgraph Inference E[Test Question] --> F[Embed Query] F --> G[kNN Retrieval from DB] G --> H[Build Few-Shot Prompt with CoT] H --> I[Run with Shuffled Choices x5] I --> J[Majority Vote] J --> K[Final Answer] end

Limitations and Considerations

Compute cost: Ensemble requires $M$ forward passes per question (typically 5x)
Embedding dependency: Requires a good embedding model for kNN retrieval
Training data needed: Dynamic few-shot requires a labeled training set for the embedding database (though no model training occurs)
Newer reasoning models: On OpenAI o1-preview, aggressive few-shot prompting can hurt performance; simpler prompts work better with reasoning-native models

Table of Contents