Table of Contents

MedPrompt

MedPrompt is a prompt engineering framework developed by Microsoft Research that combines dynamic few-shot selection, auto-generated chain-of-thought reasoning, and choice-shuffle ensembling to achieve state-of-the-art performance on medical benchmarks using general-purpose LLMs — without any fine-tuning. Originally demonstrated on medical QA, MedPrompt is fully generalizable and has set records on MMLU and other non-medical benchmarks.

Motivation

The conventional approach to domain-specific AI performance involves expensive fine-tuning on specialized datasets (e.g., Med-PaLM 2 trained on medical corpora). MedPrompt demonstrates that a powerful generalist model (GPT-4) combined with systematic prompt engineering can outperform fine-tuned specialist models, challenging the assumption that domain expertise requires domain-specific training.

Architecture: Three Components

MedPrompt operates in two phases — preprocessing (building a retrieval database) and inference (dynamic prompting) — with three modular components:

1. Dynamic Few-Shot Selection (kNN-based ICL)

Instead of using fixed exemplars, MedPrompt retrieves the most semantically similar examples for each query:

Contribution: +0.8% accuracy improvement over random few-shot selection.

2. Auto-Generated Chain-of-Thought

Rather than using human-written rationales, GPT-4 generates its own step-by-step reasoning for each few-shot exemplar during preprocessing:

Contribution: +3.4% accuracy improvement. Notably, auto-generated CoT outperforms expert-written rationales.

3. Choice-Shuffle Ensemble

To reduce position bias in multiple-choice questions, MedPrompt runs the model multiple times with shuffled answer orderings and aggregates via majority vote:

Contribution: +2.1% accuracy improvement with consistent gains across all benchmarks.

# MedPrompt implementation sketch
 
import numpy as np
from collections import Counter
 
def medprompt(question, training_db, llm, embedder, k=5, n_shuffles=5):
    # ---- INFERENCE ----
 
    # 1. Dynamic Few-Shot Selection (kNN)
    q_embedding = embedder.embed(question)
    similarities = [
        cosine_similarity(q_embedding, ex['embedding'])
        for ex in training_db
    ]
    top_k_indices = np.argsort(similarities)[-k:]
    exemplars = [training_db[i] for i in top_k_indices]
 
    # 2. Build prompt with auto-generated CoT exemplars
    few_shot_prompt = ''
    for ex in exemplars:
        few_shot_prompt += f"Q: {ex['question']}\n"
        few_shot_prompt += f"Reasoning: {ex['auto_cot']}\n"
        few_shot_prompt += f"A: {ex['answer']}\n\n"
 
    # 3. Choice-Shuffle Ensemble
    choices = extract_choices(question)
    votes = []
    for _ in range(n_shuffles):
        shuffled_q = shuffle_choices(question, choices)
        prompt = few_shot_prompt + f"Q: {shuffled_q}\nReasoning:"
        response = llm.generate(prompt)
        answer = parse_answer(response, choices)
        votes.append(answer)
 
    # Majority vote
    final_answer = Counter(votes).most_common(1)[0][0]
    return final_answer
 
 
def preprocess_training_db(training_data, llm, embedder):
    db = []
    for item in training_data:
        embedding = embedder.embed(item['question'])
        cot_prompt = f"Q: {item['question']}\nExplain step by step, then give the answer."
        auto_cot = llm.generate(cot_prompt)
        db.append({
            'question': item['question'],
            'answer': item['answer'],
            'embedding': embedding,
            'auto_cot': auto_cot
        })
    return db

Component Contributions (Ablation)

Component Avg. Accuracy Gain Notes
Dynamic Few-Shot +0.8% kNN retrieval from embedding space
Auto-Generated CoT +3.4% Largest single contributor
Choice-Shuffle Ensemble (5 runs) +2.1% Most consistent across benchmarks
Full MedPrompt +7.1% Over zero-shot baseline

Results on Medical Benchmarks

MedPrompt with GPT-4 achieved state-of-the-art on all nine MultiMedQA benchmarks:

Benchmark Zero-Shot GPT-4 MedPrompt GPT-4
MedQA (USMLE) ~83% 90.2%
PubMedQA Baseline Significant improvement
All MultiMedQA (avg) Baseline +7.1%

This matched or exceeded Google's Gemini Ultra and surpassed fine-tuned models like Med-PaLM 2, all without any model training.

Generalizability Beyond Medicine

MedPrompt's design is domain-agnostic. MedPrompt+ (extended version) demonstrated:

This confirms that MedPrompt's components are general-purpose prompt engineering techniques, not medical-specific.

Mathematical Formulation

The ensemble decision rule for $M$ shuffled runs:

$$\hat{a} = \arg\max_{a \in \mathcal{A}} \sum_{m=1}^{M} \mathbf{1}[f(q_{\pi_m}) = a]$$

where $\pi_m$ is the $m$-th random permutation of answer choices, $f(\cdot)$ is the model's predicted answer, and $\mathcal{A}$ is the set of possible answers.

graph TB subgraph Preprocessing A[Training Questions] --> B[Embed with text-embedding model] A --> C[Auto-generate CoT with GPT-4] B --> D[Embedding Database] C --> D end subgraph Inference E[Test Question] --> F[Embed Query] F --> G[kNN Retrieval from DB] G --> H[Build Few-Shot Prompt with CoT] H --> I[Run with Shuffled Choices x5] I --> J[Majority Vote] J --> K[Final Answer] end

Limitations and Considerations

References

See Also