====== MedPrompt ======

**MedPrompt** is a prompt engineering framework developed by Microsoft Research that combines dynamic few-shot selection, auto-generated chain-of-thought reasoning, and choice-shuffle ensembling to achieve state-of-the-art performance on medical benchmarks using general-purpose LLMs --- without any fine-tuning. Originally demonstrated on medical QA, MedPrompt is fully generalizable and has set records on MMLU and other non-medical benchmarks.

===== Motivation =====

The conventional approach to domain-specific AI performance involves expensive fine-tuning on specialized datasets (e.g., Med-PaLM 2 trained on medical corpora). MedPrompt demonstrates that a powerful generalist model (GPT-4) combined with systematic prompt engineering can //outperform// fine-tuned specialist models, challenging the assumption that domain expertise requires domain-specific training.

===== Architecture: Three Components =====

MedPrompt operates in two phases --- **preprocessing** (building a retrieval database) and **inference** (dynamic prompting) --- with three modular components:

=== 1. Dynamic Few-Shot Selection (kNN-based ICL) ===

Instead of using fixed exemplars, MedPrompt retrieves the most semantically similar examples for each query:

  * Embed all training questions using a text embedding model (e.g., ''text-embedding-ada-002'')
  * At inference, embed the test question and retrieve $k$ nearest neighbors via cosine similarity
  * Use these as few-shot exemplars, ensuring maximum relevance to the specific question

**Contribution**: +0.8% accuracy improvement over random few-shot selection.

=== 2. Auto-Generated Chain-of-Thought ===

Rather than using human-written rationales, GPT-4 generates its own step-by-step reasoning for each few-shot exemplar during preprocessing:

  * For each training example, prompt GPT-4 to produce a detailed CoT explanation
  * Store these auto-generated rationales alongside the questions and answers
  * At inference, retrieved exemplars include the model's own reasoning style

**Contribution**: +3.4% accuracy improvement. Notably, auto-generated CoT //outperforms// expert-written rationales.

=== 3. Choice-Shuffle Ensemble ===

To reduce position bias in multiple-choice questions, MedPrompt runs the model multiple times with shuffled answer orderings and aggregates via majority vote:

  * Randomly permute answer choices across $M$ runs (typically $M = 5$)
  * Each run produces an answer; take the majority vote
  * This corrects for the model's tendency to favor certain positions (e.g., option A)

**Contribution**: +2.1% accuracy improvement with consistent gains across all benchmarks.

<code python>
# MedPrompt implementation sketch

import numpy as np
from collections import Counter

def medprompt(question, training_db, llm, embedder, k=5, n_shuffles=5):
    # ---- INFERENCE ----

    # 1. Dynamic Few-Shot Selection (kNN)
    q_embedding = embedder.embed(question)
    similarities = [
        cosine_similarity(q_embedding, ex['embedding'])
        for ex in training_db
    ]
    top_k_indices = np.argsort(similarities)[-k:]
    exemplars = [training_db[i] for i in top_k_indices]

    # 2. Build prompt with auto-generated CoT exemplars
    few_shot_prompt = ''
    for ex in exemplars:
        few_shot_prompt += f"Q: {ex['question']}\n"
        few_shot_prompt += f"Reasoning: {ex['auto_cot']}\n"
        few_shot_prompt += f"A: {ex['answer']}\n\n"

    # 3. Choice-Shuffle Ensemble
    choices = extract_choices(question)
    votes = []
    for _ in range(n_shuffles):
        shuffled_q = shuffle_choices(question, choices)
        prompt = few_shot_prompt + f"Q: {shuffled_q}\nReasoning:"
        response = llm.generate(prompt)
        answer = parse_answer(response, choices)
        votes.append(answer)

    # Majority vote
    final_answer = Counter(votes).most_common(1)[0][0]
    return final_answer


def preprocess_training_db(training_data, llm, embedder):
    db = []
    for item in training_data:
        embedding = embedder.embed(item['question'])
        cot_prompt = f"Q: {item['question']}\nExplain step by step, then give the answer."
        auto_cot = llm.generate(cot_prompt)
        db.append({
            'question': item['question'],
            'answer': item['answer'],
            'embedding': embedding,
            'auto_cot': auto_cot
        })
    return db
</code>

===== Component Contributions (Ablation) =====

^ Component ^ Avg. Accuracy Gain ^ Notes ^
| Dynamic Few-Shot | +0.8% | kNN retrieval from embedding space |
| Auto-Generated CoT | +3.4% | Largest single contributor |
| Choice-Shuffle Ensemble (5 runs) | +2.1% | Most consistent across benchmarks |
| **Full MedPrompt** | **+7.1%** | Over zero-shot baseline |

===== Results on Medical Benchmarks =====

MedPrompt with GPT-4 achieved state-of-the-art on all nine MultiMedQA benchmarks:

^ Benchmark ^ Zero-Shot GPT-4 ^ MedPrompt GPT-4 ^
| MedQA (USMLE) | ~83% | **90.2%** |
| PubMedQA | Baseline | Significant improvement |
| All MultiMedQA (avg) | Baseline | **+7.1%** |

This matched or exceeded Google's Gemini Ultra and surpassed fine-tuned models like Med-PaLM 2, all without any model training.

===== Generalizability Beyond Medicine =====

MedPrompt's design is domain-agnostic. **MedPrompt+** (extended version) demonstrated:

  * **MMLU**: 89.56% with 20 ensembles (new SOTA at time of publication), 90%+ with hybrid strategy
  * **Non-medical domains** tested: electrical engineering, law, philosophy, accounting, psychology, machine learning
  * Average improvement of **+7.3%** over zero-shot across non-medical domains --- nearly identical to medical gains (+7.1%)

This confirms that MedPrompt's components are general-purpose prompt engineering techniques, not medical-specific.

===== Mathematical Formulation =====

The ensemble decision rule for $M$ shuffled runs:

$$\hat{a} = \arg\max_{a \in \mathcal{A}} \sum_{m=1}^{M} \mathbf{1}[f(q_{\pi_m}) = a]$$

where $\pi_m$ is the $m$-th random permutation of answer choices, $f(\cdot)$ is the model's predicted answer, and $\mathcal{A}$ is the set of possible answers.

<mermaid>
graph TB
    subgraph Preprocessing
        A[Training Questions] --> B[Embed with text-embedding model]
        A --> C[Auto-generate CoT with GPT-4]
        B --> D[Embedding Database]
        C --> D
    end
    subgraph Inference
        E[Test Question] --> F[Embed Query]
        F --> G[kNN Retrieval from DB]
        G --> H[Build Few-Shot Prompt with CoT]
        H --> I[Run with Shuffled Choices x5]
        I --> J[Majority Vote]
        J --> K[Final Answer]
    end
</mermaid>

===== Limitations and Considerations =====

  * **Compute cost**: Ensemble requires $M$ forward passes per question (typically 5x)
  * **Embedding dependency**: Requires a good embedding model for kNN retrieval
  * **Training data needed**: Dynamic few-shot requires a labeled training set for the embedding database (though no model training occurs)
  * **Newer reasoning models**: On OpenAI o1-preview, aggressive few-shot prompting can //hurt// performance; simpler prompts work better with reasoning-native models

===== References =====

  * [[https://arxiv.org/abs/2311.16452|Nori et al. "Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine" (2023). arXiv:2311.16452]]
  * [[https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/|Microsoft Research Blog: "The Power of Prompting"]]
  * [[https://www.microsoft.com/en-us/research/blog/steering-at-the-frontier-extending-the-power-of-prompting/|Microsoft Research Blog: "Steering at the Frontier" (MedPrompt+)]]
  * [[https://github.com/microsoft/promptbase|Microsoft PromptBase (GitHub)]]

===== See Also =====

  * [[chain_of_thought|Chain of Thought]]
  * [[few_shot_prompting|Few-Shot Prompting]]
  * [[self_consistency|Self-Consistency]]
  * [[ensemble_methods|Ensemble Methods in LLMs]]