====== MedPrompt ====== **MedPrompt** is a prompt engineering framework developed by Microsoft Research that combines dynamic few-shot selection, auto-generated chain-of-thought reasoning, and choice-shuffle ensembling to achieve state-of-the-art performance on medical benchmarks using general-purpose LLMs --- without any fine-tuning. Originally demonstrated on medical QA, MedPrompt is fully generalizable and has set records on MMLU and other non-medical benchmarks. ===== Motivation ===== The conventional approach to domain-specific AI performance involves expensive fine-tuning on specialized datasets (e.g., Med-PaLM 2 trained on medical corpora). MedPrompt demonstrates that a powerful generalist model (GPT-4) combined with systematic prompt engineering can //outperform// fine-tuned specialist models, challenging the assumption that domain expertise requires domain-specific training. ===== Architecture: Three Components ===== MedPrompt operates in two phases --- **preprocessing** (building a retrieval database) and **inference** (dynamic prompting) --- with three modular components: === 1. Dynamic Few-Shot Selection (kNN-based ICL) === Instead of using fixed exemplars, MedPrompt retrieves the most semantically similar examples for each query: * Embed all training questions using a text embedding model (e.g., ''text-embedding-ada-002'') * At inference, embed the test question and retrieve $k$ nearest neighbors via cosine similarity * Use these as few-shot exemplars, ensuring maximum relevance to the specific question **Contribution**: +0.8% accuracy improvement over random few-shot selection. === 2. Auto-Generated Chain-of-Thought === Rather than using human-written rationales, GPT-4 generates its own step-by-step reasoning for each few-shot exemplar during preprocessing: * For each training example, prompt GPT-4 to produce a detailed CoT explanation * Store these auto-generated rationales alongside the questions and answers * At inference, retrieved exemplars include the model's own reasoning style **Contribution**: +3.4% accuracy improvement. Notably, auto-generated CoT //outperforms// expert-written rationales. === 3. Choice-Shuffle Ensemble === To reduce position bias in multiple-choice questions, MedPrompt runs the model multiple times with shuffled answer orderings and aggregates via majority vote: * Randomly permute answer choices across $M$ runs (typically $M = 5$) * Each run produces an answer; take the majority vote * This corrects for the model's tendency to favor certain positions (e.g., option A) **Contribution**: +2.1% accuracy improvement with consistent gains across all benchmarks. # MedPrompt implementation sketch import numpy as np from collections import Counter def medprompt(question, training_db, llm, embedder, k=5, n_shuffles=5): # ---- INFERENCE ---- # 1. Dynamic Few-Shot Selection (kNN) q_embedding = embedder.embed(question) similarities = [ cosine_similarity(q_embedding, ex['embedding']) for ex in training_db ] top_k_indices = np.argsort(similarities)[-k:] exemplars = [training_db[i] for i in top_k_indices] # 2. Build prompt with auto-generated CoT exemplars few_shot_prompt = '' for ex in exemplars: few_shot_prompt += f"Q: {ex['question']}\n" few_shot_prompt += f"Reasoning: {ex['auto_cot']}\n" few_shot_prompt += f"A: {ex['answer']}\n\n" # 3. Choice-Shuffle Ensemble choices = extract_choices(question) votes = [] for _ in range(n_shuffles): shuffled_q = shuffle_choices(question, choices) prompt = few_shot_prompt + f"Q: {shuffled_q}\nReasoning:" response = llm.generate(prompt) answer = parse_answer(response, choices) votes.append(answer) # Majority vote final_answer = Counter(votes).most_common(1)[0][0] return final_answer def preprocess_training_db(training_data, llm, embedder): db = [] for item in training_data: embedding = embedder.embed(item['question']) cot_prompt = f"Q: {item['question']}\nExplain step by step, then give the answer." auto_cot = llm.generate(cot_prompt) db.append({ 'question': item['question'], 'answer': item['answer'], 'embedding': embedding, 'auto_cot': auto_cot }) return db ===== Component Contributions (Ablation) ===== ^ Component ^ Avg. Accuracy Gain ^ Notes ^ | Dynamic Few-Shot | +0.8% | kNN retrieval from embedding space | | Auto-Generated CoT | +3.4% | Largest single contributor | | Choice-Shuffle Ensemble (5 runs) | +2.1% | Most consistent across benchmarks | | **Full MedPrompt** | **+7.1%** | Over zero-shot baseline | ===== Results on Medical Benchmarks ===== MedPrompt with GPT-4 achieved state-of-the-art on all nine MultiMedQA benchmarks: ^ Benchmark ^ Zero-Shot GPT-4 ^ MedPrompt GPT-4 ^ | MedQA (USMLE) | ~83% | **90.2%** | | PubMedQA | Baseline | Significant improvement | | All MultiMedQA (avg) | Baseline | **+7.1%** | This matched or exceeded Google's Gemini Ultra and surpassed fine-tuned models like Med-PaLM 2, all without any model training. ===== Generalizability Beyond Medicine ===== MedPrompt's design is domain-agnostic. **MedPrompt+** (extended version) demonstrated: * **MMLU**: 89.56% with 20 ensembles (new SOTA at time of publication), 90%+ with hybrid strategy * **Non-medical domains** tested: electrical engineering, law, philosophy, accounting, psychology, machine learning * Average improvement of **+7.3%** over zero-shot across non-medical domains --- nearly identical to medical gains (+7.1%) This confirms that MedPrompt's components are general-purpose prompt engineering techniques, not medical-specific. ===== Mathematical Formulation ===== The ensemble decision rule for $M$ shuffled runs: $$\hat{a} = \arg\max_{a \in \mathcal{A}} \sum_{m=1}^{M} \mathbf{1}[f(q_{\pi_m}) = a]$$ where $\pi_m$ is the $m$-th random permutation of answer choices, $f(\cdot)$ is the model's predicted answer, and $\mathcal{A}$ is the set of possible answers. graph TB subgraph Preprocessing A[Training Questions] --> B[Embed with text-embedding model] A --> C[Auto-generate CoT with GPT-4] B --> D[Embedding Database] C --> D end subgraph Inference E[Test Question] --> F[Embed Query] F --> G[kNN Retrieval from DB] G --> H[Build Few-Shot Prompt with CoT] H --> I[Run with Shuffled Choices x5] I --> J[Majority Vote] J --> K[Final Answer] end ===== Limitations and Considerations ===== * **Compute cost**: Ensemble requires $M$ forward passes per question (typically 5x) * **Embedding dependency**: Requires a good embedding model for kNN retrieval * **Training data needed**: Dynamic few-shot requires a labeled training set for the embedding database (though no model training occurs) * **Newer reasoning models**: On OpenAI o1-preview, aggressive few-shot prompting can //hurt// performance; simpler prompts work better with reasoning-native models ===== References ===== * [[https://arxiv.org/abs/2311.16452|Nori et al. "Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine" (2023). arXiv:2311.16452]] * [[https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/|Microsoft Research Blog: "The Power of Prompting"]] * [[https://www.microsoft.com/en-us/research/blog/steering-at-the-frontier-extending-the-power-of-prompting/|Microsoft Research Blog: "Steering at the Frontier" (MedPrompt+)]] * [[https://github.com/microsoft/promptbase|Microsoft PromptBase (GitHub)]] ===== See Also ===== * [[chain_of_thought|Chain of Thought]] * [[few_shot_prompting|Few-Shot Prompting]] * [[self_consistency|Self-Consistency]] * [[ensemble_methods|Ensemble Methods in LLMs]]