Multi-agent systems are entering clinical medicine, where specialized LLM agents collaborate on diagnostic tasks that traditionally require years of physician expertise. MACD (Multi-Agent Clinical Diagnosis, 2025) introduces a framework where agents self-learn reusable clinical knowledge from historical patient cases and apply it to achieve diagnostic accuracy that matches or exceeds human physicians.
MACD's core innovation is Self-Learned Knowledge – structured diagnostic knowledge that agents automatically extract, refine, and apply from historical case data. This mimics how physicians build expertise through clinical experience.
The knowledge is stored as structured 5-tuples capturing clinical features, conditions, relevance scores, and diagnostic implications. A greedy algorithm with maximal marginal relevance selects diverse concepts while removing redundancies.
Concept-Based Causal Intervention assesses knowledge importance by ablating each concept and measuring the change in diagnostic accuracy:
$$\Delta Acc_c = Acc_{\text{with } c} - Acc_{\text{without } c}$$
Concepts with high $\Delta Acc$ are retained as high-impact knowledge, while low-impact or redundant concepts are pruned.
Knowledge Summarizer Agent: Extracts and structures diagnostic concepts from a sampling set of historical patient cases. It then refines the knowledge base through diversity selection and causal ablation to retain only high-impact knowledge.
Diagnostician Agent: Applies the Self-Learned Knowledge during inference. For each new patient case (history, exams, labs, radiology), it augments its prompt with relevant knowledge and produces a primary diagnosis with explicit rationales linking evidence to knowledge.
Evaluator Agent: Normalizes diagnostic terminology through tolerant name-matching and computes BioBERT semantic similarity scores to assess consensus among multiple diagnostician agents using diverse LLMs.
In the extended workflow, multiple Diagnostician agents (powered by different LLMs, each with their own knowledge base) engage in iterative consultations:
This simulates real-world clinical team consultations where multiple specialists review complex cases.
class MACDFramework: def __init__(self, summarizer_llm, diagnostician_llms, evaluator_llm): self.summarizer = KnowledgeSummarizer(summarizer_llm) self.diagnosticians = [ DiagnosticianAgent(llm) for llm in diagnostician_llms ] self.evaluator = EvaluatorAgent(evaluator_llm) def build_knowledge_base(self, historical_cases, disease): raw_concepts = self.summarizer.extract_concepts(historical_cases) diverse_concepts = self.summarizer.select_diverse( raw_concepts, method="maximal_marginal_relevance" ) refined_knowledge = [] for concept in diverse_concepts: delta_acc = self.causal_ablation(concept, historical_cases) if delta_acc > self.threshold: refined_knowledge.append(concept) return refined_knowledge def diagnose(self, patient_case, knowledge_base): diagnoses = [] for agent in self.diagnosticians: diagnosis = agent.diagnose( patient_case, knowledge=knowledge_base ) diagnoses.append(diagnosis) consensus = self.evaluator.check_consensus(diagnoses) if consensus.agreement_score > 0.8: return consensus.primary_diagnosis return self.escalate_to_human(patient_case, diagnoses) def causal_ablation(self, concept, cases): acc_with = self.evaluate_accuracy(cases, include=concept) acc_without = self.evaluate_accuracy(cases, exclude=concept) return acc_with - acc_without
Evaluated on 4,390 real-world cases from the MIMIC-MACD dataset across seven diseases:
| Metric | Result |
|---|---|
| Primary Diagnostic Accuracy | Up to 22.3% over clinical guidelines (e.g., Mayo Clinic) |
| Avg improvement from Self-Learned Knowledge | 11.6% |
| MACD vs Human Physicians | Llama-3.1 70B: 0.81 vs Human: 0.65 (p < 0.001) |
| MACD-Human Workflow vs Physicians-only | 18.6% improvement |
| Consensus Rate (MACD-Human) | 58.6% |
| Effective Agent Opinions | 88.5% |
The self-learned knowledge transfers across models and provides traceable rationales for explainability.