====== Clinical Diagnosis Agents: MACD ====== Multi-agent systems are entering clinical medicine, where specialized LLM agents collaborate on diagnostic tasks that traditionally require years of physician expertise. **MACD** (Multi-Agent Clinical Diagnosis, 2025) introduces a framework where agents self-learn reusable clinical knowledge from historical patient cases and apply it to achieve diagnostic accuracy that matches or exceeds human physicians. ===== Architecture: Self-Learned Knowledge ===== MACD's core innovation is **Self-Learned Knowledge** -- structured diagnostic knowledge that agents automatically extract, refine, and apply from historical case data. This mimics how physicians build expertise through clinical experience. The knowledge is stored as structured 5-tuples capturing clinical features, conditions, relevance scores, and diagnostic implications. A greedy algorithm with **maximal marginal relevance** selects diverse concepts while removing redundancies. **Concept-Based Causal Intervention** assesses knowledge importance by ablating each concept and measuring the change in diagnostic accuracy: $$\Delta Acc_c = Acc_{\text{with } c} - Acc_{\text{without } c}$$ Concepts with high $\Delta Acc$ are retained as high-impact knowledge, while low-impact or redundant concepts are pruned. ===== Three Specialized Agents ===== **Knowledge Summarizer Agent:** Extracts and structures diagnostic concepts from a sampling set of historical patient cases. It then refines the knowledge base through diversity selection and causal ablation to retain only high-impact knowledge. **Diagnostician Agent:** Applies the Self-Learned Knowledge during inference. For each new patient case (history, exams, labs, radiology), it augments its prompt with relevant knowledge and produces a primary diagnosis with explicit rationales linking evidence to knowledge. **Evaluator Agent:** Normalizes diagnostic terminology through tolerant name-matching and computes BioBERT semantic similarity scores to assess consensus among multiple diagnostician agents using diverse LLMs. ===== MACD-Human Collaborative Workflow ===== In the extended workflow, multiple Diagnostician agents (powered by different LLMs, each with their own knowledge base) engage in iterative consultations: - Each agent independently diagnoses the case - Agents exchange anonymized opinions - The Evaluator checks for consensus - Unresolved cases escalate to human physician oversight This simulates real-world clinical team consultations where multiple specialists review complex cases. ===== Code Example: Clinical Diagnosis Pipeline ===== class MACDFramework: def __init__(self, summarizer_llm, diagnostician_llms, evaluator_llm): self.summarizer = KnowledgeSummarizer(summarizer_llm) self.diagnosticians = [ DiagnosticianAgent(llm) for llm in diagnostician_llms ] self.evaluator = EvaluatorAgent(evaluator_llm) def build_knowledge_base(self, historical_cases, disease): raw_concepts = self.summarizer.extract_concepts(historical_cases) diverse_concepts = self.summarizer.select_diverse( raw_concepts, method="maximal_marginal_relevance" ) refined_knowledge = [] for concept in diverse_concepts: delta_acc = self.causal_ablation(concept, historical_cases) if delta_acc > self.threshold: refined_knowledge.append(concept) return refined_knowledge def diagnose(self, patient_case, knowledge_base): diagnoses = [] for agent in self.diagnosticians: diagnosis = agent.diagnose( patient_case, knowledge=knowledge_base ) diagnoses.append(diagnosis) consensus = self.evaluator.check_consensus(diagnoses) if consensus.agreement_score > 0.8: return consensus.primary_diagnosis return self.escalate_to_human(patient_case, diagnoses) def causal_ablation(self, concept, cases): acc_with = self.evaluate_accuracy(cases, include=concept) acc_without = self.evaluate_accuracy(cases, exclude=concept) return acc_with - acc_without ===== Results ===== Evaluated on 4,390 real-world cases from the **MIMIC-MACD** dataset across seven diseases: ^ Metric ^ Result ^ | Primary Diagnostic Accuracy | Up to **22.3%** over clinical guidelines (e.g., Mayo Clinic) | | Avg improvement from Self-Learned Knowledge | **11.6%** | | MACD vs Human Physicians | Llama-3.1 70B: 0.81 vs Human: 0.65 (p < 0.001) | | MACD-Human Workflow vs Physicians-only | **18.6%** improvement | | Consensus Rate (MACD-Human) | **58.6%** | | Effective Agent Opinions | **88.5%** | The self-learned knowledge transfers across models and provides traceable rationales for explainability. ===== Multi-Agent Diagnosis Diagram ===== flowchart TD A[Historical Patient Cases] --> B[Knowledge Summarizer Agent] B --> C[Raw Diagnostic Concepts] C --> D[Diversity Selection + Causal Ablation] D --> E[Refined Self-Learned Knowledge] F[New Patient Case] --> G[Diagnostician Agent 1] F --> H[Diagnostician Agent 2] F --> I[Diagnostician Agent 3] E --> G E --> H E --> I G --> J[Evaluator Agent] H --> J I --> J J --> K{Consensus?} K -->|Yes| L[Primary Diagnosis + Rationale] K -->|No| M[Escalate to Human Physician] ===== Clinical Significance ===== * **Outperforms clinical guidelines:** Self-learned knowledge from case data is more specific and actionable than generic guidelines * **Exceeds human physicians:** On the MIMIC-MACD benchmark, MACD agents achieve 16% higher accuracy than physicians (p < 0.001) * **Explainable diagnostics:** Each diagnosis includes traceable rationales linking patient evidence to specific knowledge concepts * **Cross-model stability:** Self-learned knowledge transfers effectively across different LLM backbones * **Human-AI collaboration:** The MACD-Human workflow preserves physician oversight while leveraging agent capabilities ===== References ===== * [[https://arxiv.org/abs/2509.20067|MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLMs (arXiv:2509.20067)]] ===== See Also ===== * [[causal_reasoning_agents|Causal Reasoning Agents: Causal-Copilot]] * [[knowledge_graph_world_models|Knowledge Graph World Models: AriGraph]] * [[agent_resource_management|Agent Resource Management: AgentRM]]