====== Clinical Diagnosis Agents: MACD ======
Multi-agent systems are entering clinical medicine, where specialized LLM agents collaborate on diagnostic tasks that traditionally require years of physician expertise. **MACD** (Multi-Agent Clinical Diagnosis, 2025) introduces a framework where agents self-learn reusable clinical knowledge from historical patient cases and apply it to achieve diagnostic accuracy that matches or exceeds human physicians.
===== Architecture: Self-Learned Knowledge =====
MACD's core innovation is **Self-Learned Knowledge** -- structured diagnostic knowledge that agents automatically extract, refine, and apply from historical case data. This mimics how physicians build expertise through clinical experience.
The knowledge is stored as structured 5-tuples capturing clinical features, conditions, relevance scores, and diagnostic implications. A greedy algorithm with **maximal marginal relevance** selects diverse concepts while removing redundancies.
**Concept-Based Causal Intervention** assesses knowledge importance by ablating each concept and measuring the change in diagnostic accuracy:
$$\Delta Acc_c = Acc_{\text{with } c} - Acc_{\text{without } c}$$
Concepts with high $\Delta Acc$ are retained as high-impact knowledge, while low-impact or redundant concepts are pruned.
===== Three Specialized Agents =====
**Knowledge Summarizer Agent:** Extracts and structures diagnostic concepts from a sampling set of historical patient cases. It then refines the knowledge base through diversity selection and causal ablation to retain only high-impact knowledge.
**Diagnostician Agent:** Applies the Self-Learned Knowledge during inference. For each new patient case (history, exams, labs, radiology), it augments its prompt with relevant knowledge and produces a primary diagnosis with explicit rationales linking evidence to knowledge.
**Evaluator Agent:** Normalizes diagnostic terminology through tolerant name-matching and computes BioBERT semantic similarity scores to assess consensus among multiple diagnostician agents using diverse LLMs.
===== MACD-Human Collaborative Workflow =====
In the extended workflow, multiple Diagnostician agents (powered by different LLMs, each with their own knowledge base) engage in iterative consultations:
- Each agent independently diagnoses the case
- Agents exchange anonymized opinions
- The Evaluator checks for consensus
- Unresolved cases escalate to human physician oversight
This simulates real-world clinical team consultations where multiple specialists review complex cases.
===== Code Example: Clinical Diagnosis Pipeline =====
class MACDFramework:
def __init__(self, summarizer_llm, diagnostician_llms, evaluator_llm):
self.summarizer = KnowledgeSummarizer(summarizer_llm)
self.diagnosticians = [
DiagnosticianAgent(llm) for llm in diagnostician_llms
]
self.evaluator = EvaluatorAgent(evaluator_llm)
def build_knowledge_base(self, historical_cases, disease):
raw_concepts = self.summarizer.extract_concepts(historical_cases)
diverse_concepts = self.summarizer.select_diverse(
raw_concepts, method="maximal_marginal_relevance"
)
refined_knowledge = []
for concept in diverse_concepts:
delta_acc = self.causal_ablation(concept, historical_cases)
if delta_acc > self.threshold:
refined_knowledge.append(concept)
return refined_knowledge
def diagnose(self, patient_case, knowledge_base):
diagnoses = []
for agent in self.diagnosticians:
diagnosis = agent.diagnose(
patient_case, knowledge=knowledge_base
)
diagnoses.append(diagnosis)
consensus = self.evaluator.check_consensus(diagnoses)
if consensus.agreement_score > 0.8:
return consensus.primary_diagnosis
return self.escalate_to_human(patient_case, diagnoses)
def causal_ablation(self, concept, cases):
acc_with = self.evaluate_accuracy(cases, include=concept)
acc_without = self.evaluate_accuracy(cases, exclude=concept)
return acc_with - acc_without
===== Results =====
Evaluated on 4,390 real-world cases from the **MIMIC-MACD** dataset across seven diseases:
^ Metric ^ Result ^
| Primary Diagnostic Accuracy | Up to **22.3%** over clinical guidelines (e.g., Mayo Clinic) |
| Avg improvement from Self-Learned Knowledge | **11.6%** |
| MACD vs Human Physicians | Llama-3.1 70B: 0.81 vs Human: 0.65 (p < 0.001) |
| MACD-Human Workflow vs Physicians-only | **18.6%** improvement |
| Consensus Rate (MACD-Human) | **58.6%** |
| Effective Agent Opinions | **88.5%** |
The self-learned knowledge transfers across models and provides traceable rationales for explainability.
===== Multi-Agent Diagnosis Diagram =====
flowchart TD
A[Historical Patient Cases] --> B[Knowledge Summarizer Agent]
B --> C[Raw Diagnostic Concepts]
C --> D[Diversity Selection + Causal Ablation]
D --> E[Refined Self-Learned Knowledge]
F[New Patient Case] --> G[Diagnostician Agent 1]
F --> H[Diagnostician Agent 2]
F --> I[Diagnostician Agent 3]
E --> G
E --> H
E --> I
G --> J[Evaluator Agent]
H --> J
I --> J
J --> K{Consensus?}
K -->|Yes| L[Primary Diagnosis + Rationale]
K -->|No| M[Escalate to Human Physician]
===== Clinical Significance =====
* **Outperforms clinical guidelines:** Self-learned knowledge from case data is more specific and actionable than generic guidelines
* **Exceeds human physicians:** On the MIMIC-MACD benchmark, MACD agents achieve 16% higher accuracy than physicians (p < 0.001)
* **Explainable diagnostics:** Each diagnosis includes traceable rationales linking patient evidence to specific knowledge concepts
* **Cross-model stability:** Self-learned knowledge transfers effectively across different LLM backbones
* **Human-AI collaboration:** The MACD-Human workflow preserves physician oversight while leveraging agent capabilities
===== References =====
* [[https://arxiv.org/abs/2509.20067|MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLMs (arXiv:2509.20067)]]
===== See Also =====
* [[causal_reasoning_agents|Causal Reasoning Agents: Causal-Copilot]]
* [[knowledge_graph_world_models|Knowledge Graph World Models: AriGraph]]
* [[agent_resource_management|Agent Resource Management: AgentRM]]