Table of Contents

Clinical Diagnosis Agents: MACD

Multi-agent systems are entering clinical medicine, where specialized LLM agents collaborate on diagnostic tasks that traditionally require years of physician expertise. MACD (Multi-Agent Clinical Diagnosis, 2025) introduces a framework where agents self-learn reusable clinical knowledge from historical patient cases and apply it to achieve diagnostic accuracy that matches or exceeds human physicians.

Architecture: Self-Learned Knowledge

MACD's core innovation is Self-Learned Knowledge – structured diagnostic knowledge that agents automatically extract, refine, and apply from historical case data. This mimics how physicians build expertise through clinical experience.

The knowledge is stored as structured 5-tuples capturing clinical features, conditions, relevance scores, and diagnostic implications. A greedy algorithm with maximal marginal relevance selects diverse concepts while removing redundancies.

Concept-Based Causal Intervention assesses knowledge importance by ablating each concept and measuring the change in diagnostic accuracy:

$$\Delta Acc_c = Acc_{\text{with } c} - Acc_{\text{without } c}$$

Concepts with high $\Delta Acc$ are retained as high-impact knowledge, while low-impact or redundant concepts are pruned.

Three Specialized Agents

Knowledge Summarizer Agent: Extracts and structures diagnostic concepts from a sampling set of historical patient cases. It then refines the knowledge base through diversity selection and causal ablation to retain only high-impact knowledge.

Diagnostician Agent: Applies the Self-Learned Knowledge during inference. For each new patient case (history, exams, labs, radiology), it augments its prompt with relevant knowledge and produces a primary diagnosis with explicit rationales linking evidence to knowledge.

Evaluator Agent: Normalizes diagnostic terminology through tolerant name-matching and computes BioBERT semantic similarity scores to assess consensus among multiple diagnostician agents using diverse LLMs.

MACD-Human Collaborative Workflow

In the extended workflow, multiple Diagnostician agents (powered by different LLMs, each with their own knowledge base) engage in iterative consultations:

  1. Each agent independently diagnoses the case
  2. Agents exchange anonymized opinions
  3. The Evaluator checks for consensus
  4. Unresolved cases escalate to human physician oversight

This simulates real-world clinical team consultations where multiple specialists review complex cases.

Code Example: Clinical Diagnosis Pipeline

class MACDFramework:
    def __init__(self, summarizer_llm, diagnostician_llms, evaluator_llm):
        self.summarizer = KnowledgeSummarizer(summarizer_llm)
        self.diagnosticians = [
            DiagnosticianAgent(llm) for llm in diagnostician_llms
        ]
        self.evaluator = EvaluatorAgent(evaluator_llm)
 
    def build_knowledge_base(self, historical_cases, disease):
        raw_concepts = self.summarizer.extract_concepts(historical_cases)
        diverse_concepts = self.summarizer.select_diverse(
            raw_concepts, method="maximal_marginal_relevance"
        )
        refined_knowledge = []
        for concept in diverse_concepts:
            delta_acc = self.causal_ablation(concept, historical_cases)
            if delta_acc > self.threshold:
                refined_knowledge.append(concept)
        return refined_knowledge
 
    def diagnose(self, patient_case, knowledge_base):
        diagnoses = []
        for agent in self.diagnosticians:
            diagnosis = agent.diagnose(
                patient_case, knowledge=knowledge_base
            )
            diagnoses.append(diagnosis)
        consensus = self.evaluator.check_consensus(diagnoses)
        if consensus.agreement_score > 0.8:
            return consensus.primary_diagnosis
        return self.escalate_to_human(patient_case, diagnoses)
 
    def causal_ablation(self, concept, cases):
        acc_with = self.evaluate_accuracy(cases, include=concept)
        acc_without = self.evaluate_accuracy(cases, exclude=concept)
        return acc_with - acc_without

Results

Evaluated on 4,390 real-world cases from the MIMIC-MACD dataset across seven diseases:

Metric Result
Primary Diagnostic Accuracy Up to 22.3% over clinical guidelines (e.g., Mayo Clinic)
Avg improvement from Self-Learned Knowledge 11.6%
MACD vs Human Physicians Llama-3.1 70B: 0.81 vs Human: 0.65 (p < 0.001)
MACD-Human Workflow vs Physicians-only 18.6% improvement
Consensus Rate (MACD-Human) 58.6%
Effective Agent Opinions 88.5%

The self-learned knowledge transfers across models and provides traceable rationales for explainability.

Multi-Agent Diagnosis Diagram

flowchart TD A[Historical Patient Cases] --> B[Knowledge Summarizer Agent] B --> C[Raw Diagnostic Concepts] C --> D[Diversity Selection + Causal Ablation] D --> E[Refined Self-Learned Knowledge] F[New Patient Case] --> G[Diagnostician Agent 1] F --> H[Diagnostician Agent 2] F --> I[Diagnostician Agent 3] E --> G E --> H E --> I G --> J[Evaluator Agent] H --> J I --> J J --> K{Consensus?} K -->|Yes| L[Primary Diagnosis + Rationale] K -->|No| M[Escalate to Human Physician]

Clinical Significance

References

See Also