Architecture: Self-Learned Knowledge
Three Specialized Agents
MACD-Human Collaborative Workflow
Code Example: Clinical Diagnosis Pipeline
Results
Multi-Agent Diagnosis Diagram
Clinical Significance
References
See Also

Clinical Diagnosis Agents: MACD

Multi-agent systems are entering clinical medicine, where specialized LLM agents collaborate on diagnostic tasks that traditionally require years of physician expertise. MACD (Multi-Agent Clinical Diagnosis, 2025) introduces a framework where agents self-learn reusable clinical knowledge from historical patient cases and apply it to achieve diagnostic accuracy that matches or exceeds human physicians.

Architecture: Self-Learned Knowledge

MACD's core innovation is Self-Learned Knowledge – structured diagnostic knowledge that agents automatically extract, refine, and apply from historical case data. This mimics how physicians build expertise through clinical experience.

The knowledge is stored as structured 5-tuples capturing clinical features, conditions, relevance scores, and diagnostic implications. A greedy algorithm with maximal marginal relevance selects diverse concepts while removing redundancies.

Concept-Based Causal Intervention assesses knowledge importance by ablating each concept and measuring the change in diagnostic accuracy:

$$\Delta Acc_c = Acc_{\text{with } c} - Acc_{\text{without } c}$$

Concepts with high $\Delta Acc$ are retained as high-impact knowledge, while low-impact or redundant concepts are pruned.

Three Specialized Agents

Knowledge Summarizer Agent: Extracts and structures diagnostic concepts from a sampling set of historical patient cases. It then refines the knowledge base through diversity selection and causal ablation to retain only high-impact knowledge.

Diagnostician Agent: Applies the Self-Learned Knowledge during inference. For each new patient case (history, exams, labs, radiology), it augments its prompt with relevant knowledge and produces a primary diagnosis with explicit rationales linking evidence to knowledge.

Evaluator Agent: Normalizes diagnostic terminology through tolerant name-matching and computes BioBERT semantic similarity scores to assess consensus among multiple diagnostician agents using diverse LLMs.

MACD-Human Collaborative Workflow

In the extended workflow, multiple Diagnostician agents (powered by different LLMs, each with their own knowledge base) engage in iterative consultations:

Each agent independently diagnoses the case
Agents exchange anonymized opinions
The Evaluator checks for consensus
Unresolved cases escalate to human physician oversight

This simulates real-world clinical team consultations where multiple specialists review complex cases.

Code Example: Clinical Diagnosis Pipeline

class MACDFramework:
    def __init__(self, summarizer_llm, diagnostician_llms, evaluator_llm):
        self.summarizer = KnowledgeSummarizer(summarizer_llm)
        self.diagnosticians = [
            DiagnosticianAgent(llm) for llm in diagnostician_llms
        ]
        self.evaluator = EvaluatorAgent(evaluator_llm)
 
    def build_knowledge_base(self, historical_cases, disease):
        raw_concepts = self.summarizer.extract_concepts(historical_cases)
        diverse_concepts = self.summarizer.select_diverse(
            raw_concepts, method="maximal_marginal_relevance"
        )
        refined_knowledge = []
        for concept in diverse_concepts:
            delta_acc = self.causal_ablation(concept, historical_cases)
            if delta_acc > self.threshold:
                refined_knowledge.append(concept)
        return refined_knowledge
 
    def diagnose(self, patient_case, knowledge_base):
        diagnoses = []
        for agent in self.diagnosticians:
            diagnosis = agent.diagnose(
                patient_case, knowledge=knowledge_base
            )
            diagnoses.append(diagnosis)
        consensus = self.evaluator.check_consensus(diagnoses)
        if consensus.agreement_score > 0.8:
            return consensus.primary_diagnosis
        return self.escalate_to_human(patient_case, diagnoses)
 
    def causal_ablation(self, concept, cases):
        acc_with = self.evaluate_accuracy(cases, include=concept)
        acc_without = self.evaluate_accuracy(cases, exclude=concept)
        return acc_with - acc_without

Results

Evaluated on 4,390 real-world cases from the MIMIC-MACD dataset across seven diseases:

Metric	Result
Primary Diagnostic Accuracy	Up to 22.3% over clinical guidelines (e.g., Mayo Clinic)
Avg improvement from Self-Learned Knowledge	11.6%
MACD vs Human Physicians	Llama-3.1 70B: 0.81 vs Human: 0.65 (p < 0.001)
MACD-Human Workflow vs Physicians-only	18.6% improvement
Consensus Rate (MACD-Human)	58.6%
Effective Agent Opinions	88.5%

The self-learned knowledge transfers across models and provides traceable rationales for explainability.

Multi-Agent Diagnosis Diagram

flowchart TD A[Historical Patient Cases] --> B[Knowledge Summarizer Agent] B --> C[Raw Diagnostic Concepts] C --> D[Diversity Selection + Causal Ablation] D --> E[Refined Self-Learned Knowledge] F[New Patient Case] --> G[Diagnostician Agent 1] F --> H[Diagnostician Agent 2] F --> I[Diagnostician Agent 3] E --> G E --> H E --> I G --> J[Evaluator Agent] H --> J I --> J J --> K{Consensus?} K -->|Yes| L[Primary Diagnosis + Rationale] K -->|No| M[Escalate to Human Physician]

Clinical Significance

Outperforms clinical guidelines: Self-learned knowledge from case data is more specific and actionable than generic guidelines
Exceeds human physicians: On the MIMIC-MACD benchmark, MACD agents achieve 16% higher accuracy than physicians (p < 0.001)
Explainable diagnostics: Each diagnosis includes traceable rationales linking patient evidence to specific knowledge concepts
Cross-model stability: Self-learned knowledge transfers effectively across different LLM backbones
Human-AI collaboration: The MACD-Human workflow preserves physician oversight while leveraging agent capabilities

References

MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLMs (arXiv:2509.20067)

Table of Contents