The Harvard Medical AI Triage Study represents a significant research initiative examining the application of advanced language models to emergency department triage decisions. Conducted collaboratively by Harvard Medical School, Beth Israel Deaconess Medical Center, and Stanford University, the study evaluated OpenAI's o1 model's performance in clinical triage scenarios against experienced attending physicians.
The research employed a rigorous experimental framework consisting of six distinct triage scenarios designed to evaluate diagnostic accuracy and clinical decision-making. A critical methodological strength involved the use of blinded reviewers who evaluated both AI-generated and physician-generated assessments without knowledge of the source. This blinded review process eliminated potential bias toward or against AI-generated diagnoses, providing objective comparison of clinical outcomes 1).
The study design specifically focused on emergency department (ED) triage, a high-stakes clinical domain where rapid and accurate assessment directly impacts patient outcomes. Triage represents a particularly challenging decision point in acute care, as it requires synthesizing incomplete information under time pressure to classify patients by clinical urgency.
The experimental results demonstrated that OpenAI's o1 model outperformed attending physicians across the triage scenarios evaluated. Notably, blinded reviewers could not reliably distinguish between AI-generated and human physician assessments, suggesting comparable diagnostic quality between the two approaches. This finding carries significant implications for the potential role of AI systems in supporting clinical decision-making 2).
The performance advantage was particularly pronounced during the initial uncertain phase of triage—the critical early minutes when clinical information remains incomplete and diagnostic uncertainty is highest. This timing is clinically significant, as early triage decisions establish the patient's clinical pathway and resource allocation within the emergency department. The model demonstrated substantial gains precisely where human clinicians face the greatest cognitive demands in managing diagnostic uncertainty.
The study results suggest potential applications for AI-assisted triage systems in emergency medicine settings. Rather than replacing physician judgment, such systems could support clinicians during high-uncertainty decision points by providing evidence-based recommendations and helping structure diagnostic reasoning. The finding that reviewers could not distinguish AI from human assessments indicates that AI triage support need not be recognized as such to be clinically useful 3).
Emergency department triage traditionally relies on standardized protocols such as the Emergency Severity Index (ESI) scale, which categorizes patients into acuity levels based on resource requirements and urgency. AI systems could potentially enhance these existing frameworks by improving stratification accuracy and reducing inter-rater variability in triage assessment.
While the study presents promising findings, several factors warrant careful consideration before clinical implementation. The research evaluated AI performance on designed experimental scenarios, which may differ from real-world emergency department operations with their inherent complexity, time pressures, and diverse patient populations. Real-world deployment would require validation across multiple institutions and patient demographics.
Additionally, questions remain regarding the appropriate role of AI in clinical decision-making. Integration of AI-generated recommendations into existing clinical workflows requires careful attention to human-AI collaboration, maintaining physician oversight, and ensuring that AI support enhances rather than diminishes clinical reasoning 4).
The study contributes to the growing body of evidence regarding large language models' potential in medical applications, particularly in high-stakes decision-making domains where rapid assessment and diagnostic accuracy are critical clinical requirements.