OpenAI o1 AI Diagnosis: Outperforms Doctors in Harvard Study

OpenAI’s o1 reasoning model correctly diagnosed 67% of emergency room patients in a new Harvard study, while triage doctors working from the same notes landed between 50% and 55%. The trial, flagged on Hacker News and published in Science, pitted the model against hundreds of physicians on real ER cases at a Boston hospital. Independent experts called the results a ‘genuine step forward’ for clinical reasoning in large language models.

What the researchers actually did

The team ran two main experiments at Beth Israel Deaconess Medical Center. In the first, 76 patients who arrived at the ER had their standard electronic health record handed to both o1 and pairs of human doctors. That record was the bare bones of triage: vitals, demographics, and a few sentences from a nurse explaining why the patient came in.

The second experiment tested longer-term thinking. Forty-six doctors and the AI worked through five clinical case studies, building treatment plans that included antibiotic regimes and end-of-life planning. The humans got to use conventional resources like search engines.

The numbers

Here’s how the head-to-head shook out, according to the figures reported in the study:

Triage diagnosis (minimal info): o1 achieved 67% accuracy compared to human doctors at 50-55%.

Diagnosis with fuller detail: o1 reached 82% accuracy versus human doctors at 70-79%.

Long-term treatment plans: o1 achieved 89% accuracy while human doctors reached only 34%.

The gap on triage, where speed and limited information matter most, is the headline. The fuller-detail comparison was not statistically significant, so don’t read too much into the 82 vs 79. The treatment planning gap, on the other hand, is enormous and hard to explain away.

One case stuck with the researchers. A patient came in with a pulmonary blood clot and worsening symptoms. The human doctors assumed the anti-coagulants were failing. o1 caught something they missed: the patient’s lupus history pointed to lung inflammation as the real driver. The AI was right.

Why this matters for practitioners

Lead author Arjun Manrai, who runs an AI lab at Harvard Medical School, was clear that this isn’t a replacement story. The AI was working from text only. It never saw the patient’s face, never registered distress, never noticed the thousand small physical signals a doctor reads in person. Co-author Dr Adam Rodman framed the future as a ‘triadic care model’: doctor, patient, and AI working together.

That framing matches what’s already happening. Nearly one in five US physicians use AI to help with diagnosis, per recent research. In the UK, 16% of doctors use it daily and another 15% weekly, with clinical decision-making one of the top use cases in a Royal College of Physicians survey.

For anyone building or deploying clinical AI, the practical takeaway is sharper than ‘AI is good now.’ The model shines as a second opinion on paperwork, especially when the differential diagnosis is wide and a doctor risks anchoring on the obvious answer too early. That’s exactly where it caught the lupus link.

The limitations the authors flagged

The researchers were careful to list what the study does not show:

It does not test physical examination, where doctors still have a clear edge.
It does not break down which patient groups the AI handles worse. Dr Wei Xing of the University of Sheffield asked the obvious question: does it stumble on elderly patients or non-English speakers? Nobody knows yet.
It does not address accountability. Rodman noted there’s no formal framework for who carries the liability when an AI gets it wrong.
Xing also raised automation bias. Doctors may start deferring to the AI rather than reasoning independently, and that risk grows the more routine the tool becomes.

Prof Ewen Harrison of the University of Edinburgh summed up the cautious read: these systems are moving past ‘passing medical exams’ into useful second-opinion territory, but routine clinical deployment is a different bar entirely.

For the full study breakdown and the discussion thread, the original source has more detail.

Read original article

What the researchers actually did

The numbers

Why this matters for practitioners

The limitations the authors flagged

Related: