Research published in the journal Science delivered a result that will make a lot of people stop and think about the future of medicine.
A patient arrives at the emergency room with a pulmonary embolism, a serious condition where a blood clot travels to the lungs. The initial treatment works and the condition improves, but shortly after, the symptoms start getting worse again. The medical team suspects the medication is not working as it should.
That is where artificial intelligence comes in.
After analyzing the patient’s electronic medical records, the model points to a completely different hypothesis: lupus, an autoimmune disease that can cause inflammation of the heart and explain all that seemingly senseless deterioration. And you know what happened? The AI was right. 🎯
This scenario did not come from an episode of House or a science fiction script. It is a real case, previously treated in the emergency department at Beth Israel in Boston. The case was documented in a study published on Thursday in the journal Science, conducted by researchers at Harvard Medical School and Beth Israel Deaconess Medical Center. The team tested a reasoning AI model developed by OpenAI directly against experienced emergency physicians. The results were, to say the least, surprising.
What the research actually tested
The study was not a generic simulation or an exercise with hypothetical scenarios designed to give the machine an edge. The researchers designed a series of experiments to evaluate the AI model’s clinical ability across different fronts. One of them involved real and complex clinical cases from Beth Israel’s own emergency department, including the lupus patient case. Another part used clinical cases published in the New England Journal of Medicine and standardized clinical vignettes, which are classic references in medical training around the world. These are situations that demand sharp clinical reasoning, careful cross-referencing of information, and often years of experience to reach a correct conclusion.
For the real emergency department cases, the team evaluated the model’s ability to provide an accurate diagnosis at three different points during patient care: at initial triage, during the intermediate evaluation, and at the time of hospital admission. The AI only had access to electronic health records and the same limited information that was available to the physicians at that moment. No internet, no external databases, and no additional advantage of any kind during the test.
The artificial intelligence model used is one of OpenAI’s so-called reasoning models. Unlike earlier versions like GPT-4, this type of model does not just retrieve stored information — it simulates a logical analysis process before arriving at an answer. This makes a huge difference when the subject is medical diagnosis, where every detail of a patient’s history can completely change the final conclusion.
On the other side were two experienced physicians, accustomed to making fast decisions in high-pressure emergency settings. The idea was to compare human performance with the AI’s under conditions as close to real hospital settings as possible, without giving either side an advantage. And that is where the numbers got interesting. 📊
The numbers nobody expected to see
Overall, artificial intelligence outperformed both experienced physicians in real emergency department cases, using only electronic medical records and the limited information available at the time. On top of that, the model also outperformed its predecessor, GPT-4, showing a significant leap from one model generation to the next.
In the portion of the study using cases published in the New England Journal of Medicine and clinical vignettes, the model also stood out. As Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School and one of the study’s authors, described, the model outperformed the baseline made up of a large group of physician evaluators.
What impressed the researchers even more was the model’s ability to identify rare and uncommon diagnoses, exactly like what happened in the lupus case mentioned at the beginning. Rare diseases are notoriously difficult to diagnose because their symptoms often mimic more common conditions. Physicians naturally tend to follow the most likely path first, which is totally understandable given patient volume and time pressure. AI, on the other hand, can process all possibilities simultaneously and without the confirmation bias that is so natural in human reasoning. This puts the model at a significant advantage in cases that fall outside the norm.
Earlier versions of large language models struggled when they needed to handle uncertainty and generate differential diagnosis lists — those lists of possible conditions that could explain a patient’s symptoms. The progress documented in this new study shows just how much the technology has evolved in just a few years. 🧠
What experts outside the study are saying
The reaction from professionals who did not participate in the research also deserves attention. Dr. David Reich, chief clinical officer of the Mount Sinai Health System in New York, called the paper a nice summary of how much things have improved in the field. According to him, we now have something that is quite accurate and possibly ready for large-scale use. The big open question, in his view, is how to introduce this technology into clinical workflows in a way that actually improves patient care.
Reich also raised an important point. Arriving at a complicated final diagnosis — which is where the AI model shines — does not necessarily reflect how things work in day-to-day clinical medicine. In practice, outcomes are much more nuanced and varied than simply getting a diagnosis right or wrong. There are subtleties in treatment, in how a patient progresses, and in the decisions that need to be made over weeks or months of follow-up care.
This is an observation that Dr. Adam Rodman himself, a clinical researcher at Beth Israel and co-author of the study, acknowledges. For him, the big takeaway is that the model works with the real, messy data from the emergency department, and that is very significant. But he admits it is unlikely the AI would have performed as impressively if the team had fed it the records of a patient hospitalized for an entire month, for example. The emergency room is just one slice of the total care a patient receives.
Does this mean AI is going to replace doctors?
That is the question everyone asks as soon as they read a story like this, and the honest answer is: no, at least not in the way most people imagine. The study authors themselves were emphatic on this point. None of the researchers involved believe the results justify replacing doctors with artificial intelligence — despite what some companies will probably say and how they will probably use these results, in Manrai’s words.
The authors emphasize that the AI model relied exclusively on text. In real life, doctors need to deal with many other types of information, such as imaging scans, stethoscope sounds, a patient’s nonverbal cues, and direct communication with the person being treated. Clinical reasoning goes far beyond getting a diagnosis right on paper. It involves empathy, detailed physical examination, conversations with family members, and real-time decision-making with information that is frequently incomplete and contradictory — things AI still cannot do on its own.
What the research does make clear, however, is that ignoring this technology in hospital settings is starting to look less and less reasonable. Imagine a scenario where the emergency physician has a model at their disposal that can review clinical reasoning in real time, suggest hypotheses that might have been overlooked, and flag symptom combinations that are statistically associated with rare conditions. That does not take away the professional’s autonomy. On the contrary, it can boost confidence in the decisions being made and significantly reduce the risk of diagnostic error — a problem that is far more common than people realize, even at top hospitals.
It is worth remembering that evidence-based medicine has already incorporated countless technological tools over the decades, from advanced imaging to cardiovascular risk algorithms. Artificial intelligence applied to diagnosis is, from that perspective, another step in the same direction — just with a pretty significant leap in capability. What changes now is the scale, speed, and depth at which these tools can operate. 🚀
The challenges that still need to be overcome
Despite the encouraging results, there are real and significant barriers between a scientific study and real-world implementation. One of the main ones is the question of how to validate these models across diverse populations. The study was conducted at a specific medical center in the United States, and it is essential to ensure that performance holds up when applied to different clinical, demographic, and cultural contexts.
Data privacy issues also come into play. For an AI model to work well in a hospital setting, it needs access to detailed electronic medical records, and that raises legitimate concerns about protecting sensitive patient information. Beyond that, there is the debate around legal liability in case of an error: if the model suggests a diagnosis and the doctor follows the recommendation, but the outcome is negative, who is held accountable?
Dr. David Reich pointed out that designing prospective clinical trials — studies that track the use of technology going forward in time — is a very challenging process, but an absolutely necessary one. For him, this study is the perfect call for the medical community to start designing those tests rigorously. Only then will it be possible to have greater certainty about how the technology actually impacts clinical practice over the long term.
Another point worth paying attention to is practical integration into workflows. Hospitals operate on very well-defined protocols, and adding a layer of AI to the clinical decision-making process requires care to avoid creating confusion, delays, or information overload that could end up having the opposite of the desired effect. The technology needs to fit into the pace of patient care, not get in the way.
A profound shift already underway
Even with all the challenges ahead, the signal this study sends to the healthcare sector is pretty clear. As Manrai put it, we are witnessing a truly profound shift in technology, and it is going to reshape medicine. This is no longer a future promise or a lab curiosity. The data published in Science shows that artificial intelligence models are already performing at or above the level of human specialists in complex cognitive tasks within the clinical setting.
The research from Harvard and Beth Israel is not the final word in this discussion. It is a robust and scientifically rigorous starting point for a debate that is going to grow significantly in the coming years. And if the speed of AI model evolution over the past few years is any indicator, it is safe to say the next chapters of this story will arrive faster than most people expect. 💡
For anyone following the world of artificial intelligence, this is yet another milestone that reinforces how the technology is moving from being a supplement to becoming a central piece in areas that directly impact people’s lives. Medicine is probably the field where this transformation has the most tangible and urgent consequences. And the results of this study show that the future, at least in part, has already arrived in the emergency room.
