The Silent Problem With AI in Healthcare: Being Wrong With Conviction
Artificial Intelligence has already become routine in many hospitals and clinics around the world, but there is a problem that few people stop to discuss: what happens when it gets things wrong with absolute conviction?
This is not science fiction.
Real doctors have ignored their own clinical intuition because an AI system flagged something different, and with such high confidence that it seemed impossible to question. The result can be dangerous, especially when a wrong diagnosis is accepted as truth simply because it came from a machine that appears to know everything.
This is exactly the scenario that an international group of researchers led by MIT decided to tackle head-on. The study, published in the journal BMJ Health and Care Informatics, presents a proposal that sounds simple but is technically sophisticated: teaching AI to have humility.
Not humility in the poetic sense of the word, but the actual ability to recognize when it is not sure about what it is saying, flag that uncertainty for the doctor, and encourage the pursuit of more information before any decision is made.
The difference between an AI that acts like an oracle and one that acts like a copilot might seem small on paper, but in clinical practice it changes everything. 🏥
As Leo Anthony Celi, senior researcher at the MIT Institute for Medical Engineering and Science, physician at Beth Israel Deaconess Medical Center, and associate professor at Harvard Medical School, puts it: the idea is to use AI not as an entity that delivers ready-made answers, but as a partner that enhances the professional’s ability to connect the dots and make more informed decisions.
The Real Problem: When the Machine Is Too Confident
For years, the development of Artificial Intelligence systems for healthcare was driven by one very specific metric: accuracy. The more the model got right, the better it was considered. This produced incredibly precise tools under ideal conditions, but it also created a silent and dangerous side effect. Systems trained to always deliver a definitive answer started doing exactly that, even when the input data was ambiguous, incomplete, or outside the patterns the model had learned to recognize. In other words, the machine learned to appear confident regardless of whether it was actually correct.
This behavior has a technical name: overconfidence. And it is especially problematic in the clinical setting, because healthcare professionals who work alongside these tools tend to interpret a high confidence score as a kind of confirmation. When a system points to a 97% probability for a given diagnosis, it is psychologically very difficult for any human being to question it, even if something about the patient’s clinical presentation does not quite match what is on the screen.
Previous studies cited by the MIT researchers themselves show that ICU physicians tend to defer to AI systems they perceive as reliable, even when their own intuition points in the opposite direction. Both doctors and patients are more likely to accept incorrect AI recommendations when they are presented in an authoritative manner. Automation bias, which is the tendency to over-rely on automated systems, starts to compromise the quality of care in concrete and measurable ways.
The problem is not that AI makes mistakes. Any system will make mistakes at some point. The problem is when it makes mistakes without warning that it might be wrong, and that turns a support tool into a real source of risk for the patient.
The MIT Research: A Framework for Humble AI
The group of researchers led by Celi, with lead authorship by Sebastián Andrés Cajas Ordoñez, a researcher at MIT Critical Data, a global consortium linked to the MIT Laboratory for Computational Physiology, started from a premise that seems obvious but is rarely put into practice: an honest Artificial Intelligence system needs to know when it does not know.
To achieve this, the consortium developed a framework composed of computational modules that can be integrated into existing AI systems. The first of these modules requires the AI model to assess its own certainty when making diagnostic predictions. Developed by consortium members Janan Arslan and Kurt Benke from the University of Melbourne, this component was named the Epistemic Virtue Score. It works as a kind of self-awareness check, ensuring that the system’s confidence is properly tempered by the inherent uncertainty and complexity of each clinical scenario.
With this self-awareness in place, the model adapts its response to the situation. If the system detects that its confidence exceeds what the available evidence supports, it can pause and flag the inconsistency. From there, it can request specific tests or additional medical history that might help resolve the uncertainty, or recommend a specialist consultation. The goal is to create an AI that not only provides answers but also signals when those answers should be treated with caution.
In Celi’s words, it is like having a copilot who tells you that you need a fresh set of eyes to better understand that complex patient. 🔬
Technically, this involves what the field calls uncertainty calibration, which is the model’s ability to express the degree of confidence in its own predictions in a way that accurately reflects reality. A well-calibrated model does not say 95% certainty when, given the conditions of the case, the margin of error is much larger than that.
Additionally, the model was designed to automatically identify when a case falls outside the distribution of data it was trained on, which in technical language is called out-of-distribution detection. This is critical because a large portion of serious clinical AI errors happen precisely when the system encounters a patient profile, a combination of symptoms, or a type of image it never saw during training, and yet responds with high confidence as if it were completely familiar. With active detection of these cases, the system can alert the doctor that a specific diagnosis is being made in uncharted territory, and that human oversight is especially important there.
Real Collaboration: AI and Doctors Working Together
The concept of collaboration between humans and machines in healthcare is not new, but it has rarely been implemented in a genuine way. In most cases, what exists is a one-directional consultation relationship: the doctor inputs the data, the system returns an answer, and it is up to the professional to decide whether to follow that recommendation or not. The problem is that when the answer comes with an extremely high confidence score, that decision is rarely made in a truly independent way. The collaboration ends up being superficial, and the real weight of the decision becomes invisible.
The model proposed by MIT tries to change this dynamic at a structural level. By making uncertainty explicit and communicable, the system transforms the interaction between doctor and AI into something much closer to an actual dialogue. The healthcare professional does not just receive an answer — they receive context. They know what the competing hypotheses are, they understand which aspects of the case are harder to interpret based on available data, and they can use this map of uncertainties to guide the next steps of the clinical investigation.
This might mean ordering an additional test, calling in a specialist, or simply taking more time to observe the patient’s progression before finalizing the diagnosis.
As Cajas Ordoñez explains, the idea is to actively include humans in these AI systems, making it easier for people to reflect and reimagine collectively, rather than relying on isolated AI agents that do everything on their own. The goal is for humans to become more creative through the use of artificial intelligence, not less.
This approach also has an important impact on medical training and clinical culture as a whole. When Artificial Intelligence systems are transparent about their limitations, they reinforce in professionals the idea that uncertainty is part of the process and that acknowledging it is not weakness — it is competence. This moves in the opposite direction of the hyperconfidence culture that many current systems end up feeding unintentionally. Real collaboration begins when both sides, the human and the machine, are able to clearly state what they know and what they do not know. 🤝
Human Values as Part of the Design
One of the richest discussions that emerges from this research is about the role of human values in the development of AI systems for healthcare. For a long time, the design of these systems was guided almost exclusively by technical metrics: accuracy, sensitivity, specificity, area under the ROC curve. These metrics are important, but they do not capture fundamental dimensions of medical care, such as the importance of informed consent, the need for the patient to understand the uncertainties of their own diagnosis, or the ethical value of preserving the healthcare professional’s agency in the face of an automated recommendation.
The MIT work starts from the principle that incorporating human values into the design of an AI system is not an abstract philosophical question — it is an engineering decision with direct practical consequences. When you decide that the system should communicate uncertainty clearly, you are taking an ethical stance on the doctor’s right to complete information. When you decide that the model should flag cases outside its training distribution, you are taking a stance on responsibility and patient safety. Every architectural choice carries with it a set of values, and the question is whether those values were chosen consciously or simply inherited from past optimizations.
Thinking this way opens space for multidisciplinary teams — including not just engineers and data scientists, but also doctors, nurses, bioethicists, patients, and user experience specialists — to actively participate in design decisions from the very beginning of development. Human values cannot be added as a coat of varnish at the end of the process. They need to be present in the problem definition, in the selection of training data, in the way the interface communicates results, and in how the system handles its own mistakes. That is the difference between an AI that was made to be used by humans and one that was made to work with humans. ✨
The Data Challenge and the Push for More Inclusive AI
This study is part of a broader effort by Celi and his colleagues to create AI systems that are designed by and for the people who will be most impacted by these tools. Many AI models, including MIMIC (Medical Information Mart for Intensive Care), are trained on publicly available data from the United States, which can introduce biases toward a certain way of thinking about medical issues while excluding other perspectives.
Bringing in more viewpoints is essential for overcoming these potential biases, according to Celi, who emphasizes that each member of the global consortium brings a distinct perspective to a broader collective understanding.
Another concrete problem with AI systems used for diagnosis is that they are generally trained on electronic health records, which were not originally created for that purpose. This means the data lacks much of the context that would be useful for making diagnoses and treatment recommendations. On top of that, many patients never end up being included in these datasets due to lack of access, such as people living in rural areas.
In data workshops organized by MIT Critical Data, groups that bring together data scientists, healthcare professionals, social scientists, patients, and other stakeholders work together on the design of new AI systems. Before getting started, everyone is encouraged to reflect on whether the data they are using captures all the factors that influence what they intend to predict, ensuring they do not inadvertently encode existing structural inequalities into their models.
Celi explains that he has participants question the dataset: whether they are confident about the training and validation data, whether they think there are patients who were excluded, intentionally or unintentionally, and how that will affect the model itself. He adds that it is not possible to stop or even slow down the development of AI, not just in healthcare but across all sectors, but it is necessary to be more deliberate and careful in how it is done.
What Changes in Clinical Practice
Translating all of this to the day-to-day reality of a hospital or clinic requires thinking not just about the technology, but about how it fits into the actual workflows of healthcare professionals. A system that communicates uncertainty needs to do so in a way that is readable and useful within the time constraints and pressure that define the clinical environment. There is no point in delivering a detailed technical report on probability distributions if the doctor has three minutes to make a decision about a critical patient. The interface and the way results are presented are just as important as the model itself.
In that regard, the researchers also worked with specialists in human-computer interaction to develop visual and textual ways of communicating the model’s uncertainty in an immediate and intuitive manner. This includes visual indicators that distinguish high-confidence cases from cases where the system is operating in a gray zone, as well as contextual messages that briefly explain which factors in the case are contributing to the uncertainty. The goal is for the doctor to be able to absorb this information quickly and use it to calibrate their own decision-making process, without having to dive into technical documentation to understand what the machine is saying.
Celi’s team is already implementing the new framework in AI systems based on the MIMIC database and introducing it to physicians within the Beth Israel Lahey Health system. This approach can also be applied to systems used to analyze X-ray images or to determine the best treatment options for patients in the emergency room, among other applications.
Doctors who worked with the humble version of the system reported greater trust in the tool — not because it got more things right, but because they knew when to trust it and when to be more careful. This distinction is crucial. Calibrated trust, which means knowing exactly how far a tool can go, is much more valuable than blind trust, which means assuming the tool always knows best. And that is exactly what an AI with genuine humility can build over time with the professionals who use it. 💡
The Road Ahead
The research, funded by the Boston-Korea Innovative Research Project through the Korea Health Industry Development Institute, represents an important step toward AI systems that treat the healthcare professional as a partner, not merely a recipient of automated instructions. More than an isolated technical innovation, the framework proposed by MIT places at the center of the discussion a question that the entire health technology sector needs to face: what good is an extremely accurate AI if it cannot recognize the limits of its own accuracy?
The answer the researchers offer is that computational humility is not a luxury or an extra feature. It is a fundamental necessity for Artificial Intelligence to truly fulfill the promise of helping doctors diagnose patients and personalize treatment options, without running the risk of pushing them in the wrong direction.
And perhaps the biggest takeaway here is that for technology to truly advance healthcare, it needs to learn something that the best doctors have known for centuries: the difference between having an answer and having the right answer is not always obvious, and recognizing that makes all the difference. 🧠
