AI Medical Chatbots Wrong 40% of Time Despite Growing $10B Market, University Study Shows

February 15, 2025
6 mins read
A head made of electronic circuit boards / A man's head made up of electronic components / Public domain stock illustration.
A head made of electronic circuit boards / A man's head made up of electronic components / Public domain stock illustration.

You pull out your phone, type in your symptoms, and within seconds, an AI chatbot delivers what sounds like expert medical advice. The response comes wrapped in confident language, complete with treatment recommendations. But here’s what might surprise you: that digital doctor gets it wrong 4 out of 10 times.

A University of Florida College of Medicine study revealed that ChatGPT provided appropriate responses to medical questions only 60% of the time when tested on common urology topics. The research, published in the journal Urology, examined how AI chatbots handle real patient questions—the kind people actually ask their doctors during office visits.

Yet nearly half of Americans are turning to these digital tools for health guidance. A 2024 Deloitte survey found that 48% of consumers have used generative AI for health-related concerns, drawn by promises of accessible, affordable healthcare information.

The Reality Check: What Researchers Actually Found

Dr. Russell S. Terry, Assistant Professor of Urology at UF College of Medicine and senior author of the study, told Karmactive that the findings reveal significant gaps in AI’s medical capabilities. “I am not discouraging people from using chatbots, but don’t treat what you see as the final answer. Chatbots are not a substitute for a doctor,” Terry explained to Karmactive.

The University of Florida team tested ChatGPT on 13 common urologic questions, asking each three times since the AI can generate different responses to identical queries. Five board-certified urologists independently evaluated the 39 total responses using established medical guidelines from leading professional organizations including the American Urological Association.

The results paint a concerning picture. Beyond the 60% accuracy rate, researchers found that ChatGPT “misinterprets clinical care guidelines, dismisses important contextual information, conceals its sources and provides inappropriate references”.

When asked to provide sources for its medical advice, the chatbot was almost uniformly unable to do so. “It provided sources that were either completely made up or completely irrelevant,” Terry informed Karmactive. “Transparency is important so patients can assess what they’re being told.”

The variability problem adds another layer of complexity. In 25% of question sets, the same prompt yielded different “appropriateness designations,” meaning identical medical questions could receive contradictory answers depending on when you asked.

The Confidence Trap: When AI Sounds Too Sure

Perhaps most troubling is how these digital tools deliver incorrect information. “One of the more dangerous characteristics of chatbots is that they can answer a patient’s inquiry with all the confidence of a veteran physician, even when completely wrong,” the UF Health study noted.

Dr. Terry explained to Karmactive that this confidence creates a disconnect. “These warnings are often dissonant with the high degree of confidence expressed by the chatbots in their responses, which can create confusion. It’s like the bot is delivering an answer which sounds very professional, confident, and reasonable—independently of whether it actually is accurate or not—and then it immediately couches the statement with a quick disclaimer saying ‘this is not medical advice’,” Terry told Karmactive.

In only one of the 39 evaluated responses did the AI note it “cannot give medical advice.” The chatbot recommended consulting with a doctor or medical adviser in just 62% of its responses.

The study found ChatGPT performed well on topics like hypogonadism, infertility, and overactive bladder, but struggled significantly with others. For recurrent urinary tract infections in women, it consistently provided incorrect information.



The Black Box Problem: What You Don’t Know Can Hurt You

Terry told Karmactive that the core issue stems from what experts call the “black box effect.” “Probably the biggest limitation to using chatbots for medical advice right now, however, is the ‘black box’ effect which results from lack of insight into the specific data and content on which the models were trained,” he explained.

“If the majority of the chatbot models’ training data was comprised of a vast swath of freely available information on the internet, the chatbot medical advice quality will be inherently bottlenecked and subject to the limitations and biases of its training data,” Terry informed Karmactive.

This creates a fundamental problem with context and specificity. Terry told Karmactive that patients often lack the medical training to frame questions optimally. “A query from a patient about ‘my side hurts’ would likely produce less specific results than a physician querying ‘differential diagnosis for patient with new onset acute flank pain, severe intensity, waxing and waning, associated with nausea, vomiting, and pink tinge urine,'” Terry informed Karmactive.

Where AI Actually Shines: The Educational Sweet Spot

Despite these limitations, Dr. Terry sees genuine potential for AI in healthcare education. He told Karmactive that for very general medical topics, chatbots probably have tremendous potential to help in this regard.

“Think about how many times the average primary care physician has the same exact conversation with their patients about the importance of weight loss, blood pressure control, recommended vaccinations, etc. This is incredibly important information for health maintenance, but much of it is not always individualized and could potentially be delivered interactively to interested and motivated patients via a chatbot modality,” Terry explained.

“This would then free up the physician to spend more time to focus on their patients’ complex individual needs which require a higher level of thought, medical decision making, and time to appropriately address,” he told Karmactive.

The healthcare chatbot market reflects this growing interest. The global healthcare chatbots market was valued at USD 1.49 billion in 2025 and is projected to reach USD 10.26 billion by 2034, expanding at a CAGR of 23.92%.

Meanwhile, breakthroughs in AI medical diagnostics continue to emerge. Recent advances include AI-powered blood tests achieving 98% accuracy in early breast cancer detection and NHS trials using AI to detect 12 different cancers before symptoms appear. Additionally, AI heart scan technology has helped 24,300 patients avoid invasive procedures while saving the NHS £10 million.

However, current limitations persist even in educational applications. Terry informed Karmactive that “Some work has been done evaluating these metrics in chatbot responses to health questions, and in general the chatbot responses are written at a much higher reading level than is typically recommended for consumer health information which limits the ability of the average patient to absorb and use the information (even if it is accurate).”

The Path Forward: Development and User Responsibility

Terry told Karmactive that improvement efforts are underway, though progress will likely be incremental. “I think that the first generation of these chatbots will be narrowly focused on very specific use-cases and will be trained on validated and highly specific data for those use-cases in order to maximize accuracy. I think that it will be much longer before a larger, more generalized chatbot is able to consistently provide accurate and specific health advice,” he explained.

“Realistically, most of the advances in the improvement process are going to be on the developer side. However, a necessary step of validating models is for early users to provide actionable feedback to the developers regarding response quality. This could be via a simple thumbs up/thumbs down system or often a free-text response. Providing such feedback to language models can help the models to hone their ‘learning,'” Terry told Karmactive.

Recent studies support the need for caution beyond urology. A 2024 Journal of Medical Internet Research study examining four AI chatbots on emergency care questions found that information for when to seek urgent care is “frequently incomplete and inaccurate, and patients may be unaware of misinformation.”

Healthcare AI research is advancing rapidly across multiple applications. Studies show AI can analyze ECGs to predict heart failure risk up to 24 times higher than standard methods, demonstrating AI’s potential when properly applied to specific diagnostic tasks.

Understanding the Current Landscape

The numbers tell a story of rapid adoption despite ongoing concerns. The National Center for Health Statistics found that 58.5% of US adults used the internet to look for health or medical information in 2022, with convenience and breadth of information as primary motivations. A 2023 survey reported that 78.4% of subjects would be willing to use ChatGPT for self-diagnosis.

Healthcare organizations are responding to this demand through various AI implementations, but challenges around data privacy and integration persist. Key challenges include patient hesitancy to trust AI due to concerns over data privacy and accuracy of medical advice, as well as difficulties in integrating chatbots into existing healthcare infrastructures.

Expert Guidance for Safe Usage

Dr. Terry emphasized to Karmactive the importance of maintaining skepticism while engaging with AI health tools. “For now, based on the current state of the chatbots, I think it’s reasonable for patients to explore them to ask questions and be more engaged in their own medical care. I think increased patient interest and engagement is always a positive thing. However, they need to maintain a very high level of skepticism about what they read and learn from the chatbots until they are able to have a clarifying subsequent discussion with their actual physician who can provide necessary and appropriate context,” he told Karmactive.

Terry suggested to Karmactive that the solution might involve changing how AI communicates uncertainty. “Maybe a more effective way to convey an appropriate level of skepticism to end-users would be to change the response style of medical chatbots to one that is less aggressively overconfident and more cautious, judicious, and hedging. That way, people will hopefully absorb the reality of the chatbot response limitations better,” he explained.

Industry experts echo these concerns. Julie McGuire, managing director of the BDO Center for Healthcare Excellence & Innovation, told PYMNTS that “Like other AI-powered tools, medical chatbots are more likely to provide highly accurate answers when thoroughly trained on high-quality, diverse data sets and when user prompts are clear and simple. However, when questions are more complicated or unusual, a medical chatbot may provide insufficient or incorrect answers.”

Karmactive WhatsApp Channel - https://whatsapp.com/channel/0029Vb2BWGn77qVMKpqBxg3D

The current state of AI in healthcare presents both opportunities and risks. “It’s always a good thing when patients take ownership of their health care and do research to get information on their own,” Terry said. “And that’s great. But just as when you use Google, don’t accept anything at face value without checking with your health care provider.”

The conversation about AI in healthcare continues to evolve as technology advances and more research emerges. What remains clear is that while these digital tools offer convenience and accessibility, they cannot yet replace the nuanced judgment, contextual understanding, and personalized care that human medical professionals provide. The key lies in understanding their limitations while leveraging their potential benefits responsibly.

Tejal Somvanshi

Meet Tejal Somvanshi, a soulful wanderer and a staunch wellness advocate, who elegantly navigates through the enchanting domains of Fashion and Beauty with a natural panache. Her journey, vividly painted with hues from a vibrant past in the media production world, empowers her to carve out stories that slice through the cacophony, where brands morph into characters and marketing gimmicks evolve into intriguing plot twists. To Tejal, travel is not merely an activity; it unfolds as a chapter brimming with adventures and serendipitous tales, while health is not just a regimen but a steadfast companion in her everyday epic. In the realms of fashion and beauty, she discovers her muse, weaving a narrative where each style narrates a story, and every beauty trend sparks a dialogue. Tejal seamlessly melds the spontaneous spirit of the media industry with the eloquent prose of a storyteller, crafting tales as vibrant and dynamic as the industry she thrives in.

Leave a Reply

Your email address will not be published.

Representative Image: Dextrose Inserted to a Patient Photo Source: @tima-miroshnichenko (Pexels)
Previous Story

Texas Measles Outbreak Doubles to 48 Cases; Low Vaccination Rates Fuel Spread

Sea turtles dance moves aren't just cute — they can also lead to significant research findings. (Ken Lohmann/biology department)
Next Story

Sea Turtles’ Magnetic GPS: Navigating 10,000 Miles to Feeding Grounds

Latest from Health

How Gut Health Affects Addiction Recovery

The science of addiction is evolving, and emerging research is pointing out that your gut plays an important role in addiction recovery. The gut-brain axis, your body’s internal communication linking the digestive

Don't Miss

Vicky Pattison. Photo Source: @vickypattison (Instagram)

PMDD: The Severe Condition Affecting Millions That Had Vicky Pattison ‘Terrified’ for 5 Years”

Premenstrual Dysphoric Disorder (PMDD) has gained public attention