Search Bar

There are more AI health tools than ever—but how well do they work?

Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users will be able to connect their medical records and ask specific questions about their health. A couple of days earlier, Amazon had announced that Health AI, an LLM-based tool previously restricted to members of its One Medical service, would now be widely available. These products join the ranks of ChatGPT Health, which OpenAI released back in January, and Anthropic’s Claude, which can access user health records if granted permission. Health AI for the masses is officially a trend. 

There’s a clear demand for chatbots that provide health advice, given how hard it is for many people to access it through existing medical systems. And some research suggests that current LLMs are capable of making safe and useful recommendations. But researchers say that these tools should be more rigorously evaluated by independent experts, ideally before they are widely released. 

In a high-stakes area like health, trusting companies to evaluate their own products could prove unwise, especially if those evaluations aren’t made available for external expert review. And even if the companies are doing quality, rigorous research—which some, including OpenAI, do seem to be—they might still have blind spots that the broader research community could help to fill.

“To the extent that you always are going to need more health care, I think we should definitely be chasing every route that works,” says Andrew Bean, a doctoral candidate at the Oxford Internet Institute. “It’s entirely plausible to me that these models have reached a point where they’re actually worth rolling out.”

“But,” he adds, “the evidence base really needs to be there.”

Tipping points 

To hear developers tell it, these health products are now being released because large language models have indeed reached a point where they can effectively provide medical advice. Dominic King, the vice president of health at Microsoft AI and a former surgeon, cites AI advancement as a core reason why the company’s health team was formed, and why Copilot Health now exists. “We’ve seen this enormous progress in the capabilities of generative AI to be able to answer health questions and give good responses,” he says.

But that’s only half the story, according to King. The other key factor is demand. Shortly before Copilot Health was launched, Microsoft published a report, and an accompanying blog post, detailing how people used Copilot for health advice. The company says it receives 50 million health questions each day, and health is the most popular discussion topic on the Copilot mobile app.

Other AI companies have noticed, and responded to, this trend. “Even before our health products, we were seeing just a rapid, rapid increase in the rate of people using ChatGPT for health-related questions,” says Karan Singhal, who leads OpenAI’s Health AI team. (OpenAI and Microsoft have a long-standing partnership, and Copilot is powered by OpenAI’s models.)

It’s possible that people simply prefer posing their health problems to a nonjudgmental bot that’s available to them 24-7. But many experts interpret this pattern in light of the current state of the health-care system. “There is a reason that these tools exist and they have a position in the overall landscape,” says Girish Nadkarni, chief AI officer​ at the Mount Sinai Health System. “That’s because access to health care is hard, and it’s particularly hard for certain populations.”

The virtuous vision of consumer-facing LLM health chatbots hinges on the possibility that they could improve user health while reducing pressure on the health-care system. That might involve helping users decide whether or not they need medical attention, a task known as triage. If chatbot triage works, then patients who need emergency care might seek it out earlier than they would have otherwise, and patients with more mild concerns might feel comfortable managing their symptoms at home with the chatbot’s advice rather than unnecessarily busying emergency rooms and doctor’s offices.

But a recent, widely discussed study from Nadkarni and other researchers at Mount Sinai found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies. Though Singhal and  some other experts have suggested that its methodology might not provide a complete picture of ChatGPT Health’s capabilities, the study has surfaced concerns about how little external evaluation these tools see before being released to the public.

Most of the academic experts interviewed for this piece agreed that LLM health chatbots could have real upsides, given how little access to health care some people have. But all six of them expressed concerns that these tools are being launched without testing from independent researchers to assess whether they are safe. While some advertised uses of these tools, such as recommending exercise plans or suggesting questions that a user might ask a doctor, are relatively harmless, others carry clear risks. Triage is one; another is asking a chatbot to provide a diagnosis or a treatment plan. 

The ChatGPT Health interface includes a prominent disclaimer stating that it is not intended for diagnosis or treatment, and the announcements for Copilot Health and Amazon’s Health AI include similar warnings. But those warnings are easy to ignore. “We all know that people are going to use it for diagnosis and management,” says Adam Rodman, an internal medicine physician and researcher at Beth Israel Deaconess Medical Center and a visiting researcher at Google.

Medical testing

Companies say they are testing the chatbots to ensure that they provide safe responses the vast majority of the time. OpenAI has designed and released HealthBench, a benchmark that scores LLMs on how they respond in realistic health-related conversations—though the conversations themselves are LLM-generated. When GPT-5, which powers both ChatGPT Health and Copilot Health, was released last year, OpenAI reported the model’s HealthBench scores: It did substantially better than previous OpenAI models, though its overall performance was far from perfect. 

But evaluations like HealthBench have limitations. In a study published last month, Bean—the Oxford doctoral candidate—and his colleagues found that even if an LLM can accurately identify a medical condition from a fictional written scenario on its own, a non-expert user who is given the scenario and asked to determine the condition with LLM assistance might figure it out only a third of the time. If they lack medical expertise, users might not know which parts of a scenario—or their real-life experience—are important to include in their prompt, or they might misinterpret the information that an LLM gives them.

Bean says that this performance gap could be significant for OpenAI’s models. In the original HealthBench study, the company reported that its models performed relatively poorly in conversations that required them to seek more information from the user. If that’s the case, then users who don’t have enough medical knowledge to provide a health chatbot with the information that it needs from the get-go might get unhelpful or inaccurate advice.

Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version.

Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out almost a year ago and is now outdated. 

Earlier this month, Google released a study that meets Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and none of the conversations raised major safety concerns for researchers. 

Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.

Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There’s lots of reasons that the clinical trial paradigm doesn’t always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?”

They key there is “third party.” No matter how extensively companies evaluate their own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.

OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our best to support the community,” he says. “Part of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.” 

Given how expensive it is to produce a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would be able to produce what he calls “the one evaluation to rule them all.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—such as Stanford’s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score.

Nigam Shah, a professor of medicine at Stanford University who led the MedHELM project, says it has limitations. In particular, it only evaluates individual chatbot responses, but someone who’s seeking medical advice from a chatbot tool might engage it in a multi-turn, back-and-forth conversation. He says that he and some collaborators are gearing up to build an evaluation that can score those complex conversations, but that it will take time, and money. “You and I have zero ability to stop these companies from releasing [health-oriented products], so they’re going to do whatever they damn please,” he says. “The only thing people like us can do is find a way to fund the benchmark.”

No one interviewed for this article argued that health LLMs need to perform perfectly on third-party evaluations in order to be released. Doctors themselves make mistakes—and for someone who has only occasional access to a doctor, a consistently accessible LLM that sometimes messes up could still be a huge improvement over the status quo, as long as its errors aren’t too grave. 

With the current state of the evidence, however, it’s impossible to know for sure whether the currently available tools do in fact constitute an improvement, or whether their risks outweigh their benefits.



from MIT Technology Review https://ift.tt/ZX58sKl

Post a Comment

0 Comments