Not what the doctor ordered: ChatGPT health advice left wanting


ChatGPT-generated health advice becomes less reliable when it is infused with evidence from the internet, according to a world-first Australian study that highlights the risk of relying on large language models for answers.

Research led by the CSIRO and the University of Queensland (UQ) found that while the chatbot handles simple, question-only prompts relatively well, it struggles with evidence-biased questions, regardless of whether the prompt is correct.

Evidence-biased questions were answered with just 28 per cent accuracy in some instance, revealing the heavy influence that “injecting evidence” from the internet has on answers.

The finding is also contrary to advice that suggests using prompts to avoid hallucinations, a well-known phenomenon with generative artificial intelligence tools like ChatGPT, according to the new the study, circulated on Thursday.

Dr Bevan Koopman, CSIRO principal research scientist and Associate Professor at UQ. Image: CSIRO

Dr Bevan Koopman, the lead author of the study published in December, said that with prompt-based generative large language models (LLMs) growing in popularity for health information, research into the impact of prompts is increasingly important.

“While LLMs have the potential to greatly improve the way people access information, we need more research to understand where they are effective and where they are not,” Dr Koopman, a principal research scientist at CSIRO and Associate Professor at UQ, said.

For the study, Dr Koopman and co-author UQ Professor Guido Zuccon asked 100 consumer health questions of ChatGPT that ranged from ‘Can zinc help treat the common cold?’ to ‘Will drinking vinegar dissolve a stuck fish bone?’.

Both straight questions and “questions biased with supporting or contrary evidence” were used, with the responses compared to the known correct response, or ‘ground truth’, that a medical practitioner would provide.

ChatGPT was found to consistently provide accurate responses to simpler, question-only prompts without any supporting or contrary evidence, returning 80 per cent of answers to the questions accurately.

But this accuracy dropped to 63 per cent when the LLM was given an evidence-based prompt, showing that “evidence is capable of overturning the model answers about treatments” with “detrimental effect on answer correctness”.

If the prompt allowed for an “unsure” answer, not just yes or no, then the accuracy fell even further, to 28 per cent. The researchers said this was at “odds with the idea of relying on retrieve-the-generate pipelines to reduce hallucinations in models like ChatGPT”.

“Prompting with evidence did not help flip incorrect answers to correct and actually caused correct answers to flip incorrect. This may serve as a warning for relying on retrieve-then-generate pipelines to reduce hallucinations…” the study said.

Retrieve-then-generate – or read-then-generate – is where information related to the question is identified on the internet and “passed to the model via the prompt to inform the mode’s output”.

Dr Koopman said that while it remains unclear why evidence-biasing can flip an answer regardless of whether the evidence is correct or not, it is possible that the “evidence adds too much noise, thus lowering accuracy”.

With search engines already integrating LLMs and search technologies in a process called Retrieval Augmentation Generation, Professor Zuccon said the result of the study highlight the constraints with LLMs and how they might contribute toward inaccurate health information.

“We demonstrate that the interaction between the LLM and the search component is still poorly understood and controllable, resulting in the generation of inaccurate health information,” Professor Zuccon, the director of AI for the Queensland Digital Health Centre, said.

Do you know more? Contact James Riley via Email.

Leave a Comment

Related stories