Researchers, academics, and professionals are increasingly turning to large language models (LLMs) like ChatGPT to identify citations and perform literature reviews, which many consider tedious and mentally exhausting tasks.
But this has led to career-ending errors in courtrooms, security vulnerabilities in AI-generated code for websites, government documents full of falsehoods, and misdiagnosis in medical contexts, since LLMs make up fake information that appears correct (known as hallucinating) and present it with supreme confidence, all while sycophantically praising the user.
But even if you ask ChatGPT about a specific real article, can it identify if that article has been retracted due to errors, fake data, or other research misconduct? No, it cannot, according to a new study.
The researchers found that ChatGPT correctly identified zero of even the most well-known study retractions. Instead, 73% of the time, it claimed that these studies were of “internationally excellent” or “world leading” quality. Even in the minority of cases in which ChatGPT did rate the studies as low quality, that rating had nothing to do with the retraction notice or the reason the studies were retracted.
“Users should be cautious of literature summaries made by large language models in case they contain false information,” the researchers write.
The study was conducted by Mike Thelwall, Irini Katsirea, and Er-Te Zheng at the University of Sheffield, UK, and Marianna Lehtisaari and Kim Holmberg at the University of Turku, Finland. It was published in Learned Publishing, a journal focused on issues in scholarly publication and science communication.