A new review of 137 studies of AI chatbots’ health advice models flags flawed methodologies, ethics and reporting protocols.
A new systematic review in JAMA Network Open has revealed major gaps in studies evaluating AI chatbots’ ability to provide health advice, with inconsistent reporting hindering reliable assessments.
Researchers analyzed 137 peer-reviewed articles published up to October 2023, finding most relied on proprietary models such as ChatGPT without detailing key technical parameters. This opacity limits reproducibility and clinical trust in these tools.
Large language models power chatbots by predicting text from vast datasets, enabling responses to queries on treatment, diagnosis, or prevention. Yet, 99.3% of the reviewed studies had examined closed-source systems, specifying versions in just 0.7% of cases, and omitting details such as temperature settings (which control output randomness) or token limits.
Only 11.7% had justified choice of model used, complicating performance comparisons.
Also, query strategies lacked rigor: Over 27% omitted prompt sources, and 99.3% had skipped prompt engineering phases to optimize inputs. Fewer than 40% had noted query dates — vital as models update frequently — potentially altering results. Transcripts appeared in 47.4% for responses and 67.9% for prompts, but standardized evaluations were rare (13.1%).
Then, performance metrics fared worst. About 65% had used subjective expert opinions as “ground truth” instead of guidelines (15.3%), risking bias. Blinding occurred in 11.7%, and structured rubrics in under 29%. Surgical topics dominated (40.1%), followed by medicine (37.2%), with treatment advice most tested (66.4%).
Finally, ethical oversights compounded issues. Fewer than 33% addressed patient safety or ethics, and 16.1% regulation gaps, despite risks such as hallucinations, biases from training data, and privacy breaches. Chatbots can propagate misinformation or expose data, evading HIPAA-like protections without tailored oversight.
The authors of the systematic review urge standardized tools (such as the proposed Chatbot Assessment Reporting Tool, CHART) for transparency on model traits, prompts, and objective benchmarks. Multidisciplinary teams — clinicians plus AI experts — have to prioritize high-quality data, bias mitigation, and regulation. Until then, deploying these in medicine risks patient harm over benefits.
Prospective, patient-centered trials using open-source models could validate real-world utility, but current heterogeneity demands caution. Regulators should mandate audits, data protections, and explainability to bridge gaps between hype and safe integration, noted the authors.