Dear Editor,
We read with great interest the recently published research analyzing the performances of ChatGPT with respect to the management of cirrhosis and hepatocellular carcinoma [
1]. In addition to these advanced liver diseases, steatotic liver disease (SLD) also represents a considerable burden on global health, as it affects one-third of the worldwide population [
2]. SLD requires long-term self-management and continuous support. This stems from its slow progression, the emphasis on lifestyle changes, and the constant need for regular patient-physician interactions. Therefore, for patients diagnosed with SLD, education plays a pivotal role in understanding, managing, and possibly reversing their condition. In our evolving digital era, large language models (LLMs), which are sophisticated generative AI systems trained on vast volumes of data that are capable of producing human-like textual responses, have emerged as promising aids for patient education [
3], particularly in facilitating interactions through natural language dialogues [
4]. However, given that the efficacy of LLMs in advancing SLD patient education might vary, it is imperative to compare their performances. Therefore, we conducted a comparative evaluation study to assess the performance of five leading LLMs in responding to SLD-related queries.
Our study was performed between Sep 8th and 28th, 2023. We curated 30 common SLD-related queries spanning domains such as risk factors, clinical test and diagnosis, treatment, follow-up, and prognosis based on guideline-based topics and our clinical experience (
Table 1) [
5,
6]. As a separate and independent prompt, each query was posed to five LLMs: ChatGPT-3.5, ChatGPT-4, Google Bard, Meta Llama2 and Anthropic Claude2, which yielded a total of 30 responses per LLM-chatbot. The generated responses were then randomly ordered within each set of questions and stripped of revealing information (e.g., statements such as “I’m not a doctor” from ChatGPT) to blind reviewers to the LLM-specific response identity. Three seasoned attending-level physicians independently graded the responses as either “appropriate” or “inappropriate” over five rounds, each on a separate day, with an overnight washout interval in between to mitigate memory bias (
Supplementary Fig. 1). Specifically, the responses were graded as “appropriate” when they were free from errors and “inappropriate” when they contained potential factual errors that could harm or mislead the average patient. The final grade for each chatbot response was determined using a majority consensus approach, based on the grade most often assigned by the three expert graders.
We assessed the performances of the five LLMs in responding to SLD-related queries. As shown in
Table 1, ChatGPT-4 provided 29 of 30 (96.7%) appropriate responses, followed by Bard and Llama2 with 27 of 30 (90.0%), and ChatGPT-3.5 and Claude2 both with 24 of 30 (80.0%), Chi-square test χ
2=6.17,
P=0.18. A notable area of concern was the frequent misclassification of fatty liver disease as synonymous with nonalcoholic fatty liver disease (NAFLD). This oversimplification can lead to inaccuracies. For example, ChatGPT-3.5 replied to the question “Are there different stages of fatty liver disease, and how do they differ?” with the following response: “Yes, there are different stages of fatty liver disease, which is also known as nonalcoholic fatty liver disease (NAFLD). …. The stages of NAFLD are typically categorized as follows: 1. Simple Steatosis (Fatty Liver): ….2. Nonalcoholic Steatohepatitis (NASH): .... 3. Fibrosis: …. 4. Cirrhosis: ….”
This rigorous evaluation study revealed that, among five state-of-the-art LLMs, ChatGPT-4 could generate largely appropriate responses to patient queries regarding SLD, boasting an impressive appropriateness rate of 96.7%. Other LLMs provided 80% to 90% appropriate responses. Health literacy—commonly defined as the degree to which individuals have the skills and abilities to obtain, process, and utilize health-related information—has emerged as a critical priority in reducing inequities among patients, including those with SLD [
7,
8]. Our findings underscore the varied potential of LLM chatbots to provide professional yet patient-friendly health literacy guidance to SLD patients [
3]. Whereas prior investigations predominantly focused on ChatGPT3.5 [
1], our study offers a comprehensive assessment of popular LLMs, namely ChatGPT-3.5, ChatGPT-4, Bard, Llama2 and Claude2, and we specifically evaluated their proficiency in addressing typical SLD-related patient queries. Notably, one in five responses from ChatGPT-3.5 and Claude2 was inappropriate, thus highlighting the need for further iterations and probably domain-specific fine-tuning. Although the exact parameters of ChatGPT-4 remain undisclosed, its impressive performance may result from the large parameter set, extensive user feedback, advanced reasoning abilities, and the integration of insights from previous models into the system [
9]. This study derived benefits from implementing a robust study design with proper randomization, wash-out periods and a majority consensus grading process. However, there are also limitations. These sample queries may represent only a small proportion of real-world scenarios. In addition, as the field of LLM evolves at an unprecedented speed, future research is needed to confirm whether LLMs are adapting to new nomenclatures, such as metabolic dysfunction associated steatotic liver disease (MASLD). Generative AI with LLMs—especially ChatGPT-4—may offer yet further valuable insights into opportunities for patient education about SLDs.
ACKNOWLEDGMENTS
This study was funded in part by the National Key R & D Program of China (2022YFC2502800), National Natural Science Foundation of China (82103908), the Shandong Provincial Natural Science Foundation (ZR2021QH014), the Shuimu Scholar Program of Tsinghua University, and National Postdoctoral Innovative Talent Support Program (BX20230189). The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
SUPPLEMENTAL MATERIAL
Supplementary material is available at Clinical and Molecular Hepatology website (
http://www.e-cmh.org).
Supplementary Figure 1.
Evaluation Process for State-of-the-Art Online Large Language Models in Responding to Steatotic Liver Disease- Related Queries. This figure illustrates the assessment workflow for evaluating the responses generated by LLMs to 30 steatotic liver disease (SLD)-related queries. Each LLM's responses are represented by a unique color. These responses were randomly ordered and stripped of revealing information to ensure a blind assessment. Subsequently, three physicians independently graded the responses as either "appropriate" or "inappropriate" over five days. The final grade for each response was determined using a majority consensus approach.
cmh-2023-0440-Supplementary-Fig-1.pdf
Table 1.
Performance of large language models in addressing patient queries regarding steatotic liver disease
|
GPT-3.5 |
GPT-4 |
Bard |
Llama2 |
Claude2 |
Appropriateness, n (%) |
24 (80.0) |
29 (96.7) |
27 (90.0) |
27 (90.0) |
24 (80.0) |
1. Risk factors |
|
|
|
|
|
Who is more likely to get fatty liver disease? |
1 |
3 |
3 |
3 |
3 |
What type of diet can help better manage fatty liver disease? |
1 |
0*
|
3 |
3 |
3 |
How does alcohol consumption affect my fatty liver disease, and should I abstain from alcohol completely? |
3 |
3 |
2 |
3 |
2 |
What type and amount of physical activity is recommended for someone with fatty liver disease? |
3 |
2 |
3 |
3 |
3 |
I have a lean build; how did I develop fatty liver disease? |
3 |
3 |
2 |
3 |
3 |
How does my family’s health history impact the monitoring of my fatty liver disease? |
3 |
3 |
3 |
3 |
1 |
2. Test and diagnosis |
|
|
|
|
|
What are the early signs and symptoms of fatty liver disease that I should be aware of? |
3 |
2 |
3 |
2 |
3 |
How is fatty liver disease diagnosed? |
3 |
3 |
3 |
1 |
3 |
Are there different types of fatty liver disease, and how do they differ? |
3 |
3 |
2 |
2 |
2 |
Are there different stages of fatty liver disease, and how do they differ? |
1 |
3 |
1 |
1 |
1 |
At what point is a liver biopsy recommended for individuals with fatty liver disease? |
0 |
2 |
2 |
1 |
2 |
What is the role of imaging tests such as ultrasound, MRI, or CT scan in diagnosing fatty liver disease? |
3 |
3 |
2 |
3 |
2 |
I have fatty liver disease and my ALT is 100 U/L; how should I interpret this? |
2 |
2 |
3 |
2 |
1 |
I have fatty liver disease and my FIB-4 score is 1.1; how should I interpret this? |
3 |
3 |
2 |
3 |
3 |
3. Treatment |
|
|
|
|
|
How is fatty liver disease treated? |
3 |
2 |
3 |
3 |
3 |
Are there any specific medications that are commonly prescribed for fatty liver disease? |
2 |
3 |
3 |
2 |
1 |
How should medication be used to avoid liver damage in fatty liver disease? |
1 |
3 |
1 |
2 |
3 |
What lifestyle interventions can aid in the treatment of fatty liver disease? |
2 |
3 |
2 |
3 |
3 |
In severe cases, are there surgical options available for treating fatty liver disease? |
2 |
3 |
3 |
3 |
3 |
4. Follow up and monitoring |
|
|
|
|
|
How often should I be monitored if I have fatty liver disease? |
2 |
2 |
3 |
2 |
3 |
I have fatty liver disease. What tests or procedures will be performed during follow-up appointments? |
3 |
2 |
3 |
3 |
3 |
I have fatty liver disease. What signs or symptoms should prompt me to seek immediate medical attention? |
3 |
3 |
3 |
3 |
3 |
5. Comorbidities and prognosis |
|
|
|
|
|
What other health conditions are commonly associated with fatty liver disease? |
3 |
2 |
3 |
3 |
3 |
What is the typical prognosis for someone with fatty liver disease? |
3 |
3 |
3 |
3 |
3 |
Is there an increased risk of heart disease when living with fatty liver disease? |
3 |
3 |
3 |
2 |
1 |
How does fatty liver disease affect diabetes management, and vice versa? |
3 |
3 |
3 |
3 |
3 |
Does having fatty liver disease increases my risk for liver cancer? |
2 |
3 |
2 |
2 |
1 |
How can I understand the stage of my fatty liver disease and the potential progression over time? |
1 |
2 |
3 |
2 |
3 |
Can fatty liver disease be reversed? |
3 |
3 |
3 |
3 |
3 |
Can children develop fatty liver disease, and if so, how does it affect their health as they grow up? |
2 |
3 |
1 |
3 |
2 |
Abbreviations
SLD
steatotic liver disease
LLMs
large language models
NAFLD
nonalcoholic fatty liver disease
MASLD
metabolic dysfunction associated steatotic liver disease
REFERENCES
1. Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023;29:721-732.
2. Devarbhavi H, Asrani SK, Arab JP, Nartey YA, Pose E, Kamath PS. Global burden of liver disease: 2023 update. J Hepatol 2023;79:516-537.
3. Varghese J, Chapiro J. ChatGPT: The transformative influence of generative AI on science and healthcare. J Hepatol 2023 Aug 5. doi: 10.1016/j.jhep.2023.07.028.
5. Younossi ZM, Corey KE, Lim JK. AGA clinical practice update on lifestyle modification using diet and exercise to achieve weight loss in the management of nonalcoholic fatty liver disease: Expert review. Gastroenterology 2021;160:912-918.
6. European Association for the Study of the Liver (EASL); European Association for the Study of Diabetes (EASD); European Association for the Study of Obesity (EASO). EASL-EASD-EASO Clinical Practice Guidelines for the management of non-alcoholic fatty liver disease. J Hepatol 2016;64:1388-1402.
7. Carroll AM, Rotman Y. Nutrition literacy is not sufficient to induce needed dietary changes in nonalcoholic fatty liver disease. Am J Gastroenterol 2023;118:1381-1387.
8. Coleman C, Birk S, DeVoe J. Health literacy and systemic racismusing clear communication to reduce health care inequities. JAMA Intern Med 2023;183:753-754.
9. OpenAI. GPT-4 Technical Report. arXiv 2303.08774 [Preprint]. 2023;[cited 2023 Oct 27]. Available from: https://doi.org/10.48550/arXiv.2303.08774.