In a recent study published in JAMA Network Open, a group of researchers evaluated the proficiency of Generative Pre-trained Transformer 4 (GPT-4) artificial intelligence (AI) in probabilistic reasoning compared to human clinicians by assessing pretest and posttest probability estimates in diagnostic cases.
Background
In order to diagnose disease, it is necessary to calculate the probability of different illnesses according to the manifestation of the symptoms and then correct these percentages using diagnostic findings.
Nevertheless, it is not easy for clinicians to estimate the pretest and posttest probabilities either through statistics or actual patient case situations. Large language models (LLMs) may help tackle intricate diagnostic problems, passing medical examinations, and empathetic patient interactions in clinical reasoning.
Further research is needed to explore the full potential and limitations of AI in complex, real-world diagnostic scenarios, as current studies show varying levels of AI performance in probabilistic reasoning compared to human clinicians.
About the study
The present study involved analyzing the performance of 553 practitioners in probabilistic reasoning using data from a national survey conducted between June 2018 and November 2019. These practitioners were evaluated across five cases, each aligned with scientific reference standards.
To assess the capabilities of AI in this domain, the researchers replicated each case from the survey into a model. This approach included incorporating specific prompts that were designed to elicit from the AI a committed response regarding pretest and posttest probabilities.
Given the stochastic nature of LLMs, the team employed a strategy to ensure the reliability of their findings. They executed an identical prompt within the LLM's application programming interface a hundred times. This was done at the model's default temperature setting, which is tuned to maintain a balance between creativity and consistency in the responses. This process, conducted in October 2023, allowed for the creation of a distribution of the AI's output responses.
To quantify the AI's performance, the researchers calculated the median and interquartile ranges (IQRs) of the LLMs estimates. Additionally, they determined the mean absolute error (MAE) and mean absolute percentage error (MAPE) for both the AI and human participants. The team conducted their analysis and created plots using R, version 4.3.0. The University of Maryland's institutional review board deemed this study exempt, as it did not involve human participants, and it adhered to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline throughout its conduct.
Study results
In a comparative study between human clinicians and an LLM, intriguing findings were observed regarding the estimation of pretest and posttest probabilities in various diagnostic cases. This study, involving an analysis of five different cases, revealed that the LLM consistently demonstrated lower error rates in predicting probabilities after a negative test result compared to human practitioners.
A notable example of this was seen in the case involving asymptomatic bacteriuria. Here, the LLMs median pretest probability was estimated at 26% (with an IQR of 20%-30%), while the human clinicians' median estimate was slightly lower at 20% but with a much broader interquartile range of 10%-50%. Despite the median estimate from the LLM being further from the correct answer than that of the humans, the LLM exhibited a lower MAE and MAPE at 26.2 and 5240%, respectively.
In contrast, the figures for human clinicians were higher, at 32.2 for MAE and 6450% for MAPE. This difference could be attributed to the LLMs narrower distribution of responses, providing a more consistent range of estimates compared to the wider variability seen in human responses.
Additionally, their estimation of the post-probability test following a positive test result was also notable yet inconsistent. For instance, concerning breast cancer and also an imaginary situation with testing, the LLM surpassed doctor-clinicians in precision. This indicates that it is possible that the LLM understood or handled these specific medical disorders better.
The performance of the AI was also similar to that of the human clinicians in two other situations, suggesting good expertise comparable to expertly trained medical personnel. Nonetheless, one instance in which the LLMs accuracy was lower than that of humans shows some spots that could be improved in the LLMs diagnostic capabilities.
These findings underscore the potential of AI, specifically LLMs, in the realm of medical diagnostics. The LLMs ability to often match or exceed human performance in estimating diagnostic probabilities showcases the advances in AI technology and its applicability in healthcare. However, the varied performance across different cases also indicates the need for continued refinement and understanding of AI's role and limitations in complex medical decision-making.