• Tomorrow's Dose
  • Posts
  • Edition 5 - AI Performance in Medicine: Claude3 vs. GPT-4 & Bias in Diagnostics

Edition 5 - AI Performance in Medicine: Claude3 vs. GPT-4 & Bias in Diagnostics

Explore the comparative study of Claude3 and GPT-4 in medical knowledge tests, a critical look at AI bias in diagnostics, and how AI can help explain complex echocardiogram results to patients.

  1. How does Claude3 and GPT-4 perform in a medical knowledge test?

  2. Breaking the Bias: A fair assessment of AI.

  3. ChatGPT explains echo result to heart patients.

  • Featured follow of the week

  • Top posts of the week across social

  • Meet the editor

  • Want a featured article?

Specialty: All // Sub-Specialty: AI // Body Site: All


1. How does Claude3 and GPT-4 perform in a medical knowledge test?


A recent study assessed the medical capabilities of readily-available large language models (LLMs). The study compared the medical accuracy of OpenAI’s GPT-4 and Anthropic’s Claude3-Opus to each other, and to human medical experts, through questions based on objective medical knowledge. The study revealed that Claude3 edged GPT-4 on accuracy, but both performed poorly in comparison to both human medical experts. Both LLMs answered about a third of the questions wrong, with GPT-4 answering almost half of the questions with numerical-based answers incorrectly.
Read Full Article

Paul’s Thoughts:

While it is interesting to note that Claude3 was superior to GPT-4, the research shows that general-use LLMs still don’t measure up to medical professionals in interpreting and analysing the types of medical questions that a physician encounters daily. In order for generative AI to be able to realise its potential the models must incorporate verified and domain-specific sources in their data.

Timescale: Acute | 1 Year

Specialty: All // Sub-Specialty: AI // Body Site: Bias


2. Breaking the Bias: A fair assessment of AI


A discussion has been conducted in the prestigious journal Radiology on the bias introduced in the assessment of AI for diagnostic reporting. The letter, from Dr Bennett and colleagues in Aberdeen, provides commentary on a study from Dr Plesner et al in which radiologists outperformed four commercially available artificial intelligence tools to accurately diagnose common pulmonary pathologies on chest radiographs. Dr Bennett argues that the study was unfair on the AI systems, which had to infer diagnoses based on single images without context. Additionally, the AI tools were not trained on local data, so would not necessarily be robust against radiologic variations due to technical heterogeneity in equipment and exposure settings.
Read Full Article

Paul’s Thoughts:

It is good to see that discussions are being had between members of the clinical community, which critically appraise the quality of AI assessment studies. Such studies are becoming more and more common as new AI tools become available, but it is important they are conducted correctly. This correspondence lays out some important factors to consider, which hopefully future studies will incorporate.

Timescale:  Early | 2 Years 

Specialty: Surgery // Sub-Specialty: AI // Body Site: Cardiac


3. ChatGPT explains echo result to heart patients


A study led by Lior Jankelson, MD, PhD, from New York University, found that about three-quarters of AI-generated responses in the study were deemed suitable by echocardiographers to send to patients. They used the chatbot to explain reports from 100 patients. Five echocardiographers evaluated the generated explanations on five-point Likert scales. They measured ChatGPT-4’s performance for acceptance, accuracy, relevance, understandability, and representation of quantitative information. Additional questions assessed the importance of inaccurate or missing information. The echocardiographers either agreed or strongly agreed that 73% of the GPT-generated explanations of echocardiograms were suitable to send to patients without modifications. Additionally, the experts rated 84% of the explanations as “all true” and 16% as “mostly correct.” Most importantly, the experts reported that none of the generated explanations had incorrect or missing information that was potentially dangerous and noted median Likert scores of either four or five for accuracy, relevance, and understandability.
Read Full Article

Paul’s Thoughts:

Imaging reports such as those from echocardiography exams can be technical and difficult to understand for patients. With this work the researchers suggest that AI tools could help clinicians explain imaging results immediately to patients, which can reduce patient anxiety and physician workload.

Timescale: Early | 3 Years 

A round-up of some of the best posts we found online this week.

Was this email forwarded to you?
Our weekly email brings you the latest health trends and insights, combining top news and opinions into a straightforward, digestible format.

Want an article featured?

Have an insightful link or story about the future of medical health? Reach out below, and we may include it in a future release.

Reply

or to participate.