Volume IV, Number 1 | Spring 2025

Can ChatGPT-4 Think Like an Orthopaedic Surgeon? Testing Clinical Judgement and Diagnostic Ability in Pathologies of the Foot and Ankle

1Hartman H, 2Essis M, 3Tung W, 2Peden S, 2Oh I, 2Gianakos A
1Lincoln Memorial University, Knoxville, United States; 2Yale Medicine, Orthopaedics and Rehabilitation, New Haven, United States

Introduction/Purpose
Artificial intelligence chatbots have seen a notable rise in recent years, especially with the creation of ChatGPT, a chatbot that utilizes a language learning model with an ability to carry human-like conversation. While ChatGPT does not have any specific medicine-related training, prior studies have shown that its newest version, GPT-4, has been able to pass professional licensing examinations and perform comparably on question bank sets as surgical residents. The purpose of this study was to explore the diagnostic and decision-making capacities of ChatGPT-4 in clinical management specifically assessing for accuracy in identification and treatment of foot and ankle pathologies.

Methods
This study presented 16 foot and ankle cases to ChatGPT-4. Each case was evaluated by 3 fellowship-trained foot and ankle orthopaedic surgeons. The scoring system included 5 criteria within a Likert scale, with 5 being the lowest score and 25 being the highest possible. The criteria included stating the correct diagnosis, stating the most appropriate procedure, identification of alternative treatments, providing comprehensive information beyond treatment, and not mentioning nonexisting therapies. ChatGPT-4 was referred to as “Dr. GPT”, using role prompting to encourage step-by-step processing and establish a peer dynamic so that the role of an orthopaedic surgeon was emulated by the chatbot.

Results
The average score across all criteria for all 16 cases was 4.47, with an average sum score of 22.4. The plantar fasciitis case received the highest score, with an average sum score of 24.7. The lowest score was observed in the peroneal tendon tear case, with an average sum score of 16.3. Subgroup analyses of each of the 5 criterion using Friedman Rank Sum tests showed no statistically significant differences in surgeon grading. Criterion 5, lack of mention of nonexistent treatment options, and criterion 1, the ability for ChatGPT to correctly diagnose, received the highest subgroup scores of 4.88 and 4.77, respectively. The lowest criteria score was observed in criteria 4 (4.05), evaluating ChatGPT-4 providing comprehensive information beyond treatment options.

Conclusion
This study demonstrates that ChatGPT-4 effectively diagnosed and provided reliable treatment for most foot and ankle cases presented, noting consistency amongst surgeon evaluators. The individual criterion assessment revealed that ChatGPT-4 was most effective in diagnosing pathologies. Additionally, the chatbot consistently did not suggest nonexistent treatment options, a common finding in prior studies evaluating ChatGPT-3.5 in which fabricated information was presented in a manner that was seemingly true. This resource could be useful for clinicians seeking patient education materials on diagnoses and treatment options without fear of incorrect information presentation, though comprehensive information beyond treatment may be limited.

The Journal of the American Osteopathic Academy of Orthopedics

Steven J. Heithoff, DO, MBA, FAOAO
Editor-in-Chief

To submit an article to JAOAO

Share this content on social media!

Share this content on Facebook
Share this content on LinkedIn
Authors in this Edition

© AOAO. All copyrights of published material within the JAOAO are reserved.   No part of this publication can be reproduced or transmitted in any way without the permission in writing from the JAOAO and AOAO.  Permission can be requested by contacting Joye Stewart at [email protected].