The Accuracy of ChatGPT in Classifying Lumbar Spondylolisthesis and Compression Fractures

Volume X, Number 1 | Spring 2026

Published May 29, 2026

The Accuracy of ChatGPT in Classifying Lumbar Spondylolisthesis and Compression Fractures

Justin Chung BS¹; Rowen Lin BS¹; Evan Dunn DO²; Grace Kim BA³; Caleb Choi BS⁴; Kevin Mo DO²; William Fang DO²; Daniel Lee MD⁵¹Touro University Nevada
²Valley Hospital Medical Center
³Western University of Health Sciences
⁴University of California, Irvine
⁵Desert Orthopaedic Center

DOI: http://doi.org/10.70709/qbghijluqge3j8

Abstract

Background
Chat Generative Pre-Trained Transformer (ChatGPT), released in the fall of 2022, made large language models accessible worldwide. Since then, ChatGPT’s capabilities have expanded to areas including imaging and analysis across multiple models, such as the recent advanced reasoning o-models. The aim of this study was to determine whether ChatGPT and its three other models can identify and classify two distinct lumbar spinal pathologies: L5-S1 spondylolisthesis and vertebral compression fractures (VCFs).

Methods
This study utilized 54 images from the VinDr-SpineXR database: 30 labeled as spondylolisthesis and 24 labeled as vertebral collapse. These images were uploaded into ChatGPT 5.2 Thinking, 4o, o4-mini, and o4-mini-high, all in temporary mode. Three standardized questions were asked: 1) Can you identify the pathology in this image? 2) Can you identify where the spondylolisthesis/compression fracture is in this image? 3) Can you grade the level of L5-S1 spondylolisthesis/percentage of compression in the image?

Results
Across both pathologies, ChatGPT 5.2 Thinking displayed the highest accuracy in initial pathology identification at a rate of 86.7% (p<0.001) for spondylolisthesis and at a rate of 100.0% (all p<0.001) in VCFs. Across localization and severity grading tasks, no single model uniformly outperformed the others. GPT o4-mini correctly identified the level of spondylolisthesis at the highest rate of 96.7%, while GPT 4o performed the best in VCF localization at a rate of 75.0%. In severity grading, ChatGPT 5.2 Thinking also had the most accuracy in classifying Meyerding classification at 56.7%, while o4-mini-high significantly outperformed ChatGPT 5.2 in VCF grading (58.3% vs. 29.2%; p=0.043).

Conclusion
ChatGPT 5.2 Thinking demonstrated significantly higher accuracy in identifying the pathology across both lumbar spinal conditions. However, performance across localization and grading tasks was inconsistent and results varied by pathology type and model. These findings suggest that ChatGPT 5.2 Thinking may offer meaningful improvement in recognizing radiographic findings when utilized as a decision-support tool.

Keywords: ChatGPT; Artificial Intelligence; Spondylolisthesis; Vertebral Compression Fracture; Lumbar Spine; Medical Imaging

Introduction
Chat Generative Pre-trained Transformer (ChatGPT) is a large language model (LLM) developed by OpenAI and released in November 2022.^(1,2) Upon its release, the language-based chatbot rapidly generated public attention and made its way into mainstream use due to its exceptional ability to respond in a conversationally logical manner.^(3,4) From its initial release on the interface ChatGPT-3.5, to ChatGPT-4 in 2023, the pre-trained transformers continued to improve on their capabilities concerning handling complex tasks, leveraging an extensive database of websites, journals, and books.^(5,6) In 2024, OpenAI released a model, ChatGPT 4omni (4o), a large multimodal model which is able to accept and analyze image inputs in addition to text inputs.^(7,8) Following this release, OpenAI released a new large language model line under the o-model line in the fall of 2024.⁽⁹⁾ These models stood out due to their ability to perform advanced reasoning utilizing an internal chain of thought prior to responding. In December of 2025, OpenAI released its 5.2 model, specifically focused on enhancing performance in handling more complex inquiries and multi-step tasks.⁽¹⁰⁾ ChatGPT across its different model generations has been found to be extremely useful in image identification.^(11–13) There have been multiple studies assessing the image identification function and clinical management across medical specialties such as emergency medicine, pathology, and the field of orthopaedics.^(14–21)

Spondylolisthesis and vertebral compression fractures (VCFs) are two clinically significant lumbar spinal pathologies commonly identified and classified on plain radiographs. In the United States, more than 300,000 lumbar spine fusions are performed annually, with many being performed to correct lumbar spondylolisthesis.^(22–24) VCFs are equally prevalent with approximately 1-1.5 million occurring in the United States annually.⁽²⁵⁾ This can be used as a tool that to automatically identify and classify these spinal pathologies to help make medical decision-making faster. It could potentially increase the speed of diagnosis, decrease healthcare costs, and improve overall clinical care.^(26–28) ChatGPT, as a large language model that has been continuously improving through multiple generations, holds significant promise in being able to identify and classify these spinal pathologies to improve the speed and accuracy of clinical decision-making. ^(29–31)

The aim of this study was to assess the accuracy and reliability of four ChatGPT models (ChatGPT 5.2 Thinking, ChatGPT 4o, o4-mini, and o4-mini-high) in identifying, localizing, and classifying two distinct lumbar spinal pathologies: L5-S1 spondylolisthesis and vertebral compression fractures. By assessing both pathologies under an identical methodological framework, this study also allows for cross-pathology comparisons of model performance, offering a more comprehensive picture of ChatGPT’s current capabilities in lumbar spine imaging.

Material and Methods

Study Design
This study did not require Institutional Review Board (IRB) approval, as we utilized an anonymized dataset to protect patient information. This study did not involve human participants directly. All images were sourced from a publicly available, de-identified dataset, therefore informed consent to participate was not applicable. The primary outcome was identification accuracy of pathology, defined as the percentage of images in which each model correctly identified the presence of the target pathology. Secondary outcomes included localization accuracy and severity classification accuracy, defined as the percentage of images in which each model correctly identified the affected vertebral level and assigned the correct severity grade, respectively. The sample size for each pathology was determined by the total number of images meeting inclusion criteria within the VinDr-SpineXR database. 30 images were identified for spondylolisthesis and 24 for vertebral compression fractures, representing the complete eligible sample. Given the exploratory and descriptive nature of this study, formal power analysis was not performed; however, these sample sizes are consistent with prior pilot studies evaluating ChatGPT’s diagnostic image interpretation capabilities.^(14,32,33)

Model Selection
Radiographic images were extracted from the dataset and uploaded into the current flagship model of ChatGPT, the previous flagship of ChatGPT 4o, and the two most current o-models: o4-mini, and o4-mini-high. ChatGPT 5.2 Thinking, the flagship model, was selected as the baseline; ChatGPT 4o was selected due to being the previous generation flagship, and o4-mini, and o4-mini-high were selected as they are the first o-series models able to perform complex reasoning in addition to utilizing every tool within ChatGPT, such as multimodal input.⁽³⁴⁾ To minimize bias, all models were tested in temporary chat mode with the memory reset between prompts. The reference standard for each image was established by direct radiographic measurement using a DICOM viewer prior to model testing. No missing data was encountered; all selected images received a complete response from each model for all three questions.

Image Database
VinDr-SpineXR is a large, publicly available image database containing 10,469 images from 5,000 studies of multiple spinal abnormalities.^(35,36) This database was selected for this study for several reasons. First, it is a large-scale, accessible spine radiograph dataset available for research use, enabling reproducibility and transparency. Second, its standardized lesion-type labeling system allowed for systematic image selection based on confirmed pathology, reducing selection bias. Third, the dataset is fully de-identified, protecting patient privacy.

Data Collection — Spondylolisthesis
From the VinDr-SpineXR database, 30 radiographic images labeled under the “lesion_type” of “spondylolisthesis,” were identified, reviewed, and subsequently separated to be uploaded into ChatGPT 5.2 Thinking, 4o, o4-mini, and o4-mini-high. The total number of images to be used came after a thorough review of all images labeled “spondylolisthesis,” and only lateral x-rays of L5-S1 spondylolisthesis were used. Screenshots of radiographic images were used as an image input. Images were radiographically measured using a DICOM viewer, with Meyerding grade determined by calculating the percentage of anterior slip of L5 relative to the S1 superior end plate.⁽³⁷⁾ 14 Grade I images and 16 Grade II images were used in this study to compare GPT accuracy with the images. No Grade III or higher images meeting inclusion criteria were identified in the database.

Data Collection – Vertebral Compression Fractures
From the VinDr-SpineXR database, approximately 100 images labeled under the lesion_type “vertebral collapse” were reviewed. Twenty-four images were selected based on the following inclusion criteria: a lateral x-ray with minimal pathology outside the study scope; an image containing lumbar VCF. VCF location was confirmed on the DICOM viewer, and severity was measured using percentage of anterior height compression (PAHC), with mild: 20-25%, moderate: 25-40%, and severe: >40%. Radiographs were captured as screenshots for ChatGPT evaluation, as it cannot process DICOM files directly.

The final sample included fractures at L1 (n=17), L2 (n=6), and L3 (n=1). A thorough review of all available vertebral collapse images in the database revealed no images with isolated L4 or L5 vertebral collapse that met the remaining inclusion criteria. This is consistent with the epidemiology of osteoporotic VCFs, which disproportionately affect the thoracolumbar junction (T12-L2).⁽³⁸⁾ The absence of lower lumbar VCFs in this dataset is therefore a likely reflection of clinical frequency distributions.

Procedure
For each pathology presented, three standardized questions were asked alongside each uploaded image, in order, with memory reset between questions. For spondylolisthesis: 1) “Can you identify the pathology in this image?” 2) “Can you identify where the spondylolisthesis is in this image?” 3) “Can you grade the level of L5-S1 spondylolisthesis in the image?”. For vertebral compression fractures: 1) “Can you find the pathology in this image?” 2) “At which segment in the lumbar spine is there a compression fracture?” 3) “Can you identify the percentage of compression in the affected vertebra?”. These questions were selected to best assess: (1) pathology diagnosis when given a lateral x-ray; (2) identifying the location of the pathology; (3) grading the severity of the pathology presented. The output from ChatGPT 4o, o4-mini, and o4-mini-high was subsequently classified as Y or N format as to each model’s correctness under the three categories. We then compared accurate identification, localization, and severity classification across the three models.

Statistical Analysis
Statistical analyses were performed utilizing R (version 4.5.1) to run two-sample independent Welch t-tests with ChatGPT 5.2 Thinking as the baseline and assessing general accuracy in comparison to 4o, o4-mini and o4-mini-high. Statistical significance was set at a p-value of less than 0.05. Two-sample independent Welch t-tests were selected as the appropriate statistical test given the independence of model outputs and the continuous nature of per-image accuracy scores. While sample sizes were modest, the t-test is robust to mild violations of normality at these sizes.

Results
In this study, 54 images were utilized: 30 spondylolisthesis images and 24 VCF images. Demographics were not attached to each image; however, across the total dataset, the mean age was 49 [6yrs-98yrs] with 37.83% being male and 62.17% female.^(35,36)

Part I: Spondylolisthesis (n=30)

Identification Accuracy (Question 1)
Across all 30 images, ChatGPT 5.2 Thinking performed the best in correctly identifying the images at the highest rate of 86.7% (26/30) in comparison to 4o, o4-mini and o4-mini-high at 30.0% (9/30), 10.0% (3/30) and 16.7% (5/30) respectively (mean differences: 56.7% [95% CI 35.4-77.9%], 76.7% [95% CI 59.8-93.5%], and 70.0% [95% CI 51.2-88.8%]; all p<0.001) (Table 1). With only Grade I spondylolisthesis, GPT 5.2 performed the best in initially identifying the image at 78.6% (11/14) when compared to 4o at 35.7% (5/14), o4-mini at 7.1% (1/14), and o4-mini-high at 14.3% (2/14) (Table 2). When only considering Grade II spondylolisthesis, ChatGPT 5.2 Thinking outperformed all models when initially identifying the images at 93.8% (15/16) while the other models performed similarly at around 12.5% – 25.0% (2/16-4/16) (Table 2).

Model	Identification (Q1)		Localization (Q2)		Meyerding Grading (Q3)
	n/30 (%)	p-value	n/30 (%)	p-value	n/30 (%)	p-value
GPT 5.2 Thinking	26/30 (86.7%)	—	27/30 (90.0%)	—	17/30 (56.7%)	—
GPT 4o	9/30 (30.0%)	<0.001**	24/30 (80.0%)	0.286	16/30 (53.3%)	0.799
o4-mini	3/30 (10.0%)	<0.001**	29/30 (96.7%)	0.310	14/30 (46.7%)	0.447
o4-mini-high	5/30 (16.7%)	<0.001**	24/30 (80.0%)	0.286	15/30 (50.0%)	0.612

Table 1. Spondylolisthesis overall model performance across all three questions. * p < 0.05; ** p < 0.001
GPT 5.2 Thinking served as the reference comparison. n = number; Q = question #

Model	Identification (Q1)		Localization (Q2)		Meyerding Grading (Q3)
	Grade I (n=14)	Grade II (n=16)	Grade I (n=14)	Grade II (n=16)	Grade I (n=14)	Grade II (n=16)
GPT 5.2 Thinking	11/14 (78.6%)	15/16 (93.8%)	12/14 (85.7%)	15/16 (93.8%)	10/14 (71.4%)	7/16 (43.8%)
GPT 4o	5/14 (35.7%)	4/16 (25.0%)	11/14 (78.6%)	13/16 (81.3%)	5/14 (35.7%)	11/16 (68.8%)
o4-mini	1/14 (7.1%)	2/16 (12.5%)	13/14 (92.9%)	16/16 (100.0%)	13/14 (92.9%)	1/16 (6.3%)
o4-mini-high	2/14 (14.3%)	3/16 (18.8%)	11/14 (78.6%)	13/16 (81.3%)	13/14 (92.9%)	2/16 (12.5%)

Table 2. Spondylolisthesis model performance stratified by Meyerding grade. n = number; Q = question

Localization Accuracy (Question 2)
Model o4-mini performed the best across all 30 images in localizing the level of L5-S1 spondylolisthesis, significantly outperforming ChatGPT 5.2 Thinking at 96.7% (29/30) vs 90.0% (27/30) (Table 1). In addition, GPT 4o and o4-mini-high performed the same at identifying at 80.0% (24/30). When considering only Grade I spondylolisthesis, o4-mini accurately localized the spondylolisthesis at 92.9% (13/14) while ChatGPT 5.2 Thinking was close at 85.7% (12/14) (Table 2). ChatGPT 4o and o4-mini-high both identified the level of spondylolisthesis at a rate of 78.6% (11/14). Considering only Grade II spondylolisthesis, o4-mini correctly identified the level of spondylolisthesis for all images (16/16) while ChatGPT 5.2 Thinking had a success rate of 93.8% (15/16). ChatGPT 4o and o4-mini-high had the same rate of 81.3% (13/16) (Table 2).

Grading Accuracy (Question 3)
ChatGPT 5.2 Thinking performed the best across all 30 images, identifying the correct Meyerding classification at 56.7% (17/30) when compared to ChatGPT 4o at 53.3% (16/30), o4-mini at 46.7% (14/30), and o4-mini-high at 50.0% (15/30) (Table 1). ChatGPT 4o at 35.7% (5/14) was significantly outperformed when compared to ChatGPT 5.2 Thinking at 71.4% (10/14), o4-mini and o4-mini-high at 92.9% (13/14) in classifying Grade I spondylolisthesis (Table 2). However, ChatGPT 4o at 68.8% (11/16) outperformed ChatGPT 5.2 Thinking at 43.8% (7/16), o4-mini at 6.3% (1/16), and o4-mini-high at 12.5% (2/16) when considering only Grade II spondylolisthesis (Table 2).

Part II: Vertebral Compression Fractures

Identification Accuracy (Question 1)
Across all 24 images, ChatGPT 5.2 Thinking performed the best significantly, correctly identifying the presence of VCF at a rate of 100.0% (24/24) compared to 41.7% (10/24), 45.8% (11/24), and 58.3% (14/24) respectively (mean differences: 58.3% [95% CI 37.1–79.6%], 54.2% [95% CI 32.7–75.7%], and 41.7% [95% CI 20.4–62.9%]; all p<0.001) (Table 3). When only considering L1 VCFs, ChatGPT 5.2 outperformed the other models in correctly identifying the presence of a VCF at 100.0% (17/17) (Table 4). GPT o4-mini-high and o4-mini performed similarly when only considering L2 VCFs at 66.7% (4/6), with 4o struggling at 33.3% (2/6), meanwhile ChatGPT 5.2 correctly identified L2 VCFs at 100.0% (6/6) of images presented (Table 4). Only ChatGPT 5.2 Thinking was able to accurately identify the presence of an L3 VCF (Table 4).

Model	Identification (Q1)		Localization (Q2)		Compression Grading (Q3)
	n/24 (%)	p-value	n/24 (%)	p-value	n/24 (%)	p-value
GPT 5.2 Thinking	24/24 (100.0%)	—	17/24 (70.8%)	—	7/24 (29.2%)	—
GPT 4o	10/24 (41.7%)	<0.001**	18/24 (75.0%)	0.752	11/24 (45.8%)	0.242
o4-mini	11/24 (45.8%)	<0.001**	14/24 (58.3%)	0.376	12/24 (50.0%)	0.146
o4-mini-high	14/24 (58.3%)	<0.001**	13/24 (54.2%)	0.242	14/24 (58.3%)	0.043*

Table 3. Vertebral Compression Fractures overall model performance across all three questions. * p < 0.05; ** p < 0.001
GPT 5.2 Thinking served as the reference comparison. n = number; Q = question #

Model	Identification (Q1)			Localization (Q2)			Compression Grading (Q3)
	L1 (n=17)	L2 (n=6)	L3 (n=1)	L1 (n=17)	L2 (n=6)	L3 (n=1)	L1 (n=17)	L2 (n=6)	L3 (n=1)
GPT 5.2 Thinking	17/17 (100.0%)	6/6 (100.0%)	1/1 (100.0%)	17/17 (100.0%)	0/6 (0.0%)	0/1 (0.0%)	7/17 (41.2%)	0/6 (0.0%)	0/1 (0.0%)
GPT 4o	8/17 (47.1%)	2/6 (33.3%)	0/1 (0.0%)	17/17 (100.0%)	1/6 (16.7%)	0/1 (0.0%)	6/17 (35.3%)	4/6 (66.7%)	1/1 (100.0%)
o4-mini	7/17 (41.2%)	4/6 (66.7%)	0/1 (0.0%)	12/17 (70.6%)	2/6 (33.3%)	0/1 (0.0%)	7/17 (41.2%)	4/6 (66.7%)	1/1 (100.0%)
o4-mini-high	10/17 (58.8%)	4/6 (66.7%)	0/1 (0.0%)	13/17 (76.5%)	0/6 (0.0%)	0/1 (0.0%)	10/17 (58.8%)	3/6 (50.0%)	1/1 (100.0%)

Table 4. Vertebral Compression Fractures model performance by vertebral level. n = number; Q = question #

Localization Accuracy (Question 2)
In this study, ChatGPT 4o outperformed the other three models when identifying the level of lumbar vertebra affected by VCF at a rate of 75.0% (18/24) (Table 3). When only considering the accuracy in the identification of L1 VCFs, ChatGPT 5.2 Thinking and 4o outperformed the other two models, accurately identifying all of the L1 VCFs (Table 4). All models performed poorly when having to localize L2 VCFs, with o4-mini performing the best at an accuracy rate of 33.3% (2/6) (Table 4). When considering the only L3 VCF, no ChatGPT models were able to accurately localize the VCF on the radiographic image.

Grading Accuracy (Question 3)
In this study, o4-mini-high significantly outperformed ChatGPT 5.2 Thinking when grading the severity of compression at 58.3% (14/24) vs. 29.2% (7/24) (mean difference -29.2%, 95% CI -57.3% to -1.0%; p=0.043) and did not significantly outperform the other two models (Table 3). When only considering L1 VCFs, o4-mini-high again outperformed the other three models, correctly measuring the severity of compression percentage at a rate of 58.8% (10/17) (Table 4). In L2 VCFs, ChatGPT 4o and o4-mini performed similarly in measuring the severity of compression percentage in 66.7% (4/6) of the images, while o4-mini-high accurately measured the severity of VCF at a rate of 50.0% (3/6), and ChatGPT 5.2 Thinking was unable to correctly measure the severity of the VCF (Table 4). In the only L3 VCF radiographic image presented, all but ChatGPT 5.2 Thinking accurately measured the severity of VCF (Table 4).

Discussion
ChatGPT, at its core, made large language models more easily accessible to the general public. ChatGPT 5.2 Thinking integrated the functionality of image recognition, analysis, and advanced reasoning, replacing ChatGPT 4o and its reasoning models of GPT o4-mini and o4-mini-high.⁽³⁹⁾ This prompted multiple studies into its diagnostic capabilities in the realm of medicine.^(28,40,41) This study aimed to assess the accuracy and reliability of ChatGPT models 5.2 Thinking, 4o, o4-mini, and o4-mini-high in identifying, localizing, and classifying two distinct lumbar spinal pathologies: L5-S1 spondylolisthesis and vertebral compression fractures (VCFs). ChatGPT demonstrated high variability and heterogeneous performance between the four models tested, with no uniform results as to the superior model. ChatGPT 5.2 Thinking achieved the highest accuracy in initial pathology identification across both conditions, while localization and grading performance varied by pathology type and model. This suggests that while ChatGPT and its family of models can be diagnostic aids, their capabilities are still limited in a sense.

Our findings indicate that ChatGPT 5.2 Thinking significantly outperformed the other three models in identifying radiographic images. In similar studies conducted, similar results were revealed in regards to ChatGPT and its models still being limited in terms of diagnostic accuracy.^(42,43) Throughout data collection, ChatGPT 5.2 Thinking tended to have more variability when asked to measure and classify the Meyerding grade of spondylolisthesis in comparison to the other models. It tended to mainly output a Grade I. In a study completed by Lacaita, et al., it was concluded that the previous generation ChatGPT 4o exhibited higher accuracy for straightforward pathologies.⁽⁴⁴⁾ This may explain the tendency exhibited in our study for ChatGPT 5.2 Thinking to underestimate the severity of spondylolisthesis, since a more mild case may present more straightforwardly. For compression grading, ChatGPT 5.2 Thinking was significantly outperformed by o4-mini-high in estimating PAHC, suggesting that the reasoning-optimized o-models may be better suited to the pixel-level spatial estimation required. Taken together, these severity grading limitations reinforce that ChatGPT’s clinical utility lies in preliminary screening.

ChatGPT 5.2 Thinking represents the most advanced model of the ChatGPT line yet, being able to accept any combination of text, audio, image, and video input while processing a response through the same neural network in multiple different languages.⁽¹⁰⁾ Through referencing its extensive training database, ranging from public web pages to multimodal data, and even data partnerships (Shutterstock) to access pay-walled content, OpenAI was able to produce an AI model that excels in visual comprehension.^(8,45) In being able to process multiple combinations of different inputs, ChatGPT 5.2 Thinking’s application in the medical field becomes extremely diverse. Furthermore, with the advent of ChatGPT 5.2 Thinking, OpenAI improved upon multi-step and spatial reasoning. While this study’s results were heterogeneous, ChatGPT 5.2 Thinking did exhibit a large improvement in identifying spondylolisthesis and VCFs in comparison to past models. This is promising in the future evolution of the GPT models, as well as future possible studies in training the model rather than resetting the memory between questions.

GPT o4-mini displayed a superior localization of L5-S1 spondylolisthesis pathology in general across the models tested in this study. Even when considering Grade I and Grade II individually, o4-mini proved to be quite effective, outperforming the other models at identifying the level of spondylolisthesis. In relation to Meyerding classification, ChatGPT 5.2 Thinking outperformed both o4-mini and o4-mini-high, suggesting the integration of advanced reasoning into ChatGPT 5.2 Thinking may be even more sensitive to these subtle vertebral shifts due to their access to advanced reasoning capabilities than initially thought.^(10,13,34)

The exact measurement method in assessing VHBL is controversial, with no definitive conclusion.⁽⁴⁶⁾ However, a study conducted by Hsu et al. did find vertebral body compression ratio (VBCR) to underestimate VHBL when compared to percentage of anterior height compression (PAHC), and concluded PAHC can assess VHBL with higher accuracy.⁽⁴⁷⁾ Furthermore, Hsu et al. also believe the percentage of middle height compression (PMHC) to be another important marker.⁽⁴⁷⁾ This is due to Ahn et al. finding the middle region of the vertebral body to often present clinically as the region with the most collapse.⁽⁴⁸⁾ Due to this finding in the literature, we opted in our study to calculate the PAHC as well as the PMHC, when vertebral bodies clearly displayed increased compression in the middle region.

This study assessed the past advanced reasoning models of o4-mini and o4-mini-high in image identification.⁽³⁴⁾ While studies in utilizing the previous iterations of the o-models mainly have been performed assessing its chain-of-thought reasoning, further studies should continue to assess the advanced reasoning models in reading images.^(13,49,50) Our study showed promising results in that no uniform model was superior when it came to image reading. This suggests that among the current generation models, there may be further use beyond clinical reasoning and medical management with ChatGPT 5.2 Thinking as it attempts to blend both thinking and instant functions.

ChatGPT 5.2 Thinking, 4o, o4-mini, and o4-mini-high have the potential to serve as adjunct tools. Many studies of the previous iterations of ChatGPT have been conducted in the medical field, from medical examinations to general radiological applications.^(27,51–54) One notable example, Horiuchi, et al., found that GPT-4 based ChatGPT performed at the level of a radiology resident, and that the diagnostic accuracy of radiologists improved when utilizing ChatGPT as assistance.⁽⁵⁵⁾ In comparison, our study utilized the newer generation of ChatGPT 5.2 Thinking, ChatGPT 4o, o4-mini, and o4-mini-high. However, while our study did not assess the specific model of ChatGPT, similar results were yielded. Simply put, oversight from radiologists continues to remain vital to ensure diagnostic accuracy.

This study had several limitations. First, the image sample size (n=54) limits statistical power and generalizability for a definitive conclusion. Second, only Grade I and Grade II Meyerding classification of spondylolisthesis and only compression fractures of L1, L2, and one L3 were included in this study. Within the dataset, there was no imaging matching the inclusion criteria found of higher grades of spondylolisthesis or vertebral body collapse of L4 or L5, which prevented us from assessing the GPT model’s performance in extreme cases to draw a more complete conclusion. Third, we focused on a single spinal level of L5-S1 spondylolisthesis. Results could vary if completed alongside varying cases of spondylolisthesis at different vertebral levels. Finally, the images may not be representative of what the radiologist sees. Because ChatGPT can only receive input from images and not DICOM files, it is not able to properly measure the Meyerding severity grade or PAHC to the same extent.⁽⁵⁶⁾ Despite these limitations, this study represents the first to assess ChatGPT 5.2 Thinking in relation to its previous generation of models, ChatGPT 4o, o4-mini and o4-mini-high, on its ability to read radiographic images of L5-S1 spondylolisthesis. Future studies should be conducted to compare other accessible large language models. Evaluation should also be extended to other specific pathologies, such as other spinal conditions, to assess the handling of different cases.

Conclusion
In conclusion, this study aimed to assess the accuracy and reliability of ChatGPT across four different models in identifying and classifying L5-S1 spondylolisthesis and VCFs. ChatGPT-5.2 Thinking demonstrated significantly higher accuracy in identifying the initial pathology across both conditions compared with past models of GPT-4o, o4-mini, and o4-mini-high. However, performance across localization and severity grading varied considerably by pathology type. For spondylolisthesis, o4-mini was the most successful in localizing the affected segment, while ChatGPT 5.2 Thinking showed statistically insignificant improvements in Meyerding grading. Conversely, for VCFs, GPT-4o performed the best in localization and o4-mini-high significantly outperformed the other models in grading compression severity. Overall, these findings suggest that ChatGPT 5.2 Thinking may offer meaningful improvement in recognizing radiographic findings when utilized as a decision-support tool, but its performance does not currently support use as a standalone diagnostic tool due to its inconsistent performance. This warrants further varied studies with an expanded pathology scope and extreme cases before clinical application.

All authors have read and approved the final manuscript for publication.

This study was not subject to institutional review board approval. Data is available for review.
No portions of this work were previously published.

Conflict of Interest: None to disclose

Funding: No funding was received for this work.

Word Count: 3455

References

ChatGPT. Accessed June 21, 2025. https://chatgpt.com
Chen H, Jiao F, Li X, et al. ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up? arXiv. Preprint posted online January 15, 2024:arXiv:2311.16989. doi:10.48550/arXiv.2311.16989
Shahriar S, Hayawi K. Let’s have a chat! A Conversation with ChatGPT: Technology, Applications, and Limitations. Artif Intell Appl. 2023;2(1):11-20. doi:10.47852/bonviewAIA3202939
Unlocking the Potential of ChatGPT: A Comprehensive Exploration of its Applications, Advantages, Limitations, and Future Directions in Natural Language Processing. Accessed June 21, 2025. https://arxiv.org/html/2304.02017v14#bib.bib6
Introducing ChatGPT. March 13, 2024. Accessed July 8, 2025. https://openai.com/index/chatgpt/
GPT-4. January 12, 2024. Accessed July 10, 2025. https://openai.com/index/gpt-4-research/
Zhang N, Sun Z, Xie Y, Wu H, Li C. The latest version ChatGPT powered by GPT-4o: what will it bring to the medical field? Int J Surg Lond Engl. 2024;110(9):6018-6019. doi:10.1097/JS9.0000000000001754
OpenAI, Hurst A, Lerer A, et al. GPT-4o System Card. arXiv. Preprint posted online October 25, 2024:arXiv:2410.21276. doi:10.48550/arXiv.2410.21276
Introducing OpenAI o1. Accessed July 10, 2025. https://openai.com/index/introducing-openai-o1-preview/
Introducing GPT-5.2. February 13, 2026. Accessed February 14, 2026. https://openai.com/index/introducing-gpt-5-2/
Takita H, Kabata D, Walston SL, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med. 2025;8(1):175. doi:10.1038/s41746-025-01543-z
Hosseini-Monfared P, Amiri S, Mirahmadi A, et al. ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings. Arch Acad Emerg Med. 2025;13(1):e42. doi:10.22037/aaemj.v13i1.2580
Lin Z, Li Y, Wu M, et al. Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study. Clin Exp Med. 2025;25(1):213. doi:10.1007/s10238-025-01743-7
Öztürk A, Günay S, Ateş S, Yiğit Yavuz Yigit Y. Can Gpt-4o Accurately Diagnose Trauma X-Rays? A Comparative Study with Expert Evaluations. J Emerg Med. 2025;73:71-79. doi:10.1016/j.jemermed.2024.12.010
Ding L, Fan L, Shen M, et al. Evaluating ChatGPT’s diagnostic potential for pathology images. Front Med. 2025;11:1507203. doi:10.3389/fmed.2024.1507203
Fabijan A, Zawadzka-Fabijan A, Fabijan R, Zakrzewski K, Nowosławska E, Polis B. Artificial Intelligence in Medical Imaging: Analyzing the Performance of ChatGPT and Microsoft Bing in Scoliosis Detection and Cobb Angle Assessment. Diagnostics. 2024;14(7):773. doi:10.3390/diagnostics14070773
Ren Y, Guo Y, He Q, Cheng Z, Huang Q, Yang L. Exploring whether ChatGPT-4 with image analysis capabilities can diagnose osteosarcoma from X-ray images. Exp Hematol Oncol. 2024;13(1):71. doi:10.1186/s40164-024-00537-z
Chatterjee S, Bhattacharya M, Pal S, Lee SS, Chakraborty C. ChatGPT and large language models in orthopedics: from education and surgery to research. J Exp Orthop. 2023;10(1):128. doi:10.1186/s40634-023-00700-1
Sparks CA, Fasulo SM, Windsor JT, et al. ChatGPT Is Moderately Accurate in Providing a General Overview of Orthopaedic Conditions. JB JS Open Access. 2024;9(2):e23.00129. doi:10.2106/JBJS.OA.23.00129
Hayes DS, Foster BK, Makar G, et al. Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE). J Surg Educ. 2024;81(11):1645-1649. doi:10.1016/j.jsurg.2024.08.002
Wang S, Wang Y, Jiang L, et al. Assessing the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in managing lumbar disc herniation. Eur J Med Res. 2025;30:45. doi:10.1186/s40001-025-02296-x
Li N, Scofield J, Mangham P, Cooper J, Sherman W, Kaye AD. Spondylolisthesis. Orthop Rev. 14(3):36917. doi:10.52965/001c.36917
Mikhael MM, Shapiro GS, Wang JC. High-Grade Adult Isthmic L5–S1 Spondylolisthesis: A Report of Intraoperative Slip Progression Treated with Surgical Reduction and Posterior Instrumented Fusion. Glob Spine J. 2012;2(2):119-124. doi:10.1055/s-0032-1307257
Denard PJ, Holton KF, Miller J, et al. Lumbar spondylolisthesis among elderly men: prevalence, correlates and progression. Spine. 2010;35(10):1072-1078. doi:10.1097/BRS.0b013e3181bd9e19
McDonald CL, Alsoof D, Daniels AH. Vertebral Compression Fractures. R I Med J 2013. 2022;105(8):40-45.
Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res. 2023;25:e48568. doi:10.2196/48568
Shin Y, Kim S, Lee YH. AI musculoskeletal clinical applications: how can AI increase my day-to-day efficiency? Skeletal Radiol. 2022;51(2):293-304. doi:10.1007/s00256-021-03876-8
Rao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res. 2023;25:e48659. doi:10.2196/48659
Teixeira-Marques F, Medeiros N, Nazaré F, et al. Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study. Eur Arch Oto-Rhino-Laryngol Off J Eur Fed Oto-Rhino-Laryngol Soc EUFOS Affil Ger Soc Oto-Rhino-Laryngol – Head Neck Surg. 2024;281(4):2023-2030. doi:10.1007/s00405-024-08498-z
Chen R, Zeng D, Li Y, Huang R, Sun D, Li T. Evaluating the performance and clinical decision-making impact of ChatGPT-4 in reproductive medicine. Int J Gynaecol Obstet Off Organ Int Fed Gynaecol Obstet. 2025;168(3):1285-1291. doi:10.1002/ijgo.15959
Miao J, Thongprayoon C, Fülöp T, Cheungpasitporn W. Enhancing clinical decision‐making: Optimizing ChatGPT’s performance in hypertension care. J Clin Hypertens. 2024;26(5):588-593. doi:10.1111/jch.14822
Maroncelli R, Rizzo V, Pasculli M, et al. Probing clarity: AI-generated simplified breast imaging reports for enhanced patient comprehension powered by ChatGPT-4o. Eur Radiol Exp. 2024;8(1):124. doi:10.1186/s41747-024-00526-1
Arruzza ES, Evangelista CM, Chau M. The performance of ChatGPT-4.0o in medical imaging evaluation: a cross-sectional study. J Educ Eval Health Prof. 2024;21:29. doi:10.3352/jeehp.2024.21.29
OpenAI o3 and o4-mini System Card. Accessed June 21, 2025. https://openai.com/index/o3-o4-mini-system-card/
Nguyen HT, Pham HH, Nguyen NT, et al. VinDr-SpineXR: A deep learning framework for spinal lesions detection and classification from radiographs. arXiv. Preprint posted online June 24, 2021:arXiv:2106.12930. doi:10.48550/arXiv.2106.12930
Pham HH, Nguyen Trung H, Nguyen HQ. VinDr-SpineXR: A large annotated medical image dataset for spinal lesions detection and classification from radiographs. doi:10.13026/Q45H-5H59
Koslosky E, Gendelberg D. Classification in Brief: The Meyerding Classification System of Spondylolisthesis. Clin Orthop. 2020;478(5):1125-1130. doi:10.1097/CORR.0000000000001153
Alexandru D, So W. Evaluation and Management of Vertebral Compression Fractures. Perm J. 2012;16(4):46-51. doi:10.7812/tpp/12-037
Wang J, Shue K, Liu L, Hu G. Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics. Sci Rep. 2025;15:10426. doi:10.1038/s41598-025-95233-1
Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023;330(1):78-80. doi:10.1001/jama.2023.8288
Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. doi:10.3389/frai.2023.1169595
Pradhan P. Accuracy of ChatGPT 3.5, 4.0, 4o and Gemini in diagnosing oral potentially malignant lesions based on clinical case reports and image recognition. Med Oral Patol Oral Cir Bucal. 2025;30(2):e224-e231. doi:10.4317/medoral.26824
Chen Z, Chambara N, Wu C, et al. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine. 2025;87(3):1041-1049. doi:10.1007/s12020-024-04066-x
Lacaita PG, Galijasevic M, Swoboda M, et al. The Accuracy of ChatGPT-4o in Interpreting Chest and Abdominal X-Ray Images. J Pers Med. 2025;15(5):5. doi:10.3390/jpm15050194
Zhang N, Sun Z, Xie Y, Wu H, Li C. The latest version ChatGPT powered by GPT-4o: what will it bring to the medical field? Int J Surg Lond Engl. 2024;110(9):6018-6019. doi:10.1097/JS9.0000000000001754
Vaccaro AR, Kim DH, Brodke DS, et al. Diagnosis and Management of Thoracolumbar Spine Fractures. JBJS. 2003;85(12):2456.
Hsu WE, Su KC, Chen KH, Pan CC, Lu WH, Lee CH. The Evaluation of Different Radiological Measurement Parameters of the Degree of Collapse of the Vertebral Body in Vertebral Compression Fractures. Appl Bionics Biomech. 2019;2019:4021640. doi:10.1155/2019/4021640
Ahn SE, Ryu KN, Park JS, Jin W, Park SY, Kim SB. Early Bone Marrow Edema Pattern of the Osteoporotic Vertebral Compression Fracture : Can Be Predictor of Vertebral Deformity Types and Prognosis? J Korean Neurosurg Soc. 2016;59(2):137-142. doi:10.3340/jkns.2016.59.2.137
Goto H, Shiraishi Y, Okada S. Performance Evaluation of GPT-4o and o1-Preview Using the Certification Examination for the Japanese “Operations Chief of Radiography With X-rays.” Cureus. 2024;16(11):e74262. doi:10.7759/cureus.74262
Tordjman M, Liu Z, Yuce M, et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med. Published online April 23, 2025. doi:10.1038/s41591-025-03726-3
Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. doi:10.2196/45312
Liu M, Okuhara T, Chang X, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024;26:e60807. doi:10.2196/60807
Hirschmann A, Cyriac J, Stieltjes B, Kober T, Richiardi J, Omoumi P. Artificial Intelligence in Musculoskeletal Imaging: Review of Current Literature, Challenges, and Trends. Semin Musculoskelet Radiol. 2019;23(3):304-311. doi:10.1055/s-0039-1684024
Gorelik N, Gyftopoulos S. Applications of Artificial Intelligence in Musculoskeletal Imaging: From the Request to the Report. Can Assoc Radiol J J Assoc Can Radiol. 2021;72(1):45-59. doi:10.1177/0846537120947148
Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology. Eur Radiol. 2025;35(1):506-516. doi:10.1007/s00330-024-10902-5
Chiang CH, Weng CL, Chiu HW. Automatic classification of medical image modality and anatomical location using convolutional neural network. PLoS ONE. 2021;16(6):e0253205. doi:10.1371/journal.pone.0253205

The Journal of the American Osteopathic Academy of Orthopedics

Published by the American Osteopathic Academy of Orthopedics

Steven J. Heithoff, DO, MBA, FAOAO
Editor-in-Chief

Joye Stewart
Managing Editor
[email protected]

Online ISSN: 2996-1742
Frequency: Trianually

Editorial Board

To submit an article to JAOAO

Visit AOAO.org

Share this content on social media!

© AOAO. All copyrights of published material within the JAOAO are reserved. No part of this publication can be reproduced or transmitted in any way without the permission in writing from the JAOAO and AOAO. Permission can be requested by contacting Joye Stewart at [email protected].

Volume X, Number 1 | Spring 2026

Published May 29, 2026

The Accuracy of ChatGPT in Classifying Lumbar Spondylolisthesis and Compression Fractures

Abstract

The Journal of the American Osteopathic Academy of Orthopedics

Member Quick Links

Member Resources

Publications