Large language models show agreement with specialists in glaucoma classification study
Advanced large language models demonstrated the ability to classify glaucoma versus glaucoma suspect using multimodal clinical data, according to a study presented at AGS 2026.
Researchers evaluated GPT-5 Pro, GPT-5, Gemini 2.5 Pro, and Grok 2.2 using demographic data, clinical measures, and imaging, including visual fields, OCT retinal nerve fiber layer scans, and fundus photographs. Model outputs were compared with chart diagnoses and independent assessments by glaucoma specialists.
Agreement between chart diagnoses and specialists was 78.4%. Among the models, GPT-5 Pro showed the highest agreement with chart diagnoses (74.0%) and specialists (79.7%). GPT-5 and Gemini showed moderate agreement, while Grok demonstrated lower agreement. Differences between models were statistically significant (P < 0.001).
Higher cup-to-disc ratio was associated with greater agreement with chart diagnoses, whereas age, sex, race or ethnicity, intraocular pressure, and visual acuity were not significant predictors.
Reference
Shean R, et al. Diagnostic agreement between large language models and glaucoma specialists in multimodal glaucoma classification. Poster presented at: American Glaucoma Society Annual Meeting; 2026.
