Diabetic Retinal Disease Characterization: Electronic Health Record Data Compared To Reading Center Grading Of Fundus Images

May 5, 2025

Kristi Dohm, Mozhdeh Bahrainian, Nancy Barrett, Rick Voland, John Garrett, Jomol Mathew, Michael D. Abramoff, Barbara Blodi, Amitha Domalpally, Roomasa Channa

Abstract

Purpose: Validity of diabetic retinal disease (DRD) severity level as documented in electronic health record (EHR) data is crucial for research and patient care. Our goal was to assess the validity of DRD diagnoses in EHR by comparing these diagnoses with the Wisconsin Reading Center (WRC) (level II) reference standard grading of fundus photographs acquired at the same clinic encounter.

Methods: Patients diagnosed with DRD between Jan 2010 to May 2022 with fundus photographs were identified and stratified by International Clinical Diabetic Retinopathy (ICDR) levels based on diagnosis codes. 25 patients were randomly selected per ICDR level. De-identified retinal images of both eyes were reviewed by WRC graders who were masked to patients’ visit diagnoses using a pipeline built to link clinical imaging to research. ICDR grades by visit diagnosis, reading center grading of fundus photographs and manual chart review were compared to assess agreement using weighted kappa.

Results: 159 eyes (87 patients) were initially included. 15 eyes were then excluded due to data reconciliation
issues. ICDR levels for patients based on visit diagnosis were mild non-proliferative diabetic retinopathy (NPDR) (n=23), moderate NPDR (n=20), severe NPDR (n=20), and proliferative diabetic retinopathy (PDR) (n=24). Agreement between EHR-based visit diagnosis and reading center grading was 39%, kappa: 0.19 (95% CI, 0.09, 0.29). Agreement between manual review of clinic charts and reading center grading was 59%, kappa: 0.38 (95% CI, 0.26, 0.51). Most disagreements were in the severe NPDR category.

Conclusions: Validity of DRD severity level, as recorded in the clinic chart and compared to the grading of fundus photographs by WRC graders, was low. Agreement between WRC graders and EHR-based DRD diagnoses was higher when manual chart review was conducted, compared to using the recorded visit diagnosis codes alone. This low agreement may have multiple implications on the development and validation of AI models and reliability of conclusions about DRD based on diagnosis codes in the EHR. Strengthening EHR data by AI-assisted manual chart reviews, assessing stability of diagnosis codes across multiple visits, and incorporating data from current procedural terminology codes can help increase the accuracy of DRD levels in clinic charts.