Implementation of a Large-Scale Retinal Image Curation Workflow Using Deep Learning Framework (2022)

May 1, 2022

Rohit Balaji, Jen Heathcote, Robert Slater, Nancy Barrett, Rick Voland, Vesna Tomic, Jared McDonald, Barbara A. Blodi, Amitha Domalpally

Abstract

Purpose: The development of artificial intelligence (AI) algorithms for analyzing retinal pathologies requires training based on well-organized, labeled images. The goals of this project are to develop an AI model to curate 7-field retinal photographs and to explore ways to implement the AI model at the Wisconsin Reading Center (WRC) with high-volume image submissions.

Methods: Stereoscopic 7 modified field images of the retina are used for the evaluation of diabetic retinopathy via the Early Treatment Diabetic Retinopathy Study (ETDRS) Severity Scale as an outcome in clinical trials. The imaging protocol includes 7 pairs of images of the optic disc, macula, and surrounding retinal quadrants along with an image of the anterior part of the eye (red reflex). Each field of the retina is identified by a field designation number (Figure 1). Clinical images submitted to the WRC are often inefficiently organized for the training of AI algorithms. We trained a neural network to differentiate red reflex images from retinal images and to provide the appropriate retinal field designation. Model outputs included classification of the retinal images into 8 classes (7 fields and red reflex) and a probability score (0 – 1) for each class to predict potential classification errors. The AI model was trained and internally validated on 17,529 images from multiple sites and tested on 3004 independent images. The ground truth was generated by 2 WRC graders.

Results: Exact agreement on field designations between graders and the AI model was found for 2651/3004 images (88%, kappa 0.87) (Figure 2). AI probability scores were 0.95-0.99 for labels that matched the human assessment and 0.39-0.84 for labels that did not match. Fields with non-matched labels included images with poor image quality/focus and improper localization of retinal landmarks.

Conclusions: The deep learning model provides an accurate, automated method for curating retinal images. The probability score provides a tool to flag potential errors in AI labels that can be routed for human oversight. Large-scale AI implementation systems need a tiered approach to augment workflow, increase trust in AI models, and to provide a means for continuous model development.