Influence of Race on Training Data Quality for Artificial Intelligence (AI) Algorithms (2022)

May 1, 2022

Amitha Domalpally, Rick Voland, Robert Slater, Ellie Corkery, Pamela Vargo, Rebecca Kuhtz, James Reimers, Roomasa Channa, Barbara Blodi

Abstract

Purpose: Diabetic retinopathy (DR) severity level is evaluated from stereoscopic 7-field color photographs by masked graders at the Wisconsin Reading Center and used as a reference standard for training and validation of AI algorithms. Training data quality influenced by race is understudied relative to standards established forWhite subjects. Retinal pigmentation is greater in individuals with darker skin tones, and the reduced contrast may affect the detection of DR features by graders. We explored the effect of race-related fundus pigmentationon the grader’s ability to document DR.

Methods: All images were acquired by a certified photographer at a single site using the same camera. and evaluated for DR by two experienced graders masked to all demographics. Following our standard protocol for quality assessment, graders determined a confidence score (CS) of high or low (borderline and ungradable) for each image. Graders were permitted to use digital enhancement tools for better visualization. The Red, Green, and Blue (RGB) channel values were also obtained for a representative image of each subject.

Results: Of 183 subjects, 37 (20.2%) were identified as White, 11 (6%) Black and 135 (73.8%) Other with Hispanic ethnicity. DR prevalence was 10.4% in the full cohort (366 eyes), mostly with mild-moderate non-proliferative DR. DR prevalence across the 3 racial/ethnic groups was 12.2%, 0% and 12.2% respectively. CS was high in 93%, 77%, and 82%, borderline in 5%, 23% and 16% and ungradable in 1.4%, 0% and 1.5%, respectively. There was significant difference between high and low CS across the three groups (p = 0.029)and also for the red channel of RGB values: 129 (95% CI 95,105), 85 (77,104), and 94(88,99) (p < 0.001).

Conclusions: Grader confidence for evaluating DR features was lower in pigmented retinas which could affect data accuracy from images obtained from darker-skinned individuals. It is possible that DR was under called in Black population due to difficulty in detecting early features. This bias can transfer into AI models via training data. Photographer education in image capture techniques and grader training in racially diverse datasets is needed to obtain high-quality, categorically delineated data.