An investigation of the annotated data sparsity problem in the medical domain

Pujitha Appan Kandala

Abstract

Diabetic retinopathy (DR) is the most common eye disease in people with diabetes. It affects them for significant number of years and can also lead to permanent blindness if left untreated. Early detection and treatment of DR is of utmost importance for the prevention of blindness. Hence, automatic disease detection and classification have been attracting much interest. High performance is critical in adoptionof such systems, which generally rely on training with a wide variety of annotated data. Availability of such varied annotated data in medical imaging is very scarce. The main focus of this thesis is to deal with the sparsity of annotated data and develop computer-aided diagnostic CAD systems which take less annotated data and yet give high accuracies. We propose three different solutions to address this problem. First, we propose a semi-supervised framework which paves way for including unlabeled data in training. A co-training framework is used in which features are extracted from a limited training set and independent models are learnt on each of the features, later the models are used to predict labels for new data. The highly confident labelled images from unlabelled set are added back to the training set and the process is continued, thus expanding the number of known labels. This framework is showcased on retinal neovascularization (NV) which is a critical stage of proliferative DR. The analysis of the results for detection of NV showed that an AUC of 0.985 with sensitivity of 96.2% at specificity of 92.6% which were superior to the existing models. Secondly, we propose crowdsourcing as a solution where we obtain annotations from a crowd and use them for training after refining. We employ a strategy to refine/overcome the noisy nature of crowdsourced annotations by i) assigning a reliability factor for each subject of the crowd based on their performance (at global and local levels) and experience and ii) requiring region of interest (ROI) markings rather than pixel-level markings from the crowd. We also show that these annotations are reliable by training a deep neural net (DNN) for detection of hard exudates which occur in mild non-proliferative DR. Experimental results obtained for hard exudate detection showed that training with refined crowdsourced data is effective as detection performance improves by 25% over training with just expert-markings. Lastly, we explore synthetic data generation as a solution to address this problem. We propose a novel method, based on generative adversarial networks (GAN), to generate images with lesions such that the overall severity level can be controlled. We showcase this approach for hard exudate and haemorrhage detection in retinal images with 4 levels of severity. These vary from mild to severe non-proliferativeDR. The synthetic data were also shown to be reliable for developing a CAD system for DR detection. Hard exudate/ haemorrhage detection was found to improve with inclusion of synthetic data in thetraining set with improvement in sensitivity of about 25% over training with just expert marked data.

Year of completion:	November 2018
Advisor :	Jayanthi Sivaswamy