CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Continual and Incremental Learning in Computer-aided Diagnosis Systems


Prathyusha Akundi

Abstract

Deep Neural Networks (DNNs) have shown remarkable performance in a broad range of computer vision tasks, including in the medical domain. With the advent of DNNs, the medical community has witnessed significant developments in segmentation, classification, and detection. But this success comes with a cost of heavy reliance on the abundance of data. Medical data, however, is often highly limited in volume and quality due to sparsity of patient contact, variability in medical care, and privacy concerns. Hence, to train large networks we seek data from different sources. In such a scenario, it is of interest to design a model that learns continuously and adapts to datasets or tasks as and when they are available. However, one of the important steps to achieve such a never-ending learning process is to overcome Catastrophic Forgetting (CF) of previously seen data or tasks. CF refers to the significant degradation in performance on the old task/dataset. To avoid confusion, we call a training regime Continual Learning (CL) when CAD systems have to handle a sequence of datasets collected over time from different sites with different imaging parameters/populations. Similarly, Incremental Learning (IL) is when CAD systems have to learn new classes as and when new annotations are made available. The work described in this thesis address core aspects of both CL & IL and has been compared against the state-of-the-art methods. In this thesis, we assume that access to the data belonging to previously trained datasets or tasks is not available which makes both CL and IL processes even more challenging. We start with developing a CL system that learns sequentially on different datasets and handles CF using the Uncertainty mechanism. The system consists of an ensemble of models which are trained or finetuned on each dataset and considers the prediction from the model which has the least uncertainty. We then investigate a new way to tackle CF in CL by manifold learning, inspired by the defense mechanisms against adversarial attacks. Our method uses a ‘Reformer’ which is essentially a denoising autoencoder that ‘reforms’ or brings the data from all the datasets together towards a common manifold. These reformed samples are then passed to the network to learn the desired task. Towards IL, we propose a novel approach that ensures that a model remembers the causal factor behind the decisions on the old classes, while incrementally learning new classes. We introduce a common auxiliary task during the course of incremental training, whose hidden representations are shared across all the classification heads. All the experiments for both CL and IL are conducted on multiple datasets and have shown significant performance over the state-of-the-art methods

Year of completion:  April 2023
 Advisor : Jayanthi Sivaswamy

Related Publications


    Downloads

    thesis

    3D Interactive Solution for Neuroanatomy Education


    Mythri V

    Abstract

    Typically, anatomy is taught through dissection, 2D images, presentations, and cross-sections. While these methods are convenient, they are non-interactive and fail to capture the spatial relationships and functional aspects of anatomy. New visualization technologies such as virtual reality and 3D can compensate for the impediments and provide better understanding while captivating the students. With recent advances in the industry, the methods to provide a 3D experience are economical. In this thesis, we introduce a low-cost 3D-interactive anatomy system designed for an audience of typical medical college students in an anatomy class. The setup used to achieve 3D visualization is Dual projector polarization. While there are other ways to achieve 3D visualization, like alternate frame sequencing and virtual reality, this technique can target a large audience and requires minimum accessories for the setup enabling this to be a low-cost solution for an immersive 3D experience. The 3D interactive Neuroanatomy solution is an end-to-end framework capable of designing anatomy lessons and visualizing the 3D stereoscopic projection of those anatomy lessons. To ensure superior comprehension of students, we incorporate each teacher’s unique teaching approach while developing anatomy lessons by providing the ability to create their own lessons. We have created anatomy lessons based on the human brain which is a vital organ and has a complex anatomy. Our aim is to help medical students to understand the complexity of organ systems from not just an anatomical perspective but also a radiological perspective. We use annotations on clinical case data such as MRI, MRA, etc., to create 3D models for anatomy visualization incorporating clinical information and illustrating real cases. Annotations for structures of interest are done using manual, automatic, and semi-automatic segmentation methods. Manual delineation of the structure boundaries is very tedious and time-consuming. Automatic segmentation is quick and convenient. However, manual annotations were done for the 3D anatomy viewer for small and complex structures due to substandard automatic segmentation. There is a need to improve automatic segmentation performance for those structures. While segmentation is an essential step in 3D modeling, it plays a critical role in many neurological disease diagnoses as well, which are associated with degradation in the sub-cortical region. Therefore accurate algorithms are needed for sub-cortical structure segmentation. Variance in the size of structures is significant, which introduces a performance bias towards larger structures in many deep learning approaches. In this part of the thesis, we aim to remove size bias in sub-cortical structure segmentation. The proposed method addresses this problem with a pre-training step used to learn tissue characteristics, an ROI extraction step that aids in focusing on the local context, and using structure ROIs elevates the influence of smaller structures in the network.

    Year of completion:  April 2023
     Advisor : Jayanthi Sivaswamy

    Related Publications


      Downloads

      thesis

      Towards Generalization in Multi-View Pedestrian Detection


      Jeet Vora

      Abstract

      Detecting humans in images and videos has emerged as an essential aspect of intelligent video systems that solve pedestrian detection, tracking, crowd counting, etc. It has many real-life applications varying from visual surveillance and sports to autonomous driving. Despite achieving high performance, the single camera-based detection methods are susceptible to occlusions caused by humans, which drastically degrades the performance where crowd density is very high. Therefore multi-camera setup becomes necessary, which incorporates multiple camera views for detections by computing precise 3D locations that can be visualized and transformed to Top View also termed as Bird’s Eye View (BEV) representation and thus permits better occlusion reasoning in crowded scenes. The thesis, therefore, presents a multi-camera approach that globally aggregates the multi-view cues for detection and alleviates the impact of occlusions in a crowded environment. But it was still primarily unknown how satisfactorily the multi-view detectors generalize to unseen data. In different camera setups, this becomes critical because a practical multi-view detector should be usable in scenarios such as i) when the model trained with few camera views is deployed, and one of the cameras fails during testing/inference or when we add more camera views to the existing setup, ii) when we change the camera positions in the same environment and finally iii) when deploying the system on the unseen environment; an ideal multi-camera setup system should be adaptable to such changing conditions. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. We formalized three critical forms of generalization and outlined the experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and finally, iii) to new scenes. We discover that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. To address the concerns: (a) we generated a novel Generalized MVD (GMVD) dataset, assimilating diverse scenes with changing daytime, camera configurations, varying number of cameras, and (b) we discuss the properties essential to bring generalization to MVD and developed a barebones model to incorporate them. We performed a series of experiments on the WildTrack, MultiViewX, and the GMVD datasets to motivate the necessity to evaluate the generalization abilities of MVD methods and to demonstrate the efficacy of the developed approach.

      Year of completion:  April 2023
       Advisor : Vineet Gandhi

      Related Publications


        Downloads

        thesis

        Does Audio help in deep Audio-Visual Saliency prediction models?


        Ritvik Agrawal

        Abstract

        The task of saliency prediction focuses on understanding and modeling human visual attention (HVA), i.e., where and what people pay attention to given visual stimuli. Audio has ideal properties to aid gaze fixation while viewing a scene. There exists substantial evidence of audio-visual interplay in human perception, and it is agreed that they jointly guide our visual attention. Learning computational models for saliency estimation is an effort to inch machines/robots closer to human cognitive abilities. The task of saliency prediction is helpful in many digital content-based applications like automated editing, perceptual video coding, human-robot interactions,etc. The field has progressed from using hand-crafted features to deep learning-based solutions. Efforts on static image saliency prediction methods are led by convolutional architectures. The ideas were extended to videos by integrating temporal information using 3D convolutions or LSTM’s. Many sophisticated multimodal, multi-stream architectures have been proposed to process multimodal information for saliency prediction. Despite existing works of Audio-Visual Saliency Prediction (AVSP) models claiming to achieve promising results by fusing audio modality over visual-only models, most of these models only consider visual cues and fail to leverage auditory information that is ubiquitous in dynamic scenes. In this thesis, we investigate the relevance of audio cues in conjunction with the visual ones and conduct extensive analysis to analyse the cause of AVSP models being superior by employing well-established audio modules and fusion techniques from diverse correlated audio-visual tasks. Our analysis on ten diverse saliency datasets suggests that none of the methods worked for incorporating audio. Our endeavour suggests that augmenting audio features ends up learning a predictive model agnostic to audio . Furthermore, we bring to light, why AVSP models show a gain in performance over visual-only models, though the audio branch is agnostic at inference. Our experiments clearly indicate that visual modality dominates the learning; the current models largely ignore the audio information. The observation is consistent while using three different audio backbones and four different fusion techniques and contrasts with the previous methods, which claim audio as a significant contributing factor. The performance gains are a byproduct of improved training and the additional audio branch seems to have a regularizing effect. We show that similar gains are achieved while sending random audio during training. Overall our work questions the role of audio in current deep AVSP models and motivates the community to a clear avenue for reconsideration of the complex architectures by demonstrating that simpler alternatives work equally well.

        Year of completion:  April 2023
         Advisor : Vineet Gandhi

        Related Publications


          Downloads

          thesis

          Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding


          Sowmya Aitha

          Abstract

          Ancient paper documents and palm leaf manuscripts from the Indian subcontinent have made a significant contribution to the world literary and culture. These documents often have complex, uneven, and irregular layouts. The process of digitization and deciphering the content from these documents without human intervention pose difficulties in a broad range of areas, including language, script, layout, elements, position, and number of manuscripts per image. Large-scale annotated Indic manuscript image datasets are needed for this kind of research. In order to meet this objective, we present Indiscapes, the first dataset containing multi-regional layout annotations for ancient Indian manuscripts. We also adapt a fully convolutional deep neural network architecture for fully automatic, instance-level spatial layout parsing of manuscript images in order to deal with the challenges such as presence of dense, irregular layout elements, pictures, multiple documents per image and the wide variety of scripts. Eventually, We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. Despite advancements, the segmentation of semantic layout using typical deep network methods is not resistant to the complex deformations that are observed across semantic regions. This problem is particularly evident in the domain of Indian palm-leaf manuscripts, which has limited resources. Therefore, we present Indiscapes2, a new expansive dataset of various Indic manuscripts with semantic layout annotations, to help address the issue. Indiscapes2 is 150% larger than Indiscapes and contains materials from four different historical collections. In addition, we propose a novel deep network called Palmira for reliable, deformation-aware region segmentation in handwritten manuscripts. As a performance metric, we additionally report a boundary-centric measure called Hausdorff distance and its variations. Our tests show that Palmira offers reliable layouts and outperforms both strong baseline methods and ablative versions. We also highlight our results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of PALMIRA. Even though we have reliable deep-network based approaches for comprehending manuscript layout, these models implicitly assume one or two manuscripts per image during the process, whereas in a real-world scenario there are often cases where multiple manuscripts are typically scanned together into a scanned image to maximise scanner surface area and reduce manual labour. Now, making sure that each individual manuscript within a scanned image can be isolated (segmented) on a per-instance basis became the first essential step in understanding the content of a manuscript. Hence, there is a need for a precursor system which extracts individual manuscripts before downstream processing. The highly curved and deformed boundaries of manuscripts, which frequently cause them to overlap with each other, introduce another complexity when confronting issue. We introduce another new document image dataset named IMMI (Indic Multi Manuscript Images) to address these issues. We also present a method that generates synthetic images to augment sourced non-synthetic images in order to boost the efficiency of the dataset and facilitate deep network training. Adapted versions of current document instance segmentation frameworks are used in our experiments. The results demonstrate the efficacy of the new frameworks for the task. Overall, our contributions enable robust extraction of individual historical manuscript pages. This in turn, could potentially enable better performance on downstream tasks such as region-level instance segmentation, optical character recognition and word-spotting in historical Indic manuscripts at scale.

          Year of completion:  May 2023
           Advisor : Ravi Kiran Sarvadevabhatla

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Situation Recognition for Holistic Video Understanding
            2. Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing
            3. Weakly supervised explanation generation for computer aided diagnostic systems
            4. Improving the Efficiency of Fingerprint Recognition Systems
            • Start
            • Prev
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.