CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News
  • Blog
  • Contact Us

Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing


Durga Nagendra Raghava Kumar Modhugu

Abstract

The footprint of partial or fully autonomous vehicles is increasing gradually with time. The existenceand availability of the necessary modern infrastructure are crucial for the widespread use of autonomousnavigation. One of the most critical efforts in this direction is to build and maintain HD maps efficientlyand accurately. The information in HD maps is organized in various levels 1) Geometric layer, 2)Semantic layer and 3) Map prior’s layer. The conventional approaches to capturing and extractinginformation at different HD map levels rely heavily on huge sensor networks and manual annotation.This is not scalable to create HD maps for massive road networks. We propose two novel solutionsto address the mentioned problems in this work. The first solution deals with the generation of thegeometric layer with parametric information of the road scene and other one to update information onroad infrastructure and traffic violations in the semantic layer.Firstly, the creation of the geometric layer of the HD map requires understanding the road layout interms of structure, number of lanes, lane width, curvature, etc. Prediction of these attributes as part ofa generalizable parametric model with which road layout can be rendered would suite the creation of ageometric layer. Many previous works that tried to solve this problem rely only on ground imagery andare limited by the narrow field of view of the camera, occlusions, and perspective shortening. This workdemonstrates the effectiveness of using aerial imagery as an additional modality to overcome the abovechallenges. We propose a novel architecture, Unified, that combines aerial and ground imagery featuresto infer scene attributes. We quantitatively evaluate on the KITTI dataset and show that our Unifiedmodel outperforms prior works. Since this dataset is limited to road scenes close to the vehicle, we sup-plement the publicly available Argoverse dataset with scene attribute annotations and evaluate far-awayscenes. We quantitatively and qualitatively show the importance of aerial imagery in understanding roadscenes, especially in regions farther away from the ego-vehicle.Finally, we also propose a simple mobile imaging setup to address and audit several common prob-lems in urban mobility and road safety, which can enrich the information in a semantic layer of HDmaps. Recent computer vision techniques are used to identify street irregularities (including missinglane markings and potholes), absence of street lights, and defective traffic signs using videos obtainedfrom a moving camera-mounted vehicle. Beyond the inspection of static road infrastructure, we alsodemonstrate the applicability of mobile imaging solutions to spot traffic violations. We validate ourproposal on the long stretches of unconstrained road scenes covering over 2000Km and discuss practi-cal challenges in applying computer vision techniques at such a scale. Exhaustive evaluation is carried viiout on 257 long-stretches with unconstrained settings and 20 conditions-based hierarchical frame-levellabels for different timings, weather conditions, road type, traffic density, and state of road damage. Forthe first time, we demonstrate that large-scale analytics of irregular road infrastructure is feasible withexisting computer vision techniques.

Year of completion:  December 2022
 Advisor : C V Jawahar

Related Publications


    Downloads

    thesis

    Weakly supervised explanation generation for computer aided diagnostic systems


    Aniket Joshi

    Abstract

    Computer Aided Diagnosis (CAD) systems are developed to aid doctors and clinicians in diagnosis after interpreting and examining a medical image. CAD systems aids in performing the task more consistently. With the arrival of data-driven deep learning paradigm and availability of large amount of data in the medical domain, CAD systems are being developed to diagnose a large variety of diseases ranging from different types of cancers, heart and brain diseases, Alzheimer’s disease, and diabetic retinopathy, etc. These systems are highly competent in performing the task on which they are trained. Although such systems perform at par with the trained clinicians, they suffer from a limitation in that they are completely black box in nature and are trained only on image-level class labels. This poses a problem in deploying CAD systems as stand alone solutions for disease diagnosis. This is because decisions in the medical domain are about health of a patient and are well reasoned and backed by evidence, sometimes from multiple modalities. Hence, there is a critical need for CAD system’s decisions to be explainable. Restricting our focus to only image modality, a solution to design explainable CAD systems could be to train the system using both class labels and local annotations and derive the explanation in a fully supervised manner. However, getting these local annotations is very expensive, time-consuming, and infeasible in most circumstances. In this thesis we address this explainability and data scarcity problem and propose two different approaches towards the development of weakly supervised explainable CAD systems. Firstly, we aim to explain the classification decision by providing heatmaps denoting important regions of interest in the image, which helped the model make the prediction. In order to generate anatomically accurate heatmaps, we provide a mixed set of annotations to our model - class labels for the entire training set of images and rough localization of suspect regions for a smaller subset of images in the training set. The proposed approach is illustrated on two different disease classification tasks based on disparate image modalities - Diabetic macular edema (DME) classification from OCT slices and Breast Cancer detection from mammographic images. Good classification results are shown on public datasets, supplemented by explanations in the form of suspect regions; these are derived using just a third of images with local annotations, emphasizing the potential for generalisability of the proposed solution.

    Year of completion:  November 2021
     Advisor : Jayanthi Sivaswamy

    Related Publications


      Downloads

      thesis

      Interactive Layout Parsing of Highly Unstructured Document Images


      Abhishek Trivedi

      Abstract

      Ancient historical handwritten documents were one of the earliest forms of written media and contributed to the most valuable cultural and natural heritage of many countries globally. They hold early written knowledge about subjects like science, medicine, Buddhist doctrines and astrology. India has the most extensive collection of manuscripts and studies have been conducted on their digitization for passing on their wealth of wisdom to future generations. Targeted annotation systems for automatic multi-region instance segmentation of their document images exist but with relatively inferior quality layout prediction compared to their human-annotated counterparts. Precise boundary annotations of image regions in historical document images are crucial for downstream applications like OCR, which rely on region-class semantics. Some document collections contain densely laid out, highly irregular, and overlapping multi-class region instances with a large range in aspect ratio. Addressing this, a web-based layout annotation and analytics system is proposed in this thesis. The system, called HInDoLA, features an intuitive annotation GUI, a graphical analytics dashboard, and interfaces with machine-learning-based intelligent modules on the backend. HInDoLA has successfully helped us create the first-ever large-scale dataset for layout parsing of Indic palm-leaf manuscripts named Indiscapes. Keeping the non-technical nature of domain experts in mind, the tool offers an interactive and relatively fast annotation process with the help of two modes, namely Fully Automatic mode and Semi-Supervised Intelligent Mode. We then discuss the semi-supervised approach superiority over fully automatic approaches for Historical document annotation. Fully automatic boundary estimation approaches tend to be data-intensive, cannot handle variable-sized images, and produce sub-optimal results for images mentioned above. BoundaryNet, a novel resizing-free approach for high-precision semi-automatic layout annotation, is another main contribution from this thesis. An attention-guided skip network first processes the variablesized user-selected region of interest. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial-boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph Convolution Network optimized using Hausdorff loss to get the final region boundary. A challenging image manuscript dataset demonstrates that BoundaryNet outperforms solid baselines and produces high-quality semantic region boundaries. Qualitatively, our approach generalizes across multiple document image datasets containing different script systems and layouts, all without additional fine-tuning.

      Year of completion:  May 2022
       Advisor : Ravi Kiran Sarvadevabhatla

      Related Publications


        Downloads

        thesis

        Improving the Efficiency of Fingerprint Recognition Systems


        Additya Popli

        Abstract

        Humans have used different characteristics to identify each-other since early times. This practice of identification based on person-specific features called biometric traits has developed over time to use more sophisticated techniques and characteristics like fingerprints, irises and gait in order to improve the identification performance. Fingerprints due to their distinctiveness, persistence over time and ease of capture, have become of the most widely use biometric traits for identification. However, with this ever increasing dependence on fingerprint biometrics, it is very important to ensure the safety of these recognition system against potential attackers. One of the most common and successful ways to circumvent these systems is through the use of fake or artificial fingers synthesized using commonly available materials like silicon and clay to match the real fingerprint of any particular person. Most fingerprint recognition systems employ a spoof detection module to filter out these fake fingerprints. While they seem to work well in general, it is a well-established fact that spoof detectors are not able to identify spoof fingerprints synthesized using ”unseen” or ”novel” spoof materials, i.e, the materials which were not available during the training phase of the detector. While it is possible to synthesize a few fingers using the various available materials, present-day spoof detectors require a large amount of samples for their training, which is practically not feasible due to the high cost and high complexity of fabrication of spoof fingers. In this thesis, we propose a method for creating artificial fingerprint images using only a very limited number of artificial fingers created from a specific material. We train a style-transfer network using available spoof fingerprint images which learns to extract material properties from the image, and then for each material, uses the limited set of spoof fingerprint images to generate a huge dataset of artificial fingerprint images without actually fabricating spoof fingers. These artificial fingerprint images can then be utilised by the spoof detector for training. Through our experiments, we show that the use of these artificially generated spoof images for training can improve the performance of existing spoof detectors over unseen spoof materials. Another major limitation of present-day recognition systems is their high resource requirements. Most fingerprint recognition systems use a spoof detector as a separate system either in series or in parallel with a fingerprint matcher leading to very high memory and time requirements during inference. To overcome this limitation, we explore the relationship between these two tasks in order to develop a common module capable of performing both spoof detection and matching. Our experiments show a high level of correlation between the features extracted for spoof detection and matching. We propose a new joint model which achieves similar fingerprint spoof detection and matching performance on various datasets as current state-of-the-art methods while using 50% less time and 40% less memory, thus providing a significant advantage for recognition systems deployed on resource-constrained devices like mobile phones.

        Year of completion:  December 2021
         Advisor : Anoop M Namboodiri

        Related Publications


          Downloads

          thesis

          Pose Based Action Recognition: Novel Frontiers for Fully Supervised Learning and Language aided Generalised Zero-Shot Learning


          Pranay Gupta

          Abstract

          Action recognition is indispensable not only to the umbrella field of computer vision but in multitudes of allied fields such as video surveillance, human computer interaction, robotics and human robot interaction. Typically action recognition is performed over RGB videos, however, in recent years skeleton action recognition has also gained a lot of traction. Much of it is owed to the development of frugal motion capture systems, which enabled the curation of large scale skeleton action datasets. In this thesis, we focus on skeleton-based human action recognition. We begin our explorations with skeleton-action recognition in the wild by introducing Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. We also introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. We benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. The results from benchmarking the top performers of NTU-120 on the newly introduced datasets reveal the challenges and domain gap induced by actions in the wild. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. Via the introduced datasets, our work enables new frontiers for human action recognition. When moving ahead from the traditional supervised recognition to the more challenging zero-shot recognition, the language component of the action name becomes important. Focusing on this, we introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. Our design choices enable SynSE to generalize compositionally, i.e., recognize sequences whose action descriptions contain words not encountered during training. We also extend our approach to the more challenging Generalized Zero-Shot Learning (GZSL) problem via a confidence-based gating mechanism. We are the first to present zero-shot skeleton action recognition results on the large scale NTU-60 and NTU-120 skeleton action datasets with multiple splits. Our results demonstrate SynSE’s state of the art performance in both ZSL and GZSL settings compared to strong baselines on the NTU-60 and NTU-120 datasets. 3-D virtual avatars are extensively used in gaming, educational animation and physical exercise coaching applications. The development of these visually interactive systems relies heavily of how humans perceive the actions performed by the 3-D avatars. To this end, we perform a short user-study to gain insights into the recognizability of human actions performed virtually. Our results reveal that actions performed by 3-D avatars are significantly easier to recognize as compared to those performed 3-D skeletons. Concrete actions, i.e actions which can only be performed using a fixed set of movements are recognized more quickly and accurately as compared to abstract actions. Overall, in this thesis we study various new-frontiers in skeleton action recognition by means of novel datasets and tasks. We drift from unimodal approaches to a multimodal setup by incorporating language in a unique syntactically aware fashion with hopes of utilizing similar ideas in more challenging problems like skeleton action generation.

          Year of completion:  May 2022
           Advisor : Ravi Kiran Sarvadevabhatla

          Related Publications


            Downloads

            thesis

            More Articles ...

            1. Exploring Data Driven Graphics for Interactivity
            2. A study of Automatic Segmentation of 3-D Brain MRI and its application to Deformable Registration
            3. Saliency Estimation in Videos and Images
            4. Fast and Accurate Image Recognition
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.