CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Towards Enhancing Semantic Segmentation in Resource Constrained Settings


Ashutosh Mishra

Abstract

Understanding the semantics of the scene to automate the decision process for self-driving cars completely is becoming a crucial task to solve in computer vision. Due to the recent progress in the state of autonomous driving, added with a lot of semantic segmentation datasets for road scene understanding being proposed, semantic segmentation of road scenes has recently evolved to be an important problem to tackle. But training semantic segmentation models becomes a resource-intensive task since it requires multi-GPU training and therefore becomes the bottleneck to reproducing results for better understanding quickly. This thesis introduces challenges and provides solutions to reduce the training time of segmentation models by introducing two small-scale datasets. Additionally, the thesis explores the potential of employing neural architecture search and automatic pruning techniques to create efficient segmentation modules in resource-constrained settings. Chapter2 of the thesis introduces the problem of semantic segmentation and discusses some deep learning approaches to solve supervised semantic segmentation. We briefly discuss the different metrics used and also touch upon the statistics of various datasets that are available in the literature to train semantic segmentation models. Chapter 3 of the thesis explains the need of having a dataset based on the Indian road scenario. Most of the datasets in the literature are captured in Western settings having well-defined traffic participants, delineated boundaries, etc, which seldom mold in the Indian setting. We describe the annotation pipeline, along with the quality check framework used to annotate the dataset. Now, though the IDD dataset [121] caters to the Indian setting, this dataset is still quite resource intensive in terms of GPU computation. Hence, there is a need to have a small resolution, less label-sized dataset for rapid prototyping. We introduce our proposed datasets and provide a detailed set of experiments, and statistical comparisons with the existing datasets to substantiate our claim regarding the usefulness of the proposed solution. We also show through experiments that the models trained using our datasets can be deployed on low-resource hardware such as Raspberry Pi. At the end of this chapter, we also look into the significance of the proposed datasets in facilitating challenges at two prominent conferences: the International Conference on Computer Vision (ICCV) and the National Conference on Pattern Recognition, Image Processing, and Graphics (NCVPRIPG) in 2019. These challenges aimed to address semantic segmentation in resource-constrained settings, inviting innovative architectures capable of achieving decent accuracy on these proposed datasets. We also discuss the potential application of these datasets in teaching semantic segmentation through a course of notebooks introducing traditional as well as deep learning-based methods to perform segmentation. These notebooks are plug-and-play, where the first three notebooks can run on laptop CPU, while the fourth notebook requires GPU access.

Year of completion:  January 2024
 Advisor : C V Jawahar,Girish Varma

Related Publications


    Downloads

    thesis

    High-Quality 3D Fingerprint Generation: Merging Skin Optics, Machine Learning and 3D Reconstruction Techniques


    Apoorva Srivastava

    Abstract

    Fingerprints are a widely recognized and commonly used method of identification. Contact-based fingerprints, which involve pressing the finger against a surface to obtain images, are a popular method of capturing fingerprints. However, this process has several drawbacks, including skin deformation, unhygienic conditions, and high sensitivity to the moisture content of the finger. These factors can negatively impact the accuracy of the fingerprint. Moreover, fingerprints are three-dimensional anatomical structures, and two-dimensional fingerprints do not capture the depth information of the finger ridges. While 3D fingerprint capture is less sensitive to skin moisture levels and avoids skin deformation, it is limited in adoption due to the high cost and system complexity associated with it. The complexity and cost are mainly attributed to the use of multiple cameras, projectors, and sometimes synchronously moving mechanical parts. Photometric stereo offers a promising solution to build low-cost, simple sensors for high-quality 3D capture using only a single camera and a few LEDs. However, the method assumes that the surface being imaged is lambertian, which is not the case for human fingers. Existing 3D fingerprint scanners based on photometric stereo also assume that the finger is lambertian, resulting in poor reconstruction results. In this context, we introduce the Split and Knit algorithm (SnK), a 3D reconstruction pipeline based on Photometric Stereo for finger surfaces. The algorithm splits the reconstruction of the ridge-valley pattern and finger shape and combines them to obtain the 3D fingerprint reconstruction for the full finger with a single camera for the first time. To reconstruct the ridge-valley pattern, SnK introduces an efficient way of estimating the direct illumination component by using a trained U-Net without extra hardware, which reduces the non-Lambertian nature of the finger image and enables a higher-quality reconstruction of the entire finger surface. To obtain the finger shape using a single camera, the algorithm introduced two novel approaches, a) using IR illumination and b) using a mirror and parametric modeling for the finger shape. Finally, we combine the overall finger shape and the ridge-valley point cloud to obtain a 3D finger phalange. The high-quality 3D reconstruction results in better matching accuracy of the captured fingerprints. Splitting the ridge-valley pattern from the finger provides an implicit way to convert 3D fingerprint into 2D fingerprint, making the SnK algorithm compatible with the 2D fingerprint recognition systems. To apply the SnK algorithm to fingerprints, we designed a 3D printed photometric stereo-based setup that captures contactless finger images and obtains their 3D reconstructions

    Year of completion:  August 2023
     Advisor : Anoop M Namboodiri

    Related Publications


      Downloads

      thesis

      A Holistic Framework for Multimodal Ecosystem of Pictionary


      Nikhil Bansal

      Abstract

      In AI, the ability of intelligent agent to model human player in games such as Backgammon, Chess and Go has been an important metric in benchmarking progress. Fundamentally, the games mentioned above can be characterized as competitive and zero-sum. In contrast, games such as Pictionary and Dumb Charades falls into the category of ‘social’ games. Unlike competitive games, the emphasis is on cooperative and co-adaptive game-play in a relaxed setting. Such social games can form the basis for the next wave of game-driven progress in AI. Pictionary™ is a wonderful example of cooperative game play to achieve a shared goal in communication-restricted settings. This popular sketch-based guessing game, which we employ as a use case, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. To enable the study of Pictionary and to understand various aspects associated with the game play, we designed a software ecosystem for web-based online game of Pictionary dubbed PICTGUESS. To overcome several technological and logistic barriers, which the actual game presents, we implemented a simplified setting for PICTGUESS wherein a game consists of a time-limited episode involving two players - a Drawer and a Guesser. The Drawer is tasked with conveying a given target phrase to a counterpart Guesser by sketching on a whiteboard within that time limit. However, occasionally some players in Pictionary draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and impairs the game experience. To address such situations in a timely and scalable manner, we introduce DRAWMON, a novel distributed framework for automatic detection of atypical sketch content in concurrently occurring Pictionary game sessions. We build specialized online interfaces to annotate atypical sketch content, resulting in ATYPICT, the first ever atypical sketch content dataset. We use ATYPICT to train CANVASNET, a deep neural atypical content detection network. We utilize CANVASNET as a core component of DRAWMON. Our analysis of post deployment game session data indicates DRAWMON’s effectiveness for scalable monitoring and atypical sketch content detection. Beyond Pictionary, our contributions can also serve as a design guide for customized atypical content response systems involving shared and interactive whiteboards.

      Year of completion:  September 2023
       Advisor : Ravi Kiran Sarvadevabhatla

      Related Publications


        Downloads

        thesis

        Deploying Multi Camera Multi Player Detection and Tracking Systems in Top View


        Swetanjal Murati Dutta

        Abstract

        Object Detection and Tracking computer vision algorithms have remarkably progressed in the past few years, owing to the rapid progress in the field of deep learning. These algorithms find numerous applications in surveillance, robotics, autonomous driving, sports analytics, and human-computer interaction. They achieve near-perfect performance on the benchmark datasets they have been evaluated on. However, deploying such algorithms in real-world poses a large number of challenges, and they do not work as well as we expect them to work by looking at their performance on benchmark datasets. In this thesis, we present details of deploying one such Multi Camera Multi-Object Detection and Tracking system to track players in the bird’s eye (top) view in a sports event, specifically in the game of cricket. Our system is able to generate the top-view map of the fielders from cameras placed on crowd stands of a stadium, making it extremely cost-effective compared to using spider cameras. We present details on the challenges and hurdles we faced while deploying our systems and the solutions we designed to overcome these challenges. Ultimately, we tailored a neatly engineered end-to-end system that went live in the Asia Cup of 2022. Deploying such a multi-camera detection and tracking system poses many challenges. The first of them is related to camera placement. We had to devise strategies to place cameras optimally so that all the objects of interest could be captured by one or more of the cameras placed, at the same time ensuring that they do not appear so small that it becomes difficult for a detector to localize them. Constructing the bird’s eye view of the detected players required camera calibration. In the real world, we may not find reliable point correspondences to calibrate a camera. Even if we might find point correspondences, sometimes they may be really hard to see to accurately calibrate a camera. Detecting players in a setup like this proved to be really challenging because the objects of interest appeared very small in the camera views. Moreover, since the detections were coming from multiple cameras, algorithms had to be devised to associate them correctly across camera views. Tracking in a setting like this was even harder because we had to rely only on motion cues to track the objects of interest. Appearance features did not add any value because of the fact all the players being tracked wore jerseys of the same color. Each of these challenges became even more, harder to solve because of the real-time requirements of our use case. Finally, setting up the hardware architecture to receive the real-time live feeds from each camera in a synchronized manner and implementing a fast communication protocol for transmitting data between the various system components required careful design choices to be made, all of which have been presented in detail in this thesis.

        Year of completion:  November 2023
         Advisor : Vineet Gandhi

        Related Publications


          Downloads

          thesis

          Towards building controllable Text to Speech systems


          Saiteja Kosgi

          Abstract

          Text-to-speech systems convert any given text to speech. They play a vital role in making Humancomputer interaction (HCI) possible. As humans, we don't just rely on text (language) to communicate; we use many other mechanisms like voice, gestures, expressions, etc., to communicate efficiently. In natural language processing, vocabulary and grammar tend to take center stage, but those elements of speech only tell half the story. Affective prosody of speech provides larger context and gives meaning to words, and keeps listeners engaged. Current HCI systems largely communicate in text, and they lack a lot of prosodic information, which is crucial in a conversation. To make the HCI systems communicate in speech, text to speech systems should be able to synthesize speech that is expressive and controllable. But the existing text to speech systems learn the average variation in the dataset it’s trained on, which synthesizes samples in a neutral way without many prosodic variations. To this end, we develop a textto-speech system that can synthesize the given emotion where the emotion is represented as a tuple of Arousal, Valance and Dominance (AVD) values. Text to speech systems have a lot of complexities. Training such a system requires the data to be very clear, noiseless, and collecting such data is difficult. If the data is noisy, it will reflect unnecessary artifacts in the synthesized samples. Training emotion based text to speech models is considerably more difficult and not strait forward. The fact that obtaining emotion annotated data for the desired speaker is costly and very subjective makes it a cumbersome task. Current emotion based systems can synthesize emotions with some limitations. (1) Emotion controllability comes at the cost of loss in quality, (2) Have discreet emotions which lack the finer control, and (3) cannot be generalized to new speakers without the annotated emotion data. We propose a system that overcomes the above-mentioned problems by leveraging the largely available corpus of noisy speech annotated with emotions. Even though the data is noisy, our technique trains an emotion based text to speech system that can synthesize desired emotion without any loss of quality in the output. We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate variances/features (pitch, energy, and duration) as levers. We learn how the variances change with respect to emotion. We bring the finer control in the synthesized speech by using AVD values, which can represent emotions in a 3D space. Our proposed method also doesn’t require emotion annotated data for the target speaker. Once trained on the emotion annotated data, it can be applied to any system which has the prediction of the variances as an intermediate step. vi vii With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted ”human touch” in machine dialogue.

          Year of completion:  May 2023
           Advisor : Vineet Gandhi

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Effective and Efficient Attribute-aware Open-set Face Verification
            2. Face Reenactment: Crafting Realistic Talking Heads for Enhanced Video Communication and Beyond
            3. Fingerprint Disentanglement for Presentation Attack Generalization Across Sensors and Materials
            4. Extending PRT Framework for Lowly-Tessellated and Continuous Surfaces
            • Start
            • Prev
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.