CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Behaviour Detection of Vehicles Using Graph Convolutional Networks with Attention


Sravan Mylavarapu

Abstract

Autonomous driving has evolved with the advent of deep learning and increased computational capacity in the past two decades. While many tasks have evolved that help autonomous navigation, understanding on-road vehicle behaviour from a continuous sequence of sensor data (camera or radar) is an important and useful task. Determining state of other vehicles and their motion helps in decision making for the ego-vehicle (Self/Our Vehicle). Such systems can be a part of driver assistance systems or even path planners for vehicles to avoid obstacles. Instances of such decision making could be, deciding to apply brakes incase an external agent is moving into our lane or coming opposite to us. Another instance would be keeping distance from an aggressive driver who is changing lanes and overtaking frequently could be helpful and safer. Specifically, our proposed methods classify the behaviours of on-road vehicles into one of the following categories-{Moving Away, Moving Towards Us, Parked/stationary, Lane Change, Overtake}. Many current methods leverage 3D depth information using expensive equipment, such as Lidar and complex algorithms. In this thesis, we propose a simpler pipeline for understanding vehicle behaviour from a monocular (single camera) image sequence or video. This will help in easier installation of the equipment and helps in Driver Assistance systems. A simple video along with Camera parameters, scene semantics (Detected Objects in the images) are used to get information about the objects of interest (vehicles) and other static objects like lanes/poles in the scene. We consider detecting objects across frames as the pre-processing pipe-line and propose two main methods, 1.Spatio-temporal MRGCN (Chapter:3) and 2.Relational-Attentive GCN (Chapter:4) for behaviour detection of vehicles. We divide our process of identifying behaviours into learning positional information of vehicles first, and then observing them across time to make a prediction. The positional information is encoded by a Multi-Relational Graph Convolutional Network (MR-GCN) in both the methods. Temporal information is encoded by recurrent networks in Method 1, while it is formulated inside the graph in Method 2. We also showcase how Attention is an important aspect in both our methods through our experiments. The proposed frameworks can classify a variety of vehicle behaviours to high fidelity on datasets that are diverse and include European, Chinese and Indian on-road scenes. The framework also provides for seamless transfer of models across datasets without entailing re-annotation, retraining and even fine-tuning. We provide comparative performance gain over baselines and detail a variety of ablations to showcase the efficacy of the framework.

Year of completion:  February 2021
 Advisor : Anoop M Namboodiri,Madhava Krishna

Related Publications


    Downloads

    thesis

    Computer Vision for Atmospheric Turbulence: Generation, Restoration, and its Applications


    Shyam Nandan Rai

    Abstract

    Real-world images often suffer from variation in weather conditions such as rain, fog, snow, and temperature. These variations in the atmosphere adversely affect the performance of computer vision models in real-world scenarios. This problem can be bypassed by collecting and annotating images for each weather. However, collecting and annotating images in such conditions is an extremely tedious task, which is time-consuming as well as expensive. So, in this work, we address the forementioned problems. Among all the weather conditions, we focus on the distortions in the image caused by high temperature, also known as atmospheric turbulence. These distortions introduce geometrical deformation around the boundaries of an object in an image which causes a vision algorithm to perform poorly and pose a major concern. Hence, in this thesis, we address the problem of artificially generating atmospheric turbulence and restoring the images from it. In the first part of our work, we attempt to model atmospheric turbulence. Since such models are critical to extending computer vision solutions developed in the laboratory to real-world use cases. And, simulating atmospheric turbulence by using statistical models or by computer graphics is often computationally expensive. To overcome this problem, we train a generative adversarial network(GAN) which outputs an atmospheric turbulent image by utilizing less computational resources than traditional methods. We propose a novel loss function to efficiently learn the atmospheric turbulence at the finer level. Experiments show that by using the proposed loss function, our network outperforms the existing state-of-the-art image to image translation network in turbulent image generation. In the second part of the thesis, we address the ill-posed problem of restoring images degraded due to atmospheric turbulence. We propose a deep adversarial network to recover the images which are distorted due to atmospheric turbulence and show the applicability of restored images in several tasks. Unlike previous methods, our approach neither uses any prior knowledge about atmospheric turbulence conditions at inference time nor requires the fusion of multiple images to get a single restored image. To train our models, we synthesized turbulent images by following a series of efficient 2D operations. Thereafter, using our trained models we run inference on real and synthesized turbulent images. Our final restoration models DT-GAN+ and DTD-GAN+ qualitatively and quantitatively outperforms the general state-of-the-art image-to-image translation models. The improved performance of our model is due to the use of optimized residual structures along with channel attention and sub-pixel mechanism which exploits the information between the channels and removes atmospheric turbulence at the finer level. We also perform extensive experiments on restored images by utilizing them for downstream tasks such as classification, pose estimation, semantic keypoint estimation, and depth estimation. In the third part of our work, we study the problem of the semantic segmentation model in adapting to hot climate cities. This issue can be circumvented by collecting and annotating images in such weather conditions and training segmentation models on those images. But, the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creating an annotated dataset for such adverse weather variations. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. In the last part of our work, we improve the performance of our joint restoration and segmentation network via a feedback mechanism. In, the previous approach the restoration network does not learn directly from the errors of the segmentation network. In other words, the restoration network is not task aware. Hence, we propose a semantic feedback learning approach, which improves the task of semantic segmentation giving a feedback response into the restoration network. This response works as an attend and fix mechanism by focusing on those areas of an image where restoration needs improvement. Also, we proposed loss functions: Iterative Focal Loss (iFL) and Class-Balanced Iterative Focal Loss (CBiFL), which are specifically designed to improve the performance of the feedback network. These losses focus more on those samples that are continuously miss-classified over successive iterations. Our approach gives a gain of 17.41 mIoU over the standard segmentation model, including the additional gain of 1.9 mIoU with CB-iFL on the Cityscapes dataset

    Year of completion:  December 2020
     Advisor : C V Jawahar,Vineeth Balasubramanian,Anbumani Subramanian

    Related Publications


      Downloads

      thesis

      Monocular 3D Human Body Reconstruction


      Abbhinav Venkat

      Abstract

      Monocular 3D human reconstruction is a very relevant problem due to numerous applications to the entertainment industry, e-commerce, health care, mobile-based AR/VR platforms, etc. However, it is severely ill-posed due to self-occlusions from complex body poses and shapes, clothing obstructions, lack of surface texture, background clutter, single view, etc. Conventional approaches address these challenges by using different sensing systems - marker-based, marker-less multi-view cameras, inertial sensors, and 3D scanners. Although effective, such methods are often expensive and have limited wide-scale applicability. In an attempt to produce scalable solutions, a few have focused on fitting statistical body models to monocular images, but are susceptible to the costly optimization process. Recent efforts focus on using data-driven algorithms such as deep learning to learn priors directly from data. However, they focus on template model recovery, rigid object reconstruction, or propose paradigms that don’t directly extend to recovering personalized models. To predict accurate surface geometry, our first attempt was VolumeNet, which predicted a 3D occupancy grid from a monocular image. This was the first of its kind model for non-rigid human shapes at that time. To circumvent the ill-posed nature of this problem (aggravated by an unbounded 3D representation), we follow the ideology of providing maximal training priors with our unique training paradigms, to enable testing with minimal information. As we did not impose any body-model based constraint, we were able to recover deformations induced by free-form clothing. Further, we extended VolumeNet to PoShNet by decoupling Pose and Shape, in which we learn the volumetric pose first, and use it as a prior for learning the volumetric shape, thereby recovering a more accurate surface. Although volumetric regression enables recovering a more accurate surface reconstruction, they do so without an animatable skeleton. Further, such methods yield reconstructions of low resolution at higher computational cost (regression over the cubic voxel grid) and often suffer from an inconsistent topology via broken or partial body parts. Hence, statistical body models become a natural choice to offset the ill-posed nature of this problem. Although theoretically, they are low dimensional, learning such models has been challenging due to the complex non-linear mapping from the image to the relative axis-angle representation. Hence, most solutions rely on different projections of the underlying mesh (2D/3D keypoints, silhouettes, etc.). To simplify the learning process, we propose the CR framework that uses classification as a prior for guiding the regression’s learning process. Although recovering personalized models with high-resolution meshes isn’t a possibility in this space, the framework shows that learning such template models can be difficult without additional supervision. As an alternative to directly learning parametric models, we propose HumanMeshNet to learn an “implicitly structured point cloud”, in which we make use of the mesh topology as a prior to enable better learning. We hypothesize that instead of learning the highly non-linear SMPL parameters, learning its corresponding point cloud (although high dimensional) and enforcing the same parametric template topology on it is an easier task. This proposed paradigm can theoretically learn local surface deformations that the body model based PCA space can’t capture. Further, going ahead, attempting to produce highresolution meshes (with accurate geometry details) is a natural extension that is easier in 3D space than in the parametric one. In summary, in this thesis, we attempt to address several of the aforementioned challenges and empower machines with the capability to interpret a 3D human body model (pose and shape) from a single image in a manner that is non-intrusive, inexpensive and scalable. In doing so, we explore different 3D representations that are capable of producing accurate surface geometry, with a long-term goal of recovering personalized 3D human models.

      Year of completion:  November 2020
       Advisor : Avinash Sharma

      Related Publications


        Downloads

        thesis

        Neural and Multilingual Approaches to Machine Translation for IndianLanguages and its Applications


        Jerin Philip

        Abstract

        Neural Machine Translation (NMT), together with multilingual formulations have arisen as the de-facto standard in translating a sentence from a source language to a target language. However, unlike many western languages, the available resources like training data of parallel sentences ortrained models which can be used to build and demonstrate applications in other domains are limited for the languagesin the Indian subcontinent. This work takes a major step towards closing this gap.In this work, we describe the development of state-of-the art translation solutions for 10 Indian lan-guages and English. We do this in four parts described below:1.Considering the Hindi-English language pair we successfully develop an NMT solution for anarrow-domain, demonstrating its application in translating cricket commentary.2.Through heavy data augmentation, we extend the above to the general domain and build a state-of-the art MT system for Hindi-English language pair. Further, We extend to five more languagesby taking advantage of multiway formulations.3.WedemonstratetheapplicationoftheNMTincontributingmoreresourcestothealreadyresource-scarce field, expanding to 11 langauges and its application in a multimodal task of translating a talking face to a target language with lip synchronization.4.Next, we improve both data-situation and performance for machine translation in 11 Indian Lan-guages iteratively to place our models in a standardized, comparable set of metrics setting up forfuture advances in the space to comprehensively evaluate and compare against.

        Year of completion:  August 2020
         Advisor : C V Jawahar, Vinay P. Namboodiri

        Related Publications


          Downloads

          thesis

          Lip-syncing Videos In The Wild


          Prajwal K R

          Abstract

          The widespread access to the Internet has led to a meteoric rise in audio-visual content consumption. Our content consumption habits have changed from listening to podcasts and radio broadcasts to watching videos on YouTube. We are now increasingly preferring the highly engaging nature of video calls over plain voice calls. Given this considerable shift in desire for audio-visual content, there has also been a surge in video content creation to cater to these consumption needs. In this fabric of video content creation, especially those containing people talking, lies the problem of making these videos accessible across language barriers. If we want to translate a deep learning lecture video in English to Hindi, it is not only that the speech should be translated but also the visual stream, specifically, the lip movements. Learning to lip-sync arbitrary videos to any desired target speech is a problem with several applications ranging from video translation, to readily creating new content that would otherwise require humongous efforts. However, speaker-independent lip synthesis for any voice, and language is a very challenging task. In this thesis, we tackle the problem of lip-syncing videos in the wild to any given target speech. We propose two new models in this space: one that significantly improves the generation quality and the other significantly improving on lip-sync accuracy. In the first model, LipGAN, we identify key issues that plague the current approaches for speakerindependent lip synthesis that prevent them from reaching the generation quality of speaker-specific models. Specifically, ours is the first model to generate face images that can be pasted back into the video frame. This feature is crucial for all the real-world applications where the face is just a small part of the entire content being displayed. We show that our improvements in quality lead to multiple real-world applications that have not been demonstrated in any of the previous lip-sync works. In the second model, Wav2Lip, we investigate why current models are inaccurate while lip-syncing arbitrary talking face videos. We hypothesize that the reason is weak penalization. This finding allows us to create a lip-sync model that can generate lip-synced videos for any identity and voice with remarkable accuracy and quality. We re-think the current evaluation framework for this task and propose multiple new benchmarks, two new metrics, and a Real world lip Sync Evaluation Dataset (ReSyncED). Also, using our model, we show applications on lip-syncing dubbed movies and animating real CGI movie clips to new speech. We also demonstrate a futuristic video call application that is useful for poor network connections. Finally, we present two major appli cations that our model can impact the most social media content creation and personalization and video translation. We hope that our advances in lip synthesis open up new avenues for research in the space of talking face generation from speech.

          Year of completion:  August 2020
           Advisor : C V Jawahar, Vinay P. Namboodiri

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Multiscale Two-view Stereo using Convolutional Neural Networks for Unrectified Images
            2. SemanticEdge Labeling usingDepth cues
            3. A New Algorithm for Ray Tracing Synthetic Light Fields on the GPU
            4. Image Representations for Style Retrieval, Recognition and Background Replacement Tasks
            • Start
            • Prev
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            • 25
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. MS Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.