CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Driver Attention Monitoring using Facial Features


Isha Dua

Abstract

How can we assess the quality of human driving using AI? Driver inattention is one of the leading causes of vehicle crashes and incidents worldwide. Driver inattention includes driver fatigue leading to drowsiness and driver distraction, say due to the use of cellphone or rubbernecking, all of which leads to a lack of situational awareness. Hitherto, techniques presented to monitor driver attention evaluated factors such as fatigue and distraction independently. However, to develop a robust driver attention monitoring system, all the factors affecting a driver’s attention needs to be analyzed holistically. In this thesis, we present two novel approaches for driver attention analysis on the road using driver video and fusion of driver and road video.
In the first approach, we propose the driver attention rating system that leverages the front camera of a windshield-mounted smartphone to monitor the driver attention by combining several features. We derive a driver attention rating by fusing spatio-temporal features based on the driver state and behavior such as head pose, eye gaze, eye closure, yawns, use of cellphones, etc. We present a few architec- tures for feature aggregation like AutoRate and Attention-based AutoRate. We perform an extensive evaluation of feature aggregation networks on real-world driving data and also data from controlled, static vehicle settings with 30 drivers in a large city. We compare the proposed method’s automatically- generated rating with the scores given by 5 human annotators. We introduce the kappa coefficient, an evaluation metric to compute the inter-rater agreement between the generated rating and the rating pro- vided by human annotators. We observe that Attention-based AutoRate outperforms other proposed designs for feature aggregation by 10%. Further, we use the learned temporal and spatial attention to visualize the key frame and the key action, which justifies the model’s predicted rating. Finally, to pro- vide driver-specific results, we fine-tune the Attention-based AutoRate model using the specific driver data to give personalized driver experience.

Year of completion:  June 2020
 Advisor : Prof. C.V. Jawahar

Related Publications


    Downloads

    thesis

    An investigation of the annotated data sparsity problem in the medical domain


    Pujitha Appan Kandala

    Abstract

    Diabetic retinopathy (DR) is the most common eye disease in people with diabetes. It affects them for significant number of years and can also lead to permanent blindness if left untreated. Early detection and treatment of DR is of utmost importance for the prevention of blindness. Hence, automatic disease detection and classification have been attracting much interest. High performance is critical in adoptionof such systems, which generally rely on training with a wide variety of annotated data. Availability of such varied annotated data in medical imaging is very scarce. The main focus of this thesis is to deal with the sparsity of annotated data and develop computer-aided diagnostic CAD systems which take less annotated data and yet give high accuracies. We propose three different solutions to address this problem. First, we propose a semi-supervised framework which paves way for including unlabeled data in training. A co-training framework is used in which features are extracted from a limited training set and independent models are learnt on each of the features, later the models are used to predict labels for new data. The highly confident labelled images from unlabelled set are added back to the training set and the process is continued, thus expanding the number of known labels. This framework is showcased on retinal neovascularization (NV) which is a critical stage of proliferative DR. The analysis of the results for detection of NV showed that an AUC of 0.985 with sensitivity of 96.2% at specificity of 92.6% which were superior to the existing models. Secondly, we propose crowdsourcing as a solution where we obtain annotations from a crowd and use them for training after refining. We employ a strategy to refine/overcome the noisy nature of crowdsourced annotations by i) assigning a reliability factor for each subject of the crowd based on their performance (at global and local levels) and experience and ii) requiring region of interest (ROI) markings rather than pixel-level markings from the crowd. We also show that these annotations are reliable by training a deep neural net (DNN) for detection of hard exudates which occur in mild non-proliferative DR. Experimental results obtained for hard exudate detection showed that training with refined crowdsourced data is effective as detection performance improves by 25% over training with just expert-markings. Lastly, we explore synthetic data generation as a solution to address this problem. We propose a novel method, based on generative adversarial networks (GAN), to generate images with lesions such that the overall severity level can be controlled. We showcase this approach for hard exudate and haemorrhage detection in retinal images with 4 levels of severity. These vary from mild to severe non-proliferativeDR. The synthetic data were also shown to be reliable for developing a CAD system for DR detection. Hard exudate/ haemorrhage detection was found to improve with inclusion of synthetic data in thetraining set with improvement in sensitivity of about 25% over training with just expert marked data.

     

    Year of completion:  November 2018
     Advisor : Jayanthi Sivaswamy

    Related Publications


      Downloads

      thesis

      Document Image Quality Assessment


      Pranjal Kumar Rai

      Abstract

      The amelioration in video capture technology has made recording videos very easy. The introduction of smaller affordable cameras which not only boast high-end specifications but are also capable of capturing videos at very high resolution (4K,8K and even 16K) have made recording of high quality videos accessible to everyone. Although this makes recording of videos very straightforward and effortless, a major part of the video production process - Editing is still labor intensive and requires skill and expertise. This thesis takes a step towards automating the editing process and making it less time consuming. In this thesis, (1) we explore a novel approach of automatically editing stage performances such that both the context of the scene as well as close-up details of the actors are shown. (2) We propose a new method to optimally re-target videos to any desired aspect ratio while retaining the salient regions that are derived using gaze tracking. Recordings of stage performances are easy to capture with a high-resolution camera, but are difficult to watch because the actors’ faces are too small. We present an approach to automatically create a split screen video that transforms these recordings to show both the context of the scene as well as close-up details of the actors. Given a static recording of a stage performance and the tracking information about the actors positions, our system generates videos showing a focus+context view based on computed close-up camera motions using crop-and zoom. The key to our approach is to compute these camera motions such that they are cinematically valid close-ups and to ensure that the set of views of the different actors are properly coordinated and presented. We pose the computation of camera motions as convex optimization that creates detailed views and smooth movements, subject to cinematic constraints such as not cutting faces with the edge of the frame. Additional constraints link the close up views of each actor, causing them to merge seamlessly when actors are close. Generated views are placed in a resulting layout that preserves the spatial relationships between actors. This eliminates the need for manual labour and expertise required for both capturing the performance and later editing it, instead the splitscreen of focus+context views allows the viewer to make an active decision on attending to whatever seems important. We also demonstrate our results on a variety of staged theater and dance performances. When videos are captured they are captured according to a specific aspect ratio keeping in mind the size of target screen in which they are meant to be viewed, this results in a inferior viewing experience when they are not watched on screens with their native aspect ratio. We present an approach to automatically retarget any given video to any desired aspect ratio while preserving its most salient regions obtained using gaze tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. The algorithm has two steps in total. The first step uses dynamic programming to find a cropping window path that maximizes gaze inclusion within the window and also tries to find the location of plausible new cuts (if required). The second step performs regularized convex optimization on the path obtained via dynamic programming to produce a smooth cropping window path comprised of piecewise linear, constant and parabolic segments. We test our re-editing algorithm on a diverse collection of movie and theater sequences. A study conducted with 16 users confirms that our retargeting algorithm results in a superior viewing experience as compared to gaze driven re-editing [30] and letterboxing methods, especially for wide-angle static camera recordings.

       

      Year of completion:  November 2018
       Advisor : Pranjal Kumar Rai

      Related Publications


        Downloads

        thesis

        Computational Video Editing and Re-editing


        Moneish Kumar

        Abstract

        The amelioration in video capture technology has made recording videos very easy. The introduction of smaller affordable cameras which not only boast high-end specifications but are also capable of capturing videos at very high resolution (4K,8K and even 16K) have made recording of high quality videos accessible to everyone. Although this makes recording of videos very straightforward and ef\fortless, a major part of the video production process - Editing is still labor intensive and requires skill and expertise. This thesis takes a step towards automating the editing process and making it less time consuming. In this thesis, (1) we explore a novel approach of automatically editing stage performances such that both the context of the scene as well as close-up details of the actors are shown. (2) We propose a new method to optimally re-target videos to any desired aspect ratio while retaining the salient regions that are derived using gaze tracking. Recordings of stage performances are easy to capture with a high-resolution camera, but are difficult to watch because the actors’ faces are too small. We present an approach to automatically create a split screen video that transforms these recordings to show both the context of the scene as well as close-up details of the actors. Given a static recording of a stage performance and the tracking information about the actors positions, our system generates videos showing a focus+context view based on computed close-up camera motions using crop-and zoom. The key to our approach is to compute these camera motions such that they are cinematically valid close-ups and to ensure that the set of views of the different actors are properly coordinated and presented. We pose the computation of camera motions as convex optimization that creates detailed views and smooth movements, subject to cinematic constraints such as not cutting faces with the edge of the frame. Additional constraints link the close up views of each actor, causing them to merge seamlessly when actors are close. Generated views are placed in a resulting layout that preserves the spatial relationships between actors. This eliminates the need for manual labour and expertise required for both capturing the performance and later editing it, instead the splitscreen of focus+context views allows the viewer to make an active decision on attending to whatever seems important. We also demonstrate our results on a variety of staged theater and dance performances. When videos are captured they are captured according to a specific aspect ratio keeping in mind the size of target screen in which they are meant to be viewed, this results in a inferior viewing experience when they are not watched on screens with their native aspect ratio. We present an approach to auto-matically retarget any given video to any desired aspect ratio while preserving its most salient regions obtained using gaze tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. The algorithm has two steps in total. The first step uses dynamic programming to find a cropping window path that maximizes gaze inclusion within the window and also tries to find the location of plausible new cuts (if required). The second step performs regularized convex optimization on the path obtained via dynamic programming to produce a smooth cropping window path comprised of piecewise linear, constant and parabolic segments. We test our re-editing algorithm on a diverse collection of movie and theater sequences. A study conducted with 16 users confirms that our retargeting algorithm results in a superior viewing experience as compared to gaze driven re-editing [30] and letterboxing methods, especially for wide-angle static camera recordings.

         

        Year of completion:  November 2018
         Advisor : Vineet Gandhi

        Related Publications


          Downloads

          thesis

          Geometric + Kinematic Priors and Part-based Graph Convolutional Network for Skeleton-based Human Action Recognition


          Kalpit Thakkar

          Abstract

          Videos engulf the media archive in every capacity, generating a rich and vast source of information from which machines can learn how to understand such data and predict useful attributes that can help technologies make human lives better. One of the most significant and instrumental part of a computer vision system is the comprehension of human actions from visual sequences – a paragon of machine intelligence that can be achieved through computer vision. The problem of human action recognition has high importance as it facilitates several applications built around recognition of human actions. Understanding actions from monocular sequences has been studied immensely, while comprehending human actions from a skeleton video has developed in recent times. We propose action recognition frameworks that use the skeletal data (viz. 3D locations of some joints) of human body to learn spatio-temporal representations necessary for recognizing actions. Information about human actions is composed of two integral dimensions, space and time, along which the variations occur. Spatial representation of a human action aims at encapsulating the config- uration of human body essential to the action, while the temporal representation aims at capturing the evolution of such configurations across time. To this end, we propose to use geometric relations betweenhuman skeleton joints to discern the body pose relative to the action and physically inspired kinematic quantities in order to understand the temporal evolution of body pose. Spatio-temporal understanding of human actions is thus conceived as a comprehension of geometric and kinematic information with the help of machine learning frameworks. Using a representation inculcating an amalgamation of geo- metric and kinematic features, we recognize human actions from skeleton videos (S-videos) using such frameworks. We first present a non-parametric approach for temporal sub-segmentation of trimmed action videos using angular momentum trajectory of the skeletal pose sequence. Meaningful summarization of the pose sequence is a product of the sub-segmentation achieved through systematic sampling of the seg- ments. Descriptors capturing geometric and kinematic statistics encoded as histograms and spread across a periodic range of orientations are computed to represent the summarized pose sequences, which are fed to a kernelized classifier for recognizing actions. This framework for understanding human ac- tions instils the effects of using geometric and kinematic properties of human pose sequences, important to spatio-temporal modeling of the actions. However, a downside of this framework is the inability to scale with availability of large amount of visual data.To mitigate the impending drawback, we next present geometric deep learning frameworks and specifically, graph convolutional networks, for the same task. Representation of human skeleton as a sparse spatial graph is intuitive and a structured form which lies on graph manifolds. A human action video hence results in the formation of a spatio-temporal graph and graph convolutions facilitate the learning of a spatio-temporal descriptor for the action. Inspired by the success of Deformable Part-based Models (DPMs) for the task of object understanding from images, we propose a part-based graph con- volutional network (PB-GCN) that operates on a human skeletal graph divided into parts. Incorporating the culmination of understandings from the success of geometric and kinematic features, we propose to use relative coordinates and temporal displacements of the 3D joints coordinates as features at the vertices in the skeletal graph. Owing to these signals, the prowess of graph convolutional networks is further boosted to attain state-of-the-art performance among several action recognition systems using skeleton videos. In this thesis, we meticulously examine the growth of the idea about using geometry and kinematics, transition to geometric deep learning frameworks and design a PB-GCN with geometric + kinematic signals at the vertices, for the task of human action recognition using skeleton videos.

           

          Year of completion:  March 2019
           Advisor : P J Narayanan

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Towards Data-Driven Cinematography and Video Retargeting using Gaze
            2. Development and Tracking of Consensus Mesh for Monocular Depth Sequences
            3. Adversarial Training for Unsupervised Monocular Depth Estimation
            4. Exploring Binarization and Pruning of Convolutional Neural Networks
            • Start
            • Prev
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            • 25
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.