CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Investigation of Different Aspects of Image Quality


Murtuza Bohra

Abstract

Image quality is a fundamental problem in computer vision. For variety of applications, for instance scanning documents, QR codes, bar codes or algorithms like object detection, recognition, tracking, scene understanding etc. images with good contrast, high illumination and sharpness are desired. Similarly in computer graphics for information visualisation, animations, presentations etc. aesthetically pleasing design and good colorization of the images are desired. Therefore the definition of image quality depend on the context and application of the image. In this thesis we attempt to address various challenges pertaining to image quality, (1) for natural imaging, we explore a novel approach for predicting the capture quality of the images taken in the wild. (2) For Graphics designs, we explore the aesthetic quality of images by suggesting multiple aesthetically pleasing colorization of graphics designs. Due to increasing advancements and portability of smartphone cameras, it has become a default choice for capturing images in the wild. However, there are quality issues with camera captured images due to reasons like lack of stability during capture process. This hinders the automatic workflows which takes camera captured images as input e.g. Optical Character Recognition (OCR) for documents image, face detection/recognition from human image etc. Part of this thesis is focused on Image Quality Assessment (IQA), the aim is to quantify the degradation like out-of-focus blur and motion artefacts in a given image. One of the major challenge in IQA for images captured in the wild is that, we do not have ground truth to measure the capture quality. Therefore various previous attempts of IQA require human in loop for creating the ground truths for capture quality of images. Large user studies are conducted and mean human opinion scores are then used as measure for quality. In this work we use a signal processing based technique to generate the IQA ground truth, and propose a comprehensive IQA dataset which is a good representative of the real degradation during the process of capture. Further, we propose deep learning based approach to predict image quality for captures in the wild. Such IQA algorithm can be helpful in the cause by either giving online quality suggestion during capture or rating the quality post capture. Another dimension to the image quality is aesthetic quality. Increasing usage of internet, social media and advancement in the mobile camera, photography has become a very popular hobby and interest to a large section. Even for a well captured image, people use varieties of filters and effects post capture to enhance the appearance of the image e.g. adjusting color temperature, contrast or even blurring part of vi vii the image (bokeh effect) etc. Therefore the capture quality is not enough to define the aesthetic quality of the image. However, one of the major factors defining the aesthetic quality of image is colorization. Particularly in computer graphics domain, where artificially generated images already have well defined structures, shape and components. Therefore sharpness or capture quality are not relevant, but on the other hand, color quality plays a very important role in visualization and appearance of the image. In natural images, largely colors are associated with semantics e.g. sky is always blue or grass are green etc. whereas in animations and computer graphics where objects are loosely associated with semantics, this lead to more choices of colors. Hence the problem of colorization becomes more challenging in graphics domain, where overall appearance of the images are more than its naturalness. Therefore, this work also covers aesthetic quality of graphics images, here instead of measuring the color quality, we propose algorithm to produce better coloring suggestions for the given graphics images.

Year of completion:  March 2021
 Advisor : Vineet Gandhi

Related Publications


    Downloads

    thesis

    Deep Learning Frameworks for Human Motion Analysis


    Neeraj Battan

    Abstract

    The human body is a complex structure consisting of multiple organs, muscles and bones. However, in regard to modeling human motion, the information about fixed set of joint locations (with fixed bone length constraint) is sufficient to express the temporal evolution of poses. Thus, one can represent a human motion/activity as temporally evolving skeleton sequence. In this thesis, we primarily explore two key problems related to modeling of human motion, the first is efficient indexing & retrieval of human motion sequences and the second one is automated generation/synthesis of novel human motion sequences given input class prior. 3D Human Motion Indexing & Retrieval is an interesting problem due to the rise of several data-driven applications aimed at analyzing and/or re-utilizing 3D human skeletal data, such as data-driven animation, analysis of sports biomechanics, human surveillance, etc. Spatio-temporal articulations of humans, noisy/missing data, different speeds of the same motion, etc. make it challenging and several of the existing states of the art methods use hand-craft features along with optimization-based or histogrambased comparison in order to perform retrieval. Further, they demonstrate it only for very small datasets and few classes. We make a case for using a learned representation that should recognize the motion as well as enforce a discriminative ranking. To that end, we propose, a 3D human motion descriptor learned using a deep network. Our learned embedding is generalizable and applicable to real-world data - addressing the aforementioned challenges and further enables sub-motion searching in its embedding space using another network. Our model exploits the inter-class similarity using trajectory cues and performs far superior in a self-supervised setting. State of the art results on all these fronts is shown on two large scale 3D human motion datasets - NTU RGB+D and HDM05. In regard to the second research problem of Human Motion Generation/Synthesis, we aimed at long-term human motion synthesis that can aid to human-centric video generation [9] with potential applications in Augmented Reality, 3D character animations, pedestrian trajectory prediction, etc. Longterm human motion synthesis is a challenging task due to multiple factors like long-term temporal dependencies among poses, cyclic repetition across poses, bi-directional and multi-scale dependencies among poses, variable speed of actions, and a large as well as partially overlapping space of temporal pose variations across multiple class/types of human activities. This paper aims to address these challenges to synthesize a long-term (> 6000 ms) human motion trajectory across a large variety of human activity 1Video for Human Motion Synthesis. 2Video for Human Motion Indexing and Retrieval. classes (> 50). We propose a two-stage motion synthesis method to achieve this goal, where the first stage deals with learning the long-term global pose dependencies in activity sequences by learning to synthesize a sparse motion trajectory while the second stage addresses the synthesis of dense motion trajectories taking the output of the first stage. We demonstrate the superiority of the proposed method over SOTA methods using various quantitative evaluation metrics on publicly available datasets. In summary, this thesis successfully addressed two key research problems, namely, efficient human motion indexing & retrieval and long-term human motion synthesis. In doing so, we explore different machine learning techniques that analyze the human motion in a sequence of poses (or called as frames), generate human motion, detect human pose from an RGB image, and index human motion.

    Year of completion:  March 2021
     Advisor : Anoop M Namboodiri,Madhava Krishna

    Related Publications


      Downloads

      thesis

      Behaviour Detection of Vehicles Using Graph Convolutional Networks with Attention


      Sravan Mylavarapu

      Abstract

      Autonomous driving has evolved with the advent of deep learning and increased computational capacity in the past two decades. While many tasks have evolved that help autonomous navigation, understanding on-road vehicle behaviour from a continuous sequence of sensor data (camera or radar) is an important and useful task. Determining state of other vehicles and their motion helps in decision making for the ego-vehicle (Self/Our Vehicle). Such systems can be a part of driver assistance systems or even path planners for vehicles to avoid obstacles. Instances of such decision making could be, deciding to apply brakes incase an external agent is moving into our lane or coming opposite to us. Another instance would be keeping distance from an aggressive driver who is changing lanes and overtaking frequently could be helpful and safer. Specifically, our proposed methods classify the behaviours of on-road vehicles into one of the following categories-{Moving Away, Moving Towards Us, Parked/stationary, Lane Change, Overtake}. Many current methods leverage 3D depth information using expensive equipment, such as Lidar and complex algorithms. In this thesis, we propose a simpler pipeline for understanding vehicle behaviour from a monocular (single camera) image sequence or video. This will help in easier installation of the equipment and helps in Driver Assistance systems. A simple video along with Camera parameters, scene semantics (Detected Objects in the images) are used to get information about the objects of interest (vehicles) and other static objects like lanes/poles in the scene. We consider detecting objects across frames as the pre-processing pipe-line and propose two main methods, 1.Spatio-temporal MRGCN (Chapter:3) and 2.Relational-Attentive GCN (Chapter:4) for behaviour detection of vehicles. We divide our process of identifying behaviours into learning positional information of vehicles first, and then observing them across time to make a prediction. The positional information is encoded by a Multi-Relational Graph Convolutional Network (MR-GCN) in both the methods. Temporal information is encoded by recurrent networks in Method 1, while it is formulated inside the graph in Method 2. We also showcase how Attention is an important aspect in both our methods through our experiments. The proposed frameworks can classify a variety of vehicle behaviours to high fidelity on datasets that are diverse and include European, Chinese and Indian on-road scenes. The framework also provides for seamless transfer of models across datasets without entailing re-annotation, retraining and even fine-tuning. We provide comparative performance gain over baselines and detail a variety of ablations to showcase the efficacy of the framework.

      Year of completion:  February 2021
       Advisor : Anoop M Namboodiri,Madhava Krishna

      Related Publications


        Downloads

        thesis

        Computer Vision for Atmospheric Turbulence: Generation, Restoration, and its Applications


        Shyam Nandan Rai

        Abstract

        Real-world images often suffer from variation in weather conditions such as rain, fog, snow, and temperature. These variations in the atmosphere adversely affect the performance of computer vision models in real-world scenarios. This problem can be bypassed by collecting and annotating images for each weather. However, collecting and annotating images in such conditions is an extremely tedious task, which is time-consuming as well as expensive. So, in this work, we address the forementioned problems. Among all the weather conditions, we focus on the distortions in the image caused by high temperature, also known as atmospheric turbulence. These distortions introduce geometrical deformation around the boundaries of an object in an image which causes a vision algorithm to perform poorly and pose a major concern. Hence, in this thesis, we address the problem of artificially generating atmospheric turbulence and restoring the images from it. In the first part of our work, we attempt to model atmospheric turbulence. Since such models are critical to extending computer vision solutions developed in the laboratory to real-world use cases. And, simulating atmospheric turbulence by using statistical models or by computer graphics is often computationally expensive. To overcome this problem, we train a generative adversarial network(GAN) which outputs an atmospheric turbulent image by utilizing less computational resources than traditional methods. We propose a novel loss function to efficiently learn the atmospheric turbulence at the finer level. Experiments show that by using the proposed loss function, our network outperforms the existing state-of-the-art image to image translation network in turbulent image generation. In the second part of the thesis, we address the ill-posed problem of restoring images degraded due to atmospheric turbulence. We propose a deep adversarial network to recover the images which are distorted due to atmospheric turbulence and show the applicability of restored images in several tasks. Unlike previous methods, our approach neither uses any prior knowledge about atmospheric turbulence conditions at inference time nor requires the fusion of multiple images to get a single restored image. To train our models, we synthesized turbulent images by following a series of efficient 2D operations. Thereafter, using our trained models we run inference on real and synthesized turbulent images. Our final restoration models DT-GAN+ and DTD-GAN+ qualitatively and quantitatively outperforms the general state-of-the-art image-to-image translation models. The improved performance of our model is due to the use of optimized residual structures along with channel attention and sub-pixel mechanism which exploits the information between the channels and removes atmospheric turbulence at the finer level. We also perform extensive experiments on restored images by utilizing them for downstream tasks such as classification, pose estimation, semantic keypoint estimation, and depth estimation. In the third part of our work, we study the problem of the semantic segmentation model in adapting to hot climate cities. This issue can be circumvented by collecting and annotating images in such weather conditions and training segmentation models on those images. But, the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creating an annotated dataset for such adverse weather variations. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. In the last part of our work, we improve the performance of our joint restoration and segmentation network via a feedback mechanism. In, the previous approach the restoration network does not learn directly from the errors of the segmentation network. In other words, the restoration network is not task aware. Hence, we propose a semantic feedback learning approach, which improves the task of semantic segmentation giving a feedback response into the restoration network. This response works as an attend and fix mechanism by focusing on those areas of an image where restoration needs improvement. Also, we proposed loss functions: Iterative Focal Loss (iFL) and Class-Balanced Iterative Focal Loss (CBiFL), which are specifically designed to improve the performance of the feedback network. These losses focus more on those samples that are continuously miss-classified over successive iterations. Our approach gives a gain of 17.41 mIoU over the standard segmentation model, including the additional gain of 1.9 mIoU with CB-iFL on the Cityscapes dataset

        Year of completion:  December 2020
         Advisor : C V Jawahar,Vineeth Balasubramanian,Anbumani Subramanian

        Related Publications


          Downloads

          thesis

          Monocular 3D Human Body Reconstruction


          Abbhinav Venkat

          Abstract

          Monocular 3D human reconstruction is a very relevant problem due to numerous applications to the entertainment industry, e-commerce, health care, mobile-based AR/VR platforms, etc. However, it is severely ill-posed due to self-occlusions from complex body poses and shapes, clothing obstructions, lack of surface texture, background clutter, single view, etc. Conventional approaches address these challenges by using different sensing systems - marker-based, marker-less multi-view cameras, inertial sensors, and 3D scanners. Although effective, such methods are often expensive and have limited wide-scale applicability. In an attempt to produce scalable solutions, a few have focused on fitting statistical body models to monocular images, but are susceptible to the costly optimization process. Recent efforts focus on using data-driven algorithms such as deep learning to learn priors directly from data. However, they focus on template model recovery, rigid object reconstruction, or propose paradigms that don’t directly extend to recovering personalized models. To predict accurate surface geometry, our first attempt was VolumeNet, which predicted a 3D occupancy grid from a monocular image. This was the first of its kind model for non-rigid human shapes at that time. To circumvent the ill-posed nature of this problem (aggravated by an unbounded 3D representation), we follow the ideology of providing maximal training priors with our unique training paradigms, to enable testing with minimal information. As we did not impose any body-model based constraint, we were able to recover deformations induced by free-form clothing. Further, we extended VolumeNet to PoShNet by decoupling Pose and Shape, in which we learn the volumetric pose first, and use it as a prior for learning the volumetric shape, thereby recovering a more accurate surface. Although volumetric regression enables recovering a more accurate surface reconstruction, they do so without an animatable skeleton. Further, such methods yield reconstructions of low resolution at higher computational cost (regression over the cubic voxel grid) and often suffer from an inconsistent topology via broken or partial body parts. Hence, statistical body models become a natural choice to offset the ill-posed nature of this problem. Although theoretically, they are low dimensional, learning such models has been challenging due to the complex non-linear mapping from the image to the relative axis-angle representation. Hence, most solutions rely on different projections of the underlying mesh (2D/3D keypoints, silhouettes, etc.). To simplify the learning process, we propose the CR framework that uses classification as a prior for guiding the regression’s learning process. Although recovering personalized models with high-resolution meshes isn’t a possibility in this space, the framework shows that learning such template models can be difficult without additional supervision. As an alternative to directly learning parametric models, we propose HumanMeshNet to learn an “implicitly structured point cloud”, in which we make use of the mesh topology as a prior to enable better learning. We hypothesize that instead of learning the highly non-linear SMPL parameters, learning its corresponding point cloud (although high dimensional) and enforcing the same parametric template topology on it is an easier task. This proposed paradigm can theoretically learn local surface deformations that the body model based PCA space can’t capture. Further, going ahead, attempting to produce highresolution meshes (with accurate geometry details) is a natural extension that is easier in 3D space than in the parametric one. In summary, in this thesis, we attempt to address several of the aforementioned challenges and empower machines with the capability to interpret a 3D human body model (pose and shape) from a single image in a manner that is non-intrusive, inexpensive and scalable. In doing so, we explore different 3D representations that are capable of producing accurate surface geometry, with a long-term goal of recovering personalized 3D human models.

          Year of completion:  November 2020
           Advisor : Avinash Sharma

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Neural and Multilingual Approaches to Machine Translation for IndianLanguages and its Applications
            2. Lip-syncing Videos In The Wild
            3. Multiscale Two-view Stereo using Convolutional Neural Networks for Unrectified Images
            4. SemanticEdge Labeling usingDepth cues
            • Start
            • Prev
            • 13
            • 14
            • 15
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.