CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Effective and Efficient Attribute-aware Open-set Face Verification


Arun Kumar Subramanian

Abstract

While face recognition and verification in controlled settings is already a solved problem for machines, the uniqueness of face as a biometric is that the mode of capture is highly diverse. A face could be captured nearby or at distance, at different poses, with different lighting, and by different devices. Face recognition/verification has several challenges to overcome to effectively perform under these varying conditions. Most current methods, try to find salient features of an individual by ignoring these variations. This can be looked at from the paradigm of signal and noise. The signal here refers to that information that is unique to an individual, but not varying as per the condition. Noise represents those aspects that are not related to the identity itself and are influenced by the capture mechanism, physical setting, etc. This is usually done through metric learning approaches in addition to the use of loss functions such as cross-entropy (e.g., Siamese networks, angular loss, and other margin losses such as ArcFace). There are certain aspects that lie between signal and noise such as facial attributes (such as eyeglasses). These may or may not be unique to the individual subject, but introduces artifacts into the face image. The question then arises, why can’t these variations be detected using learning methods, and the knowledge thus attained about the variations be put to good use during the matching process? It is this curiosity that has resulted in aggregation strategies for matching, which were previously implemented for aspects such as pose, age, etc. However, in the wild, humans demonstrate significant variability in facial attributes such as facial hair, eyeglasses, hairstyles, and make-up. This is common as one of the primary mechanisms of face image acquisition is covert capture in public (with ethics of consent in place), where people usually display significant variability in facial attributes. Hence it is very important to address this variability during the matching process. This work attempts to do the same. The curious question that arises however is if indeed matching performance varies if the attribute prior is known. Even if it does, how does one conceptualize a system that exploits the same? It is here that this thesis proposed two frameworks. One of the configuration-specific operating points and the other involves suppression of attribute information in face embedding prior to matching. The attribute suppression is attempted both directly at the final embedding, and suppression of intermediary layers of a Vision Transformer Deep Neural network. Both of these require the facial attribute of each image to be detected prior to passing the images into the proposed framework for matching. The above naturally adds another task to the face verification pipeline. It is therefore extremely necessary to find efficient and effective ways of performing face attribute detection (and face template generation), since efficiently performing parts, mitigates the pipeline expansion overhead and makes this a viable pipeline to consider for face verification. We observe that face attribute detection usually employs end-to-end networks, which results in a lot of parameters for inference. A feasible alternative is to constantly leverage the SOTA (state-of-the-art) face recognition networks and use the earlier feature layers to perform the face attribute classification task. Since the highly accurate SOTA is currently DNNs (Deep Neural Networks) for face, the same is dealt with in this thesis. More narrowly, we focus on open-set face verification, where DNNs aim to find unique representation even for subjects not used for training the DNN.

Year of completion:  June 2023
 Advisor : Anoop M Namboodiri

Related Publications


    Downloads

    thesis

    Face Reenactment: Crafting Realistic Talking Heads for Enhanced Video Communication and Beyond


    Madhav Agarwal

    Abstract

    Face Reenactment and Synthetic Talking Head works have been widely popular for creating realistic face animations by using a single image of a person. In light of the recent developments in processing facial features in images and videos, as well as the ability to create realistic talking heads, We are focusing on two promising applications. These applications include utilizing face reenactment for movie dubbing and compressing video calls where the primary object is a talking face. We propose a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We use audio as an additional input for high-quality lip sync, by helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to preserve the structure of the reconstructed faces. Finally, we incorporate a carefully designed identity- aware generator module to get realistic quality of talking heads. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low-bandwidth video calls and movie dubbing. We leverage the advancements in talking head generation to propose an end-to-end system for video call compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect keypoints in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting keypoints instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver’s end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects, like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques

    Year of completion:  June 2023
     Advisor : C V Jawahar, Vinay P Namboodiri

    Related Publications


      Downloads

      thesis

      Fingerprint Disentanglement for Presentation Attack Generalization Across Sensors and Materials


      Gowri Lekshmy

      Abstract

      In today's digital era, biometric authentication has become increasingly widespread for verifying a user across a range of applications, from unlocking a smartphone to securing high-end systems. Various biometric modalities such as fingerprint, face, and iris offer a distinct way to recognize a person automatically. Fingerprints are one of the most prevalent biometric modalities. They are widely utilized in security systems owing to their remarkable reliability, distinctiveness, invariance over time and user convenience. Nowadays, automatic fingerprint recognition systems have become a prime target for attackers. Attackers fabricate fingerprints using materials like playdoh and gelatin, making it hard to distinguish them from live fingerprints. This way of circumventing biometric systems is called a presentation attack (PA). To identify such attacks, a PA detector is added to these systems. Deep learning-based PA detectors require large amounts of data to distinguish PA fingerprints from live ones. However, there exists significantly less training data with novel sensors and materials. Due to this, PA detectors do not generalize well on introducing unknown sensors or materials. It is incredibly challenging to physically fabricate an extensive train dataset of high-quality counterfeit fingerprints generated with novel materials captured across multiple sensors. Existing fingerprint presentation attack detection (FPAD) solutions improve cross-sensor and cross-material generalization by utilizing styletransfer-based augmentation wrappers over a two-class PA classifier. These solutions generate large artificial datasets for training by using style transfer which learns the style properties from a few samples obtained from the attacker. They synthesize data by learning the style as a single entity, containing both sensor and material characteristics. However, these strategies necessitate learning the entire style upon adding a new sensor for an already known material or vice versa. This thesis proposes a decomposition-based approach to improve cross-sensor and cross-material FPAD generalization. We model presentation attacks as a combination of two underlying components, i.e., material and sensor, rather than the entire style. By utilizing this approach, our method can generate synthetic patches upon introducing either a new sensor, a new material, or both. We perform two different methods of fingerprint factorization - traditional and deep-learning based. Traditional factorization of fingerprints into sensor and material representations using tensor decomposition establishes a baseline using machine learning for our hypothesis. The deep-learning method uses a decompositionbased augmentation wrapper for disentangling fingerprint style. The wrapper improves cross-sensor and cross-material FPAD, utilizing one fingerprint image of the target sensor and material. We also reduce vi vii computational complexity by generating compact representations and utilizing lesser combinations of sensors and materials to produce several styles. Our approach enables us to generate a large variety of samples using a limited amount of data, which helps improve generalization

      Year of completion:  June 2023
       Advisor : Anoop M Namboodiri

      Related Publications


        Downloads

        thesis

        Extending PRT Framework for Lowly-Tessellated and Continuous Surfaces


        Dhawal Sirikonda

        Abstract

        Precomputed Radiance Transfer (PRT) is widely used for real-time photorealistic effects. PRT dis entangles the rendering equation into transfer and lighting, enabling their precomputation. Transfer accounts for the cosine-weighted visibility of points in the scene, while Lighting is usually a distant emitted lighting, e.g., environment. Transfer computation involves tracing several rays into the scene from every point on the surface. For every ray, the binary visibility is calculated, and a spherical function is obtained. The spherical function is projected into Spherical Harmonic(SH) domain. SH is a band-limited representation of spherical functions, and the order of SH decides the representation capacity of the SH (the higher the SH order better the approximation of a spherical function). The SH domain also facilitates fast and efficient integral computation by simplifying the integral into simple dot products and convolutions. The original formulation of PRT by Sloan et al. 2002 provides different storage requirements for the transfer—vectors in the case of diffuse materials and matrices in the case of glossy materials. Using matrices for Transfer representation makes it infeasible as the SH orders increase. The work of Triple Product Formulation by Ng et al. in 2004 extended the formulation to allow simple vector-based Transfer storage even for the case of glossy materials. Prior art stored precomputed transfer in a tabulated manner in vertex space. These values are fetched with interpolation at each point for shading. Since the barycentric interpolation is finally employed to calculate the final color across the geometry apart from the vertex locations, the vertex space methods require densely tessellated mesh vertices to obtain accurate radiance. Sometimes high-density(tessellated) meshes adversely affect runtimes and memory requirements. This is mainly observed in simple geometries with no additional detailing but still demanding higher triangle counts (e.g., planes, walls, etc.). The first work provides a solution by leveraging Texture space, which is more continuous than the Vertex space. We also added additional functionality to obtain inter-reflection effects in the texture space. While Texture space methods provide faithful results in meshes, they require non-overlapping, areapreserving UV mapping, and a high-resolution texture to avoid artifacts. In the subsequent work, we propose a compact transfer representation that is learnt directly on scene geometry points. Specifically, we train a small multi-layer perceptron (MLP) to predict the transfer at sampled surface points. Our approach is most beneficial where inherent mesh storage structure and natural UV mapping are unavailable, such as Implicit Surfaces, as it learns the transfer values directly on the surface. Using our approach, we demonstrate real-time, photorealistic renderings of diffuse and glossy materials on SDF geometries with PRT.

        Year of completion:  June 2023
         Advisor : P J Narayanan

        Related Publications


          Downloads

          thesis

          Virtual World Creation


          Aryamaan Jain

          Abstract

          Synthesis, capture and analysis of a highly complex 3D terrain structure are essential for critical applications such as river/flood modelling, disaster mitigation planning, landslide modelling and flight simulation. On the other hand, synthesis of natural-looking 3D terrains finds its applications in the entertainment industry such as computer gaming and VFX. This thesis explores novel learning-based techniques for the generation of immersive and realistic 3D virtual environments, catering to the needs of the aforementioned applications. The generation of virtual worlds involves multiple components, including terrain, vegetation, and other objects. We primarily focus on three key aspects for virtual world generation: 1) develop novel AI-enabled 3D terrain authoring solutions based on real-world satellite and aerial data using a learning-based framework, 2) L-systems grammar-based 3D tree generation and 3) rendering techniques that are used to create high-quality visualizations of the generated world. Terrain generation is a critical component of 3D virtual world generation, as it provides the foundational structure for the environment. Traditional techniques for 3D terrain generation involve procedural generation, which relies on mathematical algorithms to generate landscapes. However, deep learning techniques have shown promise in generating more realistic terrain, as they can learn from real-world data to produce new, varied, and realistic landscapes. In this thesis, we explore the use of deep learning techniques for 3D terrain generation, which can produce realistic and varied terrains with high visual fidelity. Specifically, we propose two learning-based novel frameworks for Interactive 3D Terrain Authoring & Manipulation and Adaptive Multi-Resolution Infinite Terrain Generation. In addition to terrain generation, vegetation is another important component of virtual world generation. Trees and other plants provide visual interest and can help create a more immersive environment. L-systems are a popular technique for generating realistic vegetation, as they are capable of generating complex structures that resemble real-world plants. In this thesis, we propose a variant of the L-systems for 3D tree generation and compare the results to traditional procedural generation techniques. Finally, rendering is a critical component of 3D virtual world generation, as it is responsible for creating the final visual output that users will see. In addition to terrain and tree generation, this thesis also covers rendering techniques used to visualize the generated virtual world. We explore the use of real-time rendering techniques in conjunction with terrain generation to achieve high-quality visual results while maintaining performance. Overall, the research presented in this thesis aims to advance the state-of-the-art in virtual world generation and contribute to the development of more realistic and immersive virtual environments. We performed extensive empirical evaluation on publicly available datasets to report details qualitative and quantitative results demonstrating the superiority of the proposed methods over existing solutions in the literature.

          Year of completion:  July 2023
           Advisor : Avinash Sharma, Rajan Krishnan Sundara

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Driving into the Dataverse: Real and Synthetic Data for Autonomous Vehicles
            2. Improved Representation Spaces for Videos
            3. Open-Vocabulary Audio Keyword Spotting with Low Resource Language Adaptation
            4. On Designing Efficient Deep Neural Networks for Semantic Segmentation
            • Start
            • Prev
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.