CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Vulnerability of Neural Network based Speaker Recognition Systems


Ritu Srivastava

Abstract

Speaker recognition (SR) involves automatic identification of individual speakers based on their voices, often representing acoustic traits as fixed-dimensional vectors through speaker embedding. A standard speaker recognition system (SRS) consists of three key phases: training, enrollment, and recognition. In each stage, acoustic features are extracted from raw speech signals using an acoustic feature extraction module, resulting in the acquisition of essential acoustic characteristics. Commonly used acoustic features include speech spectrogram, filter bank, and Mel-frequency cepstral coefficients. During the training stage, a background model is trained to establish a mapping from training voices to embeddings. The traditional background model employs a Gaussian Mixture Model (GMM) to generate identity-vector (ivector) embeddings. In contrast, more recent and promising background models leverage deep neural networks (DNNs) to generate deep embeddings, like xvector. In the enrollment stage, a voice spoken by an individual undergoing enrollment is mapped to an enrollment embedding using the previously trained background model. In the recognition stage, the process begins by retrieving the testing embedding of a given voice from the background model. Subsequently, the scoring module is engaged to measure the similarity between the enrollment and testing embeddings. The scoring module evaluates the similarity between the speaker and recorded embedding. Following the assessment, the scoring and decision module makes a decision based on the similarity score. A decision threshold is established, which serves as a criterion to determine whether the claimed identity of the speaker is accepted or rejected. The concept of voiceprint is rapidly gaining prominence as one of the emerging biometrics, primarily owing to its seamless integration with natural and human-centered Voice User Interface (VUI). The fast progress of Speaker Recognition Systems (SRSs) is intricately linked to the evolution of Neural Networks (NNs), with a particular emphasis on Deep Neural Networks (DNNs). With strides made in deep learning, Speaker Recognition (SR) has also benefitted and found extensive applications across hardware and software platforms. However, it has been shown that NNs are vulnerable to adversarial attacks, highlighting a challenge that needs to be addressed. Thus, even though users have the convenience of authentication with Speaker Recognition services, it has become evident that these solutions are vulnerable to adversarial attacks. This vulnerability highlights that Speaker Recognition (SR) is encountering security threats, raising significant concerns about user privacy. Adversarial attack was initially implemented with images, where an image classification model was successfully deceived using adversarial examples. Drawing inspiration from the progress made in adversarial attacks within the image domain, there is a growing interest in extending these techniques to the audio field. With emerging trends, convolutional neural networks have demonstrated instability to artificially crafted perturbations that remain undetectable to the human eye. Virtually every type of model, ranging from CNN to graphical neural network (GNN), has shown vulnerability to adversarial examples, particularly in the domain of image classification. Deep learning models typically get audio input by converting the audio into a spectrogram for further processing. A spectrogram serves as a condensed representation of an audio input. Given its image-like nature, the audio spectrogram is frequently used as input data for deep learning models, especially Convolutional Neural Networks (CNNs) adapted for audio tasks. CNN-based architectures were initially designed for image processing. This thesis contributes to the assessment of Convolutional Neural Networks (CNNs) for their resilience against adversarial attacks, a domain that is yet to be extensively investigated concerning endto-end trained CNNs for speaker recognition. This examination is essential for sustaining the integrity and security of speaker recognition systems. Our study fills this gap by exploring the variations of iterative Fast Gradient Sign Method (FGSM) to carry out adversarial attacks. We note that using a vanilla iterative FGSM technique can alter the identity of each speaker sample to any other speaker within the LibriSpeech dataset. Additionally, we introduce adversarial attacks specific to Mel spectrogram features by (a) constraining the number of manipulated pixels, (b) confining alterations to certain frequency bands, (c) limiting changes to particular time segments, and (d) employing a substitute model to generate the adversarial sample. Through comprehensive qualitative and quantitative analyses, we illustrate the vulnerability and counterintuitive behavior of existing CNN-based speaker recognition systems, wherein the predicted speaker identities can be inverted without discernible alterations in the audio. The samples are available at “https://advdemo.github.io/speech/".

Year of completion:  June 2024
 Advisor : Vineet Gandhi

Related Publications


    Downloads

    thesis

    Beyond Text: Expanding Speech Synthesis with Lip-to-Speech and Multi-Modal Fusion


    Neha Sahipjohn

    Abstract

    Speech constitutes a fundamental aspect of human communication. Therefore, the ability of computers to synthesize speech is paramount for achieving more natural human-computer interactions and increased accessibility, particularly for individuals with reading limitations. Recent advancements in AI and machine learning technologies, alongside generative AI techniques, have significantly improved speech synthesis quality. Text input serves as a common modality for speech synthesis, and Text-toSpeech (TTS) systems have achieved notable milestones in terms of intelligibility and naturalness. In this thesis, we propose a system to synthesize speech directly from lip movements and explore the idea of a unified speech synthesis model that can synthesize speech from different modalities, like text-only, video-only or combined text and video inputs. This facilitates applications in dubbing and accessibility initiatives aimed at providing voice to individuals who are unable to vocalize. This innovation promises streamlined communication in noisy environments as well. We propose a novel system for lip-to-speech synthesis that achieves state-of-the-art performance by leveraging advancements in selfsupervised learning and sequence-to-sequence networks. This enables the generation of highly intelligible and natural-sounding speech even with limited data. Existing lip-to-speech systems primarily focus on directly synthesizing speech or mel-spectrograms from lip movements. This often leads to compromised intelligibility and naturalness due to the entanglement of speech content with ambient information and speaker characteristics. We propose a modularized approach that uses representations that disentangle speech content from speaker characteristics, leading to superior performance. Our work sheds light on the information-rich nature of embedding spaces compared to tokenized representations. The system maps lip movement representations to disentangled speech representations, which are then fed into a vocoder for speech generation. Recognizing the potential applications in dubbing and the importance of synthesizing accurate speech, we explore a multimodal input setting by incorporating text alongside lip movements. Through extensive experimentation and evaluation across various datasets and metrics, we demonstrate the superior performance achieved by our proposed method. Our approach demonstrates high correctness and intelligibility, paving the way for practical deployment in real-world scenarios. Our work contributes significantly to advancing the field of lip-to-speech synthesis, offering a robust and versatile solution for generating natural-sounding speech from silent videos with broader implications for accessibility, human-computer interaction, and communication technology.

    Year of completion:  June 2024
     Advisor : Vineet Gandhi

    Related Publications


      Downloads

      thesis

      Unsupervised Learning of Disentangled Video Representation for Future Frame Prediction


      Ujjwal Tiwari

      Abstract

      Predicting what may happen in the future is a critical design element in developing an intelligent decision-making system. This thesis aims to shed some light on video prediction models that can predict future frames of a video sequence by observing a set of previously known frames. These models learn video representations encoding the causal rules that govern the physical world. Hence, these models have been extensively used in the design of various vision-guided robotic systems. These models also have applications in reinforcement learning, autonomous navigation, and healthcare. Video frame prediction remains challenging despite the availability of large amounts of video data and the recent progress of generative modeling techniques in synthesizing high-quality images. The challenges associated with predicting future frames can be attributed to two significant characteristics of video data - the high dimensionality of video frames and the stochastic nature of the motion exhibited in these video sequences. Existing video prediction models solve the challenge of predicting frames in high-dimensional pixel space by learning a low-dimensional disentangled video representation. These methods factorize video representations into dynamic and static components. The disentangled video representation is subsequently used for the downstream task of future frame prediction. In Chapter 3, we propose a mutual information-based predictive autoencoder, MIPAE, a self-supervised learning framework. The proposed framework factorizes the latent space representation of videos into two components - static content and a dynamic pose component. The MIPAE architecture comprises a content encoder, pose encoder, decoder, and a standard LSTM network. We train MIPAE using a twostep procedure, such that in the first step, the content encoder, pose encoder, and decoder are trained to learn disentangled frame representations. The content encoder is trained using the slow feature analysis constraint, while the pose encoder is trained using a novel mutual information loss term to achieve proper disentanglement. In the second step of our training methodology, we train an LSTM network to predict the low-dimensional pose representation of future frames. The predicted pose and learned content representations are decoded to generate future frames of a video sequence. In this thesis, we present detailed qualitative and quantitative results to compare the performance of our proposed MIPAE framework. We evaluate our approach on standard video prediction datasets like DSprites, MPI3D-real, and SMNIST using various visual quality assessment metrics, namely LPIPS, SSIM, and PSNR. We also present a metric based on mutual information gap, MIG, to quantitatively evaluate the degree of disentanglement between the factorized latent variables - pose and content. MIG score is subsequently used for a detailed comparative study of the proposed framework with other disentanglement-based video prediction approaches to showcase the efficacy of our disentanglement approach. We conclude our analysis by showcasing the visual superiority of the frames predicted by MIPAE. In Chapter 4, we explore the paradigm of stochastic video prediction models, which aim to capture the inherent uncertainty in real-world videos by using a stochastic latent variable to predict a different but plausible sequence of future frames corresponding to each sample of the stochastic latent variable. In our work, we modify the architecture of two stochastic video prediction models and apply a novel cycle consistency loss term to disentangle the video representation space into pose and content factors and model the uncertainty in the pose of various objects in the scene, to generate sharp and plausible frame predictions.

      Year of completion:  June 2024
       Advisor : Anoop M Namboodiri

      Related Publications


        Downloads

        thesis

        Targeted Segmentation: Leveraging Localization with DAFT for Improved Medical Image Segmentation


        Samruddhi Shastri

        Abstract

        Medical imaging plays a pivotal role in modern healthcare, providing clinicians with crucial insights into the human body’s internal structures. However, extracting meaningful information from medical images, such as X-rays and Computed Tomography (CT) scans, remains a challenging task, particularly in the context of accurate segmentation. This thesis presents a novel two-stage Deep Learning (DL) pipeline designed to address the limitations of existing single-stage models and improve segmentation performance in two critical medical imaging tasks: pneumothorax segmentation in chest radiographs and multi-organ segmentation in abdominal CT scans. The first stage of the proposed pipeline focuses on localizing target organs or lesions within the image. This initial localization stage utilizes a specialized module tailored to the specific organ/lesion and image type. This stage outputs a “localization map” highlighting the most probable regions where the target resides, guiding the next step. The second stage, fine-grained segmentation, precisely delineates the organ/lesion boundaries. This is achieved by combining UNet, known for its ability to capture both general and detailed features, with Dynamic Affine Feature-Map Transform (DAFT) modules that dynamically adjust information within the network. This combined approach leads to more accurate boundary delineation, meticulously outlining the exact borders of the target organ/lesion after roughly locating it in the first stage. An application of the proposed pipeline focuses on pneumothorax segmentation, leveraging not only the image data but also the accompanying free-text radiology reports. By incorporating text-guided attention and DAFT, the pipeline produces low-dimensional region-localization maps, significantly reducing false positive predictions and improving segmentation accuracy. Extensive experiments on the CANDID-PTX dataset demonstrate the efficacy of the approach, achieving a Dice Similarity Coefficient (DSC) of 0.60 for positive cases and 0.052 False Positive Rate (FPR) for negative cases, with DSC ranging from 0.70 to 0.85 for medium and large pneumothoraces. Another application of the proposed pipeline involves multi-organ segmentation in abdominal CT scans, where accurate delineation of organ boundaries is crucial for various medical tasks. The proposed Guided-nnUNet leverages spatial guidance from a ResNet-50-based localization map in the first stage, followed by DAFT-enhanced 3D U-Net (nn-UNet implementation). Evaluation on the AMOS and Beyond The Cranial Vault (BTCV) datasets demonstrates a significant improvement over baseline models, with an average increase of 7% and 9% on the respective datasets. Moreover, Guided-nnUNet outperforms state-of-the-art (SOTA) methods, including MedNeXt, by 3.6% and 5.3% on the AMOS and BTCV datasets, respectively. Overall, this thesis proposes a novel two-stage deep learning pipeline for medical image segmentation, demonstrating its effectiveness in handling a wide range of anatomical structures and image modalities (2D X-ray, 3D CT) for both single-organ (e.g., pneumothorax segmentation in chest radiographs) and multi-organ segmentation tasks (e.g., abdominal CT scans). This comprehensive approach offers significant advancements and contributes to improved medical image analysis, potentially leading to better healthcare outcomes.

        Year of completion:  June 2024
         Advisor : Jayanthi Sivaswamy

        Related Publications


          Downloads

          thesis

          Estimating 3D Human Pose, Shape, and Correspondences from Monocular Input


          Amogh Tiwari

          Abstract

          In recent years, advances in computer vision have opened up multiple applications in virtual reality, healthcare, robotics, and many other domains. One crucial problem domain in computer vision, which has been a key research focus lately, is estimating the 3D human pose, shape, and correspondences from monocular input. This problem domain has applications in various industries like fashion, entertainment, healthcare, etc. However, it is also highly challenging due to various reasons like large variations in the pose, shape, and appearance of humans and clothing details, external and self-occlusions, challenges with ensuring consistency etc. As part of this thesis, we tackle two key problems related to 3D human pose, shape, and correspondence estimation. First, we focus on the problem of temporally consistent 3D human pose and shape estimation from monocular videos. Next, we focus on dense correspondence estimation across images of different (or the same) humans. We show that despite receiving a lot of research attention lately, existing methods for these tasks still perform sub-optimally in many challenging scenarios and have significant scope for improvement. We aim to overcome some of the limitations of existing methods and advance state-of-the-art (SOTA) solutions to these problems. First, we propose a novel method for temporally consistent 3D human pose and shape estimation from a monocular video. Instead of using the traditionally used, generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose and camera initialization over a temporal window followed by a novel spatio-temporal feature aggregation by using a combination of self-similarity and self-attention over the body-aware features and the per-frame initialization. Together, they yield enhanced spatio-temporal context for every frame by considering the remaining past and future frames. These features are used to predict the pose and shape parameters of the human body model, which are further refined using an LSTM. Next, we expand our focus to the task of dense correspondence estimation between humans, which requires understanding the relations between different body regions (represented using dense correspondences), including the clothing details, of the same or different human(s). We present Continuous Volumetric Embeddings (ConVol-E), a novel robust representation for dense correspondence-matching across RGB images of different human subjects in arbitrary poses and appearances under non-rigid deformation scenarios. Unlike existing representations, ConVol-E captures the deviation from the underlying parametric body model by choosing suitable anchor/key points on the underlying parametric body surface and then representing any point in the volume based on its Euclidean relationship with the anchor points. This allows us to represent any arbitrary point around the parametric body (clothing details, hair, etc.) by an embedding vector. Subsequently, given a monocular RGB image of a person, we learn to predict per-pixel ConVol-E embedding, which carries a similar meaning across different subjects and is invariant to pose and appearance, thereby acting as a descriptor to establish robust, dense correspondences across different images of humans. We thoroughly evaluate our methods on publicly available benchmark datasets and show that our methods outperform existing SOTA. Finally, we provide a summary of our contributions and discuss the potential future research directions in this problem domain. We believe that this thesis improves the research landscape for the domain of the human body, pose, shape, and correspondence estimation and helps accelerate progress in this direction.

          Year of completion:  June 2024
           Advisor : Avinash Sharma

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Quality Beyond Perception: Introducing Image Quality Metrics for Enhanced Facial and Fingerprint Recognition
            2. Towards Label Free Few Shot Learning : How Far Can We Go?
            3. A Deep Learning Paradigm for Fingerprint Recognition: Harnessing U-Net Architecture for Fingerprint Enhancement and Representation Learning
            4. Physical Adversarial Attacks on Face Presentation Attack Detection Systems
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.