Vulnerability of Neural Network based Speaker Recognition Systems

Ritu Srivastava

Abstract

Speaker recognition (SR) involves automatic identification of individual speakers based on their voices, often representing acoustic traits as fixed-dimensional vectors through speaker embedding. A standard speaker recognition system (SRS) consists of three key phases: training, enrollment, and recognition. In each stage, acoustic features are extracted from raw speech signals using an acoustic feature extraction module, resulting in the acquisition of essential acoustic characteristics. Commonly used acoustic features include speech spectrogram, filter bank, and Mel-frequency cepstral coefficients. During the training stage, a background model is trained to establish a mapping from training voices to embeddings. The traditional background model employs a Gaussian Mixture Model (GMM) to generate identity-vector (ivector) embeddings. In contrast, more recent and promising background models leverage deep neural networks (DNNs) to generate deep embeddings, like xvector. In the enrollment stage, a voice spoken by an individual undergoing enrollment is mapped to an enrollment embedding using the previously trained background model. In the recognition stage, the process begins by retrieving the testing embedding of a given voice from the background model. Subsequently, the scoring module is engaged to measure the similarity between the enrollment and testing embeddings. The scoring module evaluates the similarity between the speaker and recorded embedding. Following the assessment, the scoring and decision module makes a decision based on the similarity score. A decision threshold is established, which serves as a criterion to determine whether the claimed identity of the speaker is accepted or rejected. The concept of voiceprint is rapidly gaining prominence as one of the emerging biometrics, primarily owing to its seamless integration with natural and human-centered Voice User Interface (VUI). The fast progress of Speaker Recognition Systems (SRSs) is intricately linked to the evolution of Neural Networks (NNs), with a particular emphasis on Deep Neural Networks (DNNs). With strides made in deep learning, Speaker Recognition (SR) has also benefitted and found extensive applications across hardware and software platforms. However, it has been shown that NNs are vulnerable to adversarial attacks, highlighting a challenge that needs to be addressed. Thus, even though users have the convenience of authentication with Speaker Recognition services, it has become evident that these solutions are vulnerable to adversarial attacks. This vulnerability highlights that Speaker Recognition (SR) is encountering security threats, raising significant concerns about user privacy. Adversarial attack was initially implemented with images, where an image classification model was successfully deceived using adversarial examples. Drawing inspiration from the progress made in adversarial attacks within the image domain, there is a growing interest in extending these techniques to the audio field. With emerging trends, convolutional neural networks have demonstrated instability to artificially crafted perturbations that remain undetectable to the human eye. Virtually every type of model, ranging from CNN to graphical neural network (GNN), has shown vulnerability to adversarial examples, particularly in the domain of image classification. Deep learning models typically get audio input by converting the audio into a spectrogram for further processing. A spectrogram serves as a condensed representation of an audio input. Given its image-like nature, the audio spectrogram is frequently used as input data for deep learning models, especially Convolutional Neural Networks (CNNs) adapted for audio tasks. CNN-based architectures were initially designed for image processing. This thesis contributes to the assessment of Convolutional Neural Networks (CNNs) for their resilience against adversarial attacks, a domain that is yet to be extensively investigated concerning endto-end trained CNNs for speaker recognition. This examination is essential for sustaining the integrity and security of speaker recognition systems. Our study fills this gap by exploring the variations of iterative Fast Gradient Sign Method (FGSM) to carry out adversarial attacks. We note that using a vanilla iterative FGSM technique can alter the identity of each speaker sample to any other speaker within the LibriSpeech dataset. Additionally, we introduce adversarial attacks specific to Mel spectrogram features by (a) constraining the number of manipulated pixels, (b) confining alterations to certain frequency bands, (c) limiting changes to particular time segments, and (d) employing a substitute model to generate the adversarial sample. Through comprehensive qualitative and quantitative analyses, we illustrate the vulnerability and counterintuitive behavior of existing CNN-based speaker recognition systems, wherein the predicted speaker identities can be inverted without discernible alterations in the audio. The samples are available at “https://advdemo.github.io/speech/".

Year of completion:	June 2024
Advisor :	Vineet Gandhi