Open-Vocabulary Audio Keyword Spotting with Low Resource Language Adaptation

Kirandevraj R

Abstract

Open-Vocabulary Keyword spotting solves the problem of spotting audio keywords in an utterance. The keyword set can incorporate keywords the system has seen and not seen during training, functioning as zero-shot keyword spotting. The traditional method involves using ASR to transcribe audio to text and search in the text space to spot keywords. Other methods include obtaining the posteriors from a Deep Neural Network (DNN) and using template matching algorithms to find similarities. Keyword spotting does not require transcribing the entire audio and focuses on detecting the specific words of interest. In this thesis, we aim to explore the usage of the Automatic Speech Recognition (ASR) system for Keyword Spotting. We demonstrate that the intermediate representation of ASR can be used for open vocabulary keyword spotting. With this, we show the effectiveness of using Connectionist Temporal Classification (CTC) loss for learning word embeddings for keyword spotting. We propose a novel method of using the CTC loss function with the traditional triplet loss function to learn word embeddings for keyword spotting on the TIMIT English language audio dataset. We show this method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View recurrent method that learns jointly on the text and acoustic embeddings achieves only 0.218 for out-of-vocabulary words. We propose a novel method to generalize our approach to Tamil, Vallader, and Hausa low-resource languages. Here we use transliteration to convert the Tamil language script to English such that the Tamil words sound similar written with English alphabets. The model predicts the transliterated text for input Tamil audio with CTC and triplet loss functions. We show that this method helps transfer the knowledge learned from high resource language English to low resource language Tamil. We further reduce the model size to make it work in a small footprint scenario like mobile phones. To this extent, we explore various knowledge distillation loss functions such as MSE, KL Divergence, and CosEmbedding loss functions. We observe that small-footprint ASR representation is competitive with knowledge distillation methods for small-footprint keyword spotting. This methodology makes use of existing ASR networks trained with massive datasets. It converts them into open vocabulary keyword spotting systems that can also be generalized to low-resource language.

Year of completion:	November 2022
Advisor :	C V Jawahar,Vinay P Namboodiri,Vinod kumar Kurmi