Towards building controllable Text to Speech systems

Saiteja Kosgi

Abstract

Text-to-speech systems convert any given text to speech. They play a vital role in making Humancomputer interaction (HCI) possible. As humans, we don't just rely on text (language) to communicate; we use many other mechanisms like voice, gestures, expressions, etc., to communicate efficiently. In natural language processing, vocabulary and grammar tend to take center stage, but those elements of speech only tell half the story. Affective prosody of speech provides larger context and gives meaning to words, and keeps listeners engaged. Current HCI systems largely communicate in text, and they lack a lot of prosodic information, which is crucial in a conversation. To make the HCI systems communicate in speech, text to speech systems should be able to synthesize speech that is expressive and controllable. But the existing text to speech systems learn the average variation in the dataset it’s trained on, which synthesizes samples in a neutral way without many prosodic variations. To this end, we develop a textto-speech system that can synthesize the given emotion where the emotion is represented as a tuple of Arousal, Valance and Dominance (AVD) values. Text to speech systems have a lot of complexities. Training such a system requires the data to be very clear, noiseless, and collecting such data is difficult. If the data is noisy, it will reflect unnecessary artifacts in the synthesized samples. Training emotion based text to speech models is considerably more difficult and not strait forward. The fact that obtaining emotion annotated data for the desired speaker is costly and very subjective makes it a cumbersome task. Current emotion based systems can synthesize emotions with some limitations. (1) Emotion controllability comes at the cost of loss in quality, (2) Have discreet emotions which lack the finer control, and (3) cannot be generalized to new speakers without the annotated emotion data. We propose a system that overcomes the above-mentioned problems by leveraging the largely available corpus of noisy speech annotated with emotions. Even though the data is noisy, our technique trains an emotion based text to speech system that can synthesize desired emotion without any loss of quality in the output. We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate variances/features (pitch, energy, and duration) as levers. We learn how the variances change with respect to emotion. We bring the finer control in the synthesized speech by using AVD values, which can represent emotions in a 3D space. Our proposed method also doesn’t require emotion annotated data for the target speaker. Once trained on the emotion annotated data, it can be applied to any system which has the prediction of the variances as an intermediate step. vi vii With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted ”human touch” in machine dialogue.

Year of completion:	May 2023
Advisor :	Vineet Gandhi