Automatically Generating Audio Descriptions for Movies
Abstract:
Audio Description is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges - the Audio Description must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. This requires a visual-language model that can address all three of the `what', `who', and `when' questions: What is happening in the scene? Who are the characters in the scene? And when should a description be given?
Professor Andrew Zisserman visited IIIT-H and gave a talk on the 21st of August, 2023. He discussed how to build on large pre-trained models to construct a visual-language model that can generate Audio Descriptions addressing the following questions: (i) how to incorporate visual information into a pre-trained language model; (ii) how to train the model using only partial information; (iii) how to use a `character bank' to provide information on who is in a scene; and (iv) how to improve the temporal alignment of an ASR model to obtain clean data for training.
Bio:
Andrew Zisserman is a Professor at the University of Oxford and is one of the principal architects of modern computer vision. He is best known for his leading role in establishing the computational theory of multiple-view reconstruction and the development of practical algorithms that are widely in use today. This culminated in the publication of his book with Richard Hartley, already regarded as a standard text. He is a fellow of the Royal Society and has won the prestigious Marr Prize three times.
