Thesis Students

Neural and Multilingual Approaches to Machine Translation for IndianLanguages and its Applications

Jerin Philip

Abstract

Neural Machine Translation (NMT), together with multilingual formulations have arisen as the de-facto standard in translating a sentence from a source language to a target language. However, unlike many western languages, the available resources like training data of parallel sentences ortrained models which can be used to build and demonstrate applications in other domains are limited for the languagesin the Indian subcontinent. This work takes a major step towards closing this gap.In this work, we describe the development of state-of-the art translation solutions for 10 Indian lan-guages and English. We do this in four parts described below:1.Considering the Hindi-English language pair we successfully develop an NMT solution for anarrow-domain, demonstrating its application in translating cricket commentary.2.Through heavy data augmentation, we extend the above to the general domain and build a state-of-the art MT system for Hindi-English language pair. Further, We extend to five more languagesby taking advantage of multiway formulations.3.WedemonstratetheapplicationoftheNMTincontributingmoreresourcestothealreadyresource-scarce field, expanding to 11 langauges and its application in a multimodal task of translating a talking face to a target language with lip synchronization.4.Next, we improve both data-situation and performance for machine translation in 11 Indian Lan-guages iteratively to place our models in a standardized, comparable set of metrics setting up forfuture advances in the space to comprehensively evaluate and compare against.

Year of completion:	August 2020
Advisor :	C V Jawahar, Vinay P. Namboodiri

Related Publications

Downloads

Lip-syncing Videos In The Wild

Prajwal K R

Abstract

The widespread access to the Internet has led to a meteoric rise in audio-visual content consumption. Our content consumption habits have changed from listening to podcasts and radio broadcasts to watching videos on YouTube. We are now increasingly preferring the highly engaging nature of video calls over plain voice calls. Given this considerable shift in desire for audio-visual content, there has also been a surge in video content creation to cater to these consumption needs. In this fabric of video content creation, especially those containing people talking, lies the problem of making these videos accessible across language barriers. If we want to translate a deep learning lecture video in English to Hindi, it is not only that the speech should be translated but also the visual stream, specifically, the lip movements. Learning to lip-sync arbitrary videos to any desired target speech is a problem with several applications ranging from video translation, to readily creating new content that would otherwise require humongous efforts. However, speaker-independent lip synthesis for any voice, and language is a very challenging task. In this thesis, we tackle the problem of lip-syncing videos in the wild to any given target speech. We propose two new models in this space: one that significantly improves the generation quality and the other significantly improving on lip-sync accuracy. In the first model, LipGAN, we identify key issues that plague the current approaches for speakerindependent lip synthesis that prevent them from reaching the generation quality of speaker-specific models. Specifically, ours is the first model to generate face images that can be pasted back into the video frame. This feature is crucial for all the real-world applications where the face is just a small part of the entire content being displayed. We show that our improvements in quality lead to multiple real-world applications that have not been demonstrated in any of the previous lip-sync works. In the second model, Wav2Lip, we investigate why current models are inaccurate while lip-syncing arbitrary talking face videos. We hypothesize that the reason is weak penalization. This finding allows us to create a lip-sync model that can generate lip-synced videos for any identity and voice with remarkable accuracy and quality. We re-think the current evaluation framework for this task and propose multiple new benchmarks, two new metrics, and a Real world lip Sync Evaluation Dataset (ReSyncED). Also, using our model, we show applications on lip-syncing dubbed movies and animating real CGI movie clips to new speech. We also demonstrate a futuristic video call application that is useful for poor network connections. Finally, we present two major appli cations that our model can impact the most social media content creation and personalization and video translation. We hope that our advances in lip synthesis open up new avenues for research in the space of talking face generation from speech.

Year of completion:	August 2020
Advisor :	C V Jawahar, Vinay P. Namboodiri

Related Publications

Downloads

Multiscale Two-view Stereo using Convolutional Neural Networks for Unrectified Images

Y N Pramod Pramod

Abstract

Two view stereo problem is a subset of multiview stereo problem where only two views or orientations are available for estimation of depth or disparity. Given the constrained nature of the setup, traditional algorithms assume either the intrinsic or extrinsic parameters to be available in advance in order to build the homographies between the views to rectify the images. Stereo rectification allows epipolar constraint to be enforced such that the corresponding projections of a 3D point could be searched in one dimension along the epipolar lines. When both calibration matrices are not available, the two view stereo problem reduces to estimating the fundamental matrix. A condition number which measures the instability of a function when input conditions change, is high for a fundamental matrix when estimated using an 8 point algorithm. Deep learning methods have been the sought after solutions to numerous computer vision problems as the state of the art research have exposed the power in terms of learning capability of neural networks in general. We explore stereo correspondences in an uncalibrated setting in general by estimating a depthmap given a pair of unrectified stereo images. An end-to-end solution is sought after in a setting where the relative depths of pixels with respect to a single view point could be extracted with the aid of another view. Extending the capabilities of the correlation layer as devised by the flownet architecture, a modified flownet architecure is designed to regress depthmaps with an extension of multiscale correlations for handling textureless surfaces and repetitive textured surfaces. Due to unavailability of dataset for deep learning of unrectified images, a constrained setup of turn table sequences is constructed for this purpose using Google 3D warehouse models.Following the concepts of Attention modelling, we implement an architecture for combining correlations computed at multiple resolutions using a simple element-wise multiplication of the correlations to aid the architecture to resolve correspondences for textureless and repeated textured surfaces. Our experiments show both qualitative and quantitative improvements of depth maps over the original Flownet architecture and provide a solution to the unrectified stereo depth estimation which in literature, most algorithms work on stereo rectified image pairs to compute depthmaps/disparities.

Year of completion:	May 2020
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

SemanticEdge Labeling usingDepth cues

Nishit Soni

Abstract

Contours are critical inhumanperception of a scene. Theyprovide information about object boundaries, surface planes and surface intersections. This information helps to isolate objects from a scene. In computer vision, contours have similar importance. It has been shown that labelled edges can contribute to segmentation, reconstruction and recognition problems. This thesis has addressed edge labeling of images in indoor and outdoor scenes using depth and RGB data. We classify the contours as occluding, planar (depth discontinuity), and convex, concave (surface normal discontinuity). This task is not straightforward and it is one of the fundamental problems in computer vision.We propose a novel algorithm using random forest for classifying edge pixels into occluding, planar, convex and concave entities.We approach the problem by ﬁrst focusing on indoor images where we use depth information fromKinect. We release an indoor data set withmore than 500 RGBD images with pixel-wise ground labels. Our method produces promising results and achieves an F-score of 0.84. We also test the approach onmore complex images from from NYU kinect data set and we obtain F-Score of 0.74. While addressing this problem in outdoor images where we use depth from stereo, we realise the need for additional features. Stereo depth of outdoor scenes has artifacts and errors which cannot conﬁdently represent an edge type locally.We show that a simple feature based on semantic classes helps improving the labeling. On Kitti outdoor driving stereo data set, we obtain occluding and planar average F-Score of 0.77 while the approach works poorly to classify curvature edges i.e convex and concave edges. We ﬁnd this to be because of stereo depth errors and low resolution depth at far distance, which gives poor feature extraction. However,we acknowledge the potential of using semantic classes to improve edge labeling and with large amount of ground truth edgelabels andbetter semantic segmentation, there is ahope of improving the classiﬁcation.

Year of completion:	May 2020
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

A New Algorithm for Ray Tracing Synthetic Light Fields on the GPU

Udyan Khurana

Abstract

Ray Tracing is one of the two major approaches to 3D rendering. Images produced through ray tracing show a higher degree of visual realism, and are more accurate compared to other 3D rendering methods like rasterisation. It is primarily used to generate photo-realistic imagery, as it is based on tracing the path of light going into the scene space. Ray tracing was initially an offline algorithm, and was seen as more suitable for applications which required high quality rendering, where the render time constraints were not as strict. Still images, animated films and television series thus used Ray tracing for rendering visual effects like reflections, shadows, refractions etc. accurately. Computer Generated Imagery (CGI) was first used in the science fiction movie Westworld in 1973. In 1995, ’Toy Story’ became the first film to be fully developed using computer animation. RenderMan, a Ray tracing engine developed by Pixar Studios, made this possible. At that time, rendering enough frames to make a 80-minute film took more than 800,000 machine hours. With massive increases in computational power, and the invention of Graphics Processing Units (GPUs), things have considerably improved, leading to renewed interest in ray tracing research. GPUs gave a huge boost to Ray tracing because of it’s inherently parallel nature. Though GPUs accelerate almost all parts of the Ray trac- ing pipeline, it was still considered to be impractical for interactive, real-time rendering applications, like games. With the annoucement of specalised hardware units for Ray tracing in the Turing GPU microarchitecture by NVIDIA, real-time ray tracing is believed to be possible now. Camera technology has advanced a lot over the years. With standard optical cameras, it was impor- tant to focus accurately on the subject, as slight errors would cause a substantial loss of quality in the desired portions of the image. Digital cameras today have the capability to adjust focus and aperture on their own to accurately focus on a subject. While simulating a camera-esque setup using computer graphics, we don’t have the flexibility of a refocusing after rendering is completed. The only way is to render the setup again with the correct setting. The light field representation offers a viable solution to this problem. Light field is a 4D function that captures all the radiance information of a scene. Traditionally, the creation of light fields was done by image-based rendering mechanisms. These reconstruct the 4D space using pre-captured imagery from various views and employ refocusing algorithms to generate output images. Plenoptic cameras capture the information of all light coming towards the camera, and use that to generate images at any focus setting. Handheld plenoptic cameras, like the Lytro camera series also capture light fields using a microlens array between the sensor and the main lens, but physical constraints have limited their spatial and angular resolutions. The first part of this thesis presents a GPU-based synthetic light field rendering framework that is robust and physically accurate. It can create light fields of very high resolutions, which are orders of magnitude higher than currently available in commercial plenoptic cameras. We show how light field rendering is possible and viable in a synthetic rendering setup, and high-quality images at any desired focus and aperture setting can be generated in reasonable times with a very low memory footprint. We explain the theory behind synthetic light field creation, different representations possible for a 4D light field, and show results based on our implementation, as compared with state-of-the-art ray tracing frameworks like POVRay. In the second part of the thesis, we attempt to solve the problem of light field storage. The 4D light field is inherently rich in information, but is very bulky to store. We address this issue by separating the light field creation and output image generation passes of a synthetic light field framework. Our thesis presents a compact representation of the 4D light slabs using a video compression codec and demon- strate different quality-size combinations using this representation. We demonstrate the equivalence of the standard light field camera representation, called the sub-aperture representation, with light slab representation for synthetic light fields. We use this to exhibit the capability of our framework, not only to trace light fields of very high resolutions, but also to store them in memory. The required images can therefore be generated independently from the ray tracing pass with a very small cost

Year of completion:	April 2020
Advisor :	P J Narayanan

Neural and Multilingual Approaches to Machine Translation for IndianLanguages and its Applications

Jerin Philip

Abstract

Related Publications

Downloads

Lip-syncing Videos In The Wild

Prajwal K R

Abstract

Related Publications

Downloads

Multiscale Two-view Stereo using Convolutional Neural Networks for Unrectified Images

Y N Pramod Pramod

Abstract

Related Publications

Downloads

SemanticEdge Labeling usingDepth cues

Nishit Soni

Abstract

Related Publications

Downloads

A New Algorithm for Ray Tracing Synthetic Light Fields on the GPU

Udyan Khurana

Abstract

Related Publications

Downloads

More Articles …