Thesis Students

Towards Handwriting Recognition and Search in Indic & Latin Scripts

Santhoshini Gongidi

Abstract

ML-powered document image analysis approaches can enable intelligent solutions in bringing handwritten information into the digital world. Two major components of handwriting understanding include handwritten text recognition(HTR) and handwritten search. The former task enables the conversion of handwritten text to digital format, whereas the latter task provides easy access to the handwritten information scattered across books, archives, manuscripts and so on. We primarily focus on both these problem statements in this thesis. Handwritten document image analysis for Indic scripts is still in its nascent stage compared to Latin scripts. For example, many commercial applications and open-source demonstrations are available for Latin scripts and there are hardly such known applications for Indic scripts. It is challenging to develop solutions for Indic scripts due to (i) variety of scripts within the Indic subcontinent, (ii) lack of huge annotated datasets and challenges in collecting data for multiple scripts, and (iii) inherent challenges in Indic scripts like inflections, joining multiple glyphs to form an akshara(equivalent to a character in Latin scripts). While challenging, it is also crucial to develop HTR and handwritten search approaches for Indic scripts. Therefore, this thesis is majorly focused on discussing approaches for Indic HTR and Indic handwritten search. In the last two decades, large digitization projects converted paper documents and ancient historical manuscripts into digital forms. However, they remain often inaccessible due to the unavailability of robust HTR solutions. Recognizing handwritten text is fundamental to any modern document analysis system. In recent years, efforts toward developing text recognition systems have advanced due to the success of deep neural networks and the availability of annotated datasets. This is especially true for Latin scripts. In this thesis, we discuss the standard text recognition pipeline that comprises various neural network modules. Then, we present a simple and effective way to improve the text recognition pipeline and training approach. We report the improvement from our approach on four benchmark datasets in Latin and Indic scripts. The existing state-of-the-art approaches for Latin HTR and Latin handwritten search are highly datadriven. Due to the lack of availability of large-scale data, developing Indic HTR and Indic handwritten search is challenging. Therefore, we release a collective Indic handwritten dataset with text images from majorly spoken 10 Indic scripts. We establish a high baseline for text recognition in prominent Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. We also explore the utility of pre-training for Indic HTRs. We hope our efforts will catalyze research and fuel applications related to handwritten document understanding in Indic scripts. Finally, we investigate the problem of handwritten search and retrieval for unlabeled collections. Handwritten search pipelines are needed in online platforms like E-libraries and digital archives. Such pipelines can efficiently search through handwritten collections and present relevant results, much like Google Search. With its ease of access and time-saving capability, the handwritten search application can prove to be valuable to many communities that study such historical documents. In this thesis, we present one such pipeline for handwritten search that performs retrieval on new and unseen collections. The proposed retrieval is not fine-tuned for specific writing styles or unknown vocabulary in the new collection. Therefore, it can be applied to new unlabeled collections.

Year of completion:	November 2022
Advisor :	C V Jawahar

Related Publications

Downloads

Learnable HMD Facial De-occlusion for VR Applications in a Person-Specific Setting

Surabhi Gupta

Abstract

Immersive technologies such as Virtual Reality (VR) and Augmented Reality (AR) are among the fasted growing and fascinated technology today. As the name suggests, these technologies promise to provide users with a much better experience using immersive head-mounted displays. Medicine, culture, education, and architecture are some areas that have already taken advantage of this technology. Popular video conferencing platforms such as Microsoft Teams, Zoom, and Google Meet is working to improve user experience by allowing users to use their digital avatars. Nevertheless, they lack immersiveness and realism. What if we extend these applications to virtual reality platforms so people can feel lively talking to each other? Integrating virtual reality platforms in collaborative spaces such as virtual telepresence systems have become quite popular after globalization since it enables multiple users to share the same virtual environment, thus mimicking real-life face-to-face interactions. For a better immersive experience in virtual telepresence/communication systems, it is essential to recover the entire face, including the portion masked by the headsets (e.g., Head-Mounted Displays abbreviated as HMDs). Several methods have been proposed in the literature that deal with this problem in various forms, such as HMD removal and face inpainting. Despite some remarkable explorations, none of these methods promises to provide usable results as expected in virtual reality platforms. Addressing these challenges in the real-world deployment of AR/VR-based applications draws emerging attention. Considering the existing limitations and usability of previous solutions, we explore various research challenges and propose a practical approach to facial de-occlusion/HMD removal for virtual telepresence systems. This thesis is well-documented to motivate and introduce the audience to various research challenges in facial de-occlusion, familiarizing them with existing solutions and their inapplicability in our problem domain, followed by the idea and formulation of our proposed solution to tackle the problem. With this view, the first chapter lays the outline of this thesis. In the second chapter, we propose a method for facial de-occlusion and discuss the importance of personalized facial de-occlusion methods in enhancing the sense of realism in virtual environments. The third chapter talks about the refinement of to previously proposed network in improving the reconstruction of the eye region. Last but not least, the final chapter briefly discusses the existing face datasets in face reconstruction and inpainting, followed by an overview of the dataset we collected for this work, from acquisition to making it usable for training deep learning models. In addition, we also attempt to extend image-based facial de-occlusion to video frames using off-the-shelf approaches, briefly explained in Appendix A.

Year of completion:	November 2022
Advisor :	Avinash Sharma, Anoop M Namboodiri

Related Publications

Downloads

Interactive Video Editing using Machine Learning Techniques

Anchit Gupta

Abstract

There is no doubt that videos are today's most popular content consumption method. With the rise of the streaming giants such as YouTube, Netflix, etc., video content is accessible to more people. Naturally, video content creation has also increased to cater to the rising demand. In order to reach out to a wider audience, the creators dub their content. An important aspect of dubbing is not only changing the speech but also lip synchronizing the speaker in the video. Talking-face video generation works have achieved state-of-the-art results in synthesizing videos with accurate lip synchronization. However, most of the previous works deal with low-resolution talking-face videos (up to 256 × 256 pixels), thus, generating extremely high-resolution videos still remains a challenge. Also, with advancements in internet and camera tech more and more number of people are able to create video content and that too in ultra high resolution such as 4K (3840 × 2160). In this thesis, we take a giant leap and propose a novel method to synthesize talking-face videos at resolutions as high as 4K! Our task presents several key challenges: (i) Scaling the existing methods to such high resolutions is resource-constrained, both in terms of compute and the availability of very high-resolution datasets, (ii) The synthesized videos need to be spatially and temporally coherent. The sheer number of pixels that the model needs to generate while maintaining the temporal consistency at the video level makes this task non-trivial and has never been attempted in literature. We propose to train the lip-sync generator in a compact Vector Quantized (VQ) space for the first time to address these issues. Our core idea to encode the faces in a compact 16 × 16 representation allows us to model high-resolution videos. In our framework, we learn the lip movements in the quantized space on the newly collected 4K Talking Faces (4KTF) dataset. Our approach is speaker agnostic and can handle various languages and voices. We benchmark our technique against several competitive works and show that we can achieve a remarkable 64-times more pixels than the current state-of-the-art! Now, how to edit videos using the above algorithm or any other deep learning algorithm? To do so, the person has to download the source code of the required method and run the code manually. How amazing would it be if people could use the deep learning techniques in video editors with a click of a single button? In this thesis, we also propose a video editor based on OpenShot with several state-of-theart facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively. Apart from lip-syncing, the editor also uses audio and facial re-enactment to generate expressive talking faces. The manual control improves the overall experience of video editing without missing out on the benefits of modern synthetic video generation algorithms. This control enables us to lip-sync complex dubbed movie scenes, interviews, television shows, and other visual content. Furthermore, our editor provides features that automatically translate lectures from spoken content, lip-sync of the professor, and background content like slides. While doing so, we also tackle the critical aspect of synchronizing background content with the translated speech. We qualitatively evaluate the usefulness of the proposed editor by conducting human evaluations. Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality

Year of completion:	November 2022
Advisor :	C V Jawahar,Vinay P Namboodiri

Related Publications

Downloads

Refining 3D Human Digitization Using Learnable Priors

Routhu Snehith Goud

Abstract

The 3D reconstruction of a human from a monocular RGB image is an interesting yet very challenging research problem in the field of computer vision. It has various applications in the movie industry, the gaming industry, and AR/VR applications. Hence it is important to recover detailed geometric 3D reconstruction of humans in order to enhance the realism. The problem is ill-posed in nature owing to high freedom of human pose, self-occlusions, loose clothing, camera viewpoint, and illumination. Though it is possible to generate accurate geometric reconstructions using structured light scanners or multi-view cameras these are cost expensive and require a specific setup e.g., numerous cameras, controlled illumination, etc. With the current advancement of deep learning techniques, the focus of the community shifted to the 3D reconstruction of people from monocular RGB images. The goal of the thesis is toward 3D Human Digitization of people in loose clothing with accurate person-specific details. The problem of reconstruction of people in loose clothing is difficult as the topology of clothing is different from the human body. A multi-layered shape representation called PeeledHuman was able to deal with loose clothing and occlusions but it suffered from discontinuity or distorted body parts in the occluded regions. To overcome this we propose peeled semantic segmentation maps which provide the semantic information of the body parts across multiple peeled layers. These Peeled semantic maps help the network in predicting consistent depth for the pixels belonging to the same body part across different peeled maps. Our proposed Peeled Segmentation maps help in improving reconstruction both quantitatively and qualitatively. Additionally, the 3D semantic segmentation labels have various applications, for example, can be used to extract the clothing from the reconstructed output. The face plays an important role in 3D Human digitization, it gives a person their identity and the realism enhances with high-frequency facial details. However, the face appears in a smaller region of the image which captures the complete body and makes the task of high-fidelity face reconstruction along with the body even more challenging. We reconstruct person-specific facial geometry along with the complete body by incorporating a facial prior and refining it further using our proposed framework. Another common challenge faced by the existing methods is the surface noise i.e, false geometric edges generated because of the textural edges present in the image space. We address this problem by incorporating a wrinkle map prior that distinguishes the geometrical from the textural edges coming from image space. In summary, in this thesis, we address the problem of 3D human digitization from monocular images. We evaluate our proposed solution on various existing 3D human datasets and demonstrate that our proposed solutions outperform the existing state-of-the-art methods. The limitations present in the proposed methods were mentioned and potential solutions on how to address them have been briefly discussed. In the end, some future directions that can be potentially explored based on the proposed solutions have been discussed. Finally, we proposed efficient and robust methods to recover accurate and personalized 3D humans from images.

Year of completion:	December 2022
Advisor :	Avinash Sharma

Related Publications

Downloads

Deep Neural Models for Generalized Synthesis of Multi-Person Actions

Debtanu Gupta

Abstract

The ability to synthesize novel and diverse human motion at scale is indispensable not only to the umbrella field of computer vision but in multitudes of allied fields such as animation, human computer interaction, robotics and human robot interaction. Over the years, various approaches have been proposed including physics-based simulation, key-framing, database methods, etc. But ever since the renaissance of deep learning and the rapid development of computing, the generation of synthetic human motion using deep learning based methods have received significant attention. Apart from pixel-based video data, the availability of reliable motion capture systems has enabled pose-based human action synthesis. Much of it is owed to the development of frugal motion capture systems, which enabled the curation of large scale skeleton action datasets. In this thesis, we focus on skeleton-based human action generation. To begin with, we study an approach for large-scale skeleton-based action generation. In doing so, we introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multiperson pose-based action sequences with locomotion. Our controllable approach enables variable-length generations customizable by action category, across more than 100 categories. To enable intra/intercategory diversity, we model the latent generative space using a Conditional Gaussian Mixture Variational Autoencoder. To enable realistic generation of actions involving locomotion, we decouple local pose and global trajectory components of the action sequence. We incorporate duration-aware feature representations to enable variable-length sequence generation. We use a hybrid pose sequence representation with 3D pose sequences sourced from videos and 3D Kinect-based sequences of NTU-RGBD120. To enable principled comparison of generation quality, we employ suitably modified strong baselines during evaluation. Although smaller and simpler compared to baselines, MUGL provides better quality generations, paving the way for practical and controllable large-scale human action generation. Further, we study the approaches for methods that are generalizable across datasets with varying properties and we also study methods for dense skeleton action generation. In this backdrop, we introduce DSAG, a controllable deep neural framework for action-conditioned generation of full body multi-actor variable duration actions. To compensate for incompletely detailed finger joints in existing large-scale datasets, we introduce full body dataset variants with detailed finger joints. To overcome shortcomings in existing generative approaches, we introduce dedicated representations for encoding finger joints. We also introduce novel spatiotemporal transformation blocks with multi-head self attention and specialized temporal processing. The design choices enable generations for a large range in body joint counts (24 - 52), frame rates (13 - 50), global body movement (in-place, locomotion) and action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Human3.6M). Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitability for action-conditioned generation at scale and also for the challenging task of long-term motion prediction.

Year of completion:	December 2022
Advisor :	Ravi Kiran Sarvadevabhatla

Towards Handwriting Recognition and Search in Indic & Latin Scripts

Santhoshini Gongidi

Abstract

Related Publications

Downloads

Learnable HMD Facial De-occlusion for VR Applications in a Person-Specific Setting

Surabhi Gupta

Abstract

Related Publications

Downloads

Interactive Video Editing using Machine Learning Techniques

Anchit Gupta

Abstract

Related Publications

Downloads

Refining 3D Human Digitization Using Learnable Priors

Routhu Snehith Goud

Abstract

Related Publications

Downloads

Deep Neural Models for Generalized Synthesis of Multi-Person Actions

Debtanu Gupta

Abstract

Related Publications

Downloads

More Articles …