Thesis Students

Unsupervised Learning of Disentangled Video Representation for Future Frame Prediction

Ujjwal Tiwari

Abstract

Predicting what may happen in the future is a critical design element in developing an intelligent decision-making system. This thesis aims to shed some light on video prediction models that can predict future frames of a video sequence by observing a set of previously known frames. These models learn video representations encoding the causal rules that govern the physical world. Hence, these models have been extensively used in the design of various vision-guided robotic systems. These models also have applications in reinforcement learning, autonomous navigation, and healthcare. Video frame prediction remains challenging despite the availability of large amounts of video data and the recent progress of generative modeling techniques in synthesizing high-quality images. The challenges associated with predicting future frames can be attributed to two significant characteristics of video data - the high dimensionality of video frames and the stochastic nature of the motion exhibited in these video sequences. Existing video prediction models solve the challenge of predicting frames in high-dimensional pixel space by learning a low-dimensional disentangled video representation. These methods factorize video representations into dynamic and static components. The disentangled video representation is subsequently used for the downstream task of future frame prediction. In Chapter 3, we propose a mutual information-based predictive autoencoder, MIPAE, a self-supervised learning framework. The proposed framework factorizes the latent space representation of videos into two components - static content and a dynamic pose component. The MIPAE architecture comprises a content encoder, pose encoder, decoder, and a standard LSTM network. We train MIPAE using a twostep procedure, such that in the first step, the content encoder, pose encoder, and decoder are trained to learn disentangled frame representations. The content encoder is trained using the slow feature analysis constraint, while the pose encoder is trained using a novel mutual information loss term to achieve proper disentanglement. In the second step of our training methodology, we train an LSTM network to predict the low-dimensional pose representation of future frames. The predicted pose and learned content representations are decoded to generate future frames of a video sequence. In this thesis, we present detailed qualitative and quantitative results to compare the performance of our proposed MIPAE framework. We evaluate our approach on standard video prediction datasets like DSprites, MPI3D-real, and SMNIST using various visual quality assessment metrics, namely LPIPS, SSIM, and PSNR. We also present a metric based on mutual information gap, MIG, to quantitatively evaluate the degree of disentanglement between the factorized latent variables - pose and content. MIG score is subsequently used for a detailed comparative study of the proposed framework with other disentanglement-based video prediction approaches to showcase the efficacy of our disentanglement approach. We conclude our analysis by showcasing the visual superiority of the frames predicted by MIPAE. In Chapter 4, we explore the paradigm of stochastic video prediction models, which aim to capture the inherent uncertainty in real-world videos by using a stochastic latent variable to predict a different but plausible sequence of future frames corresponding to each sample of the stochastic latent variable. In our work, we modify the architecture of two stochastic video prediction models and apply a novel cycle consistency loss term to disentangle the video representation space into pose and content factors and model the uncertainty in the pose of various objects in the scene, to generate sharp and plausible frame predictions.

Year of completion:	June 2024
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

Targeted Segmentation: Leveraging Localization with DAFT for Improved Medical Image Segmentation

Samruddhi Shastri

Abstract

Medical imaging plays a pivotal role in modern healthcare, providing clinicians with crucial insights into the human body’s internal structures. However, extracting meaningful information from medical images, such as X-rays and Computed Tomography (CT) scans, remains a challenging task, particularly in the context of accurate segmentation. This thesis presents a novel two-stage Deep Learning (DL) pipeline designed to address the limitations of existing single-stage models and improve segmentation performance in two critical medical imaging tasks: pneumothorax segmentation in chest radiographs and multi-organ segmentation in abdominal CT scans. The first stage of the proposed pipeline focuses on localizing target organs or lesions within the image. This initial localization stage utilizes a specialized module tailored to the specific organ/lesion and image type. This stage outputs a “localization map” highlighting the most probable regions where the target resides, guiding the next step. The second stage, fine-grained segmentation, precisely delineates the organ/lesion boundaries. This is achieved by combining UNet, known for its ability to capture both general and detailed features, with Dynamic Affine Feature-Map Transform (DAFT) modules that dynamically adjust information within the network. This combined approach leads to more accurate boundary delineation, meticulously outlining the exact borders of the target organ/lesion after roughly locating it in the first stage. An application of the proposed pipeline focuses on pneumothorax segmentation, leveraging not only the image data but also the accompanying free-text radiology reports. By incorporating text-guided attention and DAFT, the pipeline produces low-dimensional region-localization maps, significantly reducing false positive predictions and improving segmentation accuracy. Extensive experiments on the CANDID-PTX dataset demonstrate the efficacy of the approach, achieving a Dice Similarity Coefficient (DSC) of 0.60 for positive cases and 0.052 False Positive Rate (FPR) for negative cases, with DSC ranging from 0.70 to 0.85 for medium and large pneumothoraces. Another application of the proposed pipeline involves multi-organ segmentation in abdominal CT scans, where accurate delineation of organ boundaries is crucial for various medical tasks. The proposed Guided-nnUNet leverages spatial guidance from a ResNet-50-based localization map in the first stage, followed by DAFT-enhanced 3D U-Net (nn-UNet implementation). Evaluation on the AMOS and Beyond The Cranial Vault (BTCV) datasets demonstrates a significant improvement over baseline models, with an average increase of 7% and 9% on the respective datasets. Moreover, Guided-nnUNet outperforms state-of-the-art (SOTA) methods, including MedNeXt, by 3.6% and 5.3% on the AMOS and BTCV datasets, respectively. Overall, this thesis proposes a novel two-stage deep learning pipeline for medical image segmentation, demonstrating its effectiveness in handling a wide range of anatomical structures and image modalities (2D X-ray, 3D CT) for both single-organ (e.g., pneumothorax segmentation in chest radiographs) and multi-organ segmentation tasks (e.g., abdominal CT scans). This comprehensive approach offers significant advancements and contributes to improved medical image analysis, potentially leading to better healthcare outcomes.

Year of completion:	June 2024
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Estimating 3D Human Pose, Shape, and Correspondences from Monocular Input

Amogh Tiwari

Abstract

In recent years, advances in computer vision have opened up multiple applications in virtual reality, healthcare, robotics, and many other domains. One crucial problem domain in computer vision, which has been a key research focus lately, is estimating the 3D human pose, shape, and correspondences from monocular input. This problem domain has applications in various industries like fashion, entertainment, healthcare, etc. However, it is also highly challenging due to various reasons like large variations in the pose, shape, and appearance of humans and clothing details, external and self-occlusions, challenges with ensuring consistency etc. As part of this thesis, we tackle two key problems related to 3D human pose, shape, and correspondence estimation. First, we focus on the problem of temporally consistent 3D human pose and shape estimation from monocular videos. Next, we focus on dense correspondence estimation across images of different (or the same) humans. We show that despite receiving a lot of research attention lately, existing methods for these tasks still perform sub-optimally in many challenging scenarios and have significant scope for improvement. We aim to overcome some of the limitations of existing methods and advance state-of-the-art (SOTA) solutions to these problems. First, we propose a novel method for temporally consistent 3D human pose and shape estimation from a monocular video. Instead of using the traditionally used, generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose and camera initialization over a temporal window followed by a novel spatio-temporal feature aggregation by using a combination of self-similarity and self-attention over the body-aware features and the per-frame initialization. Together, they yield enhanced spatio-temporal context for every frame by considering the remaining past and future frames. These features are used to predict the pose and shape parameters of the human body model, which are further refined using an LSTM. Next, we expand our focus to the task of dense correspondence estimation between humans, which requires understanding the relations between different body regions (represented using dense correspondences), including the clothing details, of the same or different human(s). We present Continuous Volumetric Embeddings (ConVol-E), a novel robust representation for dense correspondence-matching across RGB images of different human subjects in arbitrary poses and appearances under non-rigid deformation scenarios. Unlike existing representations, ConVol-E captures the deviation from the underlying parametric body model by choosing suitable anchor/key points on the underlying parametric body surface and then representing any point in the volume based on its Euclidean relationship with the anchor points. This allows us to represent any arbitrary point around the parametric body (clothing details, hair, etc.) by an embedding vector. Subsequently, given a monocular RGB image of a person, we learn to predict per-pixel ConVol-E embedding, which carries a similar meaning across different subjects and is invariant to pose and appearance, thereby acting as a descriptor to establish robust, dense correspondences across different images of humans. We thoroughly evaluate our methods on publicly available benchmark datasets and show that our methods outperform existing SOTA. Finally, we provide a summary of our contributions and discuss the potential future research directions in this problem domain. We believe that this thesis improves the research landscape for the domain of the human body, pose, shape, and correspondence estimation and helps accelerate progress in this direction.

Year of completion:	June 2024
Advisor :	Avinash Sharma

Related Publications

Downloads

Quality Beyond Perception: Introducing Image Quality Metrics for Enhanced Facial and Fingerprint Recognition

Prateek Jaiswal

Abstract

Assessing the quality of biometric images is key to making recognition technologies more accurate and reliable. Our research began with fingerprint recognition systems and later expanded to facial recognition systems, underscoring the importance of image quality in both areas. For fingerprint recognition, image quality is vital for accuracy. We developed the Fingerprint RecognitionBased Quality (FRBQ) metric, which improves on the limitations of the NFIQ2 model. FRBQ leverages deep learning algorithms in a weakly supervised setting, using matching scores from DeepPrint, a FixedLength Fingerprint Representation Model. Each score is labeled to reflect the robustness of fingerprint image matches, providing a comprehensive metric that captures diverse perspectives on image quality. Comparative analysis with NFIQ2 reveals that FRBQ correlates more strongly with recognition scores and performs better in evaluating challenging fingerprint images. Tested with the FVC 2004 dataset, FRBQ has proven effective in assessing fingerprint image quality. After our success with fingerprint recognition, we turned to facial recognition systems. In facial recognition, image quality involves more than just perceptual aspects; it includes features that convey identity information. Existing datasets consider factors like illumination and pose, which enhance robustness and performance. However, age variations and emotional expressions can still pose challenges. To tackle these, we introduced the Unified Tri-Feature Quality Metric (U3FQ). This framework combines age variance, facial expression similarity, and congruence scores from advanced recognition models like VGG-Face, ArcFace, FaceNet, and OpenFace. U3FQ uses a Regression Network model specifically designed for facial image quality assessment. We compared U3FQ to general image quality assessment techniques like BRISQUE, BLINDS-II, and RankIQA, as well as specialized facial image quality methodologies like PFE, SER-FIQA, and SDD-FIQA. Our results, supported by analyses such as DET plots, expression match heat maps, and EVRC curves, show U3FQ’s effectiveness. Our study highlights the transformative potential of artificial intelligence in biometrics, capturing critical details that traditional methods might miss. By providing precise quality assessments, we emphasize its role in advancing both fingerprint and facial recognition systems. This work sets the stage for further research and innovation in biometric analysis, underlining the importance of image quality in improving recognition technologies.

Year of completion:	July 2024
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

Towards Label Free Few Shot Learning : How Far Can We Go?

Aditya Bharti

Abstract

Deep learning frameworks have consistently pushed the state-of-the-art limit across various problem domains such as computer vision, and natural language processing applications. Such performance improvements have only been made possible by the increasing availability of labeled data and computational resources, which makes applying such systems to low data regimes extremely challenging. Computationally simple systems which are effective with limited data are essential to the continued proliferation of DNNs to more problem spaces. In addition, generalizing from limited data is a crucial step toward more human-like machine intelligence. Reducing the label requirement is an active and worthwhile area of research since getting large amounts of high-quality annotated data is labor intensive and often impossible, depending on the domain. There are various approaches to this: artificial generation of extra labeled data, using existing information (other than labels) as supervisory signals for training, and designing pipeline that specifically learn using only a few samples. We focus our efforts on that last class of channels which aims to learn from limited labeled data, also known as Few Shot Learning. Few-Shot learning systems aim to generalize to novel classes given very few novel examples, usually one to five. Conventional few-shot pipelines use labeled data from the training set to guide training, then aim to generalize to the novel classes which have limited samples. However, such approaches only shift the label requirement from the novel to the training dataset. In low data regimes, where there is a dearth of labeled data, it may not be possible to get enough training samples. Our work aims to alleviate this label requirement by using no labels during training. We examine how much performance is achievable using extremely simple pipelines overall. Our contributions are hence twofold. (i) We present a more challenging label-free few-shot learning setup and examine how much performance can be squeezed out of a system without labels. (ii) We propose a computationally and conceptually simple pipeline to tackle this setting. We tackle both the compute and data requirements by leveraging self-supervision for training and image similarity for testing.

Year of completion:	January 2024
Advisor :	C V Jawahar,Vineeth Balasubramanian

Unsupervised Learning of Disentangled Video Representation for Future Frame Prediction

Ujjwal Tiwari

Abstract

Related Publications

Downloads

Targeted Segmentation: Leveraging Localization with DAFT for Improved Medical Image Segmentation

Samruddhi Shastri

Abstract

Related Publications

Downloads

Estimating 3D Human Pose, Shape, and Correspondences from Monocular Input

Amogh Tiwari

Abstract

Related Publications

Downloads

Quality Beyond Perception: Introducing Image Quality Metrics for Enhanced Facial and Fingerprint Recognition

Prateek Jaiswal

Abstract

Related Publications

Downloads

Towards Label Free Few Shot Learning : How Far Can We Go?

Aditya Bharti

Abstract

Related Publications

Downloads

More Articles …