Doctoral Dissertations

Efficient Physically Based Rendering with Analytic and Neural Approximations

Aakash KT

Abstract

Path tracing is ubiquitous for photorealistic rendering of various real-world appearances. It follows the principles of light transport adapted from physics, which describe light propagation as a set of integral equations. These equations are stochastically evaluated by tracing light rays in virtual scenes. Such stochastic evaluations with ray-tracing form the bulk of the path tracing algorithm, which is widely used in the industry.

Stochastic evaluations in path tracing converge to the correct answer in time that is inversely proportional to the square root of the number of iterations. This coupled with the fact that the underlying integrals are often complex and high dimensional results in large compute complexity. Research efforts have thus largely focused on accelerating path tracing by improving the stochastic sampling processes. However, it is interesting to look at efficient analytic approximations by making reasonable assumptions on the nature of these light transport integrals. Such analytic methods have the potential to achieve zero variance at the outset. Practically, they are often used in conjunction with stochastic methods thereby achieving lower variance than the fully stochastic counterparts.

The primary focus of this thesis is to develop new (semi-)analytic methods and improve existing ones to accelerate direct lighting computations in path tracing. We base our research on the theory of Linearly Transformed Cosines (LTC) applied for direct lighting from area lights. The LTC method produces plausible renderings by building on the principles of light transport from the ground up and has proved useful for tasks other than real-time rendering We make the following three contributions that either build on LTCs or improve it.

We first explore fully-analytic direct lighting for arbitrarily shaped area lights, built on LTCs at the core. Due to assumptions of the LTC method, it can only handle polygonal area lights. Furthermore, rendering shadows with LTCs require stochastic evaluations - our contribution here relaxes these assumptions, enabling fully-analytic direct lighting with shadows from an arbitrary shaped area light. We show that our method achieves plausible and noise-free renderings compared to semi-analytic LTCs and ground truth ray-tracing, given equal compute budget.

Year of completion:	June 2025
Advisor :	Dr. P. J. Narayanan

Related Publications

Downloads

3D Shape Analysis: Reconstruction and Classification

Jinka Sai Sagar

Abstract

The reconstruction and analysis of 3D objects by computational systems has been an intensive and long-lasting research problem in the graphics and computer vision scientific communities. Traditional acquisition systems are largely restricted to studio environment setup which requires multiple synchronized and calibrated cameras. With the advent of active depth sensors like time-of-flight sensors, structured lighting sensors made 3D acquisition feasible. This advancement of technology has paved way to many research problems like 3D object localization, recognition, classification, reconstruction which demand innovating sophisticated/elegant solutions to match their ever growing applications. 3D human body reconstruction, in particular, has wider applications like virtual mirror, gait analysis, etc. Lately, with the advent of deep learning, 3D reconstruction from monocular images garnered significant interest among the research community as it can be applied to in-the-wild settings.

Secondly, we propose PeeledHuman - a novel shape representation of the human body that is robust to self-occlusions. PeeledHuman encodes the human body as a set of Peeled Depth and RGB maps in 2D, obtained by performing ray-tracing on the 3D body model and extending each ray beyond its first intersection. We learn these Peeled maps in an end-to-end generative adversarial fashion using our novel framework - PeelGAN. The PeelGAN enables us to predict shape and color of the 3D human in an end-to-end fashion at significantly low inference rates.

Year of completion:	May 2023
Advisor :	Dr. Avinash Sharma

Related Publications

Downloads

Surrogate Approximations for Similarity Measures

Nagendar G

Abstract

This thesis targets the problem of surrogate approximations for similarity measures to improve their performance in various applications. We have presented surrogate approximations for popular dynamic time warping (DTW) distance, canonical correlation analysis (CCA), Intersection-over-Union (IoU), PCP, and PCKh measures. For DTW and CCA, our surrogate approximations are based on their corresponding definitions. We presented a surrogate approximation using neural networks for IoU, PCP, and PCKh measures.

First, we propose a linear approximation for the naïve DTW distance. We try to speed up the DTW distance computation by learning the optimal alignment from the training data. We propose a surrogate kernel approximation over CCA in our next contribution. It enables us to use CCA in the kernel framework, further improving its performance. In our final contribution, we propose a surrogate approximation technique using neural networks to learn a surrogate loss function over IoU, PCP, and PCKh measures. For IoU loss, we validated our method over semantic segmentation models. For PCP, and PCKh loss, we validated over human pose estimation models.

Year of completion:	Novenber 2023
Advisor :	Dr. Avinash Sharma

Related Publications

Downloads

Learning with Weak Supervision for Visual Scene Understanding

Aditya Arun

Abstract

In recent years, computer vision has made remarkable progress in understanding visual scenes, including tasks such as object detection, human pose estimation, semantic segmentation, and instance segmentation. These advancements are largely driven by high-capacity models, such as deep neural networks, trained in fully supervised settings with large-scale labeled data sets. However, reliance on extensive annotations poses scalability challenges due to the significant human effort required to create these data sets. Fine-grained annotations, such as pixel-level segmentation masks, keypoint coordinates for pose estimation, or detailed object instance boundaries, provide the high precision needed for many tasks but are extremely time-consuming and costly to produce. Coarse annotations, on the other hand, such as image-level labels or approximate scribbles, are much easier and faster to create but lack the granularity required for detailed model supervision.

To address these challenges, researchers have increasingly explored alternatives to traditional supervised learning, with weakly supervised learning emerging as a promising approach. This approach mitigates annotation costs by utilizing coarse annotations (cheaper and less detailed) during training rather than the fine-grained annotations required at the output stage during testing. Despite its potential, weakly supervised learning faces challenges in transferring information from coarse annotations to fine-grained predictions, often encountering ambiguity and uncertainty during this process. Existing methods rely on various priors and heuristics to refine annotations, which are then used to train models for specific tasks. This involves managing uncertainty in latent variables during training and ensuring accurate predictions for both latent and output variables at test time.

Year of completion:

June 2025

Advisor :

Prof. C.V. Jawahar

Prof. M. Pawan Kumar

Related Publications

Downloads

-->

Document Image Layout Segmentation and Applications

Jobin K.V.

Abstract

A document has information emerging out of the existence of various physical entities or regions such as headings, paragraphs, figures, captions, tables, and backgrounds along with the textual content. To decipher the information in a document, a human reader uses a variety of additional cues, such as context, conventions, and information about language, script, location, and a complex reasoning process. Millions of documents are created and distributed daily over the Internet and printed media. Understand- ing, analyzing, sorting, and comparing a massive collection of documents in a limited time is a hectic job for humans. Here, the automatic document image understanding systems (DIUS) help humans do this tedious task within a limited time. The DIUS have typically a document image layout segmentation module and information extraction modules. This thesis focuses on the challenging problems related to document image layout segmentation in various types of documents and their applications.

In this thesis, first, we analyse the various document images using deep features. The deep features are the features extracted using a pretrained deep neural network. To study deep texture features, we propose a deep network architecture that independently learns texture patterns, discriminative patches, and shapes to solve various document image analysis tasks. The considered tasks are document image classification, genre identification from book covers, scientific document figure classification, and script identification. The presented network learns global, texture, and discriminative features and combines them judicially based on the nature of the problems. We compare the performance of the proposed approach with state-of-the-art techniques on multiple publicly available datasets such as Book covers, RVL-CDIP, CVSI and DocFigure. Experiments show that our approach obtains significantly better performance over state-of-the-art for all tasks.

Next, we focus on the problem of document image layout segmentation and propose a solution for a class of document images, including historical, scientific, and classroom slide document images. The historical document image segmentation problem is modeled as a pixel labeling task where each pixel in the document image is classified into one of the predefined labels, such as text, comment, decoration, and background. The method first extracts deep features from the superpixels of the document image. Then, we learn SVM classifier using these features and segment the document image. The pyramid pooling module is used to extract the logical regions of scientific document images. In the classroom slide images, the logical regions are distributed based on the location of the image. To utilize the location of the logical regions for slide image segmentation, we propose the architecture, Classroom Slide Segmentation Network (CSSN). The unique attributes of this architecture differ from most other semantic segmentation networks. We validate the performance of our segmentation architecture using publicly available benchmark datasets

Next, we analyze the output regions of document layout segmentation. Figures used in the documents are the complex regions to decipher the information by a DIUS. Hence, document figure classification (DFC) is an important stage of the DIUS. The design of a DFC system required well-defined figure cate- gories and datasets. Existing datasets related to the classification of figures in the document images are limited in size and category. We introduce a scientific figure classification dataset named DocFigure. The dataset consists of 33K annotated figures of 28 different categories present in the document images, which correspond to scientific articles published in last several years. Manual annotation of such a large number (33K) of figures is time-consuming and cost-ineffective. We design a web-based annotation tool that can efficiently assign category labels to many figures with the minimum effort of human annotators. To benchmark our generated dataset on the classification task, we propose three baseline classification techniques using the deep feature, deep texture feature, and both. Our analysis found that the combi- nation of both deep and texture features is more effective for document figure classification tasks than individual features

Finally, we propose the application backed by the research of this thesis. Slide presentations are an effective and efficient tool for classroom communication used by the teaching community. However, this teaching model can be challenging for blind and visually impaired (VI) students as such students require personal human assistance to understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (CSNS) that generates audio descriptions corresponding to the slide content. This problem poses an image-to-markup language generation task. Extract logical regions such as title, text, equation, figure, and table from the slide image using CSSN. We extract the content (information) from the slide using four well-established modules: optical character recognition (OCR), figure classification, equation description, and table structure recognizer. With this information, we build a Classroom Slide Narration System (CSNS) to help VI students understand the slide content. The users have given better feedback on the quality output of the proposed CSNS in comparison to existing systems like Facebook’s Automatic Alt-Text (AAT) and Tesseract.

Year of completion:	April 2025
Advisors :	Prof. C V Jawahar

Related Publications

Downloads

-->

Efficient Physically Based Rendering with Analytic and Neural Approximations

Aakash KT

Abstract

Related Publications

Downloads

3D Shape Analysis: Reconstruction and Classification

Jinka Sai Sagar

Abstract

Related Publications

Downloads

Surrogate Approximations for Similarity Measures

Nagendar G

Abstract

Related Publications

Downloads

Learning with Weak Supervision for Visual Scene Understanding

Aditya Arun

Abstract

Related Publications

Downloads

Document Image Layout Segmentation and Applications

Jobin K.V.

Abstract

Related Publications

Downloads

More Articles …