Thesis Students

Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding

Sowmya Aitha

Abstract

Ancient paper documents and palm leaf manuscripts from the Indian subcontinent have made a significant contribution to the world literary and culture. These documents often have complex, uneven, and irregular layouts. The process of digitization and deciphering the content from these documents without human intervention pose difficulties in a broad range of areas, including language, script, layout, elements, position, and number of manuscripts per image. Large-scale annotated Indic manuscript image datasets are needed for this kind of research. In order to meet this objective, we present Indiscapes, the first dataset containing multi-regional layout annotations for ancient Indian manuscripts. We also adapt a fully convolutional deep neural network architecture for fully automatic, instance-level spatial layout parsing of manuscript images in order to deal with the challenges such as presence of dense, irregular layout elements, pictures, multiple documents per image and the wide variety of scripts. Eventually, We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. Despite advancements, the segmentation of semantic layout using typical deep network methods is not resistant to the complex deformations that are observed across semantic regions. This problem is particularly evident in the domain of Indian palm-leaf manuscripts, which has limited resources. Therefore, we present Indiscapes2, a new expansive dataset of various Indic manuscripts with semantic layout annotations, to help address the issue. Indiscapes2 is 150% larger than Indiscapes and contains materials from four different historical collections. In addition, we propose a novel deep network called Palmira for reliable, deformation-aware region segmentation in handwritten manuscripts. As a performance metric, we additionally report a boundary-centric measure called Hausdorff distance and its variations. Our tests show that Palmira offers reliable layouts and outperforms both strong baseline methods and ablative versions. We also highlight our results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of PALMIRA. Even though we have reliable deep-network based approaches for comprehending manuscript layout, these models implicitly assume one or two manuscripts per image during the process, whereas in a real-world scenario there are often cases where multiple manuscripts are typically scanned together into a scanned image to maximise scanner surface area and reduce manual labour. Now, making sure that each individual manuscript within a scanned image can be isolated (segmented) on a per-instance basis became the first essential step in understanding the content of a manuscript. Hence, there is a need for a precursor system which extracts individual manuscripts before downstream processing. The highly curved and deformed boundaries of manuscripts, which frequently cause them to overlap with each other, introduce another complexity when confronting issue. We introduce another new document image dataset named IMMI (Indic Multi Manuscript Images) to address these issues. We also present a method that generates synthetic images to augment sourced non-synthetic images in order to boost the efficiency of the dataset and facilitate deep network training. Adapted versions of current document instance segmentation frameworks are used in our experiments. The results demonstrate the efficacy of the new frameworks for the task. Overall, our contributions enable robust extraction of individual historical manuscript pages. This in turn, could potentially enable better performance on downstream tasks such as region-level instance segmentation, optical character recognition and word-spotting in historical Indic manuscripts at scale.

Year of completion:	May 2023
Advisor :	Ravi Kiran Sarvadevabhatla

Related Publications

Downloads

Situation Recognition for Holistic Video Understanding

Zeeshan Kha

Abstract

Video is a complex modality consisting of multiple events, complex action, humans, objects and their interactions densely entangled over time. Understanding videos has been the core and one of the most challenging problem in computer vision and machine learning. What makes it even harder is the lack of structured formulation of the task specially when long videos are considered consisting of multiple events and diverse scenes. Prior works in video understanding have tried to address the problem only in a sparse and a uni-dimensional way, for example action recognition, spatio-temporal grounding, question answering and, free form captioning. However it requires holistic understanding to fully capture all the events, actions, and relations between all the entities, and represent any natural scene with the highest detail in the most faithful way. It requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) through semantic role labeling is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This is one of the most dense video understanding task posing several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation due to the free form captions for representing the roles. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, without requiring ground truth bounding boxes. Since evaluating free-form captions can be difficult and imprecise this not only improves the current formulation and the evaluation setup, but also improves the interpretability of the models decision, because grounding allows us to visualise where the model is looking while generating a caption. To this end we present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time

Year of completion:	May 2023
Advisor :	C V Jawahar , Makarand Tapaswi

Related Publications

Downloads

Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing

Durga Nagendra Raghava Kumar Modhugu

Abstract

The footprint of partial or fully autonomous vehicles is increasing gradually with time. The existenceand availability of the necessary modern infrastructure are crucial for the widespread use of autonomousnavigation. One of the most critical efforts in this direction is to build and maintain HD maps efficientlyand accurately. The information in HD maps is organized in various levels 1) Geometric layer, 2)Semantic layer and 3) Map prior’s layer. The conventional approaches to capturing and extractinginformation at different HD map levels rely heavily on huge sensor networks and manual annotation.This is not scalable to create HD maps for massive road networks. We propose two novel solutionsto address the mentioned problems in this work. The first solution deals with the generation of thegeometric layer with parametric information of the road scene and other one to update information onroad infrastructure and traffic violations in the semantic layer.Firstly, the creation of the geometric layer of the HD map requires understanding the road layout interms of structure, number of lanes, lane width, curvature, etc. Prediction of these attributes as part ofa generalizable parametric model with which road layout can be rendered would suite the creation of ageometric layer. Many previous works that tried to solve this problem rely only on ground imagery andare limited by the narrow field of view of the camera, occlusions, and perspective shortening. This workdemonstrates the effectiveness of using aerial imagery as an additional modality to overcome the abovechallenges. We propose a novel architecture, Unified, that combines aerial and ground imagery featuresto infer scene attributes. We quantitatively evaluate on the KITTI dataset and show that our Unifiedmodel outperforms prior works. Since this dataset is limited to road scenes close to the vehicle, we sup-plement the publicly available Argoverse dataset with scene attribute annotations and evaluate far-awayscenes. We quantitatively and qualitatively show the importance of aerial imagery in understanding roadscenes, especially in regions farther away from the ego-vehicle.Finally, we also propose a simple mobile imaging setup to address and audit several common prob-lems in urban mobility and road safety, which can enrich the information in a semantic layer of HDmaps. Recent computer vision techniques are used to identify street irregularities (including missinglane markings and potholes), absence of street lights, and defective traffic signs using videos obtainedfrom a moving camera-mounted vehicle. Beyond the inspection of static road infrastructure, we alsodemonstrate the applicability of mobile imaging solutions to spot traffic violations. We validate ourproposal on the long stretches of unconstrained road scenes covering over 2000Km and discuss practi-cal challenges in applying computer vision techniques at such a scale. Exhaustive evaluation is carried viiout on 257 long-stretches with unconstrained settings and 20 conditions-based hierarchical frame-levellabels for different timings, weather conditions, road type, traffic density, and state of road damage. Forthe first time, we demonstrate that large-scale analytics of irregular road infrastructure is feasible withexisting computer vision techniques.

Year of completion:	December 2022
Advisor :	C V Jawahar

Related Publications

Downloads

Weakly supervised explanation generation for computer aided diagnostic systems

Aniket Joshi

Abstract

Computer Aided Diagnosis (CAD) systems are developed to aid doctors and clinicians in diagnosis after interpreting and examining a medical image. CAD systems aids in performing the task more consistently. With the arrival of data-driven deep learning paradigm and availability of large amount of data in the medical domain, CAD systems are being developed to diagnose a large variety of diseases ranging from different types of cancers, heart and brain diseases, Alzheimer’s disease, and diabetic retinopathy, etc. These systems are highly competent in performing the task on which they are trained. Although such systems perform at par with the trained clinicians, they suffer from a limitation in that they are completely black box in nature and are trained only on image-level class labels. This poses a problem in deploying CAD systems as stand alone solutions for disease diagnosis. This is because decisions in the medical domain are about health of a patient and are well reasoned and backed by evidence, sometimes from multiple modalities. Hence, there is a critical need for CAD system’s decisions to be explainable. Restricting our focus to only image modality, a solution to design explainable CAD systems could be to train the system using both class labels and local annotations and derive the explanation in a fully supervised manner. However, getting these local annotations is very expensive, time-consuming, and infeasible in most circumstances. In this thesis we address this explainability and data scarcity problem and propose two different approaches towards the development of weakly supervised explainable CAD systems. Firstly, we aim to explain the classification decision by providing heatmaps denoting important regions of interest in the image, which helped the model make the prediction. In order to generate anatomically accurate heatmaps, we provide a mixed set of annotations to our model - class labels for the entire training set of images and rough localization of suspect regions for a smaller subset of images in the training set. The proposed approach is illustrated on two different disease classification tasks based on disparate image modalities - Diabetic macular edema (DME) classification from OCT slices and Breast Cancer detection from mammographic images. Good classification results are shown on public datasets, supplemented by explanations in the form of suspect regions; these are derived using just a third of images with local annotations, emphasizing the potential for generalisability of the proposed solution.

Year of completion:	November 2021
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Improving the Efficiency of Fingerprint Recognition Systems

Additya Popli

Abstract

Humans have used different characteristics to identify each-other since early times. This practice of identification based on person-specific features called biometric traits has developed over time to use more sophisticated techniques and characteristics like fingerprints, irises and gait in order to improve the identification performance. Fingerprints due to their distinctiveness, persistence over time and ease of capture, have become of the most widely use biometric traits for identification. However, with this ever increasing dependence on fingerprint biometrics, it is very important to ensure the safety of these recognition system against potential attackers. One of the most common and successful ways to circumvent these systems is through the use of fake or artificial fingers synthesized using commonly available materials like silicon and clay to match the real fingerprint of any particular person. Most fingerprint recognition systems employ a spoof detection module to filter out these fake fingerprints. While they seem to work well in general, it is a well-established fact that spoof detectors are not able to identify spoof fingerprints synthesized using ”unseen” or ”novel” spoof materials, i.e, the materials which were not available during the training phase of the detector. While it is possible to synthesize a few fingers using the various available materials, present-day spoof detectors require a large amount of samples for their training, which is practically not feasible due to the high cost and high complexity of fabrication of spoof fingers. In this thesis, we propose a method for creating artificial fingerprint images using only a very limited number of artificial fingers created from a specific material. We train a style-transfer network using available spoof fingerprint images which learns to extract material properties from the image, and then for each material, uses the limited set of spoof fingerprint images to generate a huge dataset of artificial fingerprint images without actually fabricating spoof fingers. These artificial fingerprint images can then be utilised by the spoof detector for training. Through our experiments, we show that the use of these artificially generated spoof images for training can improve the performance of existing spoof detectors over unseen spoof materials. Another major limitation of present-day recognition systems is their high resource requirements. Most fingerprint recognition systems use a spoof detector as a separate system either in series or in parallel with a fingerprint matcher leading to very high memory and time requirements during inference. To overcome this limitation, we explore the relationship between these two tasks in order to develop a common module capable of performing both spoof detection and matching. Our experiments show a high level of correlation between the features extracted for spoof detection and matching. We propose a new joint model which achieves similar fingerprint spoof detection and matching performance on various datasets as current state-of-the-art methods while using 50% less time and 40% less memory, thus providing a significant advantage for recognition systems deployed on resource-constrained devices like mobile phones.

Year of completion:	December 2021
Advisor :	Anoop M Namboodiri

Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding

Sowmya Aitha

Abstract

Related Publications

Downloads

Situation Recognition for Holistic Video Understanding

Zeeshan Kha

Abstract

Related Publications

Downloads

Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing

Durga Nagendra Raghava Kumar Modhugu

Abstract

Related Publications

Downloads

Weakly supervised explanation generation for computer aided diagnostic systems

Aniket Joshi

Abstract

Related Publications

Downloads

Improving the Efficiency of Fingerprint Recognition Systems

Additya Popli

Abstract

Related Publications

Downloads

More Articles …