CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Multi-Object Multi-Part Scene Parsing


Pranav Gupta

Abstract

This thesis presents a comprehensive exploration of Multi-Object Multi-Part Scene Parsing in 2D images, showcasing significant advancements through two novel approaches, FLOAT and OLAF, each tailored to enhance scene parsing performance and scalability. The first paper introduces FLOAT, a factorized label space framework designed to independently predict object categories and part attributes, thereby simplifying the segmentation task and enhancing scalability. Notably, FLOAT incorporates a unique ’zoom’ refinement technique at inference time, significantly elevating segmentation accuracy, particularly for smaller objects and parts. Empirical results on the Pascal-Part datasets underscore FLOAT’s superior performance, achieving notable improvements in mean Intersection Over Union (mIOU) and segmentation quality IOU (sqIOU), especially on the most comprehensive Pascal Part-201 dataset, reflecting its effectiveness in handling diverse and complex scenes.
 
The second paper delves into OLAF, a plug-and-play methodology that augments traditional RGB inputs with object-based structural cues to better capture the  complexities of scene structures. This approach leverages a weight adaptation technique, allowing pre-trained RGB models to seamlessly integrate augmented data, thus stabilizing the optimization process. Additionally, the introduction of the LDF encoder module aids in providing low-level dense feature guidance, enhancing the segmentation of smaller parts. OLAF demonstrates its versatility across various architectures and datasets, achieving significant mIOU gains on multiple Pascal-Part benchmarks, highlighting its broad applicability and robust performance enhancements in challenging segmentation scenarios.
 
Together, these studies contribute to the evolving field of computer vision by offering scalable, efficient, and effective solutions for multi-object multi-part scene parsing, reflecting a significant stride in parsing intricate scenes with high granularity and diversity.

 

Year of completion:  March 2025
 Advisor : Ravi Kiran Sarvadevabhatla C V

Related Publications


    Downloads

     

    High Precision Text Line Segmentation in Palm Leaf Manuscripts


    Niharika Vadlamudi

    Abstract

    Ancient manuscripts were among the first forms of written communication, offering key insights into our past, covering literature, medicine, culture, philosophy, religion, and more. It is imperative to save these writings to identify and extract their hidden knowledge. Document collections frequently exhibit overlapping components, irregular patterns, dense layouts, extremely high aspect ratios, physical and chemical degradation (evident in ink-based manuscripts), text misalignment, and more. Compounding these difficulties are issues that may arise during the digitization processes, such as improper document positioning, inadequate illumination, and scanner effects. In this academic thesis, our main emphasis is on identifying and segmenting text lines inside these documents for downstream OCR applications with utmost precision.

    We devise a two-stage approach - SeamFormer- to identify special text baselines in palm-leaf manuscripts using a multi-task strategy via Vision Transformers (ViTs). In the first stage, we detect text strikethroughs, namely ’scribbles,’ which act as pointers to the location of text line segments within the document. The encoder-decoder architecture is used to analyze input image patches and produce two separate maps: a scribble map and a binary map. In the second stage, we post-process the prior stage outputs and generate a diacritic-aware global energy map. To generate the final precise text line polygons, we use a modified seam generation algorithm along with customized global maps.

    The prior methodology is further enhanced by the proposed LineTR model, a multi-task DETR (Detection Transformer) model that reconceptualizes the scribble generation as a geometric problem. This method simply generates line parameters for each text line present in the input image patch. This design decision enables the model to exhibit zero-shot behavior across diverse historical manuscripts. The state-of-the-art approach has been proven to generate precise text-line segmentation with a single ’unified’ model with minimal post-processing efforts, making it a strong candidate for image-to-OCR integration pipelines.

     

    Year of completion:  March 2025
     Advisor : Ravi Kiran Sarvadevabhatla C V

    Related Publications


      Downloads

      thesis

       

      Computer Vision on Road Scenes: Benchmarking, and Open World Object Detection


      Deepak Kumar Singh

      Abstract

      In autonomous driving, we have multiple computer vision tasks like object detection, semantic segmentation, and instance segmentation which plays a crucial role in perceiving the environment around the vehicle. Understanding the behaviour and performance of such tasks helps improve and address the key issues that are inherent in the system. There can be issues which are latent in the deep learning architecture and also in the datasets on which the deep learning models are trained and tested. In this thesis, we benchmark the performance of various popular deep learning models on road scene datasets for various computer vision tasks and also formulate open-world object detection on road scenes by addressing the inherent issues present in road scene datasets.

      In the first part of the work, we aim to understand the performance and behaviour of various deep learning models on road scene datasets; Cityscapes, IDD, and BDD. Object detection, semantic segmentation, and instance segmentation form the bases for many computer vision tasks in autonomous driving. The complexity of these tasks increases as we shift from object detection to instance segmentation. The state-of-the-art models are evaluated on standard datasets such as PASCAL-VOC and MS-COCO, which does not consider the dynamics of road scenes. Driving datasets such as Cityscapes and Berkeley Deep Drive(BDD) are captured in a structured environment with better road markings and fewer variations in the appearance of objects and background. However, the same does not hold for Indian roads. The Indian Driving Dataset(IDD) dataset is captured in unstructured driving scenarios and is highly challenging for a model due to its diversity. This work presents a comprehensive evaluation of state-of-the-art models on object detection, semantic segmentation, and instance segmentation on road scene datasets. We present our analyses and compare their quantitative and qualitative performance on structured driving datasets(Cityscapes and BDD) and the unstructured driving dataset(IDD); understanding the behavior on these datasets helps in addressing various practical issues and helps in creating real-life applications.

       

      Year of completion:  March 2025
       Advisor : Jawahar C V

      Related Publications


        Downloads

        thesis

         

        Role of Scene Text Understanding in Enhancing Driver Assistance


        George Tom

        Abstract

        Scene text conveys important information to drivers and pedestrians, serving as an indispensable means of communication in various environments. Scene text contains information regarding the speed limit, route information, rest stops, and exits, among other important information. It is important for drivers and passengers to understand this information for a safe and efficient journey. However, outdoor scenes are cluttered with text, distracting drivers and making it hard to focus on what matters, potentially compromising their ability to focus on essential details and navigate safely. Recognising scene text in motion aggravates this challenge, as textual cues transiently appear and necessitate early detection at a distance. Driving scenarios introduce additional complexities, including occlusions, motion blur, perspective distortions, and varying text sizes, further complicating scene text understanding.

              In this thesis, we look at improving scene text understanding in diving scenarios through video question answering and analyzing the present state of scene text detection, recognition, and tracking: (i) We introduce new video questions answering tasks and datasets that require an understanding of text and road signs in driving videos to answer the questions. (ii) We look at the current state of scene text detection, tracking, and recognition in the driving domain through the RoadText-1K competition. (iii) We explore detection and recognition in special cases of occlusions, a common yet under-explored complication in real-world driving scenarios. By focusing on these areas, the thesis contributes to advancing scene text analysis methodologies, offering insights and solutions that are imperative for developing more intelligent and responsive driver assistance systems.

         

        Year of completion:  November 2024
         Advisor : Prof. C V Jawahar
          Prof. Dimosthenis Karatzas

        Related Publications


          Downloads

          thesis

           

          Efficient Multimodal Video Representation Learning Through Language


          Darshan Singh S

          Abstract

          This work presents several contributions to video representation learning and related multimodal tasks, addressing key challenges in datasets, efficient model adaptation using less data, and compositional and fine-grained visual understanding. Despite the rapid growth of online lecture videos in the past several years, video-language research has primarily focused on instructional videos/movies, resulting in a scarcity of specialized datasets for educational lecture videos. To address this, we first introduce AV Lectures, a large-scale dataset of STEM lecture videos. It consists of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Next, we propose a novel unsupervised temporal segmentation task to segment lecture videos into bite-sized topics. We show that multimodal cues can be effectively utilized to learn lecture-aware representations for this task, facilitating a richer analysis of educational content. Next, we address the inefficiency of adapting pre-trained models like CLIP to videos. Existing methods typically rely on large-scale, sparsely annotated video caption datasets, resulting in slow and data-intensive adaptation. We propose SRL-CLIP, a novel approach that leverages the rich, structured semantic information within Semantic Role Labels (SRLs) for highly efficient adaptation. We use VidSitu for adaptation as it provides dense SRL annotations that holistically represent the entire video. SRL-CLIP achieves comparable or superior performance on various video understanding benchmarks (zero-shot retrieval, situation recognition, dense video captioning, and localization) compared to state-of-the-art models that possess 4−8× more parameters and are post-pretrained on up to 4000× times more data. To further explore the models’ understanding of visual content, we introduce three novel benchmarks. First, VELOCITI evaluates the compositional reasoning abilities of video-language models, focusing on their ability to bind semantic concepts through time. Second, we introduce NDLB, a framework aimed at improving fine-grained image captioning, which uses self-retrieval as a key component along with a new benchmark to check if the model can capture subtle visual distinctions. Finally, we introduce D3, a benchmark specifically designed to evaluate the fine-grained visual discrimination capabilities of MLLMs using self-retrieval, further pushing the boundaries of fine-grained visual understanding. These contributions, which include novel datasets, efficient training recipes, and insightful benchmarks, collectively advance the state of the art in multimodal and video representation learning.

           

          Year of completion:  December 2024
           Advisor : Jawahar C V

          Related Publications


            Downloads

            thesis

             

            More Articles …

            1. Weakly Supervised and Deep Learning Methods for Histopathological Image Classification in Neurological and Renal Disorders
            2. Editing Neural Radiance Fields
            3. Neural Fields for Hand-object Interactions
            4. Vulnerability of Neural Network based Speaker Recognition Systems
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.