CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Ads and Anomalies: Structuring the Known and Probing the Unknown


Keralapura Nagaraju Amruth Sagar

Abstract

The convergence of computer vision and advertising analysis has seen progress, but existing advertisement datasets remain limited. Many are small subsets of larger datasets, and while larger datasets may offer multiple annotations, they often lack consistent organization across all images, making it challenging to structure ads hierarchically. This lack of clear categorization and overlap in labeling hinders in-depth analysis. To address this, we introduce MAdVerse1 , a comprehensive, multilingual dataset of over 50,000 advertisements sourced from websites, social media, and e-newspapers. MAdVerse organizes ads into a hierarchy with 11 primary categories, 51 sub-categories, and 524 specific brands, facilitating fine-grained analysis across a diverse range of brands. We establish baseline performance metrics for key ad-related tasks, including hierarchical classification, source classification, and hierarchy induction in other ad datasets and, in a multilingual context, thereby providing a structured foundation for advertisement analysis.

In our second work, we investigate foundational aspects of out-of-distribution (OOD) detection. Existing OOD benchmarks typically focus on broad, class-level shifts but lack controlled environments for assessing how individual attribute changes such as color or shape affect OOD detection. To bridge this gap, we created two synthetic datasets, SHAPES and CHARS2 , each designed to allow controlled experimentation with isolated shifts in attributes. Through variations in color, size, rotation, and other factors, these datasets facilitate a targeted examination of OOD detection performance under specific conditions, providing insights into how OOD detection is affected under different attribute shifts. Later, we apply OOD detection methods to advertisements, where models face real-world distribution shifts characteristic of diverse advertising styles.

Our contributions, MAdVerse for structured ad analysis and SHAPES and CHARS for controlled OOD studies emphasize the importance of robust, adaptable models for both foundational research and practical applications in advertisement analysis.

 

Year of completion:  December 2024
 Advisor : Ravi Kiran Sarvadevabhatla

Related Publications


    Downloads

    thesis

     

    Advancing Motion with LLMs: Leveraging Large Language Models for Enhanced Text-Conditioned Motion Generation and Retrieval


    Kalakonda Sai Shashank

    Abstract

    In the field of artificial intelligence, the generation of human-like motion from natural language descriptions has garnered increasing attention across various research domains. Computer vision focuses on understanding and replicating visual cues for motion, while computer graphics aims to create and edit visually realistic animations. Similarly, multimedia research explores the intersection of data modalities, such as text, motion, and image, to enhance user experiences. Robotics and human-computer interaction are pivotal areas where language-driven motion systems improve the autonomy and responsiveness of machines, facilitating more efficient and meaningful human-robot interactions. Despite its significance, existing approaches still encounter significant difficulties, particularly when generating motions from unseen or novel text descriptions. These models often lack the ability to fully capture intricate, low-level motion nuances that go beyond basic action labels. This limitation arises from the reliance on brief and simplistic textual descriptions, which fail to convey the complex and fine-grained characteristics of human motion, resulting in less diverse and realistic outputs. As a result, the generated motions frequently lack the subtlety and depth required for more dynamic and context-specific applications.

    This thesis introduces two key contributions to overcome these limitations and advance text-conditioned human motion generation. First, we present Action-GPT, a novel framework aimed at significantly enhancing text-based action generation models by incorporating Large Language Models (LLMs). Traditional motion capture datasets tend to provide action descriptions that are brief and minimalistic, often failing to convey the full range of complexities involved in human movement. Such sparse descriptions limit the ability of models to generate diverse and nuanced motion sequences. Action-GPT leverages LLMs to create richer, more detailed descriptions of actions, capturing finer aspects of movement. By doing so, it improves the alignment between text and motion spaces, enabling models to generate more precise and contextually accurate motion sequences. This framework is designed to work with both stochastic models (e.g., VAE-based) and deterministic models offering flexibility across different types of motion generation architectures. Experimental results demonstrate that Action-GPT not only enhances the quality of synthesized motions—both in terms of realism and diversity—but also excels in zero-shot generation, effectively handling previously unseen text descriptions.

     

    Year of completion:  February 2025
     Advisor : Ravi Kiran Sarvadevabhatla

    Related Publications


      Downloads

      thesis

       

      Towards Understanding Small Objects in Indian Driving Situations


      Umamahesh Janapareddi

      Abstract

      In Indian urban and rural driving scenarios, small objects are pervasive and often crucial for safe navigation. These objects can include pedestrians crossing roads, children playing near streets, cyclists, stray animals, as well as small vehicles like scooters and motorbikes. Additionally, traffic signs, signal lights, potholes, and road markings (such as lane dividers or zebra crossings) are often small in size but essential for driving decisions. In such contexts, missing or inaccurately segmenting these small objects can lead to critical errors in detection, causing accidents or delays in the vehicle’s decision-making process. Automated understanding of such objects need detection and segmentation to start with.

      Semantic Segmentation is a critical task in computer vision with a wide range of applications. The objective is to partition an image—a collection of pixels—into distinct labeled regions, each corresponding to specific objects or parts of the scene. This process is crucial for scene understanding and enables the localization of objects within the image. Over time, significant progress has been made in semantic segmentation, especially with the advent of deep learning. The advances in this area have revolutionized computer vision, pushing beyond traditional methods and achieving remarkable improvements in performance.

      When discussing semantic segmentation, we often focus on datasets, the objects within those datasets, and their corresponding segmentations. While many datasets exist for road scenarios, particularly those representing Western road conditions, there is relatively little research on road conditions specific to India. One notable exception is the Indian Driving Dataset (IDD), a dataset specifically designed for semantic segmentation of Indian road scenarios.

      Road and driving datasets typically contain objects of varying sizes within each class label. These objects can be broadly categorized into three types: small, medium, and large. The importance of segmentation is well understood across several domains such as medical imaging, autonomous vehicles, aerial imagery, robotics, surveillance, and industrial automation. However, one of the most challenging problems in segmentation is the segmentation of small objects. Small object segmentation is particularly difficult due to factors such as (i) the limited number of pixels representing small objects, (ii) class imbalance during training, and (iii) the inherent challenges posed by small object representations. These factors hinder the performance of deep learning architectures, making it harder for modern techniques to accurately handle small objects.

       

      Year of completion:  March 2025
       Advisor : Jawahar C V

      Related Publications


        Downloads

        thesis

         

        Multi-Object Multi-Part Scene Parsing


        Pranav Gupta

        Abstract

        This thesis presents a comprehensive exploration of Multi-Object Multi-Part Scene Parsing in 2D images, showcasing significant advancements through two novel approaches, FLOAT and OLAF, each tailored to enhance scene parsing performance and scalability. The first paper introduces FLOAT, a factorized label space framework designed to independently predict object categories and part attributes, thereby simplifying the segmentation task and enhancing scalability. Notably, FLOAT incorporates a unique ’zoom’ refinement technique at inference time, significantly elevating segmentation accuracy, particularly for smaller objects and parts. Empirical results on the Pascal-Part datasets underscore FLOAT’s superior performance, achieving notable improvements in mean Intersection Over Union (mIOU) and segmentation quality IOU (sqIOU), especially on the most comprehensive Pascal Part-201 dataset, reflecting its effectiveness in handling diverse and complex scenes.
         
        The second paper delves into OLAF, a plug-and-play methodology that augments traditional RGB inputs with object-based structural cues to better capture the  complexities of scene structures. This approach leverages a weight adaptation technique, allowing pre-trained RGB models to seamlessly integrate augmented data, thus stabilizing the optimization process. Additionally, the introduction of the LDF encoder module aids in providing low-level dense feature guidance, enhancing the segmentation of smaller parts. OLAF demonstrates its versatility across various architectures and datasets, achieving significant mIOU gains on multiple Pascal-Part benchmarks, highlighting its broad applicability and robust performance enhancements in challenging segmentation scenarios.
         
        Together, these studies contribute to the evolving field of computer vision by offering scalable, efficient, and effective solutions for multi-object multi-part scene parsing, reflecting a significant stride in parsing intricate scenes with high granularity and diversity.

         

        Year of completion:  March 2025
         Advisor : Ravi Kiran Sarvadevabhatla C V

        Related Publications


          Downloads

           

          High Precision Text Line Segmentation in Palm Leaf Manuscripts


          Niharika Vadlamudi

          Abstract

          Ancient manuscripts were among the first forms of written communication, offering key insights into our past, covering literature, medicine, culture, philosophy, religion, and more. It is imperative to save these writings to identify and extract their hidden knowledge. Document collections frequently exhibit overlapping components, irregular patterns, dense layouts, extremely high aspect ratios, physical and chemical degradation (evident in ink-based manuscripts), text misalignment, and more. Compounding these difficulties are issues that may arise during the digitization processes, such as improper document positioning, inadequate illumination, and scanner effects. In this academic thesis, our main emphasis is on identifying and segmenting text lines inside these documents for downstream OCR applications with utmost precision.

          We devise a two-stage approach - SeamFormer- to identify special text baselines in palm-leaf manuscripts using a multi-task strategy via Vision Transformers (ViTs). In the first stage, we detect text strikethroughs, namely ’scribbles,’ which act as pointers to the location of text line segments within the document. The encoder-decoder architecture is used to analyze input image patches and produce two separate maps: a scribble map and a binary map. In the second stage, we post-process the prior stage outputs and generate a diacritic-aware global energy map. To generate the final precise text line polygons, we use a modified seam generation algorithm along with customized global maps.

          The prior methodology is further enhanced by the proposed LineTR model, a multi-task DETR (Detection Transformer) model that reconceptualizes the scribble generation as a geometric problem. This method simply generates line parameters for each text line present in the input image patch. This design decision enables the model to exhibit zero-shot behavior across diverse historical manuscripts. The state-of-the-art approach has been proven to generate precise text-line segmentation with a single ’unified’ model with minimal post-processing efforts, making it a strong candidate for image-to-OCR integration pipelines.

           

          Year of completion:  March 2025
           Advisor : Ravi Kiran Sarvadevabhatla C V

          Related Publications


            Downloads

            thesis

             

            More Articles …

            1. Computer Vision on Road Scenes: Benchmarking, and Open World Object Detection
            2. Role of Scene Text Understanding in Enhancing Driver Assistance
            3. Efficient Multimodal Video Representation Learning Through Language
            4. Weakly Supervised and Deep Learning Methods for Histopathological Image Classification in Neurological and Renal Disorders
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.