Thesis Students

Coreference Without Bells and Whistles

S Kawshik Manikantan

Abstract

Coreference resolution (CR) is the task of identifying text spans that refer to the same entity. It is a fundamental component of natural language understanding with applications in various downstream NLP tasks, such as question answering, knowledge graph construction, and summarization. Despite its significance and the advancements made by neural coreference models, CR models face a major bottleneck: their limited generalization capability.

Prior work attributes this generalization gap to differences in annotations, such as what constitutes a mention (or entity) and varying preferences to span boundaries. For a model to have strong referential capabilities, it must adapt to these annotation-specific nuances. However, achieving this level of adaptability remains a significant challenge, even for state-of-the-art (SOTA) models. This challenge is further amplified when evaluating the referential capabilities of large language models (LLMs) in a few-shot setting, where replicating nuanced annotations with just a few examples is highly unrealistic. We observe that these annotation-specific nuances, can be beneficial but are not essential for downstream tasks or for evaluating the core referential capabilities of an LLM. We describe these nuances as bells and whistles.

In this work, we redefine the traditional formulation of coreference resolution by shifting focus away from its bells and whistles. Instead, we propose task formulations more aligned with practical applications and demonstrate improved generalizability across domains.

Our first contribution introduces an alternative referential task, Major Entity Identification (MEI). MEI simplifies referential tasks by:(a) assuming that target entities are explicitly provided in the input, and (b) focusing exclusively on frequent entities. Assuming entities to be part of the input shifts the responsibility for domain-specific annotation adaptation—determining which entities are annotated—from the training phase to inference. Through extensive experiments, we show that MEI models generalize effectively across domains using both supervised approaches and LLM-based few-shot prompting across multiple datasets. Importantly, MEI aligns with the classification framework, enabling the use of robust, intuitive, and well-understood classification-based evaluation metrics. Beyond its theoretical appeal, MEI also has practical utility as it allows users to efficiently search for all mentions of a specific entity or a group of entities of interest.

Our second major contribution addresses critical shortcomings identified in recent evaluations of large language models (LLMs) on coreference resolution. These studies revealed that traditional output formats and evaluation metrics fail to capture models’ referential understanding fully. Traditional evaluation methods require reproducing the entire document along with annotated cluster information or precisely replicating the antecedent span. This introduces additional bells and whistles, such as ensuring the accurate reproduction of spans and documents. To tackle this issue, we introduce IdentifyMe, a new benchmark for mention resolution that adopts a multiple-choice question (MCQ) format—a widely used evaluation approach for LLMs. With this simplified task design, any failure can now, be attributed exclusively to issues with mention resolution. IdentifyMe presents long narratives and applies heuristics to eliminate easily identifiable mentions, resulting in a more challenging and rigorous task. The benchmark incorporates a curated mix of various mention types and their corresponding entities, enabling fine-grained analysis of model performance. Notably, LLM performance remains substantially below human-level performance on IdentifyMe, highlighting considerable room for improvement even for advanced models like GPT-4. The evaluation also reveals key weaknesses in current LLMs, particularly with pronominal mentions, nested mentions, and other nuanced cases.

Overall, this work moves beyond traditional coreference resolution formulations, focusing on tasks with practical applicability and providing fresh insights into the referential strengths and weaknesses of current models. We term this approach Coreference Without Bells and Whistles — a streamlined perspective that prioritizes utility and understanding of model capabilities over tailored annotation adaptation.

Year of completion:	May 2025
Advisor :	Vineet Gandhi

Related Publications

Downloads

Predictive Modeling of Accident-Prone Road Zones and Action Recognition in Unstructured Traffic Scenarios using ADAS Systems at Population Scale

Ravi Shankar Mishra

Abstract

This thesis addresses the critical challenge of improving road safety by introducing novel approaches to predictive modeling of accident-prone zones and action recognition in critical traffic scenarios. It makes two key contributions: the early identification of accident-prone zones using Advance Driving Assistance System (ADAS) data and the development of IDD-CRS, a comprehensive dataset for action recognition in unstructured road environments.

In the first study, geo-tagged collision alert data from a fleet of 200 ADAS-equipped city buses in Nagpur, India, is leveraged to proactively identify high-risk zones across urban road networks. Using Kernel Density Estimation (KDE), this study captures the spatiotemporal distribution of collision alerts, enabling the detection of emerging blackspots before accidents occur. A novel recall-based metric evaluates the alignment of these predicted zones with historical blackspots, while Earth Mover Distance (EMD)-based analysis identifies previously unreported accident-prone areas. This predictive framework provides civic authorities with actionable insights for targeted interventions, such as traffic-calming measures and infrastructure improvements, thereby enhancing public safety.

The second part of the thesis introduces the IDD-CRS dataset, a large-scale collection of traffic scenar- ios recorded using ADAS and dash cameras. IDD-CRS fills a critical gap in existing datasets by focus- ing on complex interactions between vehicles and pedestrians, with scenarios such as high-speed lane changes, unsafe vehicle approaches, and near-miss incidents. With precise temporal annotations pow- ered by ADAS technology, the dataset ensures accurate event boundaries, providing a robust benchmark for action recognition and long-tail action recognition tasks. It includes 90 hours of footage spanning 5,400 one-minute videos and 135,000 frames, with hard negative examples to challenge existing mod- els. Initial benchmarks highlight the limitations of current video backbones in recognizing rare events, emphasizing the need for further advancements.

Together, these contributions provide a holistic framework for improving road safety through proactive accident prevention and robust action recognition in traffic scenarios. By addressing both spatial acci- dent prediction and temporal event recognition, this work offers foundational resources and actionable insights to advance research and practical solutions for safer road environments.

Year of completion:	April 2025
Advisors :	Ravi Kiran Sarvadevabhatla

Related Publications

Downloads

Ads and Anomalies: Structuring the Known and Probing the Unknown

Keralapura Nagaraju Amruth Sagar

Abstract

The convergence of computer vision and advertising analysis has seen progress, but existing advertisement datasets remain limited. Many are small subsets of larger datasets, and while larger datasets may offer multiple annotations, they often lack consistent organization across all images, making it challenging to structure ads hierarchically. This lack of clear categorization and overlap in labeling hinders in-depth analysis. To address this, we introduce MAdVerse1 , a comprehensive, multilingual dataset of over 50,000 advertisements sourced from websites, social media, and e-newspapers. MAdVerse organizes ads into a hierarchy with 11 primary categories, 51 sub-categories, and 524 specific brands, facilitating fine-grained analysis across a diverse range of brands. We establish baseline performance metrics for key ad-related tasks, including hierarchical classification, source classification, and hierarchy induction in other ad datasets and, in a multilingual context, thereby providing a structured foundation for advertisement analysis.

In our second work, we investigate foundational aspects of out-of-distribution (OOD) detection. Existing OOD benchmarks typically focus on broad, class-level shifts but lack controlled environments for assessing how individual attribute changes such as color or shape affect OOD detection. To bridge this gap, we created two synthetic datasets, SHAPES and CHARS2 , each designed to allow controlled experimentation with isolated shifts in attributes. Through variations in color, size, rotation, and other factors, these datasets facilitate a targeted examination of OOD detection performance under specific conditions, providing insights into how OOD detection is affected under different attribute shifts. Later, we apply OOD detection methods to advertisements, where models face real-world distribution shifts characteristic of diverse advertising styles.

Our contributions, MAdVerse for structured ad analysis and SHAPES and CHARS for controlled OOD studies emphasize the importance of robust, adaptable models for both foundational research and practical applications in advertisement analysis.

Year of completion:	December 2024
Advisor :	Ravi Kiran Sarvadevabhatla

Related Publications

Downloads

Advancing Motion with LLMs: Leveraging Large Language Models for Enhanced Text-Conditioned Motion Generation and Retrieval

Kalakonda Sai Shashank

Abstract

In the field of artificial intelligence, the generation of human-like motion from natural language descriptions has garnered increasing attention across various research domains. Computer vision focuses on understanding and replicating visual cues for motion, while computer graphics aims to create and edit visually realistic animations. Similarly, multimedia research explores the intersection of data modalities, such as text, motion, and image, to enhance user experiences. Robotics and human-computer interaction are pivotal areas where language-driven motion systems improve the autonomy and responsiveness of machines, facilitating more efficient and meaningful human-robot interactions. Despite its significance, existing approaches still encounter significant difficulties, particularly when generating motions from unseen or novel text descriptions. These models often lack the ability to fully capture intricate, low-level motion nuances that go beyond basic action labels. This limitation arises from the reliance on brief and simplistic textual descriptions, which fail to convey the complex and fine-grained characteristics of human motion, resulting in less diverse and realistic outputs. As a result, the generated motions frequently lack the subtlety and depth required for more dynamic and context-specific applications.

This thesis introduces two key contributions to overcome these limitations and advance text-conditioned human motion generation. First, we present Action-GPT, a novel framework aimed at significantly enhancing text-based action generation models by incorporating Large Language Models (LLMs). Traditional motion capture datasets tend to provide action descriptions that are brief and minimalistic, often failing to convey the full range of complexities involved in human movement. Such sparse descriptions limit the ability of models to generate diverse and nuanced motion sequences. Action-GPT leverages LLMs to create richer, more detailed descriptions of actions, capturing finer aspects of movement. By doing so, it improves the alignment between text and motion spaces, enabling models to generate more precise and contextually accurate motion sequences. This framework is designed to work with both stochastic models (e.g., VAE-based) and deterministic models offering flexibility across different types of motion generation architectures. Experimental results demonstrate that Action-GPT not only enhances the quality of synthesized motions—both in terms of realism and diversity—but also excels in zero-shot generation, effectively handling previously unseen text descriptions.

Year of completion:	February 2025
Advisor :	Ravi Kiran Sarvadevabhatla

Related Publications

Downloads

Towards Understanding Small Objects in Indian Driving Situations

Umamahesh Janapareddi

Abstract

In Indian urban and rural driving scenarios, small objects are pervasive and often crucial for safe navigation. These objects can include pedestrians crossing roads, children playing near streets, cyclists, stray animals, as well as small vehicles like scooters and motorbikes. Additionally, traffic signs, signal lights, potholes, and road markings (such as lane dividers or zebra crossings) are often small in size but essential for driving decisions. In such contexts, missing or inaccurately segmenting these small objects can lead to critical errors in detection, causing accidents or delays in the vehicle’s decision-making process. Automated understanding of such objects need detection and segmentation to start with.

Semantic Segmentation is a critical task in computer vision with a wide range of applications. The objective is to partition an image—a collection of pixels—into distinct labeled regions, each corresponding to specific objects or parts of the scene. This process is crucial for scene understanding and enables the localization of objects within the image. Over time, significant progress has been made in semantic segmentation, especially with the advent of deep learning. The advances in this area have revolutionized computer vision, pushing beyond traditional methods and achieving remarkable improvements in performance.

When discussing semantic segmentation, we often focus on datasets, the objects within those datasets, and their corresponding segmentations. While many datasets exist for road scenarios, particularly those representing Western road conditions, there is relatively little research on road conditions specific to India. One notable exception is the Indian Driving Dataset (IDD), a dataset specifically designed for semantic segmentation of Indian road scenarios.

Road and driving datasets typically contain objects of varying sizes within each class label. These objects can be broadly categorized into three types: small, medium, and large. The importance of segmentation is well understood across several domains such as medical imaging, autonomous vehicles, aerial imagery, robotics, surveillance, and industrial automation. However, one of the most challenging problems in segmentation is the segmentation of small objects. Small object segmentation is particularly difficult due to factors such as (i) the limited number of pixels representing small objects, (ii) class imbalance during training, and (iii) the inherent challenges posed by small object representations. These factors hinder the performance of deep learning architectures, making it harder for modern techniques to accurately handle small objects.

Year of completion:	March 2025
Advisor :	Jawahar C V

Coreference Without Bells and Whistles

S Kawshik Manikantan

Abstract

Related Publications

Downloads

Predictive Modeling of Accident-Prone Road Zones and Action Recognition in Unstructured Traffic Scenarios using ADAS Systems at Population Scale

Ravi Shankar Mishra

Abstract

Related Publications

Downloads

Ads and Anomalies: Structuring the Known and Probing the Unknown

Keralapura Nagaraju Amruth Sagar

Abstract

Related Publications

Downloads

Advancing Motion with LLMs: Leveraging Large Language Models for Enhanced Text-Conditioned Motion Generation and Retrieval

Kalakonda Sai Shashank

Abstract

Related Publications

Downloads

Towards Understanding Small Objects in Indian Driving Situations

Umamahesh Janapareddi

Abstract

Related Publications

Downloads

More Articles …