Thesis Students

Deploying Multi Camera Multi Player Detection and Tracking Systems in Top View

Swetanjal Murati Dutta

Abstract

Object Detection and Tracking computer vision algorithms have remarkably progressed in the past few years, owing to the rapid progress in the field of deep learning. These algorithms find numerous applications in surveillance, robotics, autonomous driving, sports analytics, and human-computer interaction. They achieve near-perfect performance on the benchmark datasets they have been evaluated on. However, deploying such algorithms in real-world poses a large number of challenges, and they do not work as well as we expect them to work by looking at their performance on benchmark datasets. In this thesis, we present details of deploying one such Multi Camera Multi-Object Detection and Tracking system to track players in the bird’s eye (top) view in a sports event, specifically in the game of cricket. Our system is able to generate the top-view map of the fielders from cameras placed on crowd stands of a stadium, making it extremely cost-effective compared to using spider cameras. We present details on the challenges and hurdles we faced while deploying our systems and the solutions we designed to overcome these challenges. Ultimately, we tailored a neatly engineered end-to-end system that went live in the Asia Cup of 2022. Deploying such a multi-camera detection and tracking system poses many challenges. The first of them is related to camera placement. We had to devise strategies to place cameras optimally so that all the objects of interest could be captured by one or more of the cameras placed, at the same time ensuring that they do not appear so small that it becomes difficult for a detector to localize them. Constructing the bird’s eye view of the detected players required camera calibration. In the real world, we may not find reliable point correspondences to calibrate a camera. Even if we might find point correspondences, sometimes they may be really hard to see to accurately calibrate a camera. Detecting players in a setup like this proved to be really challenging because the objects of interest appeared very small in the camera views. Moreover, since the detections were coming from multiple cameras, algorithms had to be devised to associate them correctly across camera views. Tracking in a setting like this was even harder because we had to rely only on motion cues to track the objects of interest. Appearance features did not add any value because of the fact all the players being tracked wore jerseys of the same color. Each of these challenges became even more, harder to solve because of the real-time requirements of our use case. Finally, setting up the hardware architecture to receive the real-time live feeds from each camera in a synchronized manner and implementing a fast communication protocol for transmitting data between the various system components required careful design choices to be made, all of which have been presented in detail in this thesis.

Year of completion:	November 2023
Advisor :	Vineet Gandhi

Related Publications

Downloads

Towards building controllable Text to Speech systems

Saiteja Kosgi

Abstract

Text-to-speech systems convert any given text to speech. They play a vital role in making Humancomputer interaction (HCI) possible. As humans, we don't just rely on text (language) to communicate; we use many other mechanisms like voice, gestures, expressions, etc., to communicate efficiently. In natural language processing, vocabulary and grammar tend to take center stage, but those elements of speech only tell half the story. Affective prosody of speech provides larger context and gives meaning to words, and keeps listeners engaged. Current HCI systems largely communicate in text, and they lack a lot of prosodic information, which is crucial in a conversation. To make the HCI systems communicate in speech, text to speech systems should be able to synthesize speech that is expressive and controllable. But the existing text to speech systems learn the average variation in the dataset it’s trained on, which synthesizes samples in a neutral way without many prosodic variations. To this end, we develop a textto-speech system that can synthesize the given emotion where the emotion is represented as a tuple of Arousal, Valance and Dominance (AVD) values. Text to speech systems have a lot of complexities. Training such a system requires the data to be very clear, noiseless, and collecting such data is difficult. If the data is noisy, it will reflect unnecessary artifacts in the synthesized samples. Training emotion based text to speech models is considerably more difficult and not strait forward. The fact that obtaining emotion annotated data for the desired speaker is costly and very subjective makes it a cumbersome task. Current emotion based systems can synthesize emotions with some limitations. (1) Emotion controllability comes at the cost of loss in quality, (2) Have discreet emotions which lack the finer control, and (3) cannot be generalized to new speakers without the annotated emotion data. We propose a system that overcomes the above-mentioned problems by leveraging the largely available corpus of noisy speech annotated with emotions. Even though the data is noisy, our technique trains an emotion based text to speech system that can synthesize desired emotion without any loss of quality in the output. We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate variances/features (pitch, energy, and duration) as levers. We learn how the variances change with respect to emotion. We bring the finer control in the synthesized speech by using AVD values, which can represent emotions in a 3D space. Our proposed method also doesn’t require emotion annotated data for the target speaker. Once trained on the emotion annotated data, it can be applied to any system which has the prediction of the variances as an intermediate step. vi vii With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted ”human touch” in machine dialogue.

Year of completion:	May 2023
Advisor :	Vineet Gandhi

Related Publications

Downloads

Effective and Efficient Attribute-aware Open-set Face Verification

Arun Kumar Subramanian

Abstract

While face recognition and verification in controlled settings is already a solved problem for machines, the uniqueness of face as a biometric is that the mode of capture is highly diverse. A face could be captured nearby or at distance, at different poses, with different lighting, and by different devices. Face recognition/verification has several challenges to overcome to effectively perform under these varying conditions. Most current methods, try to find salient features of an individual by ignoring these variations. This can be looked at from the paradigm of signal and noise. The signal here refers to that information that is unique to an individual, but not varying as per the condition. Noise represents those aspects that are not related to the identity itself and are influenced by the capture mechanism, physical setting, etc. This is usually done through metric learning approaches in addition to the use of loss functions such as cross-entropy (e.g., Siamese networks, angular loss, and other margin losses such as ArcFace). There are certain aspects that lie between signal and noise such as facial attributes (such as eyeglasses). These may or may not be unique to the individual subject, but introduces artifacts into the face image. The question then arises, why can’t these variations be detected using learning methods, and the knowledge thus attained about the variations be put to good use during the matching process? It is this curiosity that has resulted in aggregation strategies for matching, which were previously implemented for aspects such as pose, age, etc. However, in the wild, humans demonstrate significant variability in facial attributes such as facial hair, eyeglasses, hairstyles, and make-up. This is common as one of the primary mechanisms of face image acquisition is covert capture in public (with ethics of consent in place), where people usually display significant variability in facial attributes. Hence it is very important to address this variability during the matching process. This work attempts to do the same. The curious question that arises however is if indeed matching performance varies if the attribute prior is known. Even if it does, how does one conceptualize a system that exploits the same? It is here that this thesis proposed two frameworks. One of the configuration-specific operating points and the other involves suppression of attribute information in face embedding prior to matching. The attribute suppression is attempted both directly at the final embedding, and suppression of intermediary layers of a Vision Transformer Deep Neural network. Both of these require the facial attribute of each image to be detected prior to passing the images into the proposed framework for matching. The above naturally adds another task to the face verification pipeline. It is therefore extremely necessary to find efficient and effective ways of performing face attribute detection (and face template generation), since efficiently performing parts, mitigates the pipeline expansion overhead and makes this a viable pipeline to consider for face verification. We observe that face attribute detection usually employs end-to-end networks, which results in a lot of parameters for inference. A feasible alternative is to constantly leverage the SOTA (state-of-the-art) face recognition networks and use the earlier feature layers to perform the face attribute classification task. Since the highly accurate SOTA is currently DNNs (Deep Neural Networks) for face, the same is dealt with in this thesis. More narrowly, we focus on open-set face verification, where DNNs aim to find unique representation even for subjects not used for training the DNN.

Year of completion:	June 2023
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

Face Reenactment: Crafting Realistic Talking Heads for Enhanced Video Communication and Beyond

Madhav Agarwal

Abstract

Face Reenactment and Synthetic Talking Head works have been widely popular for creating realistic face animations by using a single image of a person. In light of the recent developments in processing facial features in images and videos, as well as the ability to create realistic talking heads, We are focusing on two promising applications. These applications include utilizing face reenactment for movie dubbing and compressing video calls where the primary object is a talking face. We propose a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We use audio as an additional input for high-quality lip sync, by helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to preserve the structure of the reconstructed faces. Finally, we incorporate a carefully designed identity- aware generator module to get realistic quality of talking heads. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low-bandwidth video calls and movie dubbing. We leverage the advancements in talking head generation to propose an end-to-end system for video call compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect keypoints in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting keypoints instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver’s end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects, like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques

Year of completion:	June 2023
Advisor :	C V Jawahar, Vinay P Namboodiri

Related Publications

Downloads

Fingerprint Disentanglement for Presentation Attack Generalization Across Sensors and Materials

Gowri Lekshmy

Abstract

In today's digital era, biometric authentication has become increasingly widespread for verifying a user across a range of applications, from unlocking a smartphone to securing high-end systems. Various biometric modalities such as fingerprint, face, and iris offer a distinct way to recognize a person automatically. Fingerprints are one of the most prevalent biometric modalities. They are widely utilized in security systems owing to their remarkable reliability, distinctiveness, invariance over time and user convenience. Nowadays, automatic fingerprint recognition systems have become a prime target for attackers. Attackers fabricate fingerprints using materials like playdoh and gelatin, making it hard to distinguish them from live fingerprints. This way of circumventing biometric systems is called a presentation attack (PA). To identify such attacks, a PA detector is added to these systems. Deep learning-based PA detectors require large amounts of data to distinguish PA fingerprints from live ones. However, there exists significantly less training data with novel sensors and materials. Due to this, PA detectors do not generalize well on introducing unknown sensors or materials. It is incredibly challenging to physically fabricate an extensive train dataset of high-quality counterfeit fingerprints generated with novel materials captured across multiple sensors. Existing fingerprint presentation attack detection (FPAD) solutions improve cross-sensor and cross-material generalization by utilizing styletransfer-based augmentation wrappers over a two-class PA classifier. These solutions generate large artificial datasets for training by using style transfer which learns the style properties from a few samples obtained from the attacker. They synthesize data by learning the style as a single entity, containing both sensor and material characteristics. However, these strategies necessitate learning the entire style upon adding a new sensor for an already known material or vice versa. This thesis proposes a decomposition-based approach to improve cross-sensor and cross-material FPAD generalization. We model presentation attacks as a combination of two underlying components, i.e., material and sensor, rather than the entire style. By utilizing this approach, our method can generate synthetic patches upon introducing either a new sensor, a new material, or both. We perform two different methods of fingerprint factorization - traditional and deep-learning based. Traditional factorization of fingerprints into sensor and material representations using tensor decomposition establishes a baseline using machine learning for our hypothesis. The deep-learning method uses a decompositionbased augmentation wrapper for disentangling fingerprint style. The wrapper improves cross-sensor and cross-material FPAD, utilizing one fingerprint image of the target sensor and material. We also reduce vi vii computational complexity by generating compact representations and utilizing lesser combinations of sensors and materials to produce several styles. Our approach enables us to generate a large variety of samples using a limited amount of data, which helps improve generalization

Year of completion:	June 2023
Advisor :	Anoop M Namboodiri

Deploying Multi Camera Multi Player Detection and Tracking Systems in Top View

Swetanjal Murati Dutta

Abstract

Related Publications

Downloads

Towards building controllable Text to Speech systems

Saiteja Kosgi

Abstract

Related Publications

Downloads

Effective and Efficient Attribute-aware Open-set Face Verification

Arun Kumar Subramanian

Abstract

Related Publications

Downloads

Face Reenactment: Crafting Realistic Talking Heads for Enhanced Video Communication and Beyond

Madhav Agarwal

Abstract

Related Publications

Downloads

Fingerprint Disentanglement for Presentation Attack Generalization Across Sensors and Materials

Gowri Lekshmy

Abstract

Related Publications

Downloads

More Articles …