Thesis Students

High-Quality 3D Fingerprint Generation: Merging Skin Optics, Machine Learning and 3D Reconstruction Techniques

Apoorva Srivastava

Abstract

Fingerprints are a widely recognized and commonly used method of identification. Contact-based fingerprints, which involve pressing the finger against a surface to obtain images, are a popular method of capturing fingerprints. However, this process has several drawbacks, including skin deformation, unhygienic conditions, and high sensitivity to the moisture content of the finger. These factors can negatively impact the accuracy of the fingerprint. Moreover, fingerprints are three-dimensional anatomical structures, and two-dimensional fingerprints do not capture the depth information of the finger ridges. While 3D fingerprint capture is less sensitive to skin moisture levels and avoids skin deformation, it is limited in adoption due to the high cost and system complexity associated with it. The complexity and cost are mainly attributed to the use of multiple cameras, projectors, and sometimes synchronously moving mechanical parts. Photometric stereo offers a promising solution to build low-cost, simple sensors for high-quality 3D capture using only a single camera and a few LEDs. However, the method assumes that the surface being imaged is lambertian, which is not the case for human fingers. Existing 3D fingerprint scanners based on photometric stereo also assume that the finger is lambertian, resulting in poor reconstruction results. In this context, we introduce the Split and Knit algorithm (SnK), a 3D reconstruction pipeline based on Photometric Stereo for finger surfaces. The algorithm splits the reconstruction of the ridge-valley pattern and finger shape and combines them to obtain the 3D fingerprint reconstruction for the full finger with a single camera for the first time. To reconstruct the ridge-valley pattern, SnK introduces an efficient way of estimating the direct illumination component by using a trained U-Net without extra hardware, which reduces the non-Lambertian nature of the finger image and enables a higher-quality reconstruction of the entire finger surface. To obtain the finger shape using a single camera, the algorithm introduced two novel approaches, a) using IR illumination and b) using a mirror and parametric modeling for the finger shape. Finally, we combine the overall finger shape and the ridge-valley point cloud to obtain a 3D finger phalange. The high-quality 3D reconstruction results in better matching accuracy of the captured fingerprints. Splitting the ridge-valley pattern from the finger provides an implicit way to convert 3D fingerprint into 2D fingerprint, making the SnK algorithm compatible with the 2D fingerprint recognition systems. To apply the SnK algorithm to fingerprints, we designed a 3D printed photometric stereo-based setup that captures contactless finger images and obtains their 3D reconstructions

Year of completion:	August 2023
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

A Holistic Framework for Multimodal Ecosystem of Pictionary

Nikhil Bansal

Abstract

In AI, the ability of intelligent agent to model human player in games such as Backgammon, Chess and Go has been an important metric in benchmarking progress. Fundamentally, the games mentioned above can be characterized as competitive and zero-sum. In contrast, games such as Pictionary and Dumb Charades falls into the category of ‘social’ games. Unlike competitive games, the emphasis is on cooperative and co-adaptive game-play in a relaxed setting. Such social games can form the basis for the next wave of game-driven progress in AI. Pictionary™ is a wonderful example of cooperative game play to achieve a shared goal in communication-restricted settings. This popular sketch-based guessing game, which we employ as a use case, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. To enable the study of Pictionary and to understand various aspects associated with the game play, we designed a software ecosystem for web-based online game of Pictionary dubbed PICTGUESS. To overcome several technological and logistic barriers, which the actual game presents, we implemented a simplified setting for PICTGUESS wherein a game consists of a time-limited episode involving two players - a Drawer and a Guesser. The Drawer is tasked with conveying a given target phrase to a counterpart Guesser by sketching on a whiteboard within that time limit. However, occasionally some players in Pictionary draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and impairs the game experience. To address such situations in a timely and scalable manner, we introduce DRAWMON, a novel distributed framework for automatic detection of atypical sketch content in concurrently occurring Pictionary game sessions. We build specialized online interfaces to annotate atypical sketch content, resulting in ATYPICT, the first ever atypical sketch content dataset. We use ATYPICT to train CANVASNET, a deep neural atypical content detection network. We utilize CANVASNET as a core component of DRAWMON. Our analysis of post deployment game session data indicates DRAWMON’s effectiveness for scalable monitoring and atypical sketch content detection. Beyond Pictionary, our contributions can also serve as a design guide for customized atypical content response systems involving shared and interactive whiteboards.

Year of completion:	September 2023
Advisor :	Ravi Kiran Sarvadevabhatla

Related Publications

Downloads

Deploying Multi Camera Multi Player Detection and Tracking Systems in Top View

Swetanjal Murati Dutta

Abstract

Object Detection and Tracking computer vision algorithms have remarkably progressed in the past few years, owing to the rapid progress in the field of deep learning. These algorithms find numerous applications in surveillance, robotics, autonomous driving, sports analytics, and human-computer interaction. They achieve near-perfect performance on the benchmark datasets they have been evaluated on. However, deploying such algorithms in real-world poses a large number of challenges, and they do not work as well as we expect them to work by looking at their performance on benchmark datasets. In this thesis, we present details of deploying one such Multi Camera Multi-Object Detection and Tracking system to track players in the bird’s eye (top) view in a sports event, specifically in the game of cricket. Our system is able to generate the top-view map of the fielders from cameras placed on crowd stands of a stadium, making it extremely cost-effective compared to using spider cameras. We present details on the challenges and hurdles we faced while deploying our systems and the solutions we designed to overcome these challenges. Ultimately, we tailored a neatly engineered end-to-end system that went live in the Asia Cup of 2022. Deploying such a multi-camera detection and tracking system poses many challenges. The first of them is related to camera placement. We had to devise strategies to place cameras optimally so that all the objects of interest could be captured by one or more of the cameras placed, at the same time ensuring that they do not appear so small that it becomes difficult for a detector to localize them. Constructing the bird’s eye view of the detected players required camera calibration. In the real world, we may not find reliable point correspondences to calibrate a camera. Even if we might find point correspondences, sometimes they may be really hard to see to accurately calibrate a camera. Detecting players in a setup like this proved to be really challenging because the objects of interest appeared very small in the camera views. Moreover, since the detections were coming from multiple cameras, algorithms had to be devised to associate them correctly across camera views. Tracking in a setting like this was even harder because we had to rely only on motion cues to track the objects of interest. Appearance features did not add any value because of the fact all the players being tracked wore jerseys of the same color. Each of these challenges became even more, harder to solve because of the real-time requirements of our use case. Finally, setting up the hardware architecture to receive the real-time live feeds from each camera in a synchronized manner and implementing a fast communication protocol for transmitting data between the various system components required careful design choices to be made, all of which have been presented in detail in this thesis.

Year of completion:	November 2023
Advisor :	Vineet Gandhi

Related Publications

Downloads

Towards building controllable Text to Speech systems

Saiteja Kosgi

Abstract

Text-to-speech systems convert any given text to speech. They play a vital role in making Humancomputer interaction (HCI) possible. As humans, we don't just rely on text (language) to communicate; we use many other mechanisms like voice, gestures, expressions, etc., to communicate efficiently. In natural language processing, vocabulary and grammar tend to take center stage, but those elements of speech only tell half the story. Affective prosody of speech provides larger context and gives meaning to words, and keeps listeners engaged. Current HCI systems largely communicate in text, and they lack a lot of prosodic information, which is crucial in a conversation. To make the HCI systems communicate in speech, text to speech systems should be able to synthesize speech that is expressive and controllable. But the existing text to speech systems learn the average variation in the dataset it’s trained on, which synthesizes samples in a neutral way without many prosodic variations. To this end, we develop a textto-speech system that can synthesize the given emotion where the emotion is represented as a tuple of Arousal, Valance and Dominance (AVD) values. Text to speech systems have a lot of complexities. Training such a system requires the data to be very clear, noiseless, and collecting such data is difficult. If the data is noisy, it will reflect unnecessary artifacts in the synthesized samples. Training emotion based text to speech models is considerably more difficult and not strait forward. The fact that obtaining emotion annotated data for the desired speaker is costly and very subjective makes it a cumbersome task. Current emotion based systems can synthesize emotions with some limitations. (1) Emotion controllability comes at the cost of loss in quality, (2) Have discreet emotions which lack the finer control, and (3) cannot be generalized to new speakers without the annotated emotion data. We propose a system that overcomes the above-mentioned problems by leveraging the largely available corpus of noisy speech annotated with emotions. Even though the data is noisy, our technique trains an emotion based text to speech system that can synthesize desired emotion without any loss of quality in the output. We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate variances/features (pitch, energy, and duration) as levers. We learn how the variances change with respect to emotion. We bring the finer control in the synthesized speech by using AVD values, which can represent emotions in a 3D space. Our proposed method also doesn’t require emotion annotated data for the target speaker. Once trained on the emotion annotated data, it can be applied to any system which has the prediction of the variances as an intermediate step. vi vii With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted ”human touch” in machine dialogue.

Year of completion:	May 2023
Advisor :	Vineet Gandhi

Related Publications

Downloads

Effective and Efficient Attribute-aware Open-set Face Verification

Arun Kumar Subramanian

Abstract

While face recognition and verification in controlled settings is already a solved problem for machines, the uniqueness of face as a biometric is that the mode of capture is highly diverse. A face could be captured nearby or at distance, at different poses, with different lighting, and by different devices. Face recognition/verification has several challenges to overcome to effectively perform under these varying conditions. Most current methods, try to find salient features of an individual by ignoring these variations. This can be looked at from the paradigm of signal and noise. The signal here refers to that information that is unique to an individual, but not varying as per the condition. Noise represents those aspects that are not related to the identity itself and are influenced by the capture mechanism, physical setting, etc. This is usually done through metric learning approaches in addition to the use of loss functions such as cross-entropy (e.g., Siamese networks, angular loss, and other margin losses such as ArcFace). There are certain aspects that lie between signal and noise such as facial attributes (such as eyeglasses). These may or may not be unique to the individual subject, but introduces artifacts into the face image. The question then arises, why can’t these variations be detected using learning methods, and the knowledge thus attained about the variations be put to good use during the matching process? It is this curiosity that has resulted in aggregation strategies for matching, which were previously implemented for aspects such as pose, age, etc. However, in the wild, humans demonstrate significant variability in facial attributes such as facial hair, eyeglasses, hairstyles, and make-up. This is common as one of the primary mechanisms of face image acquisition is covert capture in public (with ethics of consent in place), where people usually display significant variability in facial attributes. Hence it is very important to address this variability during the matching process. This work attempts to do the same. The curious question that arises however is if indeed matching performance varies if the attribute prior is known. Even if it does, how does one conceptualize a system that exploits the same? It is here that this thesis proposed two frameworks. One of the configuration-specific operating points and the other involves suppression of attribute information in face embedding prior to matching. The attribute suppression is attempted both directly at the final embedding, and suppression of intermediary layers of a Vision Transformer Deep Neural network. Both of these require the facial attribute of each image to be detected prior to passing the images into the proposed framework for matching. The above naturally adds another task to the face verification pipeline. It is therefore extremely necessary to find efficient and effective ways of performing face attribute detection (and face template generation), since efficiently performing parts, mitigates the pipeline expansion overhead and makes this a viable pipeline to consider for face verification. We observe that face attribute detection usually employs end-to-end networks, which results in a lot of parameters for inference. A feasible alternative is to constantly leverage the SOTA (state-of-the-art) face recognition networks and use the earlier feature layers to perform the face attribute classification task. Since the highly accurate SOTA is currently DNNs (Deep Neural Networks) for face, the same is dealt with in this thesis. More narrowly, we focus on open-set face verification, where DNNs aim to find unique representation even for subjects not used for training the DNN.

Year of completion:	June 2023
Advisor :	Anoop M Namboodiri

High-Quality 3D Fingerprint Generation: Merging Skin Optics, Machine Learning and 3D Reconstruction Techniques

Apoorva Srivastava

Abstract

Related Publications

Downloads

A Holistic Framework for Multimodal Ecosystem of Pictionary

Nikhil Bansal

Abstract

Related Publications

Downloads

Deploying Multi Camera Multi Player Detection and Tracking Systems in Top View

Swetanjal Murati Dutta

Abstract

Related Publications

Downloads

Towards building controllable Text to Speech systems

Saiteja Kosgi

Abstract

Related Publications

Downloads

Effective and Efficient Attribute-aware Open-set Face Verification

Arun Kumar Subramanian

Abstract

Related Publications

Downloads

More Articles …