Thesis Students

Efficient Annotation of Objects for Video Analysis

Swetha Sirnam (Home Page)

Abstract

The field of computer vision is rapidly expanding and has significantly more processing power and memory today, than in previous decades. Video has become one of the most popular visual media for communication and entertainment. In particular, automatic analysis and understanding the content of a video is one of the long-standing goals of computer vision. One of the fundamental problems is to model the appearance and behavior of the objects in videos. Such models mainly depend on the problem definition. Typically, in many scenarios, the change in problem statement is followed by the changes in the annotation and its complexities. Creating large-scale datasets in this scenario using the manual annotation process is monotonous, time-consuming and non-scalable. In order to address this challenge and strive towards practical large scale annotated video datasets, we investigate methods to autonomously learn and adapt object models using temporal information in videos. Even though the vision community has advanced in field of problem solving but data generation and annotation is still a tough problem. Data annotation is expensive, tedious and involves a lot of human efforts. Even after data annotation, it is essential to validate the goodness of annotations, which again is a tiresome process. To address this problem, we investigate methods to autonomously learn and adapt the object models using temporal information in videos. This involves learning robust representations of the video. The aim of this thesis is two-fold, first we propose solutions for efficient and accurate object annotation mechanisms in video sequences and secondly, to raise awareness in the community about the importance and attention it deserves. As our first contribution, we propose an efficient, scalable and accurate object bounding box annotation method for large scale complex video datasets. We focus on minimizing the annotation efforts simultaneously increasing the annotation propagation accuracy to get a precise and tight bounding box around object of interest. Using a self training approach, we propose a combination of semi-automatic initialization method with an energy minimization framework to propagate the annotations. Using an energy minimization system for segmentation gives accurate and tight bounding boxes around the object. We have quantitatively and qualitatively validated the results on publicly available datasets. In the second half, we propose annotation scheme for human pose in video sequences. The proposed model is based on a fully-automatic initialization, from any generic state-of-the-art method. But the initialization is prone to error due to the challenges in video data type. We exploit the availability of redundant information from the redundant data type. The model is build on the temporal smoothness assumption in videos. We formulate the problem as a sequence-to-sequence learning problem, the architecture uses Long Short Term Memory encoder-decoder model to encode the temporal context and annotate the pose. We show results on state-of-the-art datasets.

Year of completion:	June 2018
Advisor :	C V Jawahar,Vineeth Balasubramanian

Related Publications

Sirnam Swetha, Vineeth N Balasubramanian and C. V. Jawahar - Sequence-to-Sequence Learning for Human Pose Correction in Videos 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]
Sirnam Swetha, Anand Mishra, Guruprasad M. Hegde and C. V. Jawahar - Efficient Object Annotation for Surveillance and Automotive Applications - Proceedings of the IEEE Winter Conference on Applications of Computer Vision Workshop (WACVW 2016), March 7-9, 2016. [PDF]
Rajat Aggarwal, Sirnam Swetha, Anoop M. Namboodiri, Jayanthi Sivaswamy, C. V. Jawahar - Online Handwriting Recognition using Depth Sensors Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]

Downloads

Landmark Detection in Retinal Images

Gaurav Mittal (Home Page)

Abstract

Advances in medical field and imaging systems have resulted in a series of devices that sense, record, transform and process digital data. In the case of human eyes this digital data is fundus images, which are images of the back part of our retina. Automatic analysis of these images is required to process large amount of data and help doctors make the final diagnosis. Retina images has 3 major visible landmarks: Optic disk(OD), macula and blood vessels. In retina images, OD appears as a bright elliptical structure, macula appear as a small dark region and blood vessel appears as dark tree branch like structure. In this thesis, we have proposed methods for detection of retina landmarks. Accurate detection of OD and macula is important as computer assisted diagnosis systems uses location of these landmarks for understanding the retina image and using clinical facts about retina for improving diagnosis. Retina landmark detection also aids in assessing the severity of diseases based on the locations of abnormalities relative to these landmarks. We first used retina atlas for OD and macula detection. The idea of retina atlas is inspired by brain 3D atlas [34]. We create 2 retina atlases: intensity atlas and probability atlas, by annotating public datasets locally. We use probabilistic atlas for OD and macula detection but detection rates and accuracy of the system is low. To achieve better detection, we than used Generalized motion patterns(GMP) [14][23] for OD and macula detection. The GMP is derived by inducing motion to an image, which serves to smooth out unwanted information while highlighting the structures of interest. Our GMP based detection is fully unsupervised and its results outperformed all other unsupervised methods. The results are comparable to that of supervised methods. The proposed GMP based system is completely parallelizable and handles illumination differences efficiently. Blood vessels are another important retina landmark and we find that the current research uses evaluation measure like sensitivity, specifity, accuracy, area under curve and matthew correlation coefficient for evaluating vessel segmentation performance. We find several gaps in current evaluation measures and propose local accuracy, which is an extension of [39]. We show that local accuracy is especially useful in settings, where segmentation of weak vessels and accurate estimation of vessel width is required.

Year of completion:	June 2018
Advisor :	Jayanthi Sivaswamy

Related Publications

Gaurav Mittal, Jayanthi Sivaswamy - Optic Disk and Macula Detection from Retinal Images using Generalized Motion Pattern Proceedings of the Fifth National Conference on Computer Vision Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2015), 16-19 Dec 2015, Patna, India. [PDF]

Downloads

Multimodal Emotion Recognition from Advertisements with Application to Computational Advertising

Abhinav Shukla (Home Page)

Abstract

Advertisements (ads) are often filled with strong affective content covering a gamut of emotions intended to capture viewer attention and attempt to convey an effective message. However, most approaches to computationally analyze the emotion present in ads are based on the text modality and only a limited amount of work has been done on affective understanding of advertisements videos from the content and user-centric perspectives. This work attempts to bring together recent advances in deep learning (especially in the domain of visual recognition) and affective computing, and use them to perform affect estimation of advertisements. We first create a dataset of 100 ads which are annotated by 5 experts and are evenly distributed over the valence-arousal plane. We then perform content-based affect recognition via a transfer learning based approach to estimate the affective content in this dataset using prior affective knowledge gained from a large annotated movie dataset. We employ both visual features from video frames and audio features from spectrograms to train our deep neural networks. This approach vastly outperforms the existing benchmark. It is also very interesting to see how human physiological signals, such as that captured by Electroencephalography (EEG) data are able to provide useful affective insights into the content from a user-centric perspective. Using this time series data of the electrical activity of the brain, we train models which are able to classify the emotional dimensions of this data. This also enables us to benchmark this user-centric performance and compare it to the content-centric deep learning based models, and we find that the user-centric models outperform the content-centric models, and set the state-of-the-art in ad affect recognition. We also combine the two kinds of modalities (audiovisual and EEG) using decision fusion and find that the fusion performance is greater than either single modality, which shows that human physiological signals and the audiovisual content contain complementary affective information. We also use multi task learning (MTL) on top of the features of each kind to exploit the intrinsic relatedness of the data and boost the performance. Lastly, we validate the hypothesis of better affect estimation being able to enhance a real world application by supplying the affective values computed by our methods to a computational advertising framework to get a video program sequence with ads inserted at emotionally relevant points, determined to be appropriate based on the affective relevance between the the program content and the ads. Multiple user studies find that our methods significantly outperform the existing algorithms and are very close (and sometimes better than) human level performance. We are able to achieve much more emotionally relevant and non disruptive advertisement insertion into a program video stream. In summary, this work (1) compiles an affective ad dataset capable of evoking coherent emotions across users; (2) explores the efficacy of content-centric convolutional neural network (CNN) features for affect recognition (AR), and find that CNN features outperform low level audio-visual descriptors; (3) study user-centric ad affect recognition from Electroencephalogram (EEG) responses (with conventional classifiers as well as a novel CNN architecture for EEG) acquired while viewing the content that outperform content descriptors; (4) Examine a multi-task learning framework based on CNN and EEG features which provides state of the art AR from ads; (5) Demonstrates how better affect predictions facilitates more effective computational advertising in a real world application.

Year of completion:	June 2018
Advisor :	Ramanathan Subramanian

Related Publications

Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli and Ramanathan Subramanian - Evaluating Content-centric vs User-centric Ad Affect Recognition In Proceedings of 19th ACM International Conference on Multimodal Interaction (ICMI'17).[PDF]
Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli and Ramanathan Subramanian - Affect Recognition in Ads with Application to Computational Advertising In Proceedings of 25th ACM International Conference on Multimedia (ACM MM '17).[PDF]

Downloads

Unconstrained Arabic & Urdu Text Recognition using Deep CNN-RNN Hybrid Networks

Mohit Jain (Home Page)

Abstract

We demonstrate the effectiveness of an end-to-end trainable hybrid CNN - RNN architecture in recog- nizing Urdu text from printed documents, typically known as Urdu OCR , and from Arabic text embedded in videos and natural scenes. When dealing with low-resource languages like Arabic and Urdu, a major adversary in developing a robust recognizer is the lack of large quantity of annotated data. We overcome this problem by synthesizing millions of images from a large vocabulary of words and phrases scraped from Wikipedia’s Arabic and Urdu versions, using a wide variety of fonts downloaded from various online resources. Building robust recognizers for Arabic and Urdu text has always been a challenging task. Though a lot of research has been done in the field of text recognition, the focus of the vision community has been primarily on English. While, Arabic script has started to receive some spotlight as far as text recognition is concerned, works on other languages which use the Nabatean family of scripts, like Urdu and Persian, are very limited. Moreover, the quality of the works presented in this field generally lack a standardized structure making it hard to reproduce and verify the claims or results. This is quite surprising considering the fact that Arabic is the fifth most spoken language in the world after Chinese, English, Spanish and Hindi catering to 4.7% of the world’s population, while Urdu has over a 100 million speakers and is spoken widely in Pakistan, where it is the national language, and India where it is recognized as one of the 22 official languages. In this thesis, we introduce the problems related with text recognition of low-resource languages, namely Arabic and Urdu, in various scenarios. We propose a language independent Hybrid CNN - RNN architecture which can be trained in an end-to-end fashion and prove it’s dominance over simple RNN based methods. Moreover, we dive deeper into the working of its convolutional layers and verify the robustness of convolutional-features through layer visualizations. We also propose a method to synthe- size artificial text images to do away with the need of annotating large amounts of training data. We outperform previous state-of-the-art methods on existing benchmarks by quite some margin and release two new benchmark datasets for Arabic Scene Text and Urdu Printed Text Recognition to instill interest among fellow researchers of the field.

Year of completion:	June 2018
Advisor :	Prof. C.V. Jawahar

Related Publications

Mohit Jain, Minesh Mathew and C. V. Jawahar - Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]
Minesh Mathew , Mohit Jain and C. V. Jawahar - Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam 6th International Workshop on Multilingual OCR, Kyoto, Japan, 2017. [PDF]
Mohit Jain, Minesh Mathew and C. V. Jawahar - Unconstrained Scene Text and Video Text Recognition for Arabic Script 1st International Workshop on Arabic Script Analysis and Recognition (ASAR 2017), Nancy, France, 2017. [PDF]

Downloads

Studies in Recognition of Telugu Document Images

Venkat Rasagna (homepage)

Abstract

The rapid evolution of information technology (IT) has prompted a massive growth in digitizing books. Accessing these huge digital collections require solutions, which will enable the archived ma- terials to be searchable. These solutions can only be acquired through research in document image understanding. In the last three decades, many significant developments have been made in the recog- nition of Latin-based scripts. The recognition systems for Indian languages are very far behind the European language recognizers like English. The diversity of archived printed document poses an ad- ditional challenge to document analysis and understanding. In this work, we explore the recognition of printed text in Telugu, a south Indian language. We begin our work by building the Telugu script model for recognition and adopting an existing optical character recognition system for the same. A comprehensive study on all the modules of the optical recognizer is done, with the focus mainly on the recognition module. We then evaluate the recognition module by testing it on the synthetic and real datasets. We achieved an accuracy of 98% on synthetic dataset, but the accuracy drops to 91% on 200 pages from the scanned books (real dataset). To analyze the drop in accuracy and the modules propagating errors, we create datasets with different qualities namely laser print dataset, good real dataset and challenging real dataset. Analysis of these experiments revealed the major problems in the character recognition module. We observed that the recognizer is not robust enough to tackle the multifont problem. The classifier’s component accuracy varied significantly on pages from different books. Also, there was a huge difference in the component and word accuracies. Even with a component accuracy of 91%, the word accuracy was just 62%. This motivated us to solve the multifont problem and improve the word accuracies. Solving these problems would boost the OCR accuracy of any language.

A major requirement in the design of robust OCRs is the invariance of feature extraction scheme with the popular fonts used in the print. Many statistical and structural features have been tried for character classification in the past. In this work, we get motivated by the recent successes in object category recognition literature and use a spatial extension of the histogram of oriented gradients (HOG) for character classification. We conducted the experiments on 1.46 million Telugu character samples in 359 classes and 15 fonts. On this data set, we obtain an accuracy of 96-98% with an SVM classifier. Typical optical character recognizer (OCR) only uses local information about a particular character or word to recognize it. In this thesis, we also propose a document level OCR which exploits the fact that multiple occurrences of the same word image should be recognized as the same word. Whenever the OCR output differs for the same word, it must be due to recognition errors. We propose a method to identify such recognition errors and automatically correct them. First, multiple instances of the same word image are clustered using a fast clustering algorithm based on locality sensitive hashing. Three different techniques are proposed to correct the OCR errors by looking at differences in the OCR output for the words in the cluster. They are character majority voting, an alignment technique based on dynamic time warping and one based on Progressive Alignment of multiple sequences. In this work, we demonstrate the approach over hundreds of document images from English and Telugu books by correcting the output of the best performing OCRs for English and Telugu. The recognition accuracy at word level is improved from 93% to 97% for English and from 58% to 66% for Telugu. Our approach is applicable to documents in any language or script.

Year of completion:	May 2013
Advisor :	Prof. C.V. Jawahar

Efficient Annotation of Objects for Video Analysis

Swetha Sirnam (Home Page)

Abstract

Related Publications

Downloads

Landmark Detection in Retinal Images

Gaurav Mittal (Home Page)

Abstract

Related Publications

Downloads

Multimodal Emotion Recognition from Advertisements with Application to Computational Advertising

Abhinav Shukla (Home Page)

Abstract

Related Publications

Downloads

Unconstrained Arabic & Urdu Text Recognition using Deep CNN-RNN Hybrid Networks

Mohit Jain (Home Page)

Abstract

Related Publications

Downloads

Studies in Recognition of Telugu Document Images

Venkat Rasagna (homepage)

Abstract

Related Publications

Downloads

More Articles …