Thesis Students

Active Learning & its Applications

Priyam Bakliwal

Abstract

Active learning also known as query learning is a sub-field of machine learning. It relies on the assumption that if the learning algorithm is allowed to choose the data from which it learns, it will perform better with less training. Active Learning is predominantly used in areas where getting a large amount of annotated data for training is not feasible or extremely expensive. Active learning models aims to overcome the annotation bottleneck by asking queries in the form of unlabelled instances to be labelled by a human. In this way, the framework aims to achieve high accuracy using very less labelled instances resulting in minimization of annotation cost. In the first part of our work, we propose an Active Learning based Image Annotation model. Automatic image annotation is the computer vision task of assigning a set of appropriate textual tags to a novel image. The aim is to eventually bridge the semantic gap of visual and textual representations with the help of these tags. The advantages of the proposed model includes: (a). It is able to output the variable number of tags for images which improves the accuracy. (b). It is effectively able to choose the difficult samples that needs to be manually annotated and thereby reducing the human annotation efforts. Studies on Corel and IAPR TC-12 datasets validate the effectiveness of this model. In the second part of the thesis, we propose an active learning based solution for efficient, scalable and accurate annotations of objects in video sequences. We focus on reducing the human annotation efforts with simultaneous increase in tracking accuracy to get precise, tight bounding boxes around an object of interest. We use a novel combination of two different tracking algorithms to track an object in the whole video sequence. We propose a sampling strategy to sample the most informative frame which is given for human annotation. This newly annotated frame is used to update the previous annotations. Thus, by collaborative efforts of both human and the system we obtain accurate annotations with minimal effort. We have quantitatively and qualitatively validated the results on eight different datasets. Active Learning is efficient in Natural Language documents as well. Multilingual processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. In the third section we propose a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus using OCR with minimal user efforts. We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences. In the last section we utilised Active Learning for model updation in c QA system. Community Question Answering(c QA ) platforms like Yahoo! Answers, Baidu Zhidao, Quora, StackOverflow etc. provides experts to give precise and targeted answers to any question posted by a user. These sites form huge repositories of information in the form of questions and answers. Retrieval of semantically relevant questions and answers from c QA forums have been an important research area for the past few years. Considering the ever growing nature of the data in c QA forums, these models cannot be kept stagnant. They need to be continuously updated so that they can adapt to the changing patterns of Questions-Answers with time. Such updation procedures are expensive and time consuming. We propose a novel Topic model based active sampler named Picky. It intelligently selects a smaller subset of the newly added Question-Answer pairs to be fed to the existing model for updating it. Evaluations on real life c QA datasets show that our approach converges at a faster rate, giving comparable performance to other baseline sampling strategies updated with data of ten times the size.

Year of completion:	July 2017
Advisor :	C V Jawahar

Related Publications

Priyam Bakliwal, Guruprasad M. Hegde and C.V. Jawahar - Collaborative Contributions for Better Annotations VISIGRAPP (6: VISAPP). 2017. [PDF]
Priyam Bakliwal, Devadath V V and C.V. Jawahar - Align Me: A framework to generate Parallel Corpus Using OCRs & Bilingual Dictionaries WSSANLP 2016 (2016): 173. [PDF]
Priyam Bakliwal, C.V. Jawahar - Active Learning Based Image Annotation Proceedings of the Fifth National Conference on Computer Vision Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2015), 16-19 Dec 2015, Patna, India. [PDF]

Downloads

Population specific template construction and brain structure segmentation using deep learning methods

Raghav Mehta

Abstract

A brain template, such as MNI152 is a digital (magentic resonance image or MRI) representation of the brain in a reference coordinate system for the neuroscience research. Structural atlases, such as AAL and DKA, delineate the brain into cortical and subcortical structures which are used in Voxel Based Morphometry (VBM) and fMRI analysis. Many population specific templates, i.e. Chinese, Korean, etc., have been constructed recently. It was observed that there are morphological differences between the average brain of the eastern and the western population. In this thesis, we report on the development of a population specific brain template for the young Indian population. This is derived from a multi-centeric MRI dataset of 100 Indian adults (21 - 30 years old). Measurements made with this template indicated that the Indian brain, on average, is smaller in height and width compared to the Caucasian and the Chinese brain. A second problem this thesis examines is automated segmentation of cortical and non-cortical human brain structures, using multiple structural atlases. This has been hitherto approached using computationally expensive non-rigid registration followed by label fusion. We propose an alternative approach for this using a Convolutional Neural Network (CNN) which classifies a voxel into one of many structures. Evaluation of the proposed method on various datasets showed that the mean Dice coefficient varied from 0.844±0.031 to 0.743±0.019 for datasets with the least (32) and the most (134) number of labels, respectively. These figures are marginally better or on par with those obtained with the current state of the art methods on nearly all datasets, at a reduced computational time. We also propose an end-to-end trainable Fully Convolutional Neural Network (FCNN) architecture called the M-net, for segmenting deep (human) brain structures. A novel scheme is used to learn to combine and represent 3D context information of a given slice in a 2D slice. Consequently, the M-net utilizes only 2D convolution though it operates on 3D data. Experiment results show that the M-net outperforms other state-of-the-art model-based segmentation methods in terms of dice coefficient and is at least 3 times faster than them.

Year of completion:	July 2017
Advisor :	Jayanthi Sivaswamy

Related Publications

Jayanthi Sivaswamy, Thottupattu AJ , Mehta R, Sheelakumari R and Kesavadas C - Construction of Indian Human Brain Atlas, Neurology India (To appear).[PDF]
Majumdar A, Mehta R and Jayanthi Sivaswamy - To Learn Or Not To Learn Features For Deformable Registration? Deep Learning Fails, MICCAI 2018[PDF]
Raghav Mehta and Jayanthi Sivaswamy - M-net: A Convolutional Neural Network for deep brain structure segmentation Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on. IEEE, 2017. [PDF]
R Mehta, Aabhas Majumdar and Jayanthi Sivaswamy - BrainSegNet: a convolutional neural network architecture for automated segmentation of human brain structures Journal of Medical Imaging 4.2 (2017): 024003-024003. [PDF]
Raghav Mehta and Jayanthi Sivaswamy - A Hybrid Approach to Tissue-based Intensity Standardization of Brain MRI Images Proc. of IEEE International Symposium on Bio-Medical Imaging(ISBI), 2016, 13 - 16 April, 2016, Prague. [PDF]

Downloads

Error Detection and Correction in Indic OCRs

Vinitha VS

Abstract

Indian languages have a rich literature that is not available in digitized form. Attempts have been made to preserve this repository of art and information by maintaining a digital library of scanned books. However, this does not fulfill the purpose as indexing and searching the documents is difficult in images. An OCR system can be used to convert the scanned documents to editable form. However, the OCR systems are error prone. These errors are largely unavoidable and occur due to issues like poor-quality images, complex font, unknown glyphs etc. A post-processing system can help in improving the accuracy by using the information about the patterns and constraints in the word and sentence formation to identify the errors and correct them. OCR is considered to be a problem attempted with marked success in Latin scripts, especially English. This is not the case with Indic scripts as the error rates of various OCR systems available are comparatively high. The OCR pipeline includes three main stages, namely segmentation, text recognition and post-processing. We observe that Indic scripts have complex scripts and glyph segmentation itself is a challenge. The existence of visually similar glyphs also makes the recognition process difficult. The challenges faced in the post-processing stage are largely due to the properties of Indian languages. The inflectional properties of some languages like Telugu and Malayalam and agglutination of words creates issues due to the enormous and growing vocabulary in these languages. Unlike alphabet system in English, Indic scripts follow alphasyllabary writing system. Hence the choice of unicodes as the basic unit of a word is questionable. Aksharas which are a more meaningful unit is considered as a better alternative to unicodes. In this thesis, we analyze the challenges in building an efficient post-processor for Indic language OCR s and propose two novel error detection techniques. The post-processing module deals with the detection of errors in the recognized text and correction of those detected errors. To understand the issues in post-processing in Indian languages, we first perform a statistical analysis of the textual data. The unavailability of huge corpus prompted us to crawl various newspaper sites and Wikipedia dump to obtain the required text data. We compare the unique word distribution and word cover of popular Indian languages with English. We observe that languages like Telugu, Tamil, Kannada and Malayalam tend to have huge number of unique words compared to English. We also observe how many words get converted to other valid words in the language, using the Hamming distance between the words as a measure. We empirically analyze the effectiveness of statistical language models for error detection and correction. First we use an ngram model for detection of errors in the OCR output. We use akshara splitwords to create a bigram and trigram language model which gives the probability of a word. A word is declared as an error word if it has lower probability than a pre-computed threshold value. For error correction, we replace the lowest probability ngram with a higher probability one from the ngram list. We observe that akshara level ngrams perform better than unicode level ngram models in both error detection and correction. We also discuss why the dictionary based method, a popular method used in English, is not a reliable solution for error detection and correction in case of Indic OCR s. We use a simple binary dictionary method for error detection, wherein if the word is present in the dictionary, it is tagged as a correct word and error otherwise. The major bottleneck in using a lexicon is the enormous words in the languages like Telugu and Malayalam. In error correction, we use Levenshtein and Gestalt scores to select the candidate words from the dictionary for replacement of error word. Inflection of words causes issues in selecting the correct words as the candidate list consists of many words which are close to the error word. We propose two novel methods for detecting errors in the OCR output. Both the methods are language independent and does not require knowledge of language grammar. For detecting the errors in the OCR output, the first method proposed uses a recurrent neural network to learn the patterns of errors and correct words in the OCR output. The second method is using a Gaussian mixture model based clustering technique. Both methods use a language model of unicode as well as akshara split words in creating the features. We argue that aksharas are a better choice as the basic unit of a word than unicode. An akshara is formed by the combination of one or more unicode characters. We tested our method on four popular Indian languages and report an average error detection performance above 80% on a dataset of 5 K pages recognized using two state of the art OCR systems.

Year of completion:	December 2017
Advisor :	C V Jawahar

Related Publications

V S Vinitha, Minesh Mathew and C. V. Jawahar - An Empirical Study of Effectiveness of Post-processing in Indic Scripts 6th International Workshop on Multilingual OCR, Kyoto, Japan, 2017. [PDF]
V S Vinitha, C V Jawahar - Error Detection in Indic OCRs - Proceedings of 12th IAPR International Workshop on Document Analysis Systems (DAS'16), 11-14 April, 2016, Santorini, Greece. [PDF]

Downloads

Understanding Short Social Video Clips using Visual-Semantic Joint Embedding

Aditya Singh

Abstract

The amount of videos recorded and shared on the internet has grown massively in the past decade. The most of it is due to the cheap availability of mobile camera phones and easy access to social media websites and their mobile applications. Applications such as Instagram, Vine, Snapchat allows users to record and share their content in matter of seconds. These three are not the only such media sharing platform available but the number of active monthly users are 600 , 200, and 150 million respectively indicate the interest people have in recording, sharing and viewing their content [1,3]. The number of photos and videos collectively shared on instagram alone crosses 40 billion [1]. Vine contain approximately 40 million videos created by the users [3] and on a daily basis played 1 billions times. This cheaply available mode of data can empower many learning tasks which require huge amount of curated data. Also the videos contain novel viewpoints, and reflect real world dynamics. Different from the content available on older established websites such as Youtube, the content shared here is smaller in length (typically few seconds), contains description and associated hash-tags. Hash-tags can be thought of as keywords assigned by the user to highlight the contextual aspect of the shared media. However, unlike english words these don’t have a definite meaning associated to them as the description is heavily reliant on the content, along which the hash-tags are used. To clearly decipher the meaning of the hash-tag one requires the associated media. Hence, Hash-tags are more ambiguous and difficult to categorise than English words. In this thesis, we attempt to shed some light on applicability and utility of videos shared on a popular social media website vine.co. The videos shared here are called vines and are typically 6 seconds long. They contain with them description composed of a mixture of english words and hash-tags. We try recognising actions and recommend hash-tags to an unseen vine by utilising the visual and the semantic content and the hash-tags provided by the vines respectively. By this we try to show how this untapped resource of popular social media format can prove beneficial for resource intensive tasks which require huge amount of curated data. Action recognition deals with categorising the action being performed in a video to one of the seen categories. With the recent developments, considerable precision is achieved on established datasets. However, in an open world scenario, these approaches fail as the conditions are unpredictable. We show how vines are a much difficult categories of videos with respect to the videos currently in circulation for such tasks. To avoid manual annotations for vines we develop a semi-supervised bootstrapping approach. If one is to manually annotate vines this would defeat the purpose of easily available vines. We iteratively build an efficient classifier which leverages the existing dataset for 7 action categories and also the visual, semantic information present in the vines. The existing dataset forms the source domain and the vines compose the target domain. We utilise semantic word2vec space as a common subspace to embed video features from both, labeled source domain and unlabeled target domain. Our method incrementally augments the labeled source with target samples and iteratively modifies the embedding function to bring the source and target distributions together. Additionally, we utilise a multi-modal representation that incorporates noisy semantic information available in form of hash-tags. Hash-tags form an integral part of vines. Adding more and relevant hash-tags can expand the categories for which a vine can be selected. This enhances the utility of vines by providing missing tags and expanding the scope for the vines. We design a Hash-tag recommendation system to assign tags for an unseen vine from 29 categories. This system uses a vines’ visual content only after accumulating knowledge gathered in an unsupervised fashion. We build a Tag2Vec space from millions of hash-tags using skip-grams using a corpus of 10 million hash-tags. We then train an embedding function to map video features to the low-dimensional Tag2vec space. We learn this embedding for 29 categories of short video clips with hash-tags. A query video without any tag-information can then be directly mapped to the vector space of tags using the learned embedding and relevant tags can be found by performing a simple nearest-neighbor retrieval in the Tag2Vec space. We validate the relevance of the tags suggested by our system qualitatively and quantitatively with a user study.

Year of completion:	December 2017
Advisor :	P J Narayanan

Related Publications

Aditya Singh, Saurabh Saini, Rajvi Shah and P. J. Narayanan - Learning to hash-tag videos with Tag2Vec Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. ACM, 2016. [PDF]
Aditya Singh, Saurabh Saini, Rajvi Shah and P J Narayanan - From Traditional to Modern : Domain Adaptation for Action Classification in Short Social Video Clips 38th German Conference on Pattern Recognition (GCPR 2016) Hannover, Germany, September 12-15 2016. [PDF]
Aditya Deshpande, Siddharth Choudhary, P J Narayanan , Krishna Kumar Singh, Kaustav Kundu, Aditya Singh, Apurva Kumar - Geometry Directed Browser for Personal Photographs Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

Downloads

Fine Pose Estimation and Region Proposals from a Single Image

Sudipto Banerjee (Home Page)

Abstract

Understanding the precise 3D structure of an environment is one of the fundamental goals of computer vision and is challenging due to a variety of factors such as appearance variation, illumination, pose, noise, occlusion and scene clutter. A generic solution to the problem is ill-posed due to the loss of depth information during imaging. In this paper, we consider a specific but common situation, where the scene contains known objects. Given 3D models of a set of known objects and a cluttered scene image, we try to detect these objects in the image, and align 3D models to their images to find their exact pose. We develop an approach that poses this as a 3D-to-2D alignment problem. We also deal with pose estimation of 3D articulated objects in images. We evaluate our proposed method on BigBird dataset and our own tabletop dataset, and present experimental comparisons with state-of-the-art methods. In order to find the pose of an object, we come up with a hierarchical approach whereby we first an initial estimate of the pose and thereby refine it using a robust algorithm. Obtaining the initial estimate is crucial as the refinement is entirely dependant on it. Estimating the object proposals or region proposals from an image is a well-known but difficult task, as the complexity of the problem intensifies due to the presence of object-object interaction and background clutter. We tackle the problem by coming up with a robust Convolutional Neural Network based method which learns object proposals in a supervised manner. As we need region proposals at object level, we solve the problem of instance-level semantic segmentation, where each pixel in the image is classified into one of the known classes. Moreover, two pixels are labelled differently if they belong to two different instances of the same class. We show quantitative and qualitative comparison of our proposed network models with previous approaches, and show our results on the challenging PASCAL VOC dataset.

Year of completion:	March 2018
Advisor :	Anoop M Namboodiri

Related Publications

Sudipto Banerjee, Sanchit Aggarwal, Anoop M. Namboodiri - Fine Pose Estimation of Known Objects in Cluttered Scene Images Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]

Active Learning & its Applications

Priyam Bakliwal

Abstract

Related Publications

Downloads

Population specific template construction and brain structure segmentation using deep learning methods

Raghav Mehta

Abstract

Related Publications

Downloads

Error Detection and Correction in Indic OCRs

Vinitha VS

Abstract

Related Publications

Downloads

Understanding Short Social Video Clips using Visual-Semantic Joint Embedding

Aditya Singh

Abstract

Related Publications

Downloads

Fine Pose Estimation and Region Proposals from a Single Image

Sudipto Banerjee (Home Page)

Abstract

Related Publications

Downloads

More Articles …