Thesis Students

Error Detection and Correction in Indic OCRs

Vinitha VS

Abstract

Indian languages have a rich literature that is not available in digitized form. Attempts have been made to preserve this repository of art and information by maintaining a digital library of scanned books. However, this does not fulfill the purpose as indexing and searching the documents is difficult in images. An OCR system can be used to convert the scanned documents to editable form. However, the OCR systems are error prone. These errors are largely unavoidable and occur due to issues like poor-quality images, complex font, unknown glyphs etc. A post-processing system can help in improving the accuracy by using the information about the patterns and constraints in the word and sentence formation to identify the errors and correct them. OCR is considered to be a problem attempted with marked success in Latin scripts, especially English. This is not the case with Indic scripts as the error rates of various OCR systems available are comparatively high. The OCR pipeline includes three main stages, namely segmentation, text recognition and post-processing. We observe that Indic scripts have complex scripts and glyph segmentation itself is a challenge. The existence of visually similar glyphs also makes the recognition process difficult. The challenges faced in the post-processing stage are largely due to the properties of Indian languages. The inflectional properties of some languages like Telugu and Malayalam and agglutination of words creates issues due to the enormous and growing vocabulary in these languages. Unlike alphabet system in English, Indic scripts follow alphasyllabary writing system. Hence the choice of unicodes as the basic unit of a word is questionable. Aksharas which are a more meaningful unit is considered as a better alternative to unicodes. In this thesis, we analyze the challenges in building an efficient post-processor for Indic language OCR s and propose two novel error detection techniques. The post-processing module deals with the detection of errors in the recognized text and correction of those detected errors. To understand the issues in post-processing in Indian languages, we first perform a statistical analysis of the textual data. The unavailability of huge corpus prompted us to crawl various newspaper sites and Wikipedia dump to obtain the required text data. We compare the unique word distribution and word cover of popular Indian languages with English. We observe that languages like Telugu, Tamil, Kannada and Malayalam tend to have huge number of unique words compared to English. We also observe how many words get converted to other valid words in the language, using the Hamming distance between the words as a measure. We empirically analyze the effectiveness of statistical language models for error detection and correction. First we use an ngram model for detection of errors in the OCR output. We use akshara splitwords to create a bigram and trigram language model which gives the probability of a word. A word is declared as an error word if it has lower probability than a pre-computed threshold value. For error correction, we replace the lowest probability ngram with a higher probability one from the ngram list. We observe that akshara level ngrams perform better than unicode level ngram models in both error detection and correction. We also discuss why the dictionary based method, a popular method used in English, is not a reliable solution for error detection and correction in case of Indic OCR s. We use a simple binary dictionary method for error detection, wherein if the word is present in the dictionary, it is tagged as a correct word and error otherwise. The major bottleneck in using a lexicon is the enormous words in the languages like Telugu and Malayalam. In error correction, we use Levenshtein and Gestalt scores to select the candidate words from the dictionary for replacement of error word. Inflection of words causes issues in selecting the correct words as the candidate list consists of many words which are close to the error word. We propose two novel methods for detecting errors in the OCR output. Both the methods are language independent and does not require knowledge of language grammar. For detecting the errors in the OCR output, the first method proposed uses a recurrent neural network to learn the patterns of errors and correct words in the OCR output. The second method is using a Gaussian mixture model based clustering technique. Both methods use a language model of unicode as well as akshara split words in creating the features. We argue that aksharas are a better choice as the basic unit of a word than unicode. An akshara is formed by the combination of one or more unicode characters. We tested our method on four popular Indian languages and report an average error detection performance above 80% on a dataset of 5 K pages recognized using two state of the art OCR systems.

Year of completion:	December 2017
Advisor :	C V Jawahar

Related Publications

V S Vinitha, Minesh Mathew and C. V. Jawahar - An Empirical Study of Effectiveness of Post-processing in Indic Scripts 6th International Workshop on Multilingual OCR, Kyoto, Japan, 2017. [PDF]
V S Vinitha, C V Jawahar - Error Detection in Indic OCRs - Proceedings of 12th IAPR International Workshop on Document Analysis Systems (DAS'16), 11-14 April, 2016, Santorini, Greece. [PDF]

Downloads

Understanding Short Social Video Clips using Visual-Semantic Joint Embedding

Aditya Singh

Abstract

The amount of videos recorded and shared on the internet has grown massively in the past decade. The most of it is due to the cheap availability of mobile camera phones and easy access to social media websites and their mobile applications. Applications such as Instagram, Vine, Snapchat allows users to record and share their content in matter of seconds. These three are not the only such media sharing platform available but the number of active monthly users are 600 , 200, and 150 million respectively indicate the interest people have in recording, sharing and viewing their content [1,3]. The number of photos and videos collectively shared on instagram alone crosses 40 billion [1]. Vine contain approximately 40 million videos created by the users [3] and on a daily basis played 1 billions times. This cheaply available mode of data can empower many learning tasks which require huge amount of curated data. Also the videos contain novel viewpoints, and reflect real world dynamics. Different from the content available on older established websites such as Youtube, the content shared here is smaller in length (typically few seconds), contains description and associated hash-tags. Hash-tags can be thought of as keywords assigned by the user to highlight the contextual aspect of the shared media. However, unlike english words these don’t have a definite meaning associated to them as the description is heavily reliant on the content, along which the hash-tags are used. To clearly decipher the meaning of the hash-tag one requires the associated media. Hence, Hash-tags are more ambiguous and difficult to categorise than English words. In this thesis, we attempt to shed some light on applicability and utility of videos shared on a popular social media website vine.co. The videos shared here are called vines and are typically 6 seconds long. They contain with them description composed of a mixture of english words and hash-tags. We try recognising actions and recommend hash-tags to an unseen vine by utilising the visual and the semantic content and the hash-tags provided by the vines respectively. By this we try to show how this untapped resource of popular social media format can prove beneficial for resource intensive tasks which require huge amount of curated data. Action recognition deals with categorising the action being performed in a video to one of the seen categories. With the recent developments, considerable precision is achieved on established datasets. However, in an open world scenario, these approaches fail as the conditions are unpredictable. We show how vines are a much difficult categories of videos with respect to the videos currently in circulation for such tasks. To avoid manual annotations for vines we develop a semi-supervised bootstrapping approach. If one is to manually annotate vines this would defeat the purpose of easily available vines. We iteratively build an efficient classifier which leverages the existing dataset for 7 action categories and also the visual, semantic information present in the vines. The existing dataset forms the source domain and the vines compose the target domain. We utilise semantic word2vec space as a common subspace to embed video features from both, labeled source domain and unlabeled target domain. Our method incrementally augments the labeled source with target samples and iteratively modifies the embedding function to bring the source and target distributions together. Additionally, we utilise a multi-modal representation that incorporates noisy semantic information available in form of hash-tags. Hash-tags form an integral part of vines. Adding more and relevant hash-tags can expand the categories for which a vine can be selected. This enhances the utility of vines by providing missing tags and expanding the scope for the vines. We design a Hash-tag recommendation system to assign tags for an unseen vine from 29 categories. This system uses a vines’ visual content only after accumulating knowledge gathered in an unsupervised fashion. We build a Tag2Vec space from millions of hash-tags using skip-grams using a corpus of 10 million hash-tags. We then train an embedding function to map video features to the low-dimensional Tag2vec space. We learn this embedding for 29 categories of short video clips with hash-tags. A query video without any tag-information can then be directly mapped to the vector space of tags using the learned embedding and relevant tags can be found by performing a simple nearest-neighbor retrieval in the Tag2Vec space. We validate the relevance of the tags suggested by our system qualitatively and quantitatively with a user study.

Year of completion:	December 2017
Advisor :	P J Narayanan

Related Publications

Aditya Singh, Saurabh Saini, Rajvi Shah and P. J. Narayanan - Learning to hash-tag videos with Tag2Vec Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. ACM, 2016. [PDF]
Aditya Singh, Saurabh Saini, Rajvi Shah and P J Narayanan - From Traditional to Modern : Domain Adaptation for Action Classification in Short Social Video Clips 38th German Conference on Pattern Recognition (GCPR 2016) Hannover, Germany, September 12-15 2016. [PDF]
Aditya Deshpande, Siddharth Choudhary, P J Narayanan , Krishna Kumar Singh, Kaustav Kundu, Aditya Singh, Apurva Kumar - Geometry Directed Browser for Personal Photographs Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

Downloads

Fine Pose Estimation and Region Proposals from a Single Image

Sudipto Banerjee (Home Page)

Abstract

Understanding the precise 3D structure of an environment is one of the fundamental goals of computer vision and is challenging due to a variety of factors such as appearance variation, illumination, pose, noise, occlusion and scene clutter. A generic solution to the problem is ill-posed due to the loss of depth information during imaging. In this paper, we consider a specific but common situation, where the scene contains known objects. Given 3D models of a set of known objects and a cluttered scene image, we try to detect these objects in the image, and align 3D models to their images to find their exact pose. We develop an approach that poses this as a 3D-to-2D alignment problem. We also deal with pose estimation of 3D articulated objects in images. We evaluate our proposed method on BigBird dataset and our own tabletop dataset, and present experimental comparisons with state-of-the-art methods. In order to find the pose of an object, we come up with a hierarchical approach whereby we first an initial estimate of the pose and thereby refine it using a robust algorithm. Obtaining the initial estimate is crucial as the refinement is entirely dependant on it. Estimating the object proposals or region proposals from an image is a well-known but difficult task, as the complexity of the problem intensifies due to the presence of object-object interaction and background clutter. We tackle the problem by coming up with a robust Convolutional Neural Network based method which learns object proposals in a supervised manner. As we need region proposals at object level, we solve the problem of instance-level semantic segmentation, where each pixel in the image is classified into one of the known classes. Moreover, two pixels are labelled differently if they belong to two different instances of the same class. We show quantitative and qualitative comparison of our proposed network models with previous approaches, and show our results on the challenging PASCAL VOC dataset.

Year of completion:	March 2018
Advisor :	Anoop M Namboodiri

Related Publications

Sudipto Banerjee, Sanchit Aggarwal, Anoop M. Namboodiri - Fine Pose Estimation of Known Objects in Cluttered Scene Images Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]

Downloads

Efficient Annotation of Objects for Video Analysis

Swetha Sirnam (Home Page)

Abstract

The field of computer vision is rapidly expanding and has significantly more processing power and memory today, than in previous decades. Video has become one of the most popular visual media for communication and entertainment. In particular, automatic analysis and understanding the content of a video is one of the long-standing goals of computer vision. One of the fundamental problems is to model the appearance and behavior of the objects in videos. Such models mainly depend on the problem definition. Typically, in many scenarios, the change in problem statement is followed by the changes in the annotation and its complexities. Creating large-scale datasets in this scenario using the manual annotation process is monotonous, time-consuming and non-scalable. In order to address this challenge and strive towards practical large scale annotated video datasets, we investigate methods to autonomously learn and adapt object models using temporal information in videos. Even though the vision community has advanced in field of problem solving but data generation and annotation is still a tough problem. Data annotation is expensive, tedious and involves a lot of human efforts. Even after data annotation, it is essential to validate the goodness of annotations, which again is a tiresome process. To address this problem, we investigate methods to autonomously learn and adapt the object models using temporal information in videos. This involves learning robust representations of the video. The aim of this thesis is two-fold, first we propose solutions for efficient and accurate object annotation mechanisms in video sequences and secondly, to raise awareness in the community about the importance and attention it deserves. As our first contribution, we propose an efficient, scalable and accurate object bounding box annotation method for large scale complex video datasets. We focus on minimizing the annotation efforts simultaneously increasing the annotation propagation accuracy to get a precise and tight bounding box around object of interest. Using a self training approach, we propose a combination of semi-automatic initialization method with an energy minimization framework to propagate the annotations. Using an energy minimization system for segmentation gives accurate and tight bounding boxes around the object. We have quantitatively and qualitatively validated the results on publicly available datasets. In the second half, we propose annotation scheme for human pose in video sequences. The proposed model is based on a fully-automatic initialization, from any generic state-of-the-art method. But the initialization is prone to error due to the challenges in video data type. We exploit the availability of redundant information from the redundant data type. The model is build on the temporal smoothness assumption in videos. We formulate the problem as a sequence-to-sequence learning problem, the architecture uses Long Short Term Memory encoder-decoder model to encode the temporal context and annotate the pose. We show results on state-of-the-art datasets.

Year of completion:	June 2018
Advisor :	C V Jawahar,Vineeth Balasubramanian

Related Publications

Sirnam Swetha, Vineeth N Balasubramanian and C. V. Jawahar - Sequence-to-Sequence Learning for Human Pose Correction in Videos 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]
Sirnam Swetha, Anand Mishra, Guruprasad M. Hegde and C. V. Jawahar - Efficient Object Annotation for Surveillance and Automotive Applications - Proceedings of the IEEE Winter Conference on Applications of Computer Vision Workshop (WACVW 2016), March 7-9, 2016. [PDF]
Rajat Aggarwal, Sirnam Swetha, Anoop M. Namboodiri, Jayanthi Sivaswamy, C. V. Jawahar - Online Handwriting Recognition using Depth Sensors Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]

Downloads

Landmark Detection in Retinal Images

Gaurav Mittal (Home Page)

Abstract

Advances in medical field and imaging systems have resulted in a series of devices that sense, record, transform and process digital data. In the case of human eyes this digital data is fundus images, which are images of the back part of our retina. Automatic analysis of these images is required to process large amount of data and help doctors make the final diagnosis. Retina images has 3 major visible landmarks: Optic disk(OD), macula and blood vessels. In retina images, OD appears as a bright elliptical structure, macula appear as a small dark region and blood vessel appears as dark tree branch like structure. In this thesis, we have proposed methods for detection of retina landmarks. Accurate detection of OD and macula is important as computer assisted diagnosis systems uses location of these landmarks for understanding the retina image and using clinical facts about retina for improving diagnosis. Retina landmark detection also aids in assessing the severity of diseases based on the locations of abnormalities relative to these landmarks. We first used retina atlas for OD and macula detection. The idea of retina atlas is inspired by brain 3D atlas [34]. We create 2 retina atlases: intensity atlas and probability atlas, by annotating public datasets locally. We use probabilistic atlas for OD and macula detection but detection rates and accuracy of the system is low. To achieve better detection, we than used Generalized motion patterns(GMP) [14][23] for OD and macula detection. The GMP is derived by inducing motion to an image, which serves to smooth out unwanted information while highlighting the structures of interest. Our GMP based detection is fully unsupervised and its results outperformed all other unsupervised methods. The results are comparable to that of supervised methods. The proposed GMP based system is completely parallelizable and handles illumination differences efficiently. Blood vessels are another important retina landmark and we find that the current research uses evaluation measure like sensitivity, specifity, accuracy, area under curve and matthew correlation coefficient for evaluating vessel segmentation performance. We find several gaps in current evaluation measures and propose local accuracy, which is an extension of [39]. We show that local accuracy is especially useful in settings, where segmentation of weak vessels and accurate estimation of vessel width is required.

Year of completion:	June 2018
Advisor :	Jayanthi Sivaswamy

Related Publications

Gaurav Mittal, Jayanthi Sivaswamy - Optic Disk and Macula Detection from Retinal Images using Generalized Motion Pattern Proceedings of the Fifth National Conference on Computer Vision Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2015), 16-19 Dec 2015, Patna, India. [PDF]

Error Detection and Correction in Indic OCRs

Vinitha VS

Abstract

Related Publications

Downloads

Understanding Short Social Video Clips using Visual-Semantic Joint Embedding

Aditya Singh

Abstract

Related Publications

Downloads

Fine Pose Estimation and Region Proposals from a Single Image

Sudipto Banerjee (Home Page)

Abstract

Related Publications

Downloads

Efficient Annotation of Objects for Video Analysis

Swetha Sirnam (Home Page)

Abstract

Related Publications

Downloads

Landmark Detection in Retinal Images

Gaurav Mittal (Home Page)

Abstract

Related Publications

Downloads

More Articles …