Thesis Students

Text Recognition and Retrieval in Natural Scene Images

Udit Roy (homepage)

Abstract

In the past few years, text in natural scene images has gained potential to be a key feature for content based retrieval. They can be extracted and used in search engines, providing relevant information about the images. Robust and efficient techniques from the document analysis and the vision community were borrowed to solve the challenge of digitizing text in such images in the wild. In this thesis, we address the common challenges towards scene text analysis by proposing novel solutions for the recognition and retrieval settings. We develop end to end pipelines which detect and recognize text, the two core challenges of scene text analysis.

For the detection task, we first study and categorize all major publications since 2000 based on their architecture. Broadening the scope of a detection method, we propose a fusion of two complementary styles of detection. The first method evaluates MSER clusters as text or non-text using an adaboost classifier. The method outperforms the other publicly available implementations on standard ICDAR 2011 and MRRC datasets. The second method generates text region proposals using a CNN based text/nontext classifier with high recall. We compare the method with other object region proposal algorithms on the ICDAR datasets and analyse our results. Leveraging on the high recall of the proposals, we fuse the two detection methods to obtain a flexible detection scheme.

For the recognition task, we propose a conditional random field based framework for recognizing word images. We model the character locations as nodes and the bigram interactions as the pairwise potentials. Observing that the interaction potentials computed using the large lexicon are less effective than the small lexicon setting, we propose an iterative method, which alternates between finding the most likely solution and refining the interaction potentials. We evaluate our method on public datasets and obtain nearly 15% improvement in recognition accuracy over baseline methods on the IIIT-5K word dataset with a large lexicon containing 0.5 million words. We also propose a text query based retrieval task for word images and evaluate retrieval performance in various settings.

Finally, we present two contrasting end to end recognition frameworks for scene text analysis on
scene images. The first framework consists of text segmentation and a standard printed text OCR. The text segmented image is fed to Tesseract to get word regions and labels. This case sensitive and lexicon free approach performs at par with the other successful pipelines of the decade on the ICDAR 2003 dataset. The second framework combines the CNN based region proposal method with the CRF based recognizer with various lexicon sizes. Additionally, we also use the latter to retrieve scene images with text queries.

Year of completion:	October 2016
Advisor :	Prof. C.V. Jawahar

Related Publications

Udit Roy, Anand Mishra, Karteek Alahari, C.V. Jawahar - Scene Text Recognition and Retrieval for Large Lexicons Proceedings of the 12th Asian Conference on Computer Vision,01-05 Nov 2014, Singapore. [PDF] [Abstract] [Poster] [Lexicons] [bibtex]

Downloads

Distinctive Parts for Relative attributes

Naga Sandeep Ramachandruni (homepage)

Abstract

Visual Attributes are properties observable in images that have human-designated names ( e.g., smiling, natural) and they are valuable as a new semantic cue in various vision problems like facial verification, object recognition, generating description of unfamiliar objects and to facilitate zero shot transfer learning etc. While most of the work on attributes focuses on binary attributes (indicating the presence or absence of attribute) the notion of relative attributes as introduced by Parikh and Grauman in ICCV 2011 provides an appealing way of comparing two images based on their visual properties than the binary attributes. Relative visual properties are a semantically rich way by which humans describe and compare objects in the world. They are necessary, for instance, to refine an identifying description (the rounder pillow; the same except bluer), or to situate with respect to reference objects (brighter than a candle; dimmer than a flashlight). Furthermore, they have potential to enhance active and interactive learning, for instance, offering a better guide for a visual search (find me similar shoes, but shinier or refine the retrieved images of downtown Chicago to those taken on sunnier days). For learning relative attributes a ranking svm based formulation was proposed that uses globally represented pairs of annotated images. In this thesis, we extend this idea towards learning relative attributes using local parts that are shared across categories.

First we propose a part based representation that jointly represents a pair of images. For facial attributes, part corresponds to a block around a landmark point detected using a domain specific method. This representation explicitly encodes correspondences among parts, thus better capturing minute differences in parts that make an attribute more prominent in one image than another as compared to global representation. Next we update this part based representation by additionally learning weights corresponding to each part that denote their contribution towards predicting the strength of a given attribute.

We call these weights as significance coefficients of parts. For each attribute the significance coefficients are learned in a discriminative manner simultaneously with a max-margin ranking model. Thus the best parts for predicting relative attribute more smiling will be different from those from predicting more eyes open. We compare the baseline method of Parikh and Grauman with the proposed method under various settings. We have collected a new dataset of 10000 pair wise attribute level annotations using images from labeled faces in the wild ( LFW) dataset particularly focusing on large variety of samples in terms of poses, lightning conditions etc and completely ignoring the category information while collecting attribute annotation . Extensive experiments demonstrate that the new method significantly improves prediction accuracy as compared to the baseline method. Moreover the learned parts also compare favorably with human selected parts, thus indicating the intrinsic capacity of the proposed framework for learning attribute specific semantic parts. Additionally we illustrate the advantage of the
proposed method with interactive image search using relative attribute based feedback.

In this work, we also propose relational attributes, which provide a more natural way of comparing two images based on some given attribute than relative attributes. Relational attributes consider not only the content of a given pair of images, but also take into account its relationship with other pairs, thus making the comparison more robust.

Year of completion:	July 2016
Advisor :	Prof. C.V. Jawahar

Related Publications

Ramachandruni N. Sandeep, Yashaswi Verma and C.V. Jawahar - Relative Parts : Distinctive Parts of Learning Relative Attributes Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 23-28 June 2014, Columbus, Ohio, USA. [PDF]

Downloads

Tomographic Image Reconstruction in Noisy and Limited Data Settings.

Syed Tabish Abbas (homepage)

Abstract

Reconstruction of images from projections lays the foundations for computed tomography (CT). Tomographic image reconstruction, due to its numerous real world applications, from medical scanners in radiology and nuclear medicine to industrial scanning and seismic equipment, is an extensively studied problem. The study of reconstructing function from its projections/line integrals, is around a century old. The classical tomographic reconstruction problem was originally solved 1917 by J. Radon, proposing and inversion method now known as filtered backprojection (FBP). It was later shown that infinitely many projections are required to reconstruct an image perfectly. It is understood that incomplete data would leads to artifacts in the reconstructed images. In addition to the artifact problem, arising due to limited data availability, the reconstructed images are known to be corrupted by noise. We study these two problems of noisy and incomplete data in the follwoing two setups. Nuclear imaging modalities like Positron emission tomography (PET) are characterized by a low SNR value due to the underlying signal generation mechanism. Given the significant role images play in current-day diagnostics, obtaining noise-free PET images is of great interest. With its higher packing density and larger and symmetrical neighbourhood, the hexagonal lattice offers a natural robustness to degradation in signal. Based on this observation, we propose an alternate solution to denoising, namely by changing the sampling lattice.

We use filtered back projection for reconstruction, followed by a sparse dictionary based denoising and compare noise-free reconstruction on the Square and Hexagonal lattices. Experiments with PET phantoms (NEMA, Hoffman) and the Shepp-Logan phantom show that the improvement in denoising, post reconstruction, is not only at the qualitative but also quantitative level. The improvement in PSNR in the hexagonal lattice is on an average between 2 to 10 dB. These results establish the potential of the hexagonal lattice for reconstruction from noisy data, in general.

In the limited data scenario we consider the Circular arc Radon Transform (CAR). Circular arc Radon transforms associate to a function, its integrals along arcs of circles. The transforms involve the integrals of a function $f$ on the plane along a family of circular arcs. These transforms arise naturally in the study of several medical imaging modalities including thermoacoustic and photoacoustic tomography, ultrasound, intravascular, radar and sonar imaging. The inversion of such transforms is of natural interest. Unlike the full circle counterpart -- the circular Radon transform -- which has attracted significant attention in recent years, the circular arc Radon transforms are scarcely studied objects. We present an efficient algorithm that gives a numerical inversion of such transforms for the cases in which the support of the function lies entirely inside or outside the acquisition circle. The numerical algorithm is non-iterative and is very efficient as the entire scheme, once processed, can be stored and used repeatedly for reconstruction of images.

Year of completion:	July 2016
Advisor :	Prof Jayanthi Sivaswamy

Related Publications

Syed Tabish Abbas, Sivaswamy J - Latent Factor ModelBased Classification for Detecting Abnormalities in Retinal Images Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]
Syed Tabish Abbas, Jayanthi Sivaswamy - Pet Image Reconstruction and Denoising on Hexagonal Lattices Proceedings of the IEEE International Conference on Image Processing, 27-30 Sep 2015,Quebec City, Canada. [PDF]

Syed T. A., Krishnan V. P. and Sivaswamy J. Numerical inversion of circular arc Radon transform (Under review).

Downloads

Understanding and Describing Tennis Videos

Mohak Sukhwani (homepage)

Abstract

Our most advanced machines are like toddlers when it comes to sight.’ When shown a tennis video to kid, he mostly probably would blabber words like ‘tennis’, ‘racquet’, ‘ball’ etc. Similar is the case with present day state-of-art video understanding algorithms. We in this work try to solve one such multimedia content analysis problem – ‘How to get machines go beyond object and action recognition and make them understand lawn tennis video content in a holistic manner ?’. We propose a multi-facet approach to understand the video content as a whole - (a) Low level Analysis: Identify and isolate court regions and players (b) Mid Level Understanding: Recognize players actions and activities (c) High Level Annotations: Generate detailed summary of event comprising of information from full game play.

Annotating visual content with text has attracted significant attention in recent years. While the focus has been mostly on images, of late few methods have also been proposed for describing videos. The descriptions produced by such methods capture the video content at certain level of semantics. However, richer and more meaningful descriptions may be required for such techniques to be useful in real-life applications. We make an attempt towards this goal by focusing on a domain specific setting – lawn tennis videos. Given a video shot from a tennis match, we intend to predict detailed (commentary-like) descriptions rather than small captions. Rich descriptions are generated by leveraging a large corpus of human created descriptions harvested from Internet. We evaluate our method on a newly created tennis video data set comprising of broadcast video recordings of matches from London Olympics 2012. Extensive analysis demonstrate that our approach addresses both semantic correctness as well as readability aspects involved in the task.

Given a test video, we predict a set of action/verb phrases individually for each frame using the features computed from its neighborhood. The identified phrases along with additional meta-data are used to find the best matching description from the commentary corpus. We begin by identifying two players on the tennis court. Regions obtained after isolating playing court regions assist us in segmenting out the candidate player regions through background subtraction using thresholding and connected component analysis. Each candidate foreground region thus obtained is represented using HOG descriptors over which a SVM classifier is trained to discard non-player foreground regions. The candidate player regions thus obtained are used to recognize players using using CEDD descriptors and Tanimoto distance.Verb phrases are recognized, by extracting features from each frame of input video using sliding window. Since this typically results into multiple firings, non-maximal suppression (NMS) is applied.

This removes low-scored responses that are in the neighborhood of responses with locally maximal confidence scores. Once we get potential phrases for all windows along with their scores, we remove the independence assumption and smooth the predictions using an energy minimization framework. For this, a Markov Random Field (MRF) based model is used which captures dependencies among nearby phrases. We formulate the task of predicting the final description, as an optimization problem of selecting the best sentence among the set of commentary sentences in corpus which covers most number of unique words in obtained phrase set. We even employ Latent Semantic Indexing (LSI) technique while matching predicted phrases with descriptions and demonstrate its effectiveness over naive lexical matching. The proposed pipeline is bench-marked against state-of-the-art methods. We compare our performance with recent methods. Caption generation based approaches achieve significantly low score owing to their generic nature. Compared to all the competing methods, our approach consistently provides better performance. We validate that in domain specific settings, rich descriptions can be produced even with small corpus size.

The thesis introduces a method to understand and describe the contents of lawn tennis videos. Our approach illustrates the utility of the simultaneous use of vision, language and machine learning techniques in a domain specific environment to produce human-like descriptions. The method has direct extensions to other sports and various other domain specific scenarios. With deep learning based approaches becoming a de-facto standard for any modern machine learning task, we wish to explore them for present task in future augmentations. The flexibility and power of such structures have made them outperform other methods in solving some really complex vision problems. Large scale deployments of combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have already surpassed other comparable methods for real time image summarization. We intend to exploit the power of such combined structures in VIDEO TO TEXT regime and generate real time commentaries for the game-videos as one of the proposed future extensions.

Year of completion:	June 2016
Advisor :	Prof. C. V. Jawahar

Related Publications

Mohak Sukhwani, C. V. Jawahar - Tennis Vid2Text : Fine-Grained Descriptions for Domain Specific Videos Proceedings of the 26th British Machine Vision Conference, 07-10 Sep 2015, Swansea, UK. [PDF]

Mohak Sukhwani, C. V. Jawahar - Frame level Annotations for Tennis Videos Proceedings of the 23rd International Conference on Pattern Recognition, 4-8 December 2016, Cancun, Mexico.

Downloads

Playing Poseidon: A Lattice Boltzmann Approach to Simulating Generalised Newtonian Fluids

Nitish Tripathi (homepage)

Abstract

Imitating the behaviour and characteristics of fluids with the help of a computer is called fluid simulation. Real world fluids are fickle. They are subtle and gentle at times; at times they are ravenous and tumultuous. Needless to say, complex equations are behind even the tiniest of ripple, so much so that often fluid mechanics has been described as ”the physicists nightmare”. Yet there are few, if any, substances which are so beautiful and graceful in motion to observe. To an ardent student of hydrodynamics, everything in the discernible world is fluid. Solids may just be classified as fluids which flow extremely slowly! Given time, every substance has the tendency to flow under the influence of an external force.

The history of fluid simulations, thus, rightly, begins with the formulation of the Navier Stokes’ equations. These were a set of partial differential equations originally developed in the 1840s on the basis of conservation laws and first order approximations. What followed was the Conventional study of fluid flows for more than a century. Arriving at computational models to solve fluid equations has been a subject of research since the early 1950s. Finding solutions to the partial differentials of Navier Stokes’ equations using discrete algorithms was area of focus. Many modern day techniques, of which some will be skimmed through in the succeeding chapters, came up during that time. Staggered marker-and-cell (MAC grid structure), Particle in Cell (PIC method) etc. are two of those.

However, most of the models and techniques developed by the CFD community then was complex and unscalable for visual effects oriented computer graphics. In the succeeding years, fluid effects was generated using non-physics based methods, such as using hand drawn animation (key frame animation) or displacement mapping.

The development of fluid simulations has traditionally been in two concurrent streams, viz., Eulerian and Lagrangian Simulations. Eulerian Method involves modelling the fluid as a collection of scalar fields (density, pressure etc.) and vector fields (velocity etc.). Each field is calculated using Navier Stokes’ equations and fluid is visualised as crossing the volume at fixed grid points, where the value of each field is known. Lagrangian simulations on the other hand take a more intuitive approach. They treat fluid particles as carriers of the field values in accordance with the Navier Stokes’ equations. The dependence of both the methods on Navier Stokes’ makes them essentially top down simulations methods - these methods look at what the perceptible fluid properties are without concerning themselves about the kind of particular interactions which give rise to the said properties.

Lattice Boltzmann Method developed around the same time but did not come into widespread usage until much later. Unlike the conventional methods, it is a statistical method based on Kinetic Theory. It treats fluids as a collection of logical mesoscopic particles. These are constrained to move in a discrete set of directions across a cartesian grid. They follow continuous alternating iterations of colliding at each grid centre and redistribution around it, and, progressing to the neighbouring centre. It was shown that for a particular kind of collisions these particular interactions give rise to Navier Stokes’ properties at the macro level. However, we can tweak certain parameters during the development so that quantities, taken to essentially be constants in the final Navier Stokes’ equations, can be varied to simulate fluids outside their scope. As we will see in the succeeding chapters, such fluids are in abundance around us and are called non-Newtonian fluids. The Navier Stokes’ equation, as will also be seen in succeeding chapters, are essentially Newton’s second law of motion. It is therefore unfit to deal with fluids which are non-Newtonian in nature and requires regular tweaking to model them.

In this work, we combine physical models of non-Newtonian fluids with Lattice Boltzmann Method. We give the CPU implementation, showing how easy it is to understand and code. We show how the method, inspite of its ease of implementation, doesn’t compromise on physical realism or accuracy. We also give a model for GPU implementation for increased efficiency and interactive frame rates.

Year of completion:	June 2016
Advisor :	Prof. P. J. Narayanan

Related Publications

Tripathi, N. and Narayanan, P.J. Generalized newtonian fluid simulations. 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG-2013). Jodhpur, India. [PDF]
Jain, Somay and Tripathi, Nitish and Narayanan, P. J. Interactive Simulation of Generalised Newtonian Fluids Using GPUs. Proceedings of the 2014 Indian Conference on Computer Vision Graphics and Image Processing (ICVGIP-2014). Bangalore, India. [PDF]

Text Recognition and Retrieval in Natural Scene Images

Udit Roy (homepage)

Abstract

Related Publications

Downloads

Distinctive Parts for Relative attributes

Naga Sandeep Ramachandruni (homepage)

Abstract

Related Publications

Downloads

Tomographic Image Reconstruction in Noisy and Limited Data Settings.

Syed Tabish Abbas (homepage)

Abstract

Related Publications

Downloads

Understanding and Describing Tennis Videos

Mohak Sukhwani (homepage)

Abstract

Related Publications

Downloads

Playing Poseidon: A Lattice Boltzmann Approach to Simulating Generalised Newtonian Fluids

Nitish Tripathi (homepage)

Abstract

Related Publications

Downloads

More Articles …