Scene Text Understanding


I4 I1 I3



Recognizing scene text is a challenging problem, even more so than the recognition of scanned documents. Given the rapid growth of camera-based applications readily available on mobile phones, understanding scene text is more important than ever. One could, for instance, foresee an application to answer questions such as, “What does this sign say?”. This is related to the problem of Optical Character Recognition (OCR), which has a long history in the computer vision community. However, the success of OCR systems is largely restricted to text from scanned documents. Scene text exhibits a large variability in appearances, and can prove to be challenging even for the state-of-the-art OCR methods. Many scene understanding methods recognize objects and regions like roads, trees, sky in the image successfully, but tend to ignore the text on the sign board. Our goal is to fill this gap in understanding the scene.


  • Binarization as a labelling problem (see our ICDAR'11 paper)
  • Both open and closed vocabulary word recognition (see our CVPR'12 paper and BMVC'12 paper)
  • Use of both top-down (lexicons) and bottom-up cues (character detection)
  • Holistic recognition approach of lexicon-driven scene text recognition (see our ICDAR'13 paper)
  • Applications in image retrieval (see our ICCV'13 paper)
  • Word recognition for large lexicions and cropped word image retrieval (see our ACCV'14 paper)


Generating exemplars (based on our ICDAR'13 paper) README

Coming Soon:

  • Scene Text Binarization
  • Scene Character Recognition


Lexicons (ACCV '14)

IIIT 5K-word


IIIT Scene Text Retrieval (IIIT STR)

Video Scene Text Retrieval Datasets (Sports-10K and TV series-1M)

Related Publications

  • Anand Mishra, Karteek Alahari and C. V. Jawahar - Enhancing energy minimization framework for scene text recognition with top-down cues - Computer Vision and Image Understanding (CVIU 2016), volume 145, pages 30–42, 2016. [PDF]

  • Udit Roy, Anand Mishra, Karteek Alahari, C.V. Jawahar - Scene Text Recognition and Retrieval for Large Lexicons Proceedings of the 12th Asian Conference on Computer Vision,01-05 Nov 2014, Singapore. [PDF] [Abstract] [Poster] [Lexicons] [bibtex]

  • Anand Mishra, Karteek Alahari and C V Jawahar - Image Retrieval using Textual Cues Proceedings of International Conference on Computer Vision, 1-8th Dec.2013, Sydney, Australia. [Pdf] [Abstract] [Project page][bibtex]

  • Vibhor Goel, Anand Mishra, Karteek Alahari, C V Jawahar - Whole is Greater than Sum of Parts: Recognizing Scene Text Words Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF] [Abstract] [bibtex]

  • Anand Mishra, Karteek Alahari and C V Jawahar - Scene Text Recognition using Higher Order Language Priors Proceedings of British Machine Vision Conference, 3-7 Sep. 2012, Guildford, UK. [PDF] [Abstract] [Slides] [bibtex]

  • Anand Mishra, Karteek Alahari and C V Jawahar - Top-down and Bottom-up Cues for Scene Text Recognition Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 16-21 June 2012, pp. 2287-2294, Providence RI, USA. [PDF] [Abstract] [Poster] [bibtex]

  • Anand Mishra, Karteek Alahari and C.V. Jawahar - An MRF Model for Binarization of Natural Scene Text Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR 2011),18-21 September, 2011, Beijing, China. [PDF] [Abstract] [Slides] [bibtex]




Anand Mishra is partly supported by MSR India PhD Fellowship 2012. 

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.

Computational Displays


Displays have seen many improvements over the years. With advancements in pixel resolution, color gamut, vertical refresh rates and power consumption, displays today have become more personal and accessible devices for sharing and feeding information. Advancements in computer human interaction have provided novel interactions with current displays. Touch panels are now common and provide better interaction with the cyberworld shown on a device. Displays today have many shortcomings still. These include rectangular shape, low color gamut, low dynamic range, lack of focus and context in a scene, lack of 3D viewing, etc. Efforts are being made to create better and more natural displays than 2D flat rectangular screens we see today. Much research has gone into designing better displays including 3D displays, focus and context displays, HDR displays, etc. Such displays work at the device level and use physical (mechanical, metallurgical, chemical, etc) means to create better displays. Such approaches prove expensive and are hard to scale to acceptable standards of color, refresh rate, etc. of current displays. An alternate to this is to use computation to enhance displays. We propose Computational Displays, which employ computation to economically alleviate some of the shortcomings of availabe displays. Specifically we give solutions to the walk-around 3D viewing, focus+context and color resolution problems The systems work on top of existing display methodologies and are independent of any specific display technology. This makes our systems scalable to any display method and enhances the same using computation. In conclusion, we argue that such computation/algorithms should be built into the displays themselves to enhance the visual experience.

View Dependent, Multi-Planar, Walkaround 3D Displays



Displays have remained flat for most part since their inception. In this work we focus on non-planar displays made out of planar facets. We focus on displaying 3D scenes in perspective correct manner to such displays. We propose an accurate rendering mechanism using GPU shaders that produces correct color and depth map on each facet, ensuring consistency across facet boundaries. We compare our results with various previous methods such as CAVE and Projective Texture Mapping and conclude that our method is artifact free and superior in terms of visual quality - a requirement for visualization applications. We also provide a scalable GPU culling algorithm to scale our rendering scheme to any display shape, consisting of over a thousand facets. The pipeline proposed uses commodity GPUs to implement the system and can handle any type of scene, even mesh deformations as shown in the video.

Distributed Massive Model Rendering



Graphics models are getting increasingly bulkier with detailed geometry, textures, normal maps, etc. There is a lot of interest to model and navigate through detailed models of large monuments. Many monuments of interest have both rich detail and large spatial extent. Rendering them for navigation on a single workstation is practically impossible, even given the power of today's CPUs and GPUs. Many models may not fit in the GPU memory, the CPU memory, or even the secondary storage of the CPU. Distributed rendering using a cluster of workstations is the only way to navigate through such models. We present a design of a Distributed rendering system intended for massive models. Our design has a server that holds the skeleton of the whole model, namely, its scenegraph with actual geometry replaced by bounding boxes at all levels. The server divides the screen space among a number of clients and sends them a list of objects they need to render using a frustum culling step. The clients use 2 GPUs with one devoted to visibility culling and the other to rendering. Frustum culling at the server, visibility culling on one GPU, and rendering on the second GPU form the stages of our distributed rendering pipeline. We describe the design and implementation of our system and demonstrated the results of rendering relatively large models using different clusters of clients in this work. The demonstration video shows the interactive rendering of huge scenes of Coal Powerplant (approx. ~96M triangles) and Fatehpur Sikri (~172M triangles) models on a 4-client setup.

Increasing Intensity Resolution on Single Displays



Displays have seen many improvements in spatial resolutions and vertical refresh, to provide a smoother visual experience. Color intensity resolution, however, has not changed much. Most displays are still limited to 8-bits per channel. Simultaneously, much work has gone into capturing high dynamic range images. Mapping these directly to current displays loses information that may be critical to many applications. We present a way to enhance intensity resolution of a given display by mixing intensities over spatial or temporal domains. Our system sacrifices high vertical refresh and spatial resolution in order to gain intensity resolution. We present three ways to mix intensities: spatially, temporally and spatio-temporally. The systems produce in-between-intensities not present on the base display, which are clearly distinguishable by the naked eye. We evaluate our systems using both a camera and human subjects, evaluating if our system scales the intensity resolution and also ensuring the newly generated intensities follow the display model.

View Dependent Parametric/Implicit Displays



This work extends the view dependent displays to non-planar parametric shapes. In this the display surface is not defined using planar facets, but by a set of parametric or implicit equations. To render perspective correct 3D on to such a display requires non-linear rasterization. No graphics pipeline at the moment supports this. We approximate it using the tessellation hardware provided by the new Shader Model 5.0 / DirectX 11 GPUs. We break large triangles on the fly to a user defined error threshold using the tessellation hardware. The resulting vertices are then moved using per-vertex raycasting to correct their positions in the observer's eye. This results in correct perspective view in the eye of a user whose head is tracked. The method is an approximate rendering scheme to any parametric display, however does not interpolate pixels as opposed to Projective Texture Mapping. The rendering speed is orders faster than approximating the surface using planes because of single pass rendering and the use of tessellation hardware. 

Garuda: A Scalable, Tiled Display Wall using Commodity PCs



Garuda is a client-server based display wall system employing distributed rendering to render massive 3D environments at interactive framerates to a tiled display. The system is designed and built using commodity hardware. Features such as client caching and use of server-push philosophy in conjunction with UDP multicast help the system scale to very large tiled configurations. The system uses a novel culling algorithm to find what part of the scene is in which of the tiles. This is a scalable algorithm both in terms of scene complexity and number of tiles. The system is built as a library intercept mechanism over the Open Scene Graph (OSG) API. This ensures that any OSG application can be ported to Garuda without the need of modifying, recompiling or relinking the code. The OSG executable directly runs on the tiled display using this. Rendering capabilities increase with number of tiles as distributed rendering is employed. No machine in Garuda, server or clients, renders the entire environment. Culling is performed at the server and part of the scene graph are sent to clients over standard Ethernet. Clients cache the geometry and evict it based on LRU, this exploits the temporal coherence in the scene. Rendering happens at client end with the server used to to synchronize the buffer swap.

Related Publications

  • Pawan Harish, Parikshit Sakurikar, P J Narayanan - Increasing Instensity Resolution on a Single Display using Spatio-Temporal Mixing Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

  • Revanth N R, P J Narayanan - Distributed Massive Model Rendering Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

  • Pawan Harish and P. J. NarayananDesigning Perspectively-Correct Multiplanar Display IEEE Transactions on Visualization and Computer Graphics, Vol.99, 2012. [PDF]

  • Nirnimesh, Pawan Harish and P. J. Narayanan - Culling an Object Hierarchy to a Frustum Hierarchy, 5th Indian Conference on Computer Vision, Graphics and Image Processing, Madurai, India, LNCL 4338 pp.252-263,2006. [PDF]


  • Pawan Harish, Parikshit Sakurikar, P. J. Narayanan - Spatio-Temporal Mixing to Increase Intensity Resolution on a Single Display (Poster) In CVPR Workshop on Computational Cameras and Displays 2012. [PDF]
  • Pawan Harish, P. J. Narayanan - View Dependent Rendering to Simple Parametric Display Surfaces (Short Paper)" In JVRC of EuroVR-EGVE, 2011. [PDF]
  • Pawan Harish, P. J. Narayanan - A View Dependent, Polyhedral 3D Display, in ACM SIGGRAPH Virtual Reality Continuum and its Applications in Industry 2009. [PDF]
  • Nirnimesh, Pawan Harish, P. J. Narayanan - Garuda: A Scalable Tiled Display Wall using Commodity PCs, In IEEE TVCG, 2007. [PDF]

 Related Videos

Multiplanar Displays.
Parametric Displays.
Polyhedral Displays.
Garuda Display Wall.

Associated People

Decomposing Bag of Words Histograms



We aim to decompose a global histogram representation of an image into histograms of its associated objects and regions. This task is formulated as an optimization problem, given a set of linear classifiers, which can effectively discriminate the object categories present in the image. Our decomposition bypasses harder problems associated with accurately localizing and segmenting objects. We evaluate our method on a wide variety of composite histograms, and also compare it with MRF-based solutions. In addition to merely measuring the accuracy of decomposition, we also show the utility of the estimated object and background histograms for the task of image classification on the PASCAL VOC 2007 dataset.


Ankit Gandhi

Karteek Alahari

C. V. Jawahar


An SVM classifier is often trained to recognize only a single class category. When multiple objects (or uncorrelated noise) are present in an image, the performance deteriorates. To better understand this issue let us consider a split of the PASCAL VOC 2007 test data into images containing a single class category (PASCAL-S) and multiple class categories (PASCAL-M). In this setting, the average precision (AP) of the BoW-trained SVM classifier for the category “cat” is 0.589 on PASCAL-S, while only 0.189 on PASCAL-M. Also, it has been observed that B o W histograms of sin- gle isolated objects are relatively easy to classify. For example, accuracy as high as 77.78% is reported on Caltech 101 dataset, while more complex images, which contain multiple objects and natural clutter are harder to work with (e.g. 62.8% is still the best score on PASCAL VOC 2007). An important reason for this deterioration in performance is the fact that a classifier trained on single objects often fails to recognize the object when the global image representation (BoW) is “corrupted” by additional objects and clutter present in the image. A question of interest to us now is the following. Is it possible to filter out the clutter and classify only the signal?


  • Ankit Gandhi, Karteek Alahari and C V Jawahar - Decomposing Bag of Words Histograms Proceedings of International Conference on Computer Vision, 1-8th Dec.2013, Sydney, Australia. [PDF]

[poster] [bibtex]




Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.

Image Annotation


In many real-life scenarios, an object can be categorized into multiple categories. E.g., a newspaper column can be tagged as "political", "election", "democracy"; an image may contain "tiger", "grass", "river"; and so on. These are instances of multi-label classification, which deals with the task of associating multiple labels with single data. It is a difficult problem becauses one needs to consider the intricate correlations that exist amont different labels.

Automatic image annotation is a multi-label classification problem that aims at associating a set of textual with an image that describe its semantics. It has potential applications in image retrieval, image description, etc. Recent outburst of multimedia content on the Internet and as personal collections has raised the demand for auto-annotation methods; due to which this has become an active area of research.

A Modified KNN for Image Annotation [1]

ex 1

{bear, reflection, water, black, river}

ex 2

{field, horses, mare, foals, tree}

ex 3

{green, phone, woman, hair, suit}

ex 4

{fight, grass, game, anime, man}

ex 5

{building, base, horse, statue, man}

ex 6

{fence, mountain, range, airplane, sky}


  • For a given image, the labels are usually predicted from an annotation vocabulary of few hundred labels. Because of the large vocabulary, there is high variance in label frequency ("class-imbalance"). Moreover, due to limitations of manual annotation, a significant number of available images are not annotated with all the relevant labels ("weak-labelling"). These two issues affect the performance of many existing image annotation models.
  • In this work, we proposed 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, that triest to address these two issues. We also proposed a metric learning framework over 2PKNN for learning better distances.

Generating Image Description [2]





ex 1 ex 2 ex 3 ex 4
* A black ferrari is parked in front of a green tree. * An adult hound is laying on an orange couch. * A blond woman is posing with an elvis impersonator. * A small sailboat is passing near a yellow buoy.
* A sporty car is parked on a concrete driveway. * A sweet cat is curling on a pink blanket. * An orange fixture is hanging in a messy kitchen. * An ocean boat is travelling in a narrow water.


  • In this work, we proposed a method to describe an image in a sentence.
  • It is based on annotating an image with linguistically motivated phrases.
  • These phrases are combined to generate image description.

Related Publications

  • Yashaswi Varma and C V Jawahar - Image Annotation using Metric Learning in Semantic Neighbourhoods Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]


Yashaswi Verma
C. V. Jawahar

Action Recognition using Canonical Correlation Kernels


Action recognition has gained significant attention from the computer vision community in recent years. This is a challenging problem, mainly due to the presence of significant camera motion, viewpoint transitions, varying illumination conditions and cluttered backgrounds in the videos. A wide spectrum of features and representations has been used for action recognition in the past. Recent advances in action recognition are propelled by (i) the use of local as well as global features, which have significantly helped in object and scene recognition, by computing them over 2D frames or over a 3D video volume (ii) the use of factorization techniques over video volume tensors and defining similarity measures over the resulting lower dimensional factors. In this project, we try to take advantages of both these approaches by defining a canonical correlation kernel that is computed from tensor representation of the videos. This also enables seamless feature fusion by combining multiple feature kernels.


yt volleyball spiking




Canonical Correlation Kernel (CCK)

We represent a video using a 3D tensor, then 3D tensor is flattened into three 2D tensors/matrices. The CCK, which is based on canonical correlation analysis (CCA) and its kernelized version (KCCA), between the two videos is defined as the sum of correlations between the corresponding flattened matrices obtained from both CCA and KCCA. The overview of the computation of CCK is given below

flow chart New



We tested CCK on four popular action recognition datasets: Cambridge, UCF, KTH and Youtube. CCK kernels are computed over Intensity values, HOG, HOF, SIFT and MBH features. For combining different CCK feature kernels, we have used simple weighted scheme. All these kernels are used with SVM classifier, with one-vs-rest approach.


Pixels   93.1     - 93.5     - 97.5     - 82.5     -
HOG   89.0     - 83.8   83.8 98.3   86.5 83.2   74.5
SIFT   95.1     - 85.7     - 98.6     - 79.1     -
HOF   95.2     - 81.5   77.6 94.3   93.2 80.4   72.8
MBH   75.1     - 80.4   84.8 98.9   95.0 80.1   83.9
Combined   96.4     - 93.5   88.2 98.9   94.2 86.3   84.2

Comparison of CCK with DT (Dense trajectories) over different features.


TCCA  [CVPR 2007]
Product Manifold  [CVPR 2010]
Tangent Bundle  [FG 2011]
Dense trajectories  [CVPR 2011]
Le et al.   [CVPR 2011]
Ikizler-Cinbis et al.   [ECCV 2010]
Jiang Wang et al.   [CVPR 2011]
Proposed (Using pixel values)
Proposed (Using multiple features)
Proposed (CCK feature kernels + DT feature kernels)

Comparison of CCK with other methods.

Related Publications

  • G Nagendar, Sai Ganesh, Mahesh Goud, C V Jawahar - Action Recognition using Canonical Correlation Kernels The 11th Asian Conference on Computer Vision, 5-9 Nov. 2012, Daejeon, Korea. [PDF]