Matching Handwritten Document Images


Abstract

We address the problem of predicting similarity between a pair of handwritten document images written by different individuals. This has applications related to matching and mining in image collections containing handwritten content. A similarity score is computed by detecting patterns of text re-usages between document images irrespective of the minor variations in word morphology, word ordering, layout and paraphrasing of the content. Our method does not depend on an accurate segmentation of words and lines. We formulate the document matching problem as a structured comparison of the word distributions across two document images. To match two word images, we propose a convolutional neural network based feature descriptor. Performance of this representation surpasses the state-of-the-art on handwritten word spotting. Finally, we demonstrate the applicability of our method on a practical problem of matching handwritten assignments.


Problem Statement

In this work, we compute a similarity score by detecting patterns of text re-usages across documents written by different individuals irrespective of the minor variations in word forms, word ordering, layout or paraphrasing of the content.

 

Motivation MatchingHW

 

 

Major Contribution:

  • To address the lack of data for training handwritten word images, we build a synthetic handwritten dataset of 9 million word images referred as IIIT-HWS dataset.
  • We report a 56% error reduction in word spotting task on the challenging dataset of IAM and pages from George Washington collection.
  • We also propose a normalized feature representation for word images which is invariant to different inflectional endings or suffixes present in words.
  • We demonstrate two immediate applications (i) searching handwritten text from instructional videos, and (ii) comparing handwritten assignments.

IIIT-HWS dataset

 

IIIT HWS

Generating synthetic images is an art which emulates the natural process of image generation in a closest possible manner. In this work, we exploit such a framework for data generation in handwritten domain. We render synthetic data using open source fonts and incorporate data augmentation schemes. As part of this work, we release 9M synthetic handwritten word image corpus which could be useful for training deep network architectures and advancing the performance in handwritten word spotting and recognition tasks.

Download link:

Description Download Link File Size
Readme file Readme 4.0 KB
IIIT-HWS image corpus IIIT-HWS 32 GB
Ground truth files GroundTruth 229 MB

 

 

Please cite the below paper in case you are using the dataset.

     

    HWNet Architecture

    HWNet

     

     

     

    Measure of document similarity (MODS)

     

    MODS

     


    Datasets and Codes

    Please contact author


    Related Papers

    • Praveen Krishnan and C.V Jawahar - Matching Handwritten Document Images, The 14th European Conference on Computer Vision (ECCV) – Amsterdam, The Netherlands, 2016. [PDF]

     


    Contact

     

    Currency Recognition on Mobile Phones


    currecy

    Abstract

    In this paper, we present an application for recognizing currency bills using computer vision techniques, that can run on a low-end smartphone. The application runs on the device without the need for any remote server. It is intended for robust, practical use by the visually impaired. Though we use the paper bills of Indian National Rupee as a working example, our method is generic and scalable to multiple domains including those beyond the currency bills. Our solution uses a visual Bag of Words (BoW) based method for recognition. To enable robust recognition in a cluttered environment, we first segment the bill from the background using an algorithm based on iterative graph cuts. We formulate the recognition problem as an instance retrieval task. This is an example of fine-grained instance retrieval that can run on mobile devices. We evaluate the performance on a set of images captured in diverse natural environments, and report an accuracy of 96.7% on 2584 images.


    Downloads and links

    Using the App :

    1. Install the app using APK
    2. On first use, wait for sometime while the app copies files.

    Demo:


    Publication

    • Suriya Singh, Shushman Choudhury, Kumar Vishal and C.V. Jawahar - Currency Recognition on Mobile Phones Proceedings of the 22nd International Conference on Pattern Recognition, 24-28 Aug 2014, Stockholm, Sweden. [PDF]

     

    CVPR 2016 Paper - First Person Action Recognition Using Deep Learned Descriptors

    FirstPersonActionRecognitionUsingDeepLearnedDescriptors1

    Abstract

    We focus on the problem of wearer’s action recognition in first person a.k.a. egocentric videos. This problem is more challenging than third person activity recognition due to unavailability of wearer’s pose, and sharp movements in the videos caused by the natural head motion of the wearer. Carefully crafted features based on hands and objects cues for the problem have been shown to be successful for limited targeted datasets. We propose convolutional neural networks (CNNs) for end to end learning and classification of wearer’s actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. It is compact. It can also be trained from relatively small labeled egocentric videos that are available. We show that the proposed network can generalize and give state of the art performance on various disparate egocentric action datasets.

    Method

    architecture 2d 3d

    fusion architecture

    Paper

    PDF

    Downloads

    Code

    Datasets and annotations

    * Note: All videos are processed at 15 fps.

     


     
     

    PR 2016 Paper - Trajectory Aligned Features For First Person Action Recognition

    trajectory

    Abstract

    Egocentric videos are characterised by their ability to have the first person view. With the popularity of Google Glass and GoPro, use of egocentric videos is on the rise. Recognizing action of the wearer from egocentric videos is an important problem. Unstructured movement of the camera due to natural head motion of the wearer causes sharp changes in the visual field of the egocentric camera causing many standard third person action recognition techniques to perform poorly on such videos. Objects present in the scene and hand gestures of the wearer are the most important cues for first person action recognition but are difficult to segment and recognize in an egocentric video. We propose a novel representation of the first person actions derived from feature trajectories. The features are simple to compute using standard point tracking and does not assume segmentation of hand/objects or recognizing object or hand pose unlike in many previous approaches. We train a bag of words classifier with the proposed features and report a performance improvement of more than 11% on publicly available datasets. Although not designed for the particular case, we show that our technique can also recognize wearer's actions when hands or objects are not visible.

    Paper

    PDF

    Downloads

    Code

    Dataset and Annotations

     


     
     

    NCVPRIPG 2015 Paper - Generic Action Recognition from Egocentric Videos

    GenericEgo

    Abstract

    Egocentric cameras are wearable cameras mounted on a person’s head or shoulder. With their ability to have first person view, such cameras are spawning new set of exciting applications in computer vision. Recognising activity of the wearer from an egocentric video is an important but challenging problem. The task is made especially difficult because of unavailability of wearer’s pose as well as extreme camera shake due to motion of wearer’s head. Solutions suggested so far for the problem, have either focussed on short term actions such as pour, stir etc. or long term activities such as walking, driving etc. The features used in both the styles are very different and the technique developed for one style often fail miserably on other kind. In this paper we propose a technique to identify if a long term or a short term action is present in an egocentric video segment. This allows us to have a generic first-person action recognition system where we can recognise both short term as well as long term actions of the wearer. We report an accuracy of 90.15% for our classifier on publicly available egocentric video dataset comprising 18 hours of video amounting to 1.9 million tested samples.

    Paper

    PDF

     

    Associated People

    Panoramic Stereo Videos Using a Single Camera


    Abstractabstract

    We present a practical solution for generating 360° stereo panoramic videos using a single camera. Current approaches either use a moving camera that captures multiple images of a scene, which are then stitched together to form the final panorama, or use multiple cameras that are synchronized. A moving camera limits the solution to static scenes, while multi-camera solutions require dedicated calibrated setups. Our approach improves upon the existing solutions in two significant ways: It solves the problem using a single camera, thus minimizing the calibration problem and providing us the ability to convert any digital camera into a panoramic stereo capture device. It captures all the light rays required for stereo panoramas in a single frame using a compact custom designed mirror, thus making the design practical to manufacture and easier to use. We analyze several properties of the design as well as present panoramic stereo and depth estimation results.


    Primary Challenges

    • To capture all the light rays corresponding to both eyes' views without causing blind spots or occlusions in the panoramas created.
    • To design an optical system which is not bulky, easy to calibrate and use, as well as simple to manufacture.
    • To be able to capture 360° stereo panoramas using a single digital camera for immersive human experience.
    • To be able to perceive depth correctly from the generated stereo panoramas.

    Major Contributions

    • We proposed a custom designed mirror surface which we cal as "coffee-filter mirror", for generating 360° stereo panoramas. Our optical system has the following advantages over the other stereo panoramic devices:
    • Simplicity of Data Acquisition
    • Ease of Calibration and Post Processing
    • Adaptability to Various Applications:
    • We have optimised the surface equations of the mirror, and calibrated it to avoid any visual mis-perceptions in 3D like virtual parallax or mis-alignments.
    • Our design is easy to manufacture and the size can be scaled up/ down according to the application. the resolution of the created panoramas improves with the sensor.
    • While designed with human consumption in mind, the stereo pairs could also be used for depth estimation.

    Datasets

    We used PovRay, a freely available ray tracing software which accurately simulates imaging by tracing rays through a given scene. We have used 3D scene datasets listed below to demonstrate how the proposed mirror is used to create stereo panoramas.

    The datasets used for the simulation can be downloaded from the following links:

    Please mail us at {This email address is being protected from spambots. You need JavaScript enabled to view it., This email address is being protected from spambots. You need JavaScript enabled to view it.}@research.iiit.ac.in for any queries.


    Results

    anag office

    anag patio

    anag TRAVIESO

     Red-Cyan anaglyph panoramas obtained by using the proposed set up using POVRay datasets.

     

    360° stereo view of the Patio Scene captured using coffee-filter mirror. The scene can be seen using any HMD. More videos will be added soon.

     

     image5

    Comparison of reconstructed depth map as obtained using the proposed set up (b) with the ground truth depth map (a)


     

    Please visit PanoStereo for more videos and results.


    Related Publications

    Rajat Aggarwal*, Amrisha Vohra*, Anoop M. Namboodiri - Panoramic Stereo Videos Using A Single Camera, IEEE Conference on Computer Vision & Pattern Recognition (CVPR), 26 June-1st July 2016. [PDF]


    Associated People

     

    * Equal Contribution

     

    Face Fiducial Detection by Consensus of Exemplars


    first page

    Abstract

    Facial fiducial detection is a challenging problem for several reasons like varying pose, appearance, expression, partial occlusion and others. In the past, several approaches like mixture of trees , regression based methods, exemplar based methods have been proposed to tackle this challenge. In this paper, we propose an exemplar based approach to select the best solution from among outputs of regression and mixture of trees based algorithms (which we call candidate algorithms). We show that by using a very simple SIFT and HOG based descriptor, it is possible to identify the most accurate fiducial outputs from a set of results produced by candidate algorithms on any given test image. Our approach manifests as two algorithms, one based on optimizing an objective function with quadratic terms and the other based on simple kNN. Both algorithms take as input fiducial locations produced by running state-of-the-art candidate algorithms on an input image, and output accurate fiducials using a set of automatically selected exemplar images with annotations. Our surprising result is that in this case, a simple algorithm like kNN is able to take advantage of the seemingly huge complementarity of these candidate algorithms, better than optimization based algorithms. We do extensive experiments on several datasets, and show that our approach outperforms state-of-the-art consistently. In some cases, we report as much as a 10% improvement in accuracy. We also extensively analyze each component of our approach, to illustrate its efficacy.


    CONTRIBUTIONS

    • Our approach attempts the problem of fiducial detection as a classification problem of differentiating between the best vs the rest among fiducial detection outputs of state-of-the-art algorithms. To our knowledge, this is the first time such an approach has been attempted.
    • Since we only focus on selecting from a variety of solution candidates, this allows our pre-processing routine to generate outputs corresponding to a variety of face detector initialization, thus rendering our algorithm insensitive to initialization unlike other approaches.
    • Combining approaches better geared for sub-pixel accuracy and algorithms designed for robustness leads to our approach outperforming state-of-the-art in both accuracy and robustness.

    Method

    method

     


    Code and Dataset

    Code.

    We evaluate our algorithms on three state of the art datasets LFPW, COFW and AFLW.

    In case of queries/doubts, please contact This email address is being protected from spambots. You need JavaScript enabled to view it.


    Related Publications


    Results

    results 

    lfpw failure rate graph  cofw failure rate graph  aflw failure rate graph 

    Associated People