CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure


Abstract

Inspecting and assessing the quality of traffic infrastructure(such as the state of the signboards or road markings) is challenging forhumans due to (i) the massive length of roads that countries will have and(ii) the regular frequency at which this needs to be done. In this paper, wedemonstrate a scalable system that uses computer vision for automaticinspection of road infrastructure from a simple video captured from amoving vehicle. We validated our method on 1500kms of roads capturedin and around the city of Hyderabad, India. Qualitative and quantitativeresults demonstrate the feasibility, scalability and effectiveness of oursolution.

 


Related Publicationstract

  • Raghava Modhugu, Ranjith Reddy and C. V. Jawahar - Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure ,National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2019 [pdf]

Dataset

We release traffic sign boards dataset of Indian roads with 51 different classes [Dataset]

Improving Word Recognition using Multiple Hypotheses and Deep Embeddings


Siddhant Bansal   Praveen Krishnan   C.V. Jawahar  

ICPR 2020

We propose to fuse recognition-based and recognition-free approaches for word recognition using learning-based methods. For this purpose, results obtained using a text recognizer and deep embeddings (generated using an End2End network) are fused. To further improve the embeddings, we propose EmbedNet, it uses triplet loss for training and learns an embedding space where the embedding of the word image lies closer to its corresponding text transcription’s embedding. This updated embedding space helps in choosing the correct prediction with higher confidence. To further improve the accuracy, we propose a plug-and-play module called Confidence based Accuracy Booster (CAB). It takes in the confidence scores obtained from the text recognizer and Euclidean distances between the embeddings and generates an updated distance vector. This vector has lower distance values for the correct words and higher distance values for the incorrect words. We rigorously evaluate our proposed method systematically on a collection of books that are in the Hindi language. Our method achieves an absolute improvement of around 10% in terms of word recognition accuracy.

For generating the textual transcription, we pass the word image through the CRNN and the End2End network (E2E), simultaneously. The CRNN generates multiple (K) textual transcriptions for the input image, whereas the E2E network generates the word image's embedding. The K textual transcriptions generated by the CRNN are passed through the E2E network to generate their embeddings. We pass these embeddings through the EmbedNet proposed in this work. The EmbedNet projects the input embedding to an updated Euclidean space, using which we get updated word image embedding and K transcriptions' embedding. We calculate the Euclidean distance between the input embedding and each of the K textual transcriptions. We then pass the distance values through the novel Confidence based Accuracy Booster (CAB), which uses them and the confidence scores from the CRNN to generate an updated list of Euclidean distance, which helps in selecting the correct prediction.

 

Impro-Img

For generating the textual transcription, we pass the word image through the CRNN and the End2End network (E2E), simultaneously. The CRNN generates multiple (K) textual transcriptions for the input image, whereas the E2E network generates the word image's embedding. The K textual transcriptions generated by the CRNN are passed through the E2E network to generate their embeddings. We pass these embeddings through the EmbedNet proposed in this work. The EmbedNet projects the input embedding to an updated Euclidean space, using which we get updated word image embedding and K transcriptions' embedding. We calculate the Euclidean distance between the input embedding and each of the K textual transcriptions. We then pass the distance values through the novel Confidence based Accuracy Booster (CAB), which uses them and the confidence scores from the CRNN to generate an updated list of Euclidean distance, which helps in selecting the correct prediction.

 

Paper

  • ArXiv: PDF
  • ICPR: Coming soon!

 

Please consider citing if you make use of this work and/or the corresponding code:
 
@misc{bansal2020improving,
      title={Improving Word Recognition using Multiple Hypotheses and Deep Embeddings}, 
      author={Siddhant Bansal and Praveen Krishnan and C. V. Jawahar},
      year={2020},
      eprint={2010.14411},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Code

This work is implemented using the pytorch neural network framework. Code is available in this GitHub repository: .

DGAZE Dataset for driver gaze mapping on road


Isha Dua   Thrupthi John   Riya Gupta   C.V. Jawahar

Mercedes Benz   IIIT Hyderabad   IIIT Hyderabad   IIIT Hyderabad

IROS 2020

 

road view

road view


Dataset Overview

DGAZE is a new dataset for mapping the driver's gaze onto the road. Currently, driver gaze datasets are collected using eye-tracking hardware which are expensive and cumbersome, and thus unsuited for use during testing. Thus, our dataset is designed so that no costly equipment is required during test time. Models trained using our dataset requires only a dashboard-mounted mobile phone during deployment, as our data is collected using mobile phones. We collect the data in a lab setting with a video of a road projected in front of the driver. We overcome the limitation of not using eye trackers by annotating points on the road video and asking the drivers to look at them. For more details, please refer to our paper.
[ paper ]
We collected road videos using mobile phones mounted on the dashboards of cars driven in the city. We combined the road videos to create a single 18-minute video that had a good mix of road, lighting, and traffic conditions. The road images have varied illumination as the images are captured from morning to evening in the real cars on actual roads. For each frame, we annotated a single object belonging to one of the classes: car, bus, motorbike, pedestrian, auto-rickshaw, traffic signal and sign board. We also marked the center of each bounding box to serve as the groundtruth for the point detection challenge. We annotated objects that typically take up a driver's attention such as relevant signage, pedestrians, and intercepting vehicles.

Use Cases

The task of driver eye-gaze can be solved as point-wise prediction or object-wise prediction. We provide annotation so that our dataset can be used for pointwise as well as object-wise prediction. Both types of eye-gaze prediction are useful. Predicting the object which the driver is looking at is useful for higher-level ADAS systems. This may be done by getting object candidates using an object detection algorithm and using the eye gaze to predict which object is being observed. Object prediction can be used to determine whether a driver is focusing on a pedestrian or if they noticed a signboard for example. Point-wise prediction is much more fine-grained and are more useful for nearby objects, as they show which part of the object is being focused on. They can be used to determine the saccade patterns of the eyes or to create a heatmap of the attention of a driver. They may even be converted into object-wise prediction. Our dataset allows both types of analyses to be conducted.

Citation

All documents and papers that report research results obtained using the DGAZE dataset should cite the below paper:
Citation: Isha Dua, Thrupthi Ann John, Riya Gupta and C. V. Jawahar. DGAZE:Diver Gaze Mapping On Road In IROS 2020.


Bibtex

 @inproceedings{isha2020iros, 
title={DGAZE: Driver Gaze Mapping on Road},
author={Dua, Isha and John, Thrupthi Ann and Gupta, Riya and Jawahar, CV},
booktitle={2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020},
year={2020}
}

Download Dataset

The DGAZE dataset is available free of charge for research purposes only. Please use the link below to download the dataset (password protected zip file). We request you to send a mail to Thrupthi Ann John and Isha Dua (contact details below) with the duly filled dataset agreement form, upon which we will send you the password. .
Disclaimer: While every effort has been made to ensure accuracy, DGaze database owners cannot accept responsibility for errors or omissions.
[Download Dataset Agreement ] [Download Dataset] [Download README]


Acknowledgements

We would like to thank Akshay Uttama Nambi and Venkat Padmanabhan from Microsoft Research for providing us with resources to collect DGAZE dataset. This work is partly supported by DST through the IMPRINT program. Thrupthi Ann John is supported by Visvesvaraya Ph.D. fellowship.


Contact

For any queries about the dataset, please contact the authors below:
Isha Dua: This email address is being protected from spambots. You need JavaScript enabled to view it.
Thrupthi Ann John: This email address is being protected from spambots. You need JavaScript enabled to view it.

Retrieval from Large Document Image Collections


Overview

Extracting the relevant information out of a large number of documents is quite a challenging and tedious task.The quality of results generated by the traditionally available full-text search engine and text-based image retrieval systems is not very optimal. These Information Retrieval(IR) tasks become more challenging with the non-traditional language scripts for example in the case of Indic scripts. We have developed OCR (Optical Character Recognition) Search Engine which is an attempt to make an Information Retrieval and Extraction (IRE) system that replicates the current state-of-the-art methods using the IRE and basic Natural Language Processing (NLP) techniques. In this project we have tried to demonstrate the study of the methods that are being used for performing search and retrieval tasks. We also present the small descriptions of the functionalities supported in our system along with the statistics of the dataset. We use Indic-OCR developed at CVIT, for generating the text for the OCR Search Engine.


pipeline This is the basic pipeline followed by the OCR Search Engine. Digitised images after the preprocessing steps, ex. denoising, are passed through a segmentation pipeline, which generates line level segmentations for the document. These together are passed in the Indic OCR developed at CVIT that generates the text output . This text output, segmented output (line level segmentations) and document images form the database of the IRE system.

Functionalities implemented in the IRE system

The developed IRE system performs the following functionalities on the user end:

  • Search is based on the keywords entered by the users in this case i.e whether they are single or multiple word query or the query with double quotes.
  • Search is based on the language chosen. It also supports multilingual search which enables users to search the books with multiple languages, for instance in the case of "Bhagawadgitha" where a shlokaand its translation is provided.
  • Search is based on how relevant the search results are and provide the users with the highlighted line/lines containing searched queries within the complete text.
  • Additionally, primary (language, type, author, etc) and secondary(genre and source) filters for the books have been added to ease up the task of search.
  • Our system also supports transliteration which enables users to take the benefit of the phonetics of the other language while typing.

Live Demo

--- We will update the link ---

Code available on : [Code]


Dataset

datasets The datasets (NDLI data & British library data) contains document images of the digitised books in the Indic languages

The dataset provided for use consists of more than 1.5 million documents in languages Hindi, Telugu and Tamil (~ 5 lacs each). This dataset consists of the books from various genres such as religious texts, historical texts, science and philosophy. Dataset statistics are presented in the table given below.

statistics Statistics of the NDLI dataset and the British Library dataset.

In addition to the existing Indic languages, we are also working on the Bangla data which is provided to us by the British Library. Our OCR model is trying to improve the text predictions for the same which will help in achieving better search results.


Acknowledgements

This work was supported, in part, by the National Digital Library of India (IIT Kharagpur) who provided us with more than 1.5 million document images in Indian languages (Hindi, Tamil and Telugu). We also thanks British Library for providing us with the ancient document images in Bangla language. I would also like to acknowledge Krishna Tulsyan for assisting the project, initially.

Contact

For further information about this project please feel free to contact :
Riya Gupta - This email address is being protected from spambots. You need JavaScript enabled to view it.
Dataset and Server Management: Varun Bhargavan, Aradhana Vinod [varun.bhargavan;aradhana.vinod]@research.iiit.ac.in

Indiscapes: Instance segmentation networks for layout parsing of historical indic manuscripts


Abstract

Historical palm-leaf manuscript and early paper documents from Indian subcontinent form an important part of the world’s literary and cultural heritage. Despite their importance, large-scale annotated Indic manuscript image datasets do not exist. To address this deficiency, we introduce Indiscapes, the first ever dataset with multi-regional layout annotations for historical Indic manuscripts. To address the challenge of large diversity in scripts and presence of dense, irregular layout elements (e.g. text lines, pictures, multiple documents per image), we adapt a Fully Convolutional Deep Neural Network architecture for fully automatic, instance-level spatial layout parsing of manuscript images. We demonstrate the effectiveness of proposed architecture on images from the OpalNet dataset. For annotation flexibility and keeping the non-technical nature of domain experts in mind, we also contribute a custom, webbased GUI annotation tool and a dashboard-style analytics portal. Overall, our contributions set the stage for enabling downstream applications such as OCR and word-spotting in historical Indic manuscripts at scale.

 

indiscapes19

To access the code and paper click here

More Articles …

  1. Quo Vadis, Skeleton Action Recognition ?
  2. N2NSkip: Learning Highly Sparse Networks using Neuron-to-Neuron Skip Connections
  3. An OCR for Classical Indic Documents Containing Arbitrarily Long Words
  4. Topological Mapping for Manhattan-like Repetitive Environments
  • Start
  • Prev
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.