CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Scene Text Recognition for Indian Languages


Sanjana Gunna

Abstract

Text recognition has been an active field in computer vision even before the beginning of the deep learning era. Due to the varied applications of recognition models, the research area has been classified into diverse categories based on the domain of the data used. Optical character recognition (OCR) is focused on scanned documents, whereas images with natural scenes and much complex backgrounds fall into the category of scene text recognition. Scene text recognition has become an exciting area of research due to the complexities and difficulties such as complex backgrounds, improper illumination, distorted images with noise, inconsistent usage of fonts and font sizes that are not usually horizontally aligned. Such cases make the task of scene text recognition more complicated and challenging. In recent years, we have observed the rise of deep learning. Subsequently, there has been an incremental growth in the recognition algorithms and datasets available for training and testing purposes. This surge has caused the performance of recognizing text in natural scenes to rise above the baseline models that were previously trained using hand-crafted features. Latin texts were the center of attention in most of these works and did not profoundly investigate the field of scene text recognition for non-Latin languages. Upon scrutiny, we observe that the performance of the current best recognition models has reached above 90% over scene text benchmark datasets. However, these recognition models do not perform as well on non-Latin languages as they did on Latin (or English) datasets. This striking difference in the performances over different languages is a rising concern among the researchers focusing on lowresource languages, and it is indeed the motivation behind our work. Scene text recognition in low-resource non-Latin languages is difficult and challenging due to the inherent complex scripts, multiple writing systems, various fonts and orientations. Despite such differences, we can also achieve Latin (English) text-like performance for low-resource non-Latin languages. In this thesis, we look at all the parameters involved in the process of text recognition and determine the importance of those parameters through thorough experiments. We use synthetic data for controlled experiments where we test the parameters as mentioned earlier in an isolated fashion to effectively identify the catalysts of text recognition. We analyse the complexity of the scripts via these synthetic data experiments. We present the results of our experiments on two baseline models, CRNN and STAR-Net models, on available datasets to ensure generalisability. In addition to this, we also propose an error correction module for correcting the labels by utilizing the training data of real test datasets. To further improve the results on real test datasets, we propose transfer learning from English to exploit the abundant data that is available for learning. We show that the transfer from English is not feasible, and it actually lowers the performance of the individual language models. Due to the failure of English transfer experiments, we shift our focus onto just the Indian languages and examine the characteristics of each language via character n-gram plots, visual features like vowel signs, conjunct characters and other word statistics. They also share a resemblance to each other concerning certain other factors. We then propose to apply transfer learning across languages to enhance the performance of the language models. We depict the improvement on real datasets because of the transfers among Indian languages that are visually closer or sometimes better than the individual models. The transfers among languages prove to be much more profitable than transfers from English. We comprehend the significance of the variety and number of fonts during data generation via synthetic data experiments on English test datasets. Synthetic data embodies various fonts to ensure diversity of data to create robust recognition systems. In order to strengthen data diversity, we incorporate over 500 Hindi fonts (including Unicode and non-Unicode fonts) into the synthetic data for improved performance on Hindi real test datasets. We also manifest the process to utilize and incorporate nonUnicode fonts of Indian languages into the training process error-free. In addition to these fonts, we make specific changes to encompass an augmentation pipeline that adds to the diversity of data. We utilize more than nine augmentation techniques to boost the performance of Hindi STR systems. We achieved significant improvements over previous works with our evaluations over natural settings. Through our experiments, we set new benchmark accuracies for STR on Hindi, Telugu, and Malayalam languages from the IIIT-ILST dataset by gaining 6%, 5%, and 2% gains in Word Recognition Rates (WRRs) compared to previous works. Similarly, we also achieved a 23% improvement in WRR for the Bangla language from the MLT-17 dataset. We further improve this result by incorporating the error correction module as mentioned above into the training pipeline. In addition to this, we also released two STR datasets for Gujarati and Tamil datasets, containing 440 scene images, further divided into 500 Gujarati and 2535 Tamil cropped word images. We report a 5% and 3% gain in WRR over our baseline models for Gujarati and Tamil, respectively. We also establish benchmark results for MLT-19 and Bangla datasets with 8% and 4% improvements in WRRs over baselines. Further enriching the synthetic dataset with non-Unicode fonts and multiple augmentations helps us achieve a remarkable Word Recognition Rate gain of over 33% on the IIIT-ILST Hindi dataset. Additionally, we implement a lexicon-based transcription approach that utilizes a dynamic lexicon for each image while testing and presenting the results for languages mentioned above. Keywords – Scene text recognition · transfer learning · photo OCR · multilingual OCR · Indian Languages · Indic OCR · Synthetic Data · Data Diversity

Year of completion:  June 2022
 Advisor : C V Jawahar

Related Publications


    Downloads

    thesis

    Summarizing Day Long Egocentric Videos


    Anuj Rathore

    Abstract

    The popularity of egocentric cameras and their always-on nature has lead to the abundance of daylong first-person videos. Because of the extreme shake and highly redundant nature, these videos are difficult to watch from beginning to end and often require summarization tools for their efficient consumption. However, traditional summarization techniques developed for static surveillance videos, or highly curated sports videos and movies are, either, not suitable or simply do not scale for such hours long videos in the wild. On the other hand, specialized summarization techniques developed for egocentric videos limit their focus to important objects and people. In this work, we present a novel unsupervised reinforcement learning technique to generate video summaries from day long egocentric videos. Our approach can be adapted to generate summaries of various lengths making it possible to view even one minute summaries of one’s entire day. The technique can also be adapted to various rewards, such as distinctiveness, indicativeness of the summary. When using the facial saliency-based reward, we show that our approach generates summaries focusing on social interactions, similar to the current state of the art. Quantitative comparison on the benchmark Disney dataset shows that our method achieves significant improvement in Relaxed F-Score (RFS) (32.56 vs. 19.21) and BLEU score (12.12 vs. 10.64). Finally, we show that our technique can be applied for summarizing traditional, short, hand-held videos as well, where we improve the state of the art F-score on benchmark SumMe and TVSum datasets from 41.4 to 45.6 and 57.6 to 59.1 respectively

    Year of completion:  July 2022
     Advisor : C V Jawahar,Chetan Arora

    Related Publications


      Downloads

      thesis

      Counting in the 2020s: Binned Representations and Inclusive Performance Measures for Deep Crowd Counting Approaches


      Sravya Vardhani Shivapuja

      Abstract

      Crowd counting is an important task in security, surveillance and monitoring. There are many competitive benchmark datasets available in this domain. The data distribution in the crowd counting datasets show a heavy-tailed and discontinuous nature. This nature of the dataset is majorly ignored while building solutions to this problem. However, the skew in datasets contradicts few assumptions made by the stages of the training pipeline. As a consequence of the skew in the dataset, unacceptably large standard deviation wrt to the customarily used performance measures (MAE, MSE) is observed. To address these issues, this thesis provides modifications that incorporate the dataset skew in training and evaluation pipelines. In the training pipeline, to enable principled and balanced minibatch sampling, a novel smoothed Bayesian binning approach is presented that stratifies the entire count range. Further, these strata are sampled to construct uniform minibatches. The optimization is upgraded with a novel strata-aware cost function that can be readily incorporated into the existing crowd counting deep networks. In the evaluation pipeline, as an alternative to the customary evaluation MAE, this thesis provides three alternative evaluation measures. Firstly, a strata-level performance in terms of mean and standard deviation gives range specific insights. Secondly, relative error perspective is brought in by using a novel Thresholded Percentage Error Ratio (TPER). Lastly, a localization included counting error metric Grid Average Mean absolute Error (GAME) is used to evaluate the different networks. In this thesis, it is shown that proposed binning-based modifications retain their superiority wrt the novel strata-level performance measure. Overall, this thesis contributes a practically useful training pipeline and detail-oriented characterization of performance for crowd counting approaches.

      Year of completion:  July 2022
       Advisor : Ravi Kiran Sarvadevabhatla,Ganesh Ramakrishnan

      Related Publications


        Downloads

        thesis

        Deep Learning Methods for 3D Garment Digitization


        Astitva Srivastava

        Abstract

        The reconstruction of 3D objects from monocular images is an active field of research in 3D computer vision which is further boosted by advancements in deep learning. In context of human body, modeling realistic 3D virtual avatars from 2D images is a recent trend, thanks to the advent of AR/VR & metaverse. The problem is challenging, owing to non-rigid nature of human body, especially because of the garments. Various attempts have been made to solve the problem, at least for relatively tighter clothing styles, but loose clothing styles still pose a huge challenge. This problem has also sparked quite an interest in the fashion e-commerce domain, where the objective is to model the 3D garments, independent from the underlying body, in order to enable intriguing applications like virtual try-on systems. 3D garment digitization has been garnering a lot of interest in the past few years, as the demand for online window-shopping and other e-commerce activities has increased in the recent years, where the unfortunate crisis of COVID-19 plays a huge role. Though the problem of 3D digitization of garments seems intriguing, solving it is not as straightforward as it looks. There are existing works out there in the field, majority of which are deep learning based solutions. Most of these methods rely on predefined garment templates which makes the task of texture synthesis easier, but restrict the usage to a fixed number of garment styles for which templates are available. Additionally, these methods do not deal with issues like complex poses and self-occlusions which are very common under in-the-wild assumption. Template-free methods are also explored which enables modeling arbitrary clothing styles, however, they lack texture information which is essential for high-quality photorealistic appearance. The thesis aims to resolve aforementioned issues by providing novel solutions. The main objective is 3D digitization of garments from a monocular RGB image of a person wearing the garment, both in template-based and template-free settings. Initially, we address challenges in existing state-of-the-art template-based methods. We aim to handle complex human poses, occlusions etc. by proposing to use a robust keypoint regressor which estimates keypoints on input monocular image. These keypoints define thin-plate-spline (TPS) based warping of texture from input image to the UV space of a predefined template. Then, we utilize a deep inpainting network to handle missing texture information. In order to train these neural networks, we curate a synthetic dataset of garments with varying textures, draped on 3D human characters in various complex poses. This dataset helps in robust training and generalization to real images. We achieve state-of-the-art results for specific clothing styles (e.g. t-shirt and trouser). However, template-based methods cannot model any arbitrary garment style. Therefore, we next aim to handle arbitrary garment styles in a template-free setting. Existing state-of-the-art template-free methods can model geometrical details of arbitrary garment styles up to some extent, but fail to recover texture information. To model arbitrary geometry of garments, we propose to use an explicit, sparse representation introduced for modeling human body. This representation handles self-occlusion and loose clothing as well. We extend this representation by introducing semantic segmentation information for differentiating between various clothing styles (top wear /bottom wear) and human body present in the input image. Furthermore, this representation is exploited in a novel way to provide seams for texture mapping, thereby retaining high-quality textural details and providing way to lot of useful applications like texture editing, appearance manipulation, texture super-resolution etc. The proposed method is the first one to model arbitrary garment styles and recover textures as well. We evaluate our proposed solutions on various publicly available datasets, outperforming existing state-of-the-art methods. We also discuss the limitations in the proposed methods and provide potential solutions that can be explored. Finally, we discuss the future extensions of the proposed methods. We believe this thesis significantly improves the research landscape in 3D garment digitization and accelerates the progress in this direction.

        Year of completion:  August 2022
         Advisor : Avinash Sharma

        Related Publications


          Downloads

          thesis

          Skeleton-based Action Recognition in Non-contextual, In-the-wild and Dense Joint Scenarios


          Neel Trivedi

          Abstract

          Human action recognition, with its irrefutable and varied use cases across fields of surveillance, robotics, human object interaction analysis and many more, has gained critical importance and attention in the field of compute vision. Traditionally entirely based on RGB sequences, action recognition domain has shifted focus towards using skeleton sequences due to the easy availability of skeleton data capturing apparatus and the release of large scale datasets, in recent years. Skeleton based human action recognition, having superiority in terms of privacy, robustness and computational efficiency over traditional RGB based action recognition, is the primary focus of this thesis. Ever since the release of large scale skeleton action datasets namely NTURGB+D and NTURGB+D 120, the community has solely focused on developing complex approaches, ranging from CNNs to complex GCNs and more recently transformers, to achieve the best classification accuracy for these datasets. However, in this rat race for state of the art performance, the community turned a blind eye to a major drawback at the data level which bottlenecks even the most sophisticated approaches. This drawback is where we start our explorations in this thesis. The pose tree provided in the NTURGB+D datasets contains only 25 joints, out of which only 6 joints (3 for each hand) are finger joints. This is a major drawback since only 3 finger level joints are not sufficient enough to distinguish between action categories such as ”Thumbs up” and ”Thumbs down” or ”Make ok sign” and ”Make victory sign”. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTURGBD, NTU60-X and NTU120-X dataset include finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improving the state of the art performance, overall and on previously worst performing action categories. Pose-based action recognition is predominantly tackled by approaches that treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters. To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves the state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency make it an attractive choice for action recognition and for deployment on compute-restricted embedded and edge devices. Finally, we conclude this thesis by exploring new and more challenging frontiers under the umbrella of skeleton action recognition namely ”in the wild” skeleton action recognition and ”non-contextual” skeleton action recognition. We introduce Skeletics-152, a curated and 3D pose dataset derived from the RGB videos included in the larger Kinetics-700 dataset to explore in the wild skeleton action recognition. We further introduce, Skeleton-mimetics, a 3D pose dataset derived from recently introduced non-contextual action dataset-Mimetics. By benchmarking and analysing various approaches on these two new dataset we lay the ground for future exploration in these two challenging problems within skeleton action recognition. Overall in this thesis, we draw attention to prevailing drawbacks in the existing skeleton action datasets and introduce extensions of these datasets to counter their shortcomings. We also introduce a novel, efficient and highly reliable skeleton action recognition approach dubbed PSUMNet. Finally, we explore more challenging tasks of in the wild and non-contextual action recognition.

          Year of completion:  September 2022
           Advisor : Ravi Kiran Sarvadevabhatla

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Casual Scene Capture and Editing for AR/VR Applications
            2. Towards Understanding Deep Saliency Prediction
            3. Retinal Image Synthesis
            4. Leveraging Structural Cues for Better Training and Deployment in Computer Vision
            • Start
            • Prev
            • 13
            • 14
            • 15
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. MS Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.