Enhancing OCR Performance with Low Supervision

Deepayan Das

Abstract

Over the last decade, a tremendous emphasis has been laid on collection and digitization of a vast number of books leading to the creation of so-called ‘Digital Libraries’. Projects like Google Book and Project Gutenberg have made significant progress in digitizing over millions of books and making it available to the public. Efforts have also been made from the perspective of Indic languages where the task to identify and recognize books from several Indian languages has been undertaken by the National Digital Library of India. Advantages of digital libraries can be manifold. Digitization of ancient manuscripts ensures the preservation of knowledge and promotes research. Books in digital libraries are indexed which facilitates easy search and retrieval. They are easy to store and do not take as much effort in maintenance as their physical counterparts. One of the most important steps in the digitization effort is the recognition and conversion of physical pages into editable text using an OCR. There are commercial OCRs available like Tesseract and Abby fine reader, however, the ability of an OCR to recognize text without committing too many errors depends very much on the print quality of the pages as well as font style of the type-written text. A pre-trained OCR will invariably make errors across pages whose distribution is different in terms of fonts and print quality from the pages on which it was trained. If the domain gap is too large then the number of error words will be too high which will result in investing significant effort in the correction. Since the books need to be indexed, one cannot afford to have too many word errors in the OCR recognized pages. Thus, a major effort must be spent on correcting the error words, misclassified by the OCR. Manually correcting each isolated error word will incur a huge cost and is infeasible. In this thesis, we look at methods to improve OCR accuracy with minimum human involvement. To this effect, we propose two approaches. In the first approach, we strive to improve the OCR performance via an efficient post- processing technique where we aim to group similar erroneous words and correct them simultaneously. We argue that since a book has a common underlying theme, it will contain many word repetitions. These word co-occurrences can be taken advantage of by grouping similar error words and correcting them in batches. We propose a novel clustering scheme which combines features from both images as well as its text transcription to group error word predictions. The grouped error predictions can then be corrected either automatically or with the help of a human annotator. We show via experimental verification that automatic correction of error word batches might not be the most efficient way to correct the error words and employing a human annotator to verify the error word clusters will be a more systematic way to address the issue. Next, we look at the problem of adapting an OCR to a new dataset without requiring too many annotated pages. Traditional norm dictates finetuning the existing OCR on a portion of target data. However, even annotating a portion of data to create image-label pairs can be a costly affair. For this, we employ a self-training approach where the OCR is finetuned on its own predictions from the target dataset. To curtail the effects of noise present in the predictions, we include only those samples in the training set on which the model is sufficiently confident. We also show that by employing various regularization strategies we can outperform the traditional finetuning method without the need for any additional labelled data. We further show that by combining self-training with finetuning we can achieve a maximum gain in terms of OCR accuracy across all the datasets. We furnish thorough empirical evidence to support all our claims.

Year of completion:	March 2021
Advisor :	C V Jawahar