Blending the Past and Present of Automatic Image Annotation

Ayushi Dutta

Abstract

Real world images depict varying scenes, actions and multiple objects interacting with each other. We consider the fundamental Computer Vision problem of image annotation, where an image needs to be automatically tagged with a set of discrete labels that best describe its semantics. As more and more digital images become available, image annotation can help in the automatic archival and retrieval of large image collections. Being at the heart of image understanding, image annotation can also assist in other visual learning tasks, such as image captioning, scene recognition, multi-object recognition, etc.. With the advent of deep neural networks, recent research has achieved ground-breaking results in single-label image classification. However, for images representing the real world, containing different objects in varying scales and viewpoints, modelling the semantic relationship between images and all of their associated labels continues to remain a challenging problem. Additional challenges are posed from class-imbalance, incomplete labelling, label-ambiguity and several other issues that are commonly observed in the image annotation datasets. In this thesis, we study the image annotation task from two aspects. First, we bring to attention some of the core issues in the image annotation domain related to dataset properties and evaluation metrics that inherently affect the annotation performance of existing approaches to a significant extent. To examine these key aspects, we evaluate ten benchmark image annotation techniques on five popular datasets using the same baseline features, and perform thorough empirical analyses. With novel experiments, we explore possible reasons behind variations in per-label versus per-image evaluation criteria and discuss when each one of these should be used. We investigate dataset specific biases and propose new quantitative measures to quantify the degree of image and label diversity in a dataset, that can also be useful in developing new image annotation datasets. We believe the conclusions derived in this analysis would be helpful in making systematic advancements in this domain. Second, we attempt to address the annotation task by a CNN-RNN framework that jointly models label dependencies in an image while annotating it. We base our approach on the premise that labels corresponding to different visual concepts in an image share rich semantic relationships among them (e.g., “sky” is related to “cloud”). We follow recent works that have explored the CNN-RNN style models due to RNN’s capacity to model higher-order dependencies, but are limited in their approach to train the RNN in a pre-defined label order sequence. We overcome this limitation and propose a new method to learn multiple label prediction paths. We evaluate our proposed method on a number of popular and relevant datasets and achieve superior results compared to existing CNN-RNN based approaches. We also discuss the scope of the CNN-RNN framework in the context of image annotation.

Year of completion:	November 2019
Advisor :	Prof. C.V. Jawahar and Yashaswi Verma