Combining Class Taxonomies and Multi Task Learning To Regularize Fine-grained Recognition
Fine-grained classification is an extremely challenging problem in computer vision, impaired by subtle differences in shape, pose, illumination and appearance, and further compounded by subtle intra-class differences and striking inter-class similarities. While convolutional neural networks have become versatile jack-of-all-trades tool in modern computer vision, approaches for fine-grained recognition still rely on localization of keypoints and parts to learn discriminative features for recognition. In order to achieve this, most approaches necessitate copious amounts of expensive manual annotations for bounding boxes and keypoints. As a result, most of the current methods inevitably end up becoming complex, multi-stage pipelines, with a deluge of tunable knobs, which makes it infeasible to reproduce them or deploy them for any practical scenario. Since image level annotation is prohibitively expensive for most fine-grained problems, we look at the problem from a rather different perspective, and try to reason about what might be the minimum amount of additional annotation that might be required to obtain an improvement in performance on the challenging task of fine-grained recognition. In order to tackle this problem, we aim to leverage the (taxonomic and/or semantic) relationships present among fine-grained classes. The crux of our proposed approach lies in the notion that fine-grained recognition effectively deals with subordinate-level classification, and as such, subordinated classes imply the presence of inter-class and intra-class relationships. These relationships may be taxonomical, such as super-classes, and/or semantic, such as attributes or factors, and are easily obtainable in the sense that domain expertise is needed for each fine-grained label, not for each image separately. We propose to exploit the rich latent knowledge embedded in these inter-class relationships for visual recognition. We posit the problem as a multi-task learning problem where each different label obtained from inter-class relationships can be treated as a related yet different task for a comprehensive multi-task model. Additional tasks/labels, which might be super-classes or attributes, or factor-classes can act as regularizers, and increase the generalization capabilities of the network. Class relationships are almost always a free source of labels that can be used as auxiliary tasks to train a multi-task loss which is usually a weighted sum of the different individual losses. Multiple tasks will try to take the network in diverging directions, and the network must reach a common minimum by adapting and learning features common to all tasks in its shared layers. Our main contribution is to utilize the taxonomic/semantic hierarchies among classes, where each level in the hierarchy is posed as a classification problem, and solved jointly using multi-task learning. We employ a cascaded multi-task network architecture, where the output of one task feeds into the next, thusenabling transfer of knowledge from the easier tasks to the more difficult ones. To gauge the relative importance of tasks, and apply appropriate learning rates for each task to ensure that the related tasks aid and unrelated tasks does not hamper performance on the primary task, we propose a novel task-wise dynamic coefficient which controls its contribution to the global objective function. We validate our proposed methods for improving fine-grained recognition via multi-task learning using class taxonomies on two datasets, viz. CIFAR 100, which has a simple 2 level hierarchy, albeit a bit noisy, which we use to estimate how robust our proposed approach is to hyperparameter sensitivities, and CUB-200-2011, which has a 4 level hierarchy, and is a more challenging real-world dataset in terms of image size, which we use to see how transferable our proposed approach is to pre-trained networks and fine-tuning. We perform ablation studies on CIFAR 100 to establish the usefulness of multi-task learning using hierarchical labels, and measure the sensitivity of our proposed architectures to different hyperparameters and design choices in an imperfect 2 level hierarchy. Further experiments on the popular, real-world, large-scale, fine-grained CUB-200-2011 dataset with a 4 level hierarchy re-affirm our claim that employing super-classes in an end-to-end model improves performance, compared to methods employing additional expensive annotations such as keypoints and bounding boxes and/or using multi-stage pipelines. We also prove the improved generalization capabilities of our multi-task models, by showing how multiple connected tasks act as regularizers, reducing the gap between training and testing errors. Additionally, we demonstrate how dynamically estimating auxiliary task relatedness and updating auxiliary task coefficients is more optimal than manual hyperparameter tuning for the same purpose.
|Year of completion:
|Prof. Anoop M Namboodiri
Koustav Ghosal, Ameya Prabhu, Riddhiman Dasgupta, Anoop M. Namboodiri - Learning Clustered Sub-spaces for Sketch-based Image Retrieval Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]