Saliency Estimation in Videos and Images

Samyak Jain


With the growing data in terms of images and videos, it becomes imperative to derive a solution to filter out important/salient information in these data. Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Developing such a solution can help reduce human effort and can be used in various applications like automatic cropping, segmentation etc. We approached this problem by investigating saliency estimation in images which is predicting the stimuli of the human visual system when exposed to an image. We start by identifying four key components of saliency models, i.e. , input features, multi-level integration, readout architecture, and loss functions. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. We review other state-of-the-art models on these four components and propose two novel and simpler end-to-end architectures - SimpleNet and MDNSal. They are straightforward, neater, minimal, and more interpretable than other architectures achieving state-of-the-art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture. MDNSal is a parametric model that directly predicts parameters of a GMM distribution and aims to bring more interpretability to the prediction maps. Conclusively, we suggest that the way to move forward is not necessarily to design complex architectures but a modular analysis to optimize each component and possibly explore novel (and simpler) alternatives. After exploring these components, we shifted our focus to saliency estimation in videos where the stimuli of users are captured when exposed to a dynamic scenario. We propose ViNet architecture for the task of video saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time. ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behavior in the previous state-of-the-art models for audio-visual saliency prediction. Our findings contrast with earlier works on deep learningbased audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio more effectively

Year of completion:  February 2022
 Advisor : Vineet Gandhi

Related Publications