From 2023
Year wise list: 2024 | 2023 | 2022 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 |

Becoming friends with pixels through intermediate representations

Dr. Kuldeep Kulkarni


Date : 30/08/2023



Manipulation of natural images for tasks like object insertion, out-painting or creating animations is extremely difficult if we operate purely in the pixel domain. Dr Kuldeep Kulkarni talked about the advantages of manipulating visual data by directly expressing them in intermediate representations and manipulating them instead of the pixels. Specifically, the focus was on his recent works with image out-painting and animating still images as target applications. He first talked about a semantically-aware novel paradigm to perform image extrapolation that enables the addition of new object instances. Expressing the images in semantic label space allows us to complete the existing objects more effectively and add completely new objects that otherwise are very difficult when working in the pixel domain. Dr. Kulkarni also discussed methods he developed to exploit intermediate representations like optical flow and surface normal maps to generate cinema graphs depicting the animation of fluid elements and human clothing.


Dr Kuldeep Kulkarni is a research scientist at Adobe Research, Bengaluru, working in the BigData Experience Labs. His current research interests broadly span computer vision, with a bent toward synthesizing beautiful and creative images and clips. Before this, Kuldeep did a post-doc stint at Carnegie Mellon University with Prof. Aswin Sankaranarayanan. Kuldeep received his Ph.D. in Electrical Engineering from Arizona State University under the supervision of Prof. Pavan Turaga. His PhD thesis focussed on tackling computer vision problems from compressive cameras at extremely low measurement rates, combining computer vision and compressed sensing.

Automatically Generating Audio Descriptions for Movies

Professor Andrew Zisserman


Date : 21/08/2023



Audio Description is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges - the Audio Description must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. This requires a visual-language model that can address all three of the `what', `who', and `when' questions: What is happening in the scene? Who are the characters in the scene? And when should a description be given?

Professor Andrew Zisserman visited IIIT-H and gave a talk on the 21st of August, 2023. He discussed how to build on large pre-trained models to construct a visual-language model that can generate Audio Descriptions addressing the following questions: (i) how to incorporate visual information into a pre-trained language model; (ii) how to train the model using only partial information; (iii) how to use a `character bank' to provide information on who is in a scene; and (iv) how to improve the temporal alignment of an ASR model to obtain clean data for training.


Andrew Zisserman is a Professor at the University of Oxford and is one of the principal architects of modern computer vision. He is best known for his leading role in establishing the computational theory of multiple-view reconstruction and the development of practical algorithms that are widely in use today. This culminated in the publication of his book with Richard Hartley, already regarded as a standard text. He is a fellow of the Royal Society and has won the prestigious Marr Prize three times.

Towards Trustworthy and Fair Medical Image Analysis Models

Raghav Mehta


Date : 15/05/2023



Although Deep Learning (DL) models have been shown to perform very well on various medical imaging tasks, inference in the presence of pathology presents several challenges to common models. These challenges impede the integration of DL models into real clinical workflows. Deployment of these models into real clinical contexts requires: (1) that the confidence in DL model predictions be accurately expressed in the form of uncertainties and (2) that they exhibit robustness and fairness across different sub-populations. In this talk, we will look at our recent work, where we developed an uncertainty quantification score for the task of Brain Tumour Segmentation. We evaluated the score's usefulness during the two consecutive Brain Tumour Segmentation (BraTS) challenges, BraTS 2019 and BraTS 2020. Overall, our findings confirm the importance and complementary value that uncertainty estimates provide to segmentation algorithms, highlighting the need for uncertainty quantification in medical image analyses. Additionally, we combine the aspect of uncertainty estimates with fairness across demographic subgroups into the picture. By performing extensive experiments on multiple tasks, we show that popular ML methods for achieving fairness across different subgroups, such as data-balancing and distributionally robust optimization, succeed in terms of the model performances for some of the tasks. However, this can come at the cost of poor uncertainty estimates associated with the model predictions. At last, we talk about our ongoing work on fairness mitigation framework in terms of calibration. Although several methods have been shown to successfully mitigate biases across subgroups in terms of accuracy, they do not consider calibration across different subgroups of these models. To this end, we propose a novel two-stage method Cluster-Focal. Extensive experiments on two different medical image classification datasets show that our method effectively controls calibration error in the worst-performing subgroups while preserving prediction performance, outperforming recent baselines.


Raghav Mehta is a Ph.D. candidate in the Department of Electrical and Computer Engineering at McGill University. He works with Prof. Tal Arbel in the Probabilistic Vision Group, Centre for Intelligent Machines. His primary research is in the field of neuroimage analysis and machine learning. Specifically, he works on quantifying and leveraging uncertainty in deep neural networks for the medical image analysis pipeline. Previously, Raghav completed his master's from IIIT Hyderabad. He was one of the main students in the project, The Construction of Brain Atlas for Young

Robustness and Safety. Raghav has received several awards, including the MEITA scholarship, reviewer award winner for MIDL, and best paper awards at DART and UNSURE workshops in MICCAI.

Computer vision - Self-Supervised Representation Learning

Yash Patel


Date : 16/03/2023



Yash Patel during his presentation, discussed a range of topics related to his research interests, which primarily include Self-Supervised Representation Learning, Image Compression, Scene Text Detection and Recognition, Tracking and Segmentation in Videos, and 3D Reconstruction.
Specifically, he discussed the challenges associated with Training Neural Networks on Non-Differentiable Losses and presented various approaches for overcoming these challenges. In particular, he presented his proposed technique for training a neural network by minimizing a surrogate loss that approximates a target evaluation metric that may be non-differentiable. To achieve this, the surrogate is learned via a deep embedding method where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. Additionally, he described his work on proposing a differentiable surrogate loss for the recall metric. To enable training with a very large batch size, which is crucial for metrics computed on the entire retrieval database, the speaker utilized an implementation that sidesteps the hardware constraints of the GPU memory. Furthermore, an efficient mixup approach that operates on pairwise scalar similarities was employed to virtually increase the batch size further.


Yash Patel, a Ph.D. candidate at the Center for Machine Perception, Czech Technical University, advised by Prof. Jiri Matas and an esteemed alumnus of IIIT Hyderabad, visited our institute on 16th March 2023.

He holds a Bachelor in Technology with Honors by Research in Computer Science and Engineering from International Institute of Information Technology, Hyderabad (IIIT-H). During my undergrad, I was working with Prof. C.V. Jawahar at the Center for Visual Information Technology (CVIT). He also holds a Master's degree in Computer Vision from the Robotics Institute of Carnegie Mellon University, where he worked with Prof. Abhinav Gupta.

Yash's research interests are primarily focused on computer vision with expertise in areas such as self-supervised representation learning, image compression, scene text detection and recognition, tracking and segmentation in videos, and 3D reconstruction.