CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login
From 2025
Year wise list: 2025 | 2024 | 2023 | 2022 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 |

Learning Co-speech Gesture Representations

Sindhu B. Hegde

 

Date : 11/04/2025

 

Abstract:

Humans gesture when they speak -- gesturing is an integral part of non-verbal communication. Yet, large-scale understanding of co-speech gestures remains relatively underexplored. In this talk, I will delve into different approaches for learning co-speech gesture representations, highlight key challenges, and outline promising directions to advance gesture understanding in real-world, multimodal settings.

Bio:

Sindhu Hegde is a second-year PhD student in the Visual Geometry Group (VGG) at the University of Oxford, supervised by Prof. Andrew Zisserman.  Her research is in Computer Vision, particularly in multimodal learning, video understanding, and self-supervised learning. Prior to joining Oxford,  she worked as a Lead Data Scientist @ Verisk Analytics. Before that, she pursued a Master’s by Research (MS) at Centre for Visual Information Technology (CVIT),  IIIT Hyderabad, supervised by Prof. C V Jawahar (IIIT-H) and Prof. Vinay Namboodiri (University of Bath, UK). Her Master’s research focused on exploiting  the redundancies in vision and speech modalities for cross-modal generation.


Sounds of Pouring

Piyush

Piyush Bagad

 

Date : 09/04/25

 

Abstract:

What can possibly be scientifically interesting about such a mundane chore as pouring a liquid into a glass? We perform this action all the time but barely realise that we effortlessly learn to infer several useful physical properties in the process. For example, evidence in psychoacoustics suggests that humans can accurately infer the level of the liquid, the time to fill, the size of the container, and even the temperature of the liquid merely from the sound of liquid. How do we do it? What is the physics behind pouring? How can we use it to train an audio model to predict some of these physical properties solely from the sound of pouring? I will answer these questions in the talk.

Bio:

Piyush is a PhD student at the VGG lab in Oxford. He is supervised by Prof. Andrew Zisserman. His interests lie in time-sensitive multi-modal video understanding. Previously, he did his Master’s in AI at the University of Amsterdam. He has also worked as a Research Fellow at Wadhwani AI in Mumbai. 

 

Model Compression (Pruning and Quantization Strategies)

Srinivas

Dr. Srinivas Rana

 

Date : 02/04/2025

 

Abstract:

Model compression, focusing specifically on pruning and quantization strategies, is an essential topic in machine learning. As machine learning models become more complex and resource-demanding, exploring ways to make them more efficient without sacrificing their performance is crucial. Pruning and quantization are two key techniques that help reduce the size of models, improve inference speed, and lower resource consumption, making them more suitable for deployment on edge devices or in environments with limited computational power.

Bio:

Dr. Srinivas Rana is currently a Senior ML Scientist at Wadhwani AI, where he leads a portfolio of healthcare solutions pertaining to screening or diagnostics across diverse domains. Previously, he was working in the UK with different organizations in the field of medical devices and life sciences. He holds a PhD from IIT Madras. 

 



From 2024
Year wise list: 2025 | 2024 | 2023 | 2022 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 |

Advances in Robot Perception and Communication Towards Networked Autonomy

patel

Rajat Talak

 

Date : 29/05/2024

 

Abstract:

Networked autonomy aspires to build a team of robots that can seamlessly perceive, communicate, and interact with humans and the environment. A robot perception system needs to build an actionable 3D scene representation in real-time that is scalable for long-horizon networked autonomy, and also implement mechanisms that can tackle failures due to changing environmental conditions. Operating in constantly changing environments, networked autonomous systems need to exchange time-sensitive information to ensure situational awareness and safety.

Bio:

Rajat Talak is a Research Scientist in the Department of Aeronautics and Astronautics at the Massachusetts Institute of Technology (MIT). Prior to this, he was a Postdoctoral Associate in the Department of Aeronautics and Astronautics, at MIT, from 2020-2022. He received his Ph.D. from the Laboratory of Information and Decision Systems at MIT, in 2020. He holds a Master of Science degree from the Dept. of Electrical Communication Engineering, Indian Institute of Science, Bangalore, India, and a B.Tech from the Dept. of Electronics and Communication Engineering, National Institute of Technology -Surathkal, India. He is the recipient of the ACM MobiHoc 2018 Best Paper Award and Gold medal for his MSc thesis from the Indian Institute of Science.


Finding Patterns in Pictures

Prof. B.S. Manjunath

 

Date : 09/02/2024

 

Abstract:

Explore the evolving landscape of computer vision as Prof. B.S. Manjunath delves into its broader applications beyond everyday object recognition. The discussion will span diverse fields such as neuroscience, materials science, ecology, and conservation. Prof. Manjunath will showcase innovative applications, including a novel approach to malware detection and classification. The talk will conclude with an overview of a unique platform designed to foster reproducible computer vision and encourage interdisciplinary collaborations between vision/ML researchers and domain scientists through the sharing of methods and data.

Bio:

Prof. B.S. Manjunath, is a distinguished academician and Chair of the ECE Department at UCSB, California. He directs the Center for Multimodal Big Data Science and Healthcare. He has published over 300 peer-reviewed articles in various journals and peer reviewed conferences and his publications have been cited extensively. He is an inventor on 24 Patents and co-edited the first book on the ISO/MPEG-7 multimedia content representation standard. He directed the NSF/ITR supported center on Bio-Image Informatics. His team is developing the open-source BisQue image informatics platform that helps users manage, annotate, analyze and share their multimodal images in a scalable manner with a focus on reproducible science.


Domain Adaptation for Fair and Robust Computer Vision

Tarun Kalluri

 

Date : 09/01/2024

 

Abstract:

While recent progress significantly advances the state of the art in computer vision across several tasks, the poor ability of these models to generalize to domains and categories under-represented in the training set remains a problem, posing a direct challenge to fair and inclusive computer vision. In my talk, I will talk about my recent efforts towards improving generalizability and robustness in computer vision using domain adaptation. First, I will talk about our work on scaling domain adaptation to large scale datasets using metric learning. Next, I will introduce our new dataset effort called GeoNet aimed at benchmarking and developing novel algorithms towards geographical robustness in various vision tasks. Finally, I will talk about some research directions for the future in terms of leveraging rich multimodal (vision, language) data to improve adaptation of visual models to new domains.

Bio:

Tarun Kalluri is a fourth year PhD student at UC San Diego in the Visual Computing Group.  Prior to that, he graduated with a bachelors from Indian Institute of Technology, Guwahati and worked as a data scientist in Oracle. His research interests lie in label and data efficient learning from images and videos, domain adaptation and improving fairness in AI. He is a recipient of IPE PhD fellowship for 2020-21.


What Do Generative Image Models Know? Understanding their Latent Grasp of Reality and Limitations

Anand Bhattad

 

Date : 06/01/2024

 

Abstract:

Recent generative image models like StyleGAN, Stable Diffusion, and VQGANs have achieved remarkable results, producing images that closely mimic real-world scenes. However, their true understanding of visual information remains unclear. My research delves into this question, exploring whether these models operate on purely abstract representations or possess an understanding akin to traditional rendering principles. My findings reveal that these models:

Encode a wide range of scene attributes like lighting, albedo, depth, and normals, suggesting a nuanced understanding of image composition and structure. Despite this, they exhibit significant shortcomings in depicting projective geometry and shadow consistency, often misrepresenting relationships and light interactions, leading to subtle visual anomalies.

This talk aims to illuminate the complex narrative of what generative image models truly understand about the visual world, highlighting their capabilities and limitations. I’ll conclude with insights into the broader implications of these findings and discuss potential directions for enhancing the realism and utility of generative imagery in applications demanding high fidelity and physical plausibility.

Bio:

Anand Bhattad is a Research Assistant Professor at the Toyota Technological Institute in Chicago (TTIC). Before this role, he earned his PhD from the University of Illinois Urbana-Champaign (UIUC), working with his advisor David Forsyth. His primary research interests lie in computer vision, with a specific focus on knowledge in generative models, and their applications to computer vision, computer graphics, and computational photography. His recent work, DIVeR, received a best paper nomination at CVPR 2022. He was listed as an Outstanding Reviewer at ICCV 2023 and an Outstanding Emergency Reviewer at CVPR 2021. 


Object-centric 3D Scene Understanding from Videos

Yash Bhalgat

 

Date : 06/01/2024

 

Abstract:

The growing demand for immersive, interactive experiences has underscored the importance of 3D data in understanding our surroundings. Traditional methods for capturing 3D data are often complex and equipment-intensive. In contrast, Yash Bhalgat's research aims to utilize unconstrained videos, such as those from augmented reality glasses, to effortlessly capture scenes and objects in their full 3D complexity.

Yash Bhalgat's talk at IIIT- H on January 6, 2024, consisted of describing a method to incorporate Epipolar Geometry priors in multi-view Transformer models to enable identifying objects across extreme pose variations. Next, he talked about his recent work on 3D object segmentation using 2D pre-trained foundation models. Finally, he touched upon his ongoing work on object-centric dynamic scene representations.

Bio:

Yash Bhalgat is a 3rd year PhD student at the University of Oxford's Visual Geometry Group supervised by Andrew Zisserman, Andrea Vedaldi, Joao Henriques and Iro Laina. His research is broadly in 3D computer vision and machine learning, with a specific focus on geometry-aware deep networks (transformers), 3D reconstruction, and neural rendering. Yash also works on the intersection of 3D and LLMs. He was a Senior Researcher at Qualcomm AI Research working on efficient deep learning previously. Yash received Masters in Computer Science from the University of Michigan - Ann Arbor, and Bachelors in Electrical Engineering (with CS minor) from IIT Bombay.


3D Representation and Analytics for Computer Vision

Prof. Chandra Kambhamettu

 

Date : 09/01/2024

 

Abstract:

Data in computer vision is commonly either in the homogeneous format in 2D projective space (e.g., images and videos) or a heterogeneous format in 3D Euclidean space (e.g., point clouds). One advantage of 3D data is its invariancy towards illumination-based appearance. 3D point clouds are now a vital data source for vision tasks such as autonomous driving, robotic perception, and scene understanding. Thus, deep learning for 3D points has become an essential branch of geometry-based research. However, deep learning over unstructured point clouds is quite challenging. Moreover, due to the 3D data explosion, new representations are necessary for compression and encryption. Therefore, we introduce a new sphere-based representation (3DSaint) to model a 3D scene and further utilize it in deep networks. Our representation produces state-of-the-art results on several 3D understanding tasks, including 3D shape classification. We also present differential geometry-based networks for the 3D analysis tasks.

Bio:

Dr. Chandra Kambhamettu is a Full Professor of the Computer Science department at the University of Delaware, where he directs the Video/Image Modelling and Synthesis (VIMS) group. His research interests span Artificial Intelligence and Data Science, including computer vision, machine learning, and big data visual analytics. His Lab focuses on novel schemes for multimodal image analysis methodologies. Some of his recent research includes image analysis for the visually impaired, 3D point cloud analysis, drone and vehicular-based camera imagery acquisition, analysis and reconstruction, and plant science image analysis. His work on nonrigid structure from motion was published in CVPR in 2000 and cited as one of the first two papers in the field of Nonrigid Structure from Motion. Several of Dr. Kambhamettu’s works also focus on problems that highly impact earth life, such as arctic sea ice observations with application towards mammal habitat quantification and climate change and hurricane image studies, among several others. Before joining UD, he was a research scientist at NASA-Goddard, where he received the “1995 Excellence in Research Award.” In addition, he received the NSF CAREER award in 2000 and NASA’s “Group Achievement Award” in 2013 for his work in the deployment of the Arctic Collaborative Environment.


From 2023
Year wise list: 2025 | 2024 | 2023 | 2022 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 |

Becoming friends with pixels through intermediate representations

Dr. Kuldeep Kulkarni

 

Date : 30/08/2023

 

Abstract:

Manipulation of natural images for tasks like object insertion, out-painting or creating animations is extremely difficult if we operate purely in the pixel domain. Dr Kuldeep Kulkarni talked about the advantages of manipulating visual data by directly expressing them in intermediate representations and manipulating them instead of the pixels. Specifically, the focus was on his recent works with image out-painting and animating still images as target applications. He first talked about a semantically-aware novel paradigm to perform image extrapolation that enables the addition of new object instances. Expressing the images in semantic label space allows us to complete the existing objects more effectively and add completely new objects that otherwise are very difficult when working in the pixel domain. Dr. Kulkarni also discussed methods he developed to exploit intermediate representations like optical flow and surface normal maps to generate cinema graphs depicting the animation of fluid elements and human clothing.

Bio:

Dr Kuldeep Kulkarni is a research scientist at Adobe Research, Bengaluru, working in the BigData Experience Labs. His current research interests broadly span computer vision, with a bent toward synthesizing beautiful and creative images and clips. Before this, Kuldeep did a post-doc stint at Carnegie Mellon University with Prof. Aswin Sankaranarayanan. Kuldeep received his Ph.D. in Electrical Engineering from Arizona State University under the supervision of Prof. Pavan Turaga. His PhD thesis focussed on tackling computer vision problems from compressive cameras at extremely low measurement rates, combining computer vision and compressed sensing.


Automatically Generating Audio Descriptions for Movies

Professor Andrew Zisserman

 

Date : 21/08/2023

 

Abstract:

Audio Description is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges - the Audio Description must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. This requires a visual-language model that can address all three of the `what', `who', and `when' questions: What is happening in the scene? Who are the characters in the scene? And when should a description be given?

Professor Andrew Zisserman visited IIIT-H and gave a talk on the 21st of August, 2023. He discussed how to build on large pre-trained models to construct a visual-language model that can generate Audio Descriptions addressing the following questions: (i) how to incorporate visual information into a pre-trained language model; (ii) how to train the model using only partial information; (iii) how to use a `character bank' to provide information on who is in a scene; and (iv) how to improve the temporal alignment of an ASR model to obtain clean data for training.

Bio:

Andrew Zisserman is a Professor at the University of Oxford and is one of the principal architects of modern computer vision. He is best known for his leading role in establishing the computational theory of multiple-view reconstruction and the development of practical algorithms that are widely in use today. This culminated in the publication of his book with Richard Hartley, already regarded as a standard text. He is a fellow of the Royal Society and has won the prestigious Marr Prize three times.


Towards Trustworthy and Fair Medical Image Analysis Models

Raghav Mehta

 

Date : 15/05/2023

 

Abstract:

Although Deep Learning (DL) models have been shown to perform very well on various medical imaging tasks, inference in the presence of pathology presents several challenges to common models. These challenges impede the integration of DL models into real clinical workflows. Deployment of these models into real clinical contexts requires: (1) that the confidence in DL model predictions be accurately expressed in the form of uncertainties and (2) that they exhibit robustness and fairness across different sub-populations. In this talk, we will look at our recent work, where we developed an uncertainty quantification score for the task of Brain Tumour Segmentation. We evaluated the score's usefulness during the two consecutive Brain Tumour Segmentation (BraTS) challenges, BraTS 2019 and BraTS 2020. Overall, our findings confirm the importance and complementary value that uncertainty estimates provide to segmentation algorithms, highlighting the need for uncertainty quantification in medical image analyses. Additionally, we combine the aspect of uncertainty estimates with fairness across demographic subgroups into the picture. By performing extensive experiments on multiple tasks, we show that popular ML methods for achieving fairness across different subgroups, such as data-balancing and distributionally robust optimization, succeed in terms of the model performances for some of the tasks. However, this can come at the cost of poor uncertainty estimates associated with the model predictions. At last, we talk about our ongoing work on fairness mitigation framework in terms of calibration. Although several methods have been shown to successfully mitigate biases across subgroups in terms of accuracy, they do not consider calibration across different subgroups of these models. To this end, we propose a novel two-stage method Cluster-Focal. Extensive experiments on two different medical image classification datasets show that our method effectively controls calibration error in the worst-performing subgroups while preserving prediction performance, outperforming recent baselines.

Bio:

Raghav Mehta is a Ph.D. candidate in the Department of Electrical and Computer Engineering at McGill University. He works with Prof. Tal Arbel in the Probabilistic Vision Group, Centre for Intelligent Machines. His primary research is in the field of neuroimage analysis and machine learning. Specifically, he works on quantifying and leveraging uncertainty in deep neural networks for the medical image analysis pipeline. Previously, Raghav completed his master's from IIIT Hyderabad. He was one of the main students in the project, The Construction of Brain Atlas for Young

Robustness and Safety. Raghav has received several awards, including the MEITA scholarship, reviewer award winner for MIDL, and best paper awards at DART and UNSURE workshops in MICCAI.


Computer vision - Self-Supervised Representation Learning

Yash Patel

 

Date : 16/03/2023

 

Abstract:

Yash Patel during his presentation, discussed a range of topics related to his research interests, which primarily include Self-Supervised Representation Learning, Image Compression, Scene Text Detection and Recognition, Tracking and Segmentation in Videos, and 3D Reconstruction.
Specifically, he discussed the challenges associated with Training Neural Networks on Non-Differentiable Losses and presented various approaches for overcoming these challenges. In particular, he presented his proposed technique for training a neural network by minimizing a surrogate loss that approximates a target evaluation metric that may be non-differentiable. To achieve this, the surrogate is learned via a deep embedding method where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. Additionally, he described his work on proposing a differentiable surrogate loss for the recall metric. To enable training with a very large batch size, which is crucial for metrics computed on the entire retrieval database, the speaker utilized an implementation that sidesteps the hardware constraints of the GPU memory. Furthermore, an efficient mixup approach that operates on pairwise scalar similarities was employed to virtually increase the batch size further.

Bio:

Yash Patel, a Ph.D. candidate at the Center for Machine Perception, Czech Technical University, advised by Prof. Jiri Matas and an esteemed alumnus of IIIT Hyderabad, visited our institute on 16th March 2023.

He holds a Bachelor in Technology with Honors by Research in Computer Science and Engineering from International Institute of Information Technology, Hyderabad (IIIT-H). During my undergrad, I was working with Prof. C.V. Jawahar at the Center for Visual Information Technology (CVIT). He also holds a Master's degree in Computer Vision from the Robotics Institute of Carnegie Mellon University, where he worked with Prof. Abhinav Gupta.

Yash's research interests are primarily focused on computer vision with expertise in areas such as self-supervised representation learning, image compression, scene text detection and recognition, tracking and segmentation in videos, and 3D reconstruction.


From 2022
Year wise list: 2025 | 2024 | 2023 | 2022 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 |

Building Maximal Vision Systems with Minimal Resources

Dr. Ayush Bansal

 

Date : 19/12/2022

 

Abstract:

Current vision and robotic systems are like mainframe machines of the 60s -- they require extensive resources: (1) dense data capture and massive human annotations, (2) large parametric models, and (3) intensive computational infrastructure. I build systems that can learn directly from sparse and unconstrained real-world samples with minimal resources, i.e., limited or no supervision, use simple and efficient models, and operate on every day computational devices. Building systems with minimal resources allows us to democratize them for non-experts. My work has impacted important areas such as virtual reality, content creation and audio-visual editing, and providing a natural voice to speech-impaired individuals.

In my talk, I will present my efforts to build vision systems for novel view synthesis. I will discuss Neural Pixel Composition, a novel approach for continuous 3D-4D view synthesis that reliably operates on sparse and wide-baseline multi-view images/videos and can be trained efficiently within a few minutes for high-resolution (12MP) content using 1 GB GPU memory. I will present my efforts to build vision systems for unsupervised audio-visual synthesis. I will primarily discuss Exemplar Autoencoders that enable zero-shot audio-visual retargeting. Exemplar Autoencoders are built on remarkably simple insights: (1) autoencoders project out-of-sample data onto the distribution of the training set; and (2) exemplar learning enables us to capture the voice, stylistic prosody (emotions and ambiance), and visual appearance of the target. These properties enable an autoencoder trained on an individual's voice to generalize for unknown voices in different languages. Exemplar Autoencoders can synthesize natural voices for speech-impaired individuals and do a zero-shot multilingual translation.

Bio:

Aayush Bansal is currently a short-term research scientist at the Reality Labs Research of Meta Platforms, Inc. He received his Ph.D. in Robotics from Carnegie Mellon University under the supervision of Prof. Deva Ramanan and Prof. Yaser Sheikh. He was a Presidential Fellow at CMU, and a recipient of the Uber Presidential Fellowship (2016-17), Qualcomm Fellowship (2017-18), and Snap Fellowship (2019-20). His research has been covered by various national and international media such as NBC, CBS, WQED, 90.5 WESA FM, France TV, and Journalist. He has also worked with production houses such as BBC Studios, Full Frontal with Samantha Bee (TBS), etc. More details are available on his webpage: https://www.aayushbansal.xyz/


Towards Autonomous Driving in Dense, Heterogeneous, and Unstructured Environments

Dr. Rohan Chandra

 

Date : 07/11/2022

 

Abstract:

In this talk, I discuss many key problems in autonomous driving towards handling dense, heterogeneous, and unstructured traffic environments. Autonomous vehicles (AV) at present are restricted to operating on smooth and well-marked roads, in sparse traffic, and among well-behaved drivers. I present new techniques to perceive, predict, and navigate among human drivers in traffic that is significantly denser in terms of a number of traffic-agents, more heterogeneous in terms of size and dynamic constraints of traffic agents, and where many drivers may not follow the traffic rules and have varying behaviors. My talk is structured along three themes—perception, driver behavior modeling, and planning. More specifically, I will talk about Improved tracking and trajectory prediction algorithms for dense and heterogeneous traffic using a combination of computer vision and deep learning techniques. A novel behavior modeling approach using graph theory for characterizing human drivers as aggressive or conservative from their trajectories. Behavior-driven planning and navigation algorithms in mixed and unstructured traffic environments using game theory and risk-aware planning. Finally, I will conclude by discussing the future implications and broader applications of these ideas in the context of social robotics where robots are deployed in warehouses, restaurants, hospitals, and inside homes to assist human beings.

Bio:

Rohan Chandra is currently a postdoctoral researcher at the University of Texas, Austin, hosted by Dr. Joydeep Biswas. Rohan obtained his B.Tech from the Delhi Technological University, New Delhi in 2016 and completed his MS and PhD in 2018 and 2022 from the University of Maryland advised by Dr. Dinesh Manocha. His doctoral thesis focused on autonomous driving in dense, heterogeneous, and unstructured traffic environments. He is a UMD’20 Future Faculty Fellow, RSS’22 Pioneer, and a recipient of a UMD’20 summer research fellowship. He has published his work in top computer vision and robotics conferences (CVPR, ICRA, IROS) and has interned at NVIDIA in the autonomous driving team. He has served on the program committee of leading conferences in robotics, computer vision, artificial intelligence, and machine learning.


From 2018
Year wise list: 2024 | 2023 | 2022 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 |

Unsupervised Representation Learning

Maneesh

Maneesh Kumar Singh

 

Date : 28/11/2018

 

Abstract:

upervised machine learning using deep neural networks have shown tremendous success on a variety of tasks in machine learning. However, supervised learning on each individual task is neither scalable nor the only way to create world models of visual (and other) phenomena and to do inference on them. The recent research thrusts are in mitigating and surmounting such problems. The learning of static world models is also called representation learning. Unsupervised representation learning seeks to create structured latent representations to avoid the onerous need to generate supervisory labels and to enable learning of task-independent (universal) representations. In this talk, I will provide an overview of recent efforts in my group at Verisk, carried out in collaboration with various academic partners, on these topics. Most of this work is available in recent publications and accompanying code online.

Bio:

Dr. Singh is Head, Verisk | AI and the Director of Human and Computation Intelligence Lab at Verisk. He leads the R&D efforts for the development of AI and machine learning technologies in a variety of areas including computer vision, natural language processing and speech understanding. Verisk Analytics builds tools for risk assessment, risk forecasting and decision analytics in a variety of sectors including insurance, financial services, energy, government and human resources.
From 2013-2015, Dr. Singh was a Technology Leader in the Center for Vision Technologies at SRI International, Princeton, NJ. At SRI, he was the technical lead for the DARPA Visual Media Reasoning (VMR) project for Automatic Performance Characterization and led the development and implementation of efficient Pareto optimal performance curves and a multithreaded APC system for benchmarking more than 40 CV and ML algorithms. Dr. Singh was the Algorithms Lead for the DARPA CwC CHAPLIN project for designing a human-computer collaboration (HCC) system to enable composition of visual narratives (cartoon strips, movies) with effective collaboration between a human actor and the computer. He was also a key performer on the DARPA DTM (Deep Temporal Models) seedling project for designing deep learning algorithms on video data. Previously, Dr. Singh was a Staff Scientist at Siemens Corporate Technology, Princeton, NJ till 2013. At Siemens, he led and contributed to a large number of projects for successful development and deployed of computer vision and machine learning technologies in multi-camera security and surveillance, aerial surveillance, advanced driver assistance and intelligent traffic control; industrial inspection; and, medical image processing and patient diagnostics. Dr. Singh received his Ph.D. in Electrical and Computer Engineering from the University of Illinois at in 2003. He has authored over 35 publications and 15 U.S. and International patents.


Visual recognition of human communications

Joon Son Chung

Dr Joon Son Chung

 

Date : 05/09/2018

 

Abstract:

The objective of this work is visual recognition of human communications.Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi-talker simultaneous speech,but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications.
Training a deep learning algorithm requires a lot of training data.We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript.To build such data-set, it is essential to know 'who' is speaking 'when'.We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabeled data, and apply this network to the tasks of audio-to-video synchronization and active speaker detection.We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images, and re-dubbing videos with audio samples from different speakers.
We then propose a number of deep learning models that are able to recognize visual speech at sentence level.The lip reading performance beats a professional lip reader on videos from BBC television.We demonstrate that if audio is available, then visual information helps to improve speech recognition performance.We also propose methods to enhance noisy audio and to resolve multi-talker simultaneous speech using visual cues.
Finally, we explore the problem of speaker recognition.Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition data-set collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this data-set.

Bio:

Joon Son is a recent graduate from the Visual Geometry Group at the University of Oxford, and a research scientist at Naver Corp. His research interests are in computer vision and machine learning.


Visual recognition of human communications

Triantafyllos Afouras

Triantafyllos Afouras

 

Date : 05/09/2018

 

Abstract:

The objective of this work is visual recognition of human communications.Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi-talker simultaneous speech,but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications.
Training a deep learning algorithm requires a lot of training data.We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript.To build such data-set, it is essential to know 'who' is speaking 'when'.We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabeled data, and apply this network to the tasks of audio-to-video synchronization and active speaker detection.We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images, and re-dubbing videos with audio samples from different speakers.
We then propose a number of deep learning models that are able to recognize visual speech at sentence level.The lip reading performance beats a professional lip reader on videos from BBC television.We demonstrate that if audio is available, then visual information helps to improve speech recognition performance.We also propose methods to enhance noisy audio and to resolve multi-talker simultaneous speech using visual cues.
Finally, we explore the problem of speaker recognition.Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition data-set collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this data-set.

Bio:

Triantafyllos is a 2nd year PhD student in the Visual Geometry Group at the University of Oxford, under the supervision of Prof. Andrew Zisserman. He currently researches computer vision for understanding human communication, which includes lip reading, audio visual speech recognition and enhancement and body language modeling.


Making sense of Urban Data: Experiences from interdisciplinary research studies

Manik Gupta

Dr Manik Gupta

 

Date : 03/08/2018

 

Abstract:

In this talk, I present an overview about the different research projects that I have worked during my research career beginning with my PhD on air pollution monitoring to my first postdoc on wheelchair accessibility and then onto my second postdoc on building energy management data to my current successful research grants on ear protection in noisy environments.
Finally, I summarize my learnings and future directions, guiding towards unlimited opportunities to make difference to millions of lives in this very exciting new era of data and more data.

Bio:

Dr Manik Gupta is a lecturer (assistant professor) in the Computer Science and Informatics division at London Southbank University (LSBU), London, UK.She received her PhD from Queen Mary University of London (QMUL) in Wireless Sensor Networks (WSN)/Internet of Things (IoT). Her research interests include Data Science for IoT with focus upon real world data collection and data quality issues, knowledge discovery from large sensor datasets, IoT data processing using time series data mining and distributed analytics, applications in environmental monitoring and well being.
She has extensive experience in IoT sensor deployments and real-world data collection studies for EPSRC funded research projects while at QMUL and University College London (UCL) on urban air pollution monitoring and wheelchair accessibility. Her most recent engagement as a postdoc at LSBU has been on an Innovate UK research project dealing with the application of topological data analysis and machine learning techniques to building energy management data. She is currently the co-investigator on an Innovate UK funded research grant and a knowledge transfer partnership on Intelligent ear protection to address occupational hearing loss for use in heavy industry.


Semantic Representation and Analysis of E-commerce Orders

Arijit

Dr.Arijit Biswas

University of Maryland

Date : 29/01/2018

 

Abstract:

E-commerce websites such as Amazon, Alibaba, and Walmart typically process billions of orders every year. Semantic representation and understanding of these orders is extremely critical for an eCommerce company. Each order can be represented as a tuple of <customer, product, price, date>. In this talk, I will describe two of our recent work (i) product embedding using MRNet-Product2Vec and (ii) generating fake orders using eCommerceGAN.

MRNet-Product2Vec [ECML-PKDD 2017]: In this work, we propose an approach called MRNet-Product2Vec for creating generic embeddings of products within an e-commerce ecosystem. We learn a dense and low-dimensional embedding where a diverse set of signals related to a product are explicitly injected into its representation. We train a Discriminative Multi-task Bidirectional Recurrent Neural Network (RNN), where the input is a product title fed through a Bidirectional RNN and at the output, product labels corresponding to fifteen different tasks are predicted.

eCommerceGAN: Exploring the space of all plausible orders could help us better understand the relationships between the various entities in an e-commerce ecosystem, namely the customers and the products they purchase. In this paper, we propose a Generative Adversarial Network (GAN) for orders made in e-commerce websites. Once trained, the generator in the GAN could generate any number of plausible orders. Our contributions include: (a) creating a dense and low-dimensional representation of e-commerce orders, (b) train an ecommerceGAN (ecGAN) with real orders to show the feasibility of the proposed paradigm, and (c) train an ecommerce-conditional-GAN (ec^2GAN) to generate the plausible orders involving a particular product. We propose several qualitative methods to evaluate ecGAN and demonstrate its effectiveness.

 

Bio:

Arijit Biswas is currently a machine learning scientist at the India machine learning team in Amazon, Bangalore. His research interests are mainly in deep learning, machine learning and computer vision. Earlier he was a research scientist at Xerox Research Centre India (XRCI) from June, 2014 to July, 2016. He received his PhD in Computer Science from University of Maryland, College Park in April 2014. His PhD thesis was on Semi-supervised and Active Learning Methods for Image Clustering. His thesis advisor was David Jacobs and he closely collaborated with Devi Parikh and Peter Belhumeur during his stay at UMD. While doing his PhD, Arijit also did internships at Xerox PARC and Toyota Technological Institute at Chicago (TTIC). He has published papers in CVPR, ECCV, ACM-MM, BMVC, IJCV and CVIU. Arijit has a Bachelor's degree in Electronics and Telecommunication Engineering from Jadavpur University, Kolkata. Arijit is also a recipient of the MIT Technology Review Innovators under 35 award from India in 2016.


 

More Articles …

  1. Talks and Visits 2017
  2. Talks and Visits 2016
  3. Talks and Visits 2015
  4. Talks and Visits 2014
  • Start
  • Prev
  • 1
  • 2
  • Next
  • End
  1. You are here:  
  2. Home
  3. Events
  4. Talks and Visits
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.