Towards Visual 3D Scene Understanding and Prediction for ADAS
NEC Labs America
Date : 23/12/2016
Modern advanced driver assistance systems (ADAS) rely on a range of sensors including radar, ultrasound, cameras and LIDAR. Active sensors such as radar are primarily used for detecting traffic participants (TPs) and measuring their distance. More expensive LIDAR are used for estimating both traffic participants and scene elements (SEs). However, camera-based systems have the potential to achieve the same capabilities at a much lower cost, while allowing new ones such as determination of TP and SE types as well as their interactions in complex traffic scenes.
In this talk, we present several technological developments for ADAS. A common theme is to overcome challenges posed by lack of large-scale annotations in deep learning frameworks. We introduce approaches to correspondence estimation that are trained on purely synthetic data but adapt well to real data at test-time. Posing the problem in a metric learning framework with fully convolutional architectures allows estimation accuracies that surpass other state-of-art by large margins. We introduce object detectors that are light enough for ADAS, trained with knowledge distillation to retain accuracies of deeper architectures. Our semantic segmentation methods are trained on weak supervision that requires only a tenth of conventional annotation time. We propose methods for 3D reconstruction that use deep supervision to recover fine object part locations, but rely on purely synthetic 3D CAD models. Further, we develop generative adversarial frameworks for reconstruction that alleviate the need to even align the 3D CAD models with images at train time. Finally, we present a framework for TP behavior prediction in complex traffic scenes, that utilizes the above as inputs to predict future trajectories that fully account for TP-TP and TP-SE interactions. Our approach allows prediction of diverse uncertain outcomes and is trained with inverse optimal control to predict long-term strategic behaviors in complex scenes.
Manmohan Chandraker is an assistant professor at the CSE department of University of California, San Diego and leads the computer vision research effort at NEC Labs America in Cupertino. He received a B.Tech. in Electrical Engineering at the Indian Institute of Technology, Bombay and a PhD in Computer Science at the University of California, San Diego. His principal research interests are sparse and dense 3D reconstruction, including structure-from-motion, 3D scene understanding and dense modeling under complex illumination or material behavior, with applications to autonomous driving, robotics or human-computer interfaces. His work on provably optimal algorithms for structure and motion estimation received the Marr Prize Honorable Mention for Best Paper at ICCV 2007, the 2009 CSE Dissertation Award for Best Thesis at UC San Diego and was a nominee for the 2010 ACM Dissertation Award. His work on shape recovery from motion cues for complex material and illumination received the Best Paper Award at CVPR 2014.
Learning without exhaustive supervision
Carnegie Mellon University
Date : 15/12/2016
Recent progress in visual recognition can be attributed to large datasets and high capacity learning models. Sadly, these data hungry models tend to be supervision hungry as well. In my research, I focus on algorithms that learn from large amounts of data without exhaustive supervision. The key approach to make algorithms "supervision efficient" is to exploit the structure or prior properties available in the data or labels, and model it in the learning algorithm. In this talk, I will focus on three axes to get around having exhaustive annotations: 1) finding structure and natural supervision in data to reduce the need for manual labels: unsupervised and semi-supervised learning from video [ECCV'16, CVPR'15]; 2) sharing information across tasks so that tasks which are easier or "free" to label can help other tasks [CVPR'16]; and 3) finding structure in the labels and the labeling process so that one can utilize labels in the wild [CVPR'16].
Ishan Misra is a PhD student at Carnegie Mellon University, working with Martial Hebert and Abhinav Gupta. His research interests are in Computer Vision and Machine Learning, particularly in visual recognition. Ishan got his BTech in Computer Science from IIIT-Hyderabad where he worked with PJ Narayanan. He got the Siebel fellowship in 2014, and has spent two summers as an intern at Microsoft Research, Redmond.
Understanding Stories by Joint Analysis of Language and Vision
Date : 14/09/2016
Humans spend a large amount of time listening, watching, and reading stories. We argue that the ability to model, analyze, and create new stories is a stepping stone towards strong AI. We thus work on teaching AI to understand stories in films and TV series. To obtain a holistic view of the story, we align videos with novel sources of text such as plot synopses and books. Plots contain a summary of the core story and allow to obtain a high-level overview. On the contrary, books provide rich details about characters, scenes and interactions allowing to ground visual information in corresponding textual descriptions. We also work on testing machine understanding of stories by asking it to answer questions. To this end, we create a large benchmark dataset of almost 15,000 questions from 400 movies and explore its characteristics with several baselines.
MakarandTapaswireceived his undergraduate education from NITK Surathkal in Electronics and Communications Engineering. Thereafter he pursued an Erasmus Mundus Masters program in Information and Communication Technologies from UPC Barcelona and KIT Germany. He continued with the Computer Vision lab at Karlsruhe Institute of Technology in Germany and recently completed his PhD. He will be going to University of Toronto as a post-doctoral fellow starting in October.
Computer Vision @ Facebook
Facebook AI Research
Date : 26/07/2016
Over the past 5 years the community has made significant strides in the field of Computer Vision. Thanks to large scale datasets, specialized computing in form of GPUs and many breakthroughs in modeling better convnet architectures Computer Vision systems in the wild at scale are becoming a reality. At Facebook AI Research we want to embark on the journey of making breakthroughs in the field of AI and using them for the benefit of connecting people and helping remove barriers for communication. In that regard Computer Vision plays a significant role as the media content coming to Facebook is ever increasing and building models that understand this content is crucial in achieving our mission of connecting everyone. In this talk I will gloss over how we think about problems related to Computer Vision at Facebook and touch various aspects related to supervised, semi-supervised, unsupervised learning. I will jump between various research efforts involving representation learning and predictive learning as well. I will highlight some large scale applications and talk about limitations of current systems through out the talk to motivate the reason to tackle these problems.
Object detection methods for Common Objects in Context
Facebook AI Research
Date : 19/01/2016
In this talk, we discuss this year's COCO object detection challenge, Facebook's entry into the challenge that placed 2nd in the competition using deep convnets and specially designed cost functions. We shall also discuss the DeepMask segmentation system that forms the core of our detection pipeline. http://mscoco.org/dataset/#detections-leaderboard
Bayesian Nonparameric Modeling of Temporal Coherence for Entity-driven Video Analytics
PhD, IISc Bangalore
Date : 19/01/2016
Due to the advent of video-sharing sites like Youtube, online usergenerated video content is increasing very rapidly. To simplify search of meaningful information from such huge volume of content, Computer Vision researchers have started to work on problems like Video Summarization and Scene Discovery from videos. People understand videos based on high-level semantic concepts. But most of the current research in video analytics makes use of lowlevel features and descriptors, which may not have semantic interpretation. We have aimed to fill in this gap, by modeling implicit structural information about videos, such as spatiotemporal properties. We have represented videos as a collection of semantic visual concepts which we call “entities”, such as persons in a movie.To aid these tasks, we have attempted to model the important property of “temporal coherence”, which means that adjacent frames are likely to have similar visual features, and contain the same set of entities. Bayesian nonparametrics is a natural way of modeling all these, but they have also given rise to the need for new models and algorithms.
Recent Advances in the Field of Missing Data: Missingness Graphs, Recoverability and Testability
Date : 06/01/2016
The talk will discuss recent advances in the field of missing data, including: 1) the graphical representation called "Missingness Graph" that portrays the causal mechanisms responsible for missingness, 2) the notion of recoverability, i.e. deciding whether there exists a consistent estimator for a given query, 3) graphical conditions (necessary and sufficient) for recovering joint and conditional distributions and algorithms for detecting these conditions in the missingness graph, 4) the question of testability i.e. whether an assumed model can be subjected to statistical tests, considering the
missingness in the data and 5) the indispensability of causal assumptions for large sets of missing data problems.