I-Do, You-Learn: Techniques for Unsupervised Procedure Learning using Egocentric Videos

Siddhant Bansal

Abstract

Consider an autonomous agent capable of observing multiple humans making a pizza and making one the next time! Motivated to contribute towards creating systems capable of understanding and reasoning instructions at the human level, in this thesis, we tackle procedure learning. Procedure learning involves identifying the key-steps and determining their logical order to perform a task. The first portion of this thesis focuses on the datasets curated for procedure learning. Existing datasets commonly consist of third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action. To this end, for studying procedure learning from egocentric videos, we propose the EgoProceL dataset. However, procedure learning from egocentric videos is challenging because the camera view undergoes extreme changes due to the wearer’s head motion and introduces unrelated frames. Due to this, current state-of-the-art methods’ assumptions that the actions occur at approximately the same time and are of the same duration do not hold. Instead, we propose to use the signal provided by the temporal correspondences between key-steps across videos. To this end, we present a novel self-supervised Correspond and Cut (CnC) framework that identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. We perform experiments on the benchmark ProceL and CrossTask datasets and achieve state-of-the-art results. In the second portion of the thesis, we look at various approaches to generate the signal for learning the embedding space. Existing approaches use only one or a couple of videos for this purpose. However, we argue that it makes key-steps discovery challenging as the algorithms lack an inter-videos perspective. To this end, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that represents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain similar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art. We hope this work motivates future research on procedure learning from egocentric videos. Furthermore, the unsupervised approaches proposed in the thesis will help create scalable systems and drive future research toward creative solutions

Year of completion:	February 2023
Advisor :	C V Jawahar, Chetan Arora