Unified 2D-3D Vision-Language Models for Long-Horizon Embodied Perception

Abstract:

Vision-language models typically tokenize RGB frames into dense 2D patches with all-to-all attention, which scales poorly for long-horizon embodied perception. This talk argues for operating over compact 3D scene representations so model complexity scales with scene content rather than raw sensory input. Ayush will present unified 2D-3D VLMs that leverage abundant 2D data while learning 3D-aware representations, achieving strong results across tasks like segmentation, referential grounding, and VQA, and will discuss extensions to dynamic 3D scenes and video understanding without explicit 3D sensing.

Speaker bio:

Ayush Jain is advised by Dr. Katerina Fragkiadaki at CMU. His work spans unified 2D-3D vision-language models with publications at CVPR, ECCV, RSS, NeurIPS, and ICML, along with multiple spotlight presentations and outstanding reviewer awards. He has also interned at Apple ML Research and Meta (FAIR and Reality Labs).