What Do Generative Image Models Know? Understanding their Latent Grasp of Reality and Limitations
Abstract:
Recent generative image models like StyleGAN, Stable Diffusion, and VQGANs have achieved remarkable results, producing images that closely mimic real-world scenes. However, their true understanding of visual information remains unclear. My research delves into this question, exploring whether these models operate on purely abstract representations or possess an understanding akin to traditional rendering principles. My findings reveal that these models:
Encode a wide range of scene attributes like lighting, albedo, depth, and normals, suggesting a nuanced understanding of image composition and structure. Despite this, they exhibit significant shortcomings in depicting projective geometry and shadow consistency, often misrepresenting relationships and light interactions, leading to subtle visual anomalies.
This talk aims to illuminate the complex narrative of what generative image models truly understand about the visual world, highlighting their capabilities and limitations. I’ll conclude with insights into the broader implications of these findings and discuss potential directions for enhancing the realism and utility of generative imagery in applications demanding high fidelity and physical plausibility.
Bio:
Anand Bhattad is a Research Assistant Professor at the Toyota Technological Institute in Chicago (TTIC). Before this role, he earned his PhD from the University of Illinois Urbana-Champaign (UIUC), working with his advisor David Forsyth. His primary research interests lie in computer vision, with a specific focus on knowledge in generative models, and their applications to computer vision, computer graphics, and computational photography. His recent work, DIVeR, received a best paper nomination at CVPR 2022. He was listed as an Outstanding Reviewer at ICCV 2023 and an Outstanding Emergency Reviewer at CVPR 2021.
