TALK    [MERL Seminar Series 2022] Prof. Vincent Sitzmann presents talk titled Self-Supervised Scene Representation Learning

Date released: March 30, 2022


  •  TALK    [MERL Seminar Series 2022] Prof. Vincent Sitzmann presents talk titled Self-Supervised Scene Representation Learning
    (Learn more about the MERL Seminar Series.)
     
  • Date & Time:

    Wednesday, March 30, 2022; 11:00 AM EDT

  • Abstract:

    Given only a single picture, people are capable of inferring a mental representation that encodes rich information about the underlying 3D scene. We acquire this skill not through massive labeled datasets of 3D scenes, but through self-supervised observation and interaction. Building machines that can infer similarly rich neural scene representations is critical if they are to one day parallel people’s ability to understand, navigate, and interact with their surroundings. This poses a unique set of challenges that sets neural scene representations apart from conventional representations of 3D scenes: Rendering and processing operations need to be differentiable, and the type of information they encode is unknown a priori, requiring them to be extraordinarily flexible. At the same time, training them without ground-truth 3D supervision is an underdetermined problem, highlighting the need for structure and inductive biases without which models converge to spurious explanations.

    I will demonstrate how we can equip neural networks with inductive biases that enables them to learn 3D geometry, appearance, and even semantic information, self-supervised only from posed images. I will show how this approach unlocks the learning of priors, enabling 3D reconstruction from only a single posed 2D image, and how we may extend these representations to other modalities such as sound. I will then discuss recent work on learning the neural rendering operator to make rendering and training fast, and how this speed-up enables us to learn object-centric neural scene representations, learning to decompose 3D scenes into objects, given only images. Finally, I will talk about a recent application of self-supervised scene representation learning in robotic manipulation, where it enables us to learn to manipulate classes of objects in unseen poses from only a handful of human demonstrations.


  • Speaker:

    Vincent Sitzmann
    MIT

    Vincent is an incoming Assistant Professor at MIT EECS, where he will be leading the Scene Representation Group. Currently, he is a Postdoc at MIT's CSAIL with Josh Tenenbaum, Bill Freeman, and Fredo Durand. Previously, he finished a Ph.D. at Stanford University. His research interest lies in neural scene representations - the way neural networks learn to represent information on our world. His goal is to allow independent agents to reason about our world given visual observations, such as inferring a complete model of a scene with information on geometry, material, lighting etc. from only few observations, a task that is simple for humans, but currently impossible for AI.

  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning