Learning to see the physical world

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020 === Cataloged from PDF of thesis. === Includes bibliographical references (pages 271-303). === Human intelligence is beyond pattern recognition. From a single image, we are able to...

Full description

Bibliographic Details
Main Author: Wu, Jiajun,Ph.D.Massachusetts Institute of Technology.
Other Authors: William T. Freeman and Joshua B. Tenenbaum.
Format: Others
Language:English
Published: Massachusetts Institute of Technology 2020
Subjects:
Online Access:https://hdl.handle.net/1721.1/128332
Description
Summary:Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020 === Cataloged from PDF of thesis. === Includes bibliographical references (pages 271-303). === Human intelligence is beyond pattern recognition. From a single image, we are able to explain what we see, reconstruct the scene in 3D, predict what's going to happen, and plan our actions accordingly. Artificial intelligence, in particular deep learning, still falls short in some preeminent aspects when compared with human intelligence, despite its phenomenal development in the past decade: they in general tackle specific problems, require large amounts of training data, and easily break when generalizing to new tasks or environments. In this dissertation, we study the problem of physical scene understanding-building versatile, data-efficient, and generalizable machines that learn to see, reason about, and interact with the physical world. The core idea is to exploit the generic, causal structure behind the world, including knowledge from computer graphics, physics, and language, in the form of approximate simulation engines, and to integrate them with deep learning. === Here, learning plays a multifaceted role: models may learn to invert simulation engines for efficient inference; they may also learn to approximate or augment simulation engines for more powerful forward simulation. This dissertation consists of three parts, where we investigate the use of such a hybrid model for perception, dynamics modeling, and cognitive reasoning, respectively. In Part I, we use learning in conjunction with graphics engines to build an object-centered scene representation for object shape, pose, and texture. In Part II, in addition to graphics engines, we pair learning with physics engines to simultaneously infer physical object properties. We also explore learning approximate simulation engines for better flexibility and expressiveness. In Part III, we leverage and extend the models introduced in Parts I and II for concept discovery and cognitive reasoning by looping in a program execution engine. === The enhanced models discover program-like structures in objects and scenes and, in turn, exploit them for downstream tasks such as visual question answering and scene manipulation. === by Jiajun Wu. === Ph. D. === Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science