Learning to Plan via Deep Optimistic Value Exploration

Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble of...

Full description

Bibliographic Details
Main Authors:	Seyde, Tim (Author), Schwarting, Wilko (Author), Karaman, Sertac (Author), Rus, Daniela L (Author)
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory (Contributor), Massachusetts Institute of Technology. Laboratory for Information and Decision Systems (Contributor)
Format:	Article
Language:	English
Published:	2020-05-11T19:59:29Z.
Subjects:	Article
Online Access:	Get fulltext


LEADER	01743 am a22002053u 4500
001	125161
042			\|a dc
100	1	0	\|a Seyde, Tim \|e author
100	1	0	\|a Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory \|e contributor
100	1	0	\|a Massachusetts Institute of Technology. Laboratory for Information and Decision Systems \|e contributor
700	1	0	\|a Schwarting, Wilko \|e author
700	1	0	\|a Karaman, Sertac \|e author
700	1	0	\|a Rus, Daniela L \|e author
245	0	0	\|a Learning to Plan via Deep Optimistic Value Exploration
260			\|c 2020-05-11T19:59:29Z.
856			\|z Get fulltext \|u https://hdl.handle.net/1721.1/125161
520			\|a Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble of models and formulate an upper confidence bound (UCB) objective to recover optimistic estimates. Training the policy on ensemble rollouts with the learned value function as the terminal cost allows for projecting long-term interactions into a limited planning horizon, thus enabling deep optimistic exploration. We do not assume a priori knowledge of either the dynamics or reward function. We demonstrate that our approach can accommodate both dense and sparse reward signals, while improving sample complexity on a variety of benchmarking tasks. Keywords: Reinforcement Learning; Deep Exploration; Model-Based; Value Function; UCB
520			\|a Office of Naval Research; Qualcomm; Toyota Research Institute
655	7		\|a Article
773			\|t Proceedings of Machine Learning Research

Learning to Plan via Deep Optimistic Value Exploration

Similar Items