Summary: | 碩士 === 國立交通大學 === 電機工程學系 === 107 === Reinforcement learning generally performs learning though interaction with environment. During learning procedure, the agent will interact with environment, acquire the observations from environment, decide the actions and then receive the rewards. Reinforcement learning is seen as a kind of trial-and-error learning to deal with the sequential decision-making problem. Reinforcement learning leverages the accumulation of knowledge from interaction with environment. Remarkable performance has been achieved in numerous practical systems by applying many breakthroughs developed in recent years. With the help of deep learning, reinforcement learning is capable of dealing with those complicated problems which were empirically intractable by using conventional methods. For examples, deep reinforcement learning algorithms have been successfully developed for building the video game agent which can directly learn from image pixels, conducting the robotic control which can directly learn from camera inputs, and training the Go agent which can learn from self-play and neural network architecture search. In general, there are three types of reinforcement learning methods which are categorized in terms of the updating procedure and the way of decision making. The value-based agent, the policy-based agent and the actor-critic are constructed accordingly. For value-based agent, only the state value estimator is continuously updated through the optimality Bellman operator. The policy-based agent directly updates the parameterized policy through policy optimization. The actor-critic does not only update the parameterized policy but also update the state value estimator in learning procedure. This thesis deals with the challenges in multi-goal actor-critic reinforcement learning and proposes the method elevating the performance to the sparse reward problem by utilizing the updating information from the trajectories or equivalently the histories of many successive states and actions. Distributional policy optimization is not only proposed to carry out the concept of value distributional through estimating the distribution of sum of discounted rewards which can accommodate the uncertainty in exploration of environment but also applied in multi-goal reinforcement learning with insufficient reward.
In this study, the algorithm is adopted to estimate the value distribution or the distribution of the sum of discounted rewards. The expectation of this distribution is calculated as the value function for reinforcement learning. Basically, the value distribution provides the value representation of state where the uncertainty information is merged. Such an algorithm performs well but suffers from the limitation of considering only the states and actions in two successive time steps. We present a distributional policy optimization where a trajectory is taken into account to carry out the value distribution for policy optimization. On the other hand, we face the challenge of sparse reward in multi-goal reinforcement learning where the environment involves with multiple goals and accordingly the rewards corresponding to individual goals become sparse. An enhancement of hindsight experience replay is incorporated to deal with sparse reward problem by measuring whether the distances between goals are close or not. In addition, the issue of multiple goals is coped with an actor-critic method where an actor network and a value distribution network are provided. The updating is performed using the trajectories. Such a framework provides a way to construct the agents to achieve desirable performance. The experiment under the environment, inverted pendulum, from OpenAI gym for one of our models, distributional proximal policy optimization, indeed shows that the value distribution can be derived from the history where agent interacts with environment. Compare to normal actor-critic which learns a deterministic value network estimating expectation of sum of discounted rewards, our model learns a value distribution where an inference can be made. Nonetheless, we also propose a modified model that leverage the experience replay, which is called distributional actor critic experience replay. Moreover, to accompany with hindsight experience replay, the last model called distributional actor critic hindsight experience replay leverages the distance between two value distributions of universal value function estimator as a criterion for the distance between trajectories to achieve different goals.
|