Topics in Online Markov Decision Processes

<p>This dissertation describes sequential decision making problems in non-stationary environments. Online learning algorithms deal with non-stationary environments, but generally there is no notion of a dynamic state to model future impacts of past actions. State-based models are common in sto...

Full description

Bibliographic Details
Main Author: Guan, Peng
Other Authors: Willett, Rebecca
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/10161/11312
id ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-11312
record_format oai_dc
spelling ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-113122016-01-06T03:30:45ZTopics in Online Markov Decision ProcessesGuan, PengElectrical engineeringMathematicsOperations researchGame theoryMarkov Decision ProcessOnline learningOptimizationReinforcement learningStochastic control<p>This dissertation describes sequential decision making problems in non-stationary environments. Online learning algorithms deal with non-stationary environments, but generally there is no notion of a dynamic state to model future impacts of past actions. State-based models are common in stochastic control settings, but well-known frameworks such as Markov decision processes (MDPs) assume a known stationary environment. In recent years, there has been a growing interest in fusing the above two important learning frameworks and considering an MDP setting in which the cost function is allowed to change arbitrarily over time. A number of online MDP algorithms have been designed to work under various assumptions about the dynamics of state transitions so far and provide performance guarantees, i.e. bounds on the regret defined as the performance gap between the total cost incurred by the learner and the total cost of the best available stationary policy that could have been chosen in hindsight. </p><p>However, most of the work in this area has been algorithmic: given a problem, one</p><p>would develop an algorithm almost from scratch and prove the performance guarantees on a case-by-case basis. Moreover, the presence of the state and the assumption of an arbitrarily varying environment complicate both the theoretical analysis and the development of computationally efficient methods. Another potential issue is that, by removing distributional assumptions about the mechanism generating the cost sequences, the existing methods have to consider the worst-case scenario, which may render their solutions too conservative in situations where the environment exhibits some degree of predictability. </p><p>This dissertation contributes several novel techniques to address the above challenges of the online MDP framework and opens up new research directions for online MDPs. </p><p>Our proposed general framework for deriving algorithms in the online MDP setting leads to a unifying view of existing methods and provides a general procedure for constructing new ones. Several new algorithms are developed and analyzed using this framework. We develop convex-analytical algorithms that take advantage of possible regularity of observed sequences, yet maintain the worst case performance guarantees. To further study the convex-analytic methods we applied above, we take a step back to consider the traditional MDP problem and extend the LP approach to MDPs by adding a relative entropy regularization term. A computationally efficient algorithm for this class of MDPs is constructed under mild assumptions on the state transition models. Two-player zero-sum stochastic games are also investigated in this dissertation as an important extension of the online MDP setting. In short, this dissertation provides in-depth analysis of the online MDP problem and answers several important questions in this field.</p>DissertationWillett, RebeccaRaginsky, Maxim2015Dissertationhttp://hdl.handle.net/10161/11312
collection NDLTD
sources NDLTD
topic Electrical engineering
Mathematics
Operations research
Game theory
Markov Decision Process
Online learning
Optimization
Reinforcement learning
Stochastic control
spellingShingle Electrical engineering
Mathematics
Operations research
Game theory
Markov Decision Process
Online learning
Optimization
Reinforcement learning
Stochastic control
Guan, Peng
Topics in Online Markov Decision Processes
description <p>This dissertation describes sequential decision making problems in non-stationary environments. Online learning algorithms deal with non-stationary environments, but generally there is no notion of a dynamic state to model future impacts of past actions. State-based models are common in stochastic control settings, but well-known frameworks such as Markov decision processes (MDPs) assume a known stationary environment. In recent years, there has been a growing interest in fusing the above two important learning frameworks and considering an MDP setting in which the cost function is allowed to change arbitrarily over time. A number of online MDP algorithms have been designed to work under various assumptions about the dynamics of state transitions so far and provide performance guarantees, i.e. bounds on the regret defined as the performance gap between the total cost incurred by the learner and the total cost of the best available stationary policy that could have been chosen in hindsight. </p><p>However, most of the work in this area has been algorithmic: given a problem, one</p><p>would develop an algorithm almost from scratch and prove the performance guarantees on a case-by-case basis. Moreover, the presence of the state and the assumption of an arbitrarily varying environment complicate both the theoretical analysis and the development of computationally efficient methods. Another potential issue is that, by removing distributional assumptions about the mechanism generating the cost sequences, the existing methods have to consider the worst-case scenario, which may render their solutions too conservative in situations where the environment exhibits some degree of predictability. </p><p>This dissertation contributes several novel techniques to address the above challenges of the online MDP framework and opens up new research directions for online MDPs. </p><p>Our proposed general framework for deriving algorithms in the online MDP setting leads to a unifying view of existing methods and provides a general procedure for constructing new ones. Several new algorithms are developed and analyzed using this framework. We develop convex-analytical algorithms that take advantage of possible regularity of observed sequences, yet maintain the worst case performance guarantees. To further study the convex-analytic methods we applied above, we take a step back to consider the traditional MDP problem and extend the LP approach to MDPs by adding a relative entropy regularization term. A computationally efficient algorithm for this class of MDPs is constructed under mild assumptions on the state transition models. Two-player zero-sum stochastic games are also investigated in this dissertation as an important extension of the online MDP setting. In short, this dissertation provides in-depth analysis of the online MDP problem and answers several important questions in this field.</p> === Dissertation
author2 Willett, Rebecca
author_facet Willett, Rebecca
Guan, Peng
author Guan, Peng
author_sort Guan, Peng
title Topics in Online Markov Decision Processes
title_short Topics in Online Markov Decision Processes
title_full Topics in Online Markov Decision Processes
title_fullStr Topics in Online Markov Decision Processes
title_full_unstemmed Topics in Online Markov Decision Processes
title_sort topics in online markov decision processes
publishDate 2015
url http://hdl.handle.net/10161/11312
work_keys_str_mv AT guanpeng topicsinonlinemarkovdecisionprocesses
_version_ 1718160404308295680