Regret Minimization in Structured Reinforcement Learning

We consider a class of sequential decision making problems in the presence of uncertainty, which belongs to the field of Reinforcement Learning (RL). Specifically, we study discrete Markov decision Processes (MDPs) which model a decision maker or agent that interacts with a stochastic and dynamic en...

Full description

Bibliographic Details
Main Author: Tranos, Damianos
Format: Others
Language:English
Published: KTH, Reglerteknik 2021
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296238
http://nbn-resolving.de/urn:isbn:978-91-7873-839-7
id ndltd-UPSALLA1-oai-DiVA.org-kth-296238
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-kth-2962382021-06-10T05:24:34ZRegret Minimization in Structured Reinforcement LearningengTranos, DamianosKTH, ReglerteknikStockholm2021Reinforcement LearningControl EngineeringReglerteknikWe consider a class of sequential decision making problems in the presence of uncertainty, which belongs to the field of Reinforcement Learning (RL). Specifically, we study discrete Markov decision Processes (MDPs) which model a decision maker or agent that interacts with a stochastic and dynamic environment and receives feedback from it in the form of a reward. The agent seeks to maximize a notion of cumulative reward. Because the environment (both the system dynamics and reward function) is unknown, it faces an exploration-exploitation dilemma, where it must balance exploring its available actions or exploiting what it believes to be the best one. This dilemma captured by the notion of regret, which compares the rewards that the agent has accumulated thus far with those that would have been obtained by an optimal policy. The agent is then said to behave optimally, if it minimizes its regret. This thesis investigates the fundamental regret limits that can be achieved by any agent. We derive general asymptotic and problem specific regret lower bounds for the cases of ergodic and deterministic MDPs. We make these explicit for ergodic MDPs that are unstructured, for MDPs with Lipschitz transitions and rewards, as well as for deterministic MDPs that satisfy a decoupling property. Furthermore, we propose DEL, an algorithm that is valid for any ergodic MDP with any structure and whose regret upper bound matches the associated regret lower bounds, thus being truly optimal. For this algorithm, we present theoretical regret guarantees as well as a numerical demonstration that verifies its ability to exploit the underlying structure. <p>QC 20210603</p>Licentiate thesis, monographinfo:eu-repo/semantics/masterThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296238urn:isbn:978-91-7873-839-7TRITA-EECS-AVL ; 2021:26application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Reinforcement Learning
Control Engineering
Reglerteknik
spellingShingle Reinforcement Learning
Control Engineering
Reglerteknik
Tranos, Damianos
Regret Minimization in Structured Reinforcement Learning
description We consider a class of sequential decision making problems in the presence of uncertainty, which belongs to the field of Reinforcement Learning (RL). Specifically, we study discrete Markov decision Processes (MDPs) which model a decision maker or agent that interacts with a stochastic and dynamic environment and receives feedback from it in the form of a reward. The agent seeks to maximize a notion of cumulative reward. Because the environment (both the system dynamics and reward function) is unknown, it faces an exploration-exploitation dilemma, where it must balance exploring its available actions or exploiting what it believes to be the best one. This dilemma captured by the notion of regret, which compares the rewards that the agent has accumulated thus far with those that would have been obtained by an optimal policy. The agent is then said to behave optimally, if it minimizes its regret. This thesis investigates the fundamental regret limits that can be achieved by any agent. We derive general asymptotic and problem specific regret lower bounds for the cases of ergodic and deterministic MDPs. We make these explicit for ergodic MDPs that are unstructured, for MDPs with Lipschitz transitions and rewards, as well as for deterministic MDPs that satisfy a decoupling property. Furthermore, we propose DEL, an algorithm that is valid for any ergodic MDP with any structure and whose regret upper bound matches the associated regret lower bounds, thus being truly optimal. For this algorithm, we present theoretical regret guarantees as well as a numerical demonstration that verifies its ability to exploit the underlying structure. === <p>QC 20210603</p>
author Tranos, Damianos
author_facet Tranos, Damianos
author_sort Tranos, Damianos
title Regret Minimization in Structured Reinforcement Learning
title_short Regret Minimization in Structured Reinforcement Learning
title_full Regret Minimization in Structured Reinforcement Learning
title_fullStr Regret Minimization in Structured Reinforcement Learning
title_full_unstemmed Regret Minimization in Structured Reinforcement Learning
title_sort regret minimization in structured reinforcement learning
publisher KTH, Reglerteknik
publishDate 2021
url http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296238
http://nbn-resolving.de/urn:isbn:978-91-7873-839-7
work_keys_str_mv AT tranosdamianos regretminimizationinstructuredreinforcementlearning
_version_ 1719409646420099072