Regret Minimization in Structured Reinforcement Learning

We consider a class of sequential decision making problems in the presence of uncertainty, which belongs to the field of Reinforcement Learning (RL). Specifically, we study discrete Markov decision Processes (MDPs) which model a decision maker or agent that interacts with a stochastic and dynamic en...

Full description

Bibliographic Details
Main Author:	Tranos, Damianos
Format:	Others
Language:	English
Published:	KTH, Reglerteknik 2021
Subjects:	Reinforcement Learning Control Engineering Reglerteknik
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296238 http://nbn-resolving.de/urn:isbn:978-91-7873-839-7

id	ndltd-UPSALLA1-oai-DiVA.org-kth-296238
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-2962382021-06-10T05:24:34ZRegret Minimization in Structured Reinforcement LearningengTranos, DamianosKTH, ReglerteknikStockholm2021Reinforcement LearningControl EngineeringReglerteknikWe consider a class of sequential decision making problems in the presence of uncertainty, which belongs to the field of Reinforcement Learning (RL). Specifically, we study discrete Markov decision Processes (MDPs) which model a decision maker or agent that interacts with a stochastic and dynamic environment and receives feedback from it in the form of a reward. The agent seeks to maximize a notion of cumulative reward. Because the environment (both the system dynamics and reward function) is unknown, it faces an exploration-exploitation dilemma, where it must balance exploring its available actions or exploiting what it believes to be the best one. This dilemma captured by the notion of regret, which compares the rewards that the agent has accumulated thus far with those that would have been obtained by an optimal policy. The agent is then said to behave optimally, if it minimizes its regret. This thesis investigates the fundamental regret limits that can be achieved by any agent. We derive general asymptotic and problem specific regret lower bounds for the cases of ergodic and deterministic MDPs. We make these explicit for ergodic MDPs that are unstructured, for MDPs with Lipschitz transitions and rewards, as well as for deterministic MDPs that satisfy a decoupling property. Furthermore, we propose DEL, an algorithm that is valid for any ergodic MDP with any structure and whose regret upper bound matches the associated regret lower bounds, thus being truly optimal. For this algorithm, we present theoretical regret guarantees as well as a numerical demonstration that verifies its ability to exploit the underlying structure. <p>QC 20210603</p>Licentiate thesis, monographinfo:eu-repo/semantics/masterThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296238urn:isbn:978-91-7873-839-7TRITA-EECS-AVL ; 2021:26application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Reinforcement Learning Control Engineering Reglerteknik
spellingShingle	Reinforcement Learning Control Engineering Reglerteknik Tranos, Damianos Regret Minimization in Structured Reinforcement Learning
description	We consider a class of sequential decision making problems in the presence of uncertainty, which belongs to the field of Reinforcement Learning (RL). Specifically, we study discrete Markov decision Processes (MDPs) which model a decision maker or agent that interacts with a stochastic and dynamic environment and receives feedback from it in the form of a reward. The agent seeks to maximize a notion of cumulative reward. Because the environment (both the system dynamics and reward function) is unknown, it faces an exploration-exploitation dilemma, where it must balance exploring its available actions or exploiting what it believes to be the best one. This dilemma captured by the notion of regret, which compares the rewards that the agent has accumulated thus far with those that would have been obtained by an optimal policy. The agent is then said to behave optimally, if it minimizes its regret. This thesis investigates the fundamental regret limits that can be achieved by any agent. We derive general asymptotic and problem specific regret lower bounds for the cases of ergodic and deterministic MDPs. We make these explicit for ergodic MDPs that are unstructured, for MDPs with Lipschitz transitions and rewards, as well as for deterministic MDPs that satisfy a decoupling property. Furthermore, we propose DEL, an algorithm that is valid for any ergodic MDP with any structure and whose regret upper bound matches the associated regret lower bounds, thus being truly optimal. For this algorithm, we present theoretical regret guarantees as well as a numerical demonstration that verifies its ability to exploit the underlying structure. === <p>QC 20210603</p>
author	Tranos, Damianos
author_facet	Tranos, Damianos
author_sort	Tranos, Damianos
title	Regret Minimization in Structured Reinforcement Learning
title_short	Regret Minimization in Structured Reinforcement Learning
title_full	Regret Minimization in Structured Reinforcement Learning
title_fullStr	Regret Minimization in Structured Reinforcement Learning
title_full_unstemmed	Regret Minimization in Structured Reinforcement Learning
title_sort	regret minimization in structured reinforcement learning
publisher	KTH, Reglerteknik
publishDate	2021
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296238 http://nbn-resolving.de/urn:isbn:978-91-7873-839-7
work_keys_str_mv	AT tranosdamianos regretminimizationinstructuredreinforcementlearning
_version_	1719409646420099072

Regret Minimization in Structured Reinforcement Learning

Similar Items