Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes

This dissertation considers a particular aspect of sequential decision making under uncertainty in which, at each stage, a decision-making agent operating in an uncertain world takes an action that elicits a reinforcement signal and causes the state of the world to change. Optimal learning is a patt...

Full description

Bibliographic Details
Main Author: Duff, Michael O'Gordon
Language:ENG
Published: ScholarWorks@UMass Amherst 2002
Subjects:
Online Access:https://scholarworks.umass.edu/dissertations/AAI3039353
id ndltd-UMASS-oai-scholarworks.umass.edu-dissertations-3616
record_format oai_dc
spelling ndltd-UMASS-oai-scholarworks.umass.edu-dissertations-36162020-12-02T14:26:14Z Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes Duff, Michael O'Gordon This dissertation considers a particular aspect of sequential decision making under uncertainty in which, at each stage, a decision-making agent operating in an uncertain world takes an action that elicits a reinforcement signal and causes the state of the world to change. Optimal learning is a pattern of behavior that yields the highest expected total reward over the entire duration of an agent's interaction with its uncertain world. The problem of determining an optimal learning strategy is a sort of meta-problem, with optimality defined with respect to a distribution of environments that the agent is likely to encounter. Given this prior uncertainty over possible environments, the optimal-learning agent must collect and use information in an intelligent way, balancing greedy exploitation of certainty-equivalent world models with exploratory actions aimed at discerning the true state of nature. My approach to approximating optimal learning strategies retains the full model of the sequential decision process that, in incorporating a Bayesian model for evolving uncertainty about unknown process parameters, takes the form of a Markov decision process defined over a set of “hyperstates” whose cardinality grows exponentially with the planning horizon. I develop computational procedures that retain the full Bayesian formulation, but sidestep intractability by utilizing techniques from reinforcement learning theory (specifically, Monte-Carlo simulation and the adoption of parameterized function approximators). By pursuing an approach that is grounded in a complete Bayesian world model, I develop algorithms that produce policies that exhibit performance gains over simple heuristics. Moreover, in contrast to many heuristics, the justification or legitimacy of the policies follows directly from the fact that they are clearly motivated by a complete characterization of the underlying decision problem to be solved. This dissertation's contributions include a reinforcement learning algorithm for estimating Gittins indices for multi-armed bandit problems, a Monte-Carlo gradient-based algorithm for approximating solutions to general problems of optimal learning, a gradient-based scheme for improving optimal learning policies instantiated as finite-state stochastic automata, and an investigation of diffusion processes as analytical models for evolving uncertainty. 2002-01-01T08:00:00Z text https://scholarworks.umass.edu/dissertations/AAI3039353 Doctoral Dissertations Available from Proquest ENG ScholarWorks@UMass Amherst Computer science|Operations research|Statistics|Artificial intelligence
collection NDLTD
language ENG
sources NDLTD
topic Computer science|Operations research|Statistics|Artificial intelligence
spellingShingle Computer science|Operations research|Statistics|Artificial intelligence
Duff, Michael O'Gordon
Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes
description This dissertation considers a particular aspect of sequential decision making under uncertainty in which, at each stage, a decision-making agent operating in an uncertain world takes an action that elicits a reinforcement signal and causes the state of the world to change. Optimal learning is a pattern of behavior that yields the highest expected total reward over the entire duration of an agent's interaction with its uncertain world. The problem of determining an optimal learning strategy is a sort of meta-problem, with optimality defined with respect to a distribution of environments that the agent is likely to encounter. Given this prior uncertainty over possible environments, the optimal-learning agent must collect and use information in an intelligent way, balancing greedy exploitation of certainty-equivalent world models with exploratory actions aimed at discerning the true state of nature. My approach to approximating optimal learning strategies retains the full model of the sequential decision process that, in incorporating a Bayesian model for evolving uncertainty about unknown process parameters, takes the form of a Markov decision process defined over a set of “hyperstates” whose cardinality grows exponentially with the planning horizon. I develop computational procedures that retain the full Bayesian formulation, but sidestep intractability by utilizing techniques from reinforcement learning theory (specifically, Monte-Carlo simulation and the adoption of parameterized function approximators). By pursuing an approach that is grounded in a complete Bayesian world model, I develop algorithms that produce policies that exhibit performance gains over simple heuristics. Moreover, in contrast to many heuristics, the justification or legitimacy of the policies follows directly from the fact that they are clearly motivated by a complete characterization of the underlying decision problem to be solved. This dissertation's contributions include a reinforcement learning algorithm for estimating Gittins indices for multi-armed bandit problems, a Monte-Carlo gradient-based algorithm for approximating solutions to general problems of optimal learning, a gradient-based scheme for improving optimal learning policies instantiated as finite-state stochastic automata, and an investigation of diffusion processes as analytical models for evolving uncertainty.
author Duff, Michael O'Gordon
author_facet Duff, Michael O'Gordon
author_sort Duff, Michael O'Gordon
title Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes
title_short Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes
title_full Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes
title_fullStr Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes
title_full_unstemmed Optimal learning: Computational procedures for Bayes -adaptive Markov decision processes
title_sort optimal learning: computational procedures for bayes -adaptive markov decision processes
publisher ScholarWorks@UMass Amherst
publishDate 2002
url https://scholarworks.umass.edu/dissertations/AAI3039353
work_keys_str_mv AT duffmichaelogordon optimallearningcomputationalproceduresforbayesadaptivemarkovdecisionprocesses
_version_ 1719363067383382016