Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a...

Full description

Bibliographic Details
Main Authors: Mike Li, Quang Dang Nguyen
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9474507/
id doaj-fcb5205821ea46c889fa20c5bc1b4e47
record_format Article
spelling doaj-fcb5205821ea46c889fa20c5bc1b4e472021-07-13T23:01:17ZengIEEEIEEE Access2169-35362021-01-019966419665710.1109/ACCESS.2021.30946239474507Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent EnvironmentsMike Li0https://orcid.org/0000-0003-4514-7260Quang Dang Nguyen1https://orcid.org/0000-0002-0403-6903Centre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaCentre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaLearning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.https://ieeexplore.ieee.org/document/9474507/Contextual bandit learningreward oraclessampling guidance from expert-designed policiesshort-term and long-term rewards
collection DOAJ
language English
format Article
sources DOAJ
author Mike Li
Quang Dang Nguyen
spellingShingle Mike Li
Quang Dang Nguyen
Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
IEEE Access
Contextual bandit learning
reward oracles
sampling guidance from expert-designed policies
short-term and long-term rewards
author_facet Mike Li
Quang Dang Nguyen
author_sort Mike Li
title Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_short Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_full Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_fullStr Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_full_unstemmed Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_sort contextual bandit learning with reward oracles and sampling guidance in multi-agent environments
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2021-01-01
description Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.
topic Contextual bandit learning
reward oracles
sampling guidance from expert-designed policies
short-term and long-term rewards
url https://ieeexplore.ieee.org/document/9474507/
work_keys_str_mv AT mikeli contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments
AT quangdangnguyen contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments
_version_ 1721304677027414016