Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a...

Full description

Bibliographic Details
Main Authors:	Mike Li, Quang Dang Nguyen
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards
Online Access:	https://ieeexplore.ieee.org/document/9474507/

id	doaj-fcb5205821ea46c889fa20c5bc1b4e47
record_format	Article
spelling	doaj-fcb5205821ea46c889fa20c5bc1b4e472021-07-13T23:01:17ZengIEEEIEEE Access2169-35362021-01-019966419665710.1109/ACCESS.2021.30946239474507Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent EnvironmentsMike Li0https://orcid.org/0000-0003-4514-7260Quang Dang Nguyen1https://orcid.org/0000-0002-0403-6903Centre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaCentre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaLearning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.https://ieeexplore.ieee.org/document/9474507/Contextual bandit learningreward oraclessampling guidance from expert-designed policiesshort-term and long-term rewards
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Mike Li Quang Dang Nguyen
spellingShingle	Mike Li Quang Dang Nguyen Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments IEEE Access Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards
author_facet	Mike Li Quang Dang Nguyen
author_sort	Mike Li
title	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_short	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_full	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_fullStr	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_full_unstemmed	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_sort	contextual bandit learning with reward oracles and sampling guidance in multi-agent environments
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2021-01-01
description	Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.
topic	Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards
url	https://ieeexplore.ieee.org/document/9474507/
work_keys_str_mv	AT mikeli contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments AT quangdangnguyen contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments
_version_	1721304677027414016

Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Similar Items