Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a...

Full description

Bibliographic Details
Main Authors: Mike Li, Quang Dang Nguyen
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9474507/