| Summary: | This paper focuses on the problem of low sample efficiency and poor algorithm robustness of the current reinforcement learning algorithms used for path planning of Unmanned Aerial Vehicle(UAV) platforms.Furthermore, this paper proposes a model-based reinforcement learning algorithm with intrinsic rewards.The algorithm adopts a parallel architecture, completely decouples data collection operations and policy update operations, and improves the learning efficiency of the algorithm.Moreover, intrinsic reward improves the agent's exploration efficiency and prevents convergence to sub-optimal strategies.In the strategy learning process, the agent learns based on the dynamic model of the simulated environment, so that information such as the state and reward can be better predicted within a limited step.Finally, by combining finite planning calculation steps and neural network prediction, the prediction accuracy of the value function is improved.This reduces the amount of empirical data required to complete the training of the agent.The experiment results show that our algorithm, compared with the model-free reinforcement learning algorithm of the same architecture, requires approximately 600 fewer empirical data to achieve the same training level.The sample efficiency and algorithm robustness are also greatly improved.Compared with traditional heuristic algorithms, the score improves by nearly 8 000 points.Compared with mainstream model-based reinforcement learning algorithms such as MVE, the average score of the algorithm can improve by approximately 2 000 points and the agent had obvious advantages in sample efficiency and stability.
|