Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Steckelmacher, Denis; Plisnier, Hélène; Roijers, Diederik M.; Nowé, Ann

Repository landing page

oai:research.vu.nl:publications/c6221af8-ec0b-4cc1-9d39-67065b9c980e

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Authors: Denis Steckelmacher
Hélène Plisnier
Diederik M. Roijers
Ann Nowé
Publication date: 1 January 2020
Publisher: Springer
Doi

Abstract

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi. Appendix: https://arxiv.org/abs/1903.04193.</p

Similar works

Full text

VU Research Portal

oai:research.vu.nl:publication...

Last time updated on 04/03/2021

This paper was published in VU Research Portal.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.