Foresighted policy gradient reinforcement learning: solving large-scale social dilemmas with rational altruistic punishment

Hoen, P.J. (Pieter Jan) \'t; Bohte, S.M. (Sander); Poutr\xc3\xa9, J.A. (Han) La

Repository landing page

Foresighted policy gradient reinforcement learning: solving large-scale social dilemmas with rational altruistic punishment

Authors: P.J. (Pieter Jan) \'t Hoen
S.M. (Sander) Bohte
J.A. (Han) La Poutr\xc3\xa9
Publication date: 1 October 2008
Publisher: CWI

Abstract

Many important and difficult problems can be modeled as \xe2\x80\x9csocial dilemmas\xe2\x80\x9d, like Hardin\'s Tragedy of the Commons or the classic iterated Prisoner\'s Dilemma. It is well known that in these problems, it can be rational for self-interested agents to promote and sustain cooperation by altruistically dispensing costly punishment to other agents, thus maximizing their own long-term reward. However, self-interested agents using most current multi-agent reinforcement learning algorithms will not sustain cooperation in social dilemmas: the algorithms do not sufficiently capture the consequences on the agent\'s reward of the interactions that it has with other agents. Recent more foresighted algorithms specifically account for such expected consequences, and have been shown to work well for the small-scale Prisoner\'s Dilemma. However, this approach quickly becomes intractable for larger social dilemmas. Here, we advance on this work and develop a \xe2\x80\x9cteach/learn\xe2\x80\x9d stateless foresighted policy gradient reinforcement learning algorithm that applies to Social Dilemma\'s with negative, unilateral side-payments, in the from of costly punishment. In this setting, the algorithm allows agents to learn the most rewarding actions to take with respect to both the dilemma (Cooperate/Defect) and the \xe2\x80\x9cteaching\xe2\x80\x9d of other agent\'s behavior through the dispensing of punishment. Unlike other algorithms, we show that this approach scales well to large settings like the Tragedy of the Commons. We show for a variety of settings that large groups of self-interested agents using this algorithm will robustly find and sustain cooperation in social dilemmas where adaptive agents can punish the behavior of other similarly adaptive agents

Similar works

Full text

Open in the Core reader

Download PDF

CWI's Institutional Repository

oai:cwi.nl:13672

Last time updated on 18/04/2020

This paper was published in CWI's Institutional Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.