Thursday, August 29, 2013

Extortionate strategies in Prisoner's Dilemma

In an earlier post, I programmed a simple simulation of the evolution of prisoner's dilemma strategies.  I promised to read a bit of the established literature on the subject, and follow up.  In particular, I will write reports on the following two papers:

Press, W. & Dyson, F. J. Iterated Prisoners’ Dilemma contains strategies that dominate any evolutionary opponent. Proc. Natl Acad. Sci. USA 109, 1040910413 (2012).

Adami, C. & Hintze, A. Evolutionary instability of zero-determinant strategies demonstrates that winning is not everything. Nat. Commun. 4, 2193 (2013).

The first paper demonstrates that in the Iterated Prisoner's Dilemma game, there exist extortionate strategies.  If you use such a strategy, your opponent can slightly improve their own score, but only by greatly benefiting you.  While this seems like a pretty powerful strategy, the second paper demonstrates that it is not evolutionarily stable.  This post will focus on the first paper.

The strategic landscape

In the previous post, I already explained the Iterated Prisoner's Dilemma.  I proposed a player who is described by three numbers: x, y, and z.  The three numbers describe the probability that the player will cooperate, given what their opponent did in the last round.  This player is known as a "memory-one" player, because they only remember the outcome of one previous round.  But I oversimplified it a bit.  In this paper, a "memory-one" player doesn't just remember what their opponent did in the previous round, they also remember what they themselves did in the previous round.  Thus you really need to describe the player with four numbers, each describing the probability of cooperating, given each of the four outcomes in the previous round.1

You could also have a "memory-two" player who remembers the outcomes of the previous two rounds.  Or a player of arbitrarily large memory.  You'd think that more memory makes for a better player.  However, the paper proves that there is a sense in which this is not true.

Suppose that player X is a memory-one player, and chooses some particular strategy.  And suppose player Y is a memory-two player, who chooses the best strategy to respond to X's strategy.  But it turns out that Y's best memory-two strategy fares no better against X than does the best memory-one strategy.  Note that this is a precise statement, and it's not quite the same as saying that more memory is useless.2 But it provides some justification for focusing solely on memory-one strategies.

In summary, each player is described by four probabilities.  Player X is described by p1, p2, p3, and p4 (for the outcomes cc, cd, dc, and dd respectively, where c is cooperate, and d is defect).  Player Y is described by q1, q2, q3, q4.  Since these are all probabilities, they must be between zero and 1.

Zero-determinant strategies

From this point on, there will be a lot of math.  Skip to "extortionate ZD strategies" to get past the math.

When I calculated scores in my simulation, I used what's called the Markov matrix.  It looks like this:

Fig 2A of paper

The Markov matrix tells you how to transform the probabilities of one iteration to the probabilities of the next iteration.  Each iteration is associated with four probabilities describing the likelihood of each of the four outcomes.  The four probabilities can be written as four components of a vector.  By multiplying the vector by the Markov matrix, you get the vector corresponding to the next iteration.  By multiplying the Markov matrix many times, you can get the probabilities after many iterations.

But if you multiply the Markov matrix many many times, the outcome should approach some limiting state, called the stationary state.3  I didn't calculate the stationary state in my simulation, because I wasn't really sure how.  But the authors of the paper are a little bit more knowledgeable, and can calculate the stationary state.  In my simulation, I calculated the average outcome of the first 10 iterations, but here the authors effectively calculate the average outcome of an infinite number of iterations.

Once you calculate the stationary state, you just need to calculate the scores.  To do this, you do a dot product between the probability vector and the score vector.  The score vector is (3,0,5,1) for the first player and (3,5,0,1) for the second player.  It turns out that this dot product is equivalent to the determinant of a special matrix:

Figure 2B of paper.  v is the stationary vector, while f is the score vector.

Why is this matrix so important?  Because player X has complete control over the second column, and player Y has complete control over the third column.  This allows for "zero-determinant" strategies (henceforth ZD strategies), where the player forces the determinant of the above matrix to be zero.  All you need to do is choose probabilities such that your column is exactly equal to the last column.  If a matrix has two identical columns, its determinant is zero.

You might wonder, why would you ever want to set the determinant to zero?  Doesn't that just set someone's score to zero?  In fact, it does more than that.  You could, for instance, set the sum of your score and the opponent's score to zero.  Instead of using (3,0,5,1) or (3,5,0,1) for the score vector, all you have to do is use the sum of these two vectors, (6,5,5,2).  In general, you can enforce equations of the following form:

Equation 7 from the paper.  sX is X's score, and sY is Y's score, while the rest of the variables are parameters chosen by the player using the ZD strategy.

This equation should make you skeptical.  Can X choose a strategy such that sX - 10 = 0, thus causing their average score to be 10?  Clearly this is impossible, since the maximum score is only 5.  If you try such a ZD strategy, you must set p1, p2, p3, and p4 outside their allowed range between zero and one.  This is the catch of the ZD strategy: not every ZD strategy is possible.

Extortionate ZD strategies

So far, I've shown that a player using a ZD strategy can enforce a linear relationship between their own score, and their opponents score.  However, not all linear relationships are possible.  What linear relationships are possible?

One thing you can do is unilaterally choose your opponent's score.  In a game where the outcome scores are 5/3/1/0, you may set your opponent's score anywhere between 1 and 3.  However, you can't unilaterally choose your own score using ZD.  After all, it would be a contradiction if both you and your opponent had complete control over your score.

Another thing you can do is enforce a positive correlation between your score and your opponent's score.  Thus, the more your opponent helps you, the better it is for your opponent.  One such ZD strategy is known as the "tit for tat" strategy, where you simply copy the action of your opponent in the last iteration.  In the tit for tat strategy, your opponent's average score will exactly equal your own average score.

But there are more extortionate ZD strategies, where you benefit far more than your opponent does.  For example, player X could enforce the rule (sX-1) = 6*(sY-1), meaning that if Y tries to improve their own score, then X's score improves six times more than Y's does.4  In general, you can be as extortionate as you want, although at some point your opponent loses incentive to bother trying for such meager rewards.  Or if your opponent is smart, they may give up on those rewards to spite you.

Consequences for evolution of altruism

It might seem like the existence of extortionate strategies is bad news for the evolution of altruism.  In fact, it's more complicated than that.  You can use extortionate strategies to hurt your opponent, but hurting your opponent doesn't necessarily help yourself.  The mere existence of an extortionate strategy doesn't lead to selfishness.  However, ZD-extortionate strategies have an additional property that I think was not properly emphasized in the paper.  The more extortionate you are, the greater your own rewards when your opponent cooperates.

For example, the least extortionate strategy is the tit for tat strategy.  If you persuade your opponent to fully cooperate, your average score will be 3.  But if you instead use the extortionate strategy above, with (sX-1) = 6*(sY-1), and if your opponent fully cooperates, then your average score will be 4.  The problem with ZD strategies is not that they're extortionate, but that they benefit more extortionate players.

So imagine that you are playing against an evolutionary opponent.  You can choose a ZD-extortionate strategy, and watch your opponent evolve towards a more cooperative strategy.  When they cooperate fully, you can use a more and more extortionate strategy to reap greater rewards (while also hurting your opponent).

Of course, in this picture, we're imagining that one player is intelligently choosing a ZD strategy, and the other player is blindly evolving.  The paper also briefly discusses the case where both players are using ZD strategies, or if one player has a "theory of mind".  I am a bit unsure about the applicability to evolution.  Even the most intelligent agents are not going to choose a strategy based on its eventual evolutionary outcome, many generations down the road.  If agents are not intelligent, I'm not sure if ZD would ever evolve.  Does a ZD strategy really give you an advantage if you're mostly playing against siblings and cousins who are using similar strategies?

I suspect that the next paper, which shows that ZD strategies are not evolutionarily stable, will bring up one or more of these issues.  I will read the paper and see.

-----------------------------------

1. Actually, a memory-one player requires five numbers.  The first four numbers specify what to do in the case of the four possible outcomes of the previous game.  The fifth number specifies what to do in the very first game.  However, in the long run, the first game doesn't really matter (except in the scenario mentioned in footnote 3).  And this paper uses some special calculations that don't use the outcome of the first iteration.

2. The key thing is that if the low-memory player chooses a strategy first, and the high-memory player responds, then there is no advantage to having high memory.  However, if the high-memory player chooses a strategy first, and the low-memory player responds, there is a disadvantage to having low memory.

3. There's also the possibility that the game-state will circle around the stationary state without ever reaching it, and I think it's possible for there to be many stationary states.  The authors don't really address these possibilities, but they're smart so I'm sure they know what they're doing.  I think we can safely ignore these possibilities because they only occur for pathological cases.

4. To get this outcome, try (p1,p2,p3,p4)=(3/5,0,2/5,0).  Tit for tat is (1,0,1,0).

This is part of a miniseries on the evolution of Prisoner's Dilemma strategies.
1. Evolution of prisoner's dilemma strategies
2. Extortionate strategies in Prisoner's dilemma
3. Prisoner's Dilemma and evolutionary stability
4. A modified prisoner's dilemma simulation

1 comment:

Larry Hamelin said...

Good stuff! I'm eagerly awaiting the next installment.