What I'm doing (June 2021)

Risk in stochastic games

We started this project with Tobias (a new postdoc), so far, we mostly read relevant literature and explored the topic.

We focus on the generalization of stochastic games, a problem open for 60 years. That fact did not stop me from trying to solve it. Of course, I wasted a lot of time.

Definition

Imagine that two players compete in Starcraft (or some game with incomplete information). Every minute, they decide to go for a strategy: Build a massive attack, build a defense, or gather more resources. They might be able to randomly discover what their opponent is doing (by scouting at the right hiding place).

In one step, every player chooses a strategy, and something random can happen. We can express it as a state (player A defends, and B knows about it and gathers resources). Given the states and possible transitions, describe the best strategy for both players.

I like to think about the game more sequentially, so I'll define it more formally. We have a directed graph where every vertex has two outgoing edges except two vertices: win for A and win for B. We place a token on one vertex that denotes the current state (in Starcraft, it's a position with a base and just a few probes). Every non-target vertex has one of three types: A, B, or random. In a vertex of type A (B) corresponding player decides where the token moves. In a random vertex, it moves randomly. Every player wants to move the token to its target.

The question here is: what is the probability that the token reaches win for A if both players play optimally?

Motivation

In formal methods, we analyze all runs of programs and then reason about the properties of that program. Imagine that you have a computer program that directs packets through a network to a destination. At some servers, you choose the next target, at some, a hacker chooses the next, and at some, it's random. It corresponds to a stochastic game.

Generally, stochastic games help us to reason about programs. Any progress in stochastic games will translate to other algorithms too.

What is known

The decision problem is in NP and co-NP. It means that it's at most as complicated as problems from NP (or co-NP).

If stochastic games are complicated: there is no polynomial algorithm, it has far-reaching implications. That means majority of computer scientists thinks that there exists a polynomial algorithm for stochastic games. Proving that every algorithm takes a lot of time implies that NP is not equal to P, which is the biggest problem in computer science. Finding a fast algorithm means that we can solve this problem quickly (and similar ones too). So, I think there exists a fast algorithm.

For a formal methods course, I studied an algorithm for parity games. A good algorithm for parity games was an open problem for a long time, and then in 2018, a quasipolynomial algorithm was discovered. The paper was good and made sense. At least it's was not a magic that I would not be able to follow. It hints that the fast algorithm is findable (and maybe even by a Ph.D. student).

Value Iteration

In practice, for these problems, a technique called value iteration is used.

Every vertex has a value. First, everyone has 0 except for the target of A. It has 1 (a win for A). Then every vertex is updated: we look to values where edges lead from that vertex. We assign maximum from these numbers if it's A's vertex, minimum if it's B's vertex, and average if it's a random vertex. We iterate this step until numbers do not change much. It corresponds to players making better and better decisions with more information.

The value iteration is fast in practice, but there exist cases where it takes exponential time. It's versatile, so it's used even for Markov decision processes and Markov chains, even though a polynomial algorithm exists for them.

When I was thinking about stochastic games, I discovered an algorithm that runs fast on instances where the value iteration goes slowly. We will see if it leads somewhere.

Risk

I have to admit, so far, I focused on solving stochastic games, but I finally moved to something more productive. We want to approach the stochastic games from a different direction. Instead of solving it straight up, which has only a binary solution: either breakthrough or nothing, we try to evaluate risk on stochastic games.

For risk, we need to define weighted stochastic games: those are games where we have more target vertices, and in every vertex, we get some payoff (somewhere we get 100, somewhere 10, and somewhere 0). Then we ask what is the A's average payoff?

The solution everyone is looking for in weighted SG is the one with the highest mean payoff. It is not always desirable. Would you get 97 euros or play a game where you can win 250 euros or lose 50 euros with the same probability?

Most people prefer security over payoff so that even the worst case is survivable. One of these risk metrics is called CVaR and is used in finance. CVaR answers how much we get in an average "bad" run. CVaR(p) is an average payoff of the lowest runs whose probability sum to number p.

The risk was studied in Markov decision processes (they are the same as stochastic games, but there are no opponent's vertices). We want to create algorithms for computing CVaR in stochastic games.

Cooperation on graphs

Simple prisoner's dilemma

Prisoner's dilemma is a simple game: every player chooses simultaneously a strategy: cooperate or defect. If both cooperate, they pay some price and enjoy a benefit. If both defect, nothing happens. If one cooperates and the other defects, the cooperator pays the price (or gets nothing), and the defector gets all benefits.

If you play just once, it is better to defect. No matter what the other player chooses, we are better off defecting. But if both players play optimally, nobody gets anything. It's better if both cooperate. It is the reason it's called dilemma (it's called prisoner's because the original motivation was to avoid prison).

These games usually describe our behavior well. Despite the dominance of selfishness we also see cooperation. It means some forces promote cooperation. Understanding them would explain parts of evolution and help us to organize our society better.

We deal with simple memoryless strategies. They play either always defect or always cooperate. They don't remember what other players played against them, which closes the easiest way to ensure cooperation: reciprocity. I might start to cooperate with the hope that my opponent cooperates too. If this does not work out and the opponent is not cooperating, I don't let him exploit me.

Structure

Instead of reciprocity, we use a spatial arrangement. In the real world, we don't interact with strangers, but we do with our friends. The whole process is simple: first, everyone plays with his friends. Then he might change his strategy to a strategy of a more successful friend.

Defectors (red) and cooperators (blue) on a square grid, the simulation is not yet finished, the boundaries are too ragged.

Generally, the structure has no constraints, but everyone uses a square grid. There, cooperators form clusters that survive an onslaught of the defectors. Afterward, they get to a stable position where the number of defectors does not increase nor decrease. And the question is how many cooperators are in the stationary state.

Spreading cooperation using graph

The papers that deal with prisoner's dilemmas usually select some trick so cooperators can survive. For instance, there might be another strategy that punishes defectors (for some cost), or cooperators can change their graph. Then they simulate the process and present nice charts.

I'm more a theoretician, so I like to prove what I see, not just simulate it. As I will describe (in the next post), these games are generally computationally hard. We cannot efficiently compute the ratio of cooperators to defectors (and an approximation is complex too). Simulations are the only approach. I want to prove something about the process to help the simulations or to show what is possible.

First, I wanted to know how a graph can help cooperators against very strong defectors. It turned out that even if we allow defectors to rob the cooperators a lot (but cooperator does not get negative payoff), we can salvage the situation by the right structure.

I already have a lot of that written up. I know there exists a good structure and that no other can be too much better. Now, I'm thinking about the best baseline because comparing it to a structureless case is stupid. There the cooperation cannot spread. Defectors are always better off.

We will see.