Stochastic and Partially Observable Games

Stochastic Games

Stochastic games can bring us a little closer to real life, by introducing a random element (like dice)

A good example of a game that combines luck and skill is Backgammon:

Legal moves are: \((5\rightarrow 11,5\rightarrow 10),(5\rightarrow 11,19\rightarrow 24),(5\rightarrow 10,10\rightarrow 16),\text{or}\ (5\rightarrow 11,11\rightarrow 16)\)

A backgammon board state

At this point in the game, we know the moves it’s possible for ourselves to make, but not the opponent why?

We can’t construct a normal game tree like this! We need chance nodes, in addition to our other \(MIN\) and \(MAX\) nodes.

  • How many ways are there to roll two dice?

    • 36
  • How many unique ways are there to roll two dice?

    • 21 (a roll of (6,5) is the same as (6,5)
  • Does every roll have the same probability?

    • No, same # pairs (1,1) have a probability of \(\frac{1}{36}\), all others have \(\frac{1}{18}\)

We can’t take a pure minimax strategy here, as we only have the calculated expected value of nodes

We can, however, use an expectiminimax value:

\[ Expectiminimax(s)=\\ \begin{cases} Utility(x,MAX), &\text{if }IsTerminal(s)\\ max_{a\in Actions(s)}Expectiminimax(Result(s,a)),& \text{if }ToMove(s)=MAX\\ min_{a\in Actions(s)}Expectiminimax(Result(s,a)),& \text{if }ToMove(s)=MIN\\ \sum_rP(r)Expectiminimax(Result(s,r)),& \text{if }ToMove(s)=CHANCE \end{cases} \]

Where \(r\) is some possible dice roll (or other chance event) and \(Result(s,r)\) is state \(s\) where the roll is some \(r\).

Evaluation functions for games of chance

Just like minimax, a good strategy is to halt our search at some point, then estimate the values of the available states.

We might naively think that our evaluation function function will look similar to chess… but there are some pernicious issues that show up. How might we address this?

To ensure this doesn’t happen, but rather a positive linear transformation of the probability of winning (or expected utility)

If there was no probabilistic element (if we knew the dice values ahead of time) then minimax solves the tree in \(O(b^m)\) (as before)

But because we consider every dice outcome, we get \(O(b^mn^m)\), where \(n\) is the number of distinct rolls

Even when the search limited to a small depth \(d\), the extra cost is prohibitive.

For example: Backgammon the \(n = 21\), and \(b \approx 20\), but can be sometimes as high as \(b = 4000\)(bang!), so 3 ply is as high as we could manage.

To reason about this, consider again alpha-beta search. It ignores future developments that won’t happen, given optimal play, it focuses on more likely outcomes.

But in a game that requires two dice throws, there are no such likely moves.

This is a general problem that occurs when uncertainty is introduced, our possible moves are hugely expanded, trying to go deep on planning a strategy is a waste, the universe may not play along.

You may have considered applying alpha-beta pruning to these chance nodes. This is possible!

Let’s look at node \(C\) from our game tree, is it possible to find an upper bound before evaluating all of its children? (which is what A-B needs)

In games like Yahtzee, where 5 dice are rolled, then you may want to consider forward-pruning unlikely outcomes, or just adopting another strategy entirely (like Monte Carlo tree search)

Partially Observable Games

Bobby Fischer declared “chess is war” but aside from bloodsport, it also lacks the partial observability of war “fog of war”.

Many modern video games also have partial observability, and like real war, the use of observability is a weapon unto itself (concealment and misinformation).

Deterministic, partially observable games have uncertainty arise from lack of knowledge of the opponent’s moves. E.g. Battleship, Stratego…

We’ll look at Kriegspiel, a partially observable chess variant. (other similar games are Phantom Go, Phantom tic-tac-toe and Screen Shogi)

Kriegspiel: Partially observable chess

Here’s how to play Kriegspiel:

  • White and Black see a board that contains only their pieces

  • A referee can see all pieces and adjudicates the game and occasionally dispenses information and rulings to both players

  • First white proposes a move to the referee, if some black piece would prevent the move, the judge announces “illegal” otherwise the move goes through, White can gather information by analyzing these illegal moves

  • If the legal move goes through, the judge may announce “Capture on square X” or “Check by D” (direction) and the type of check “rank” “file” “knight” “long diagonal” or “short diagonal”

  • If there is a checkmate or stalemate, the referee announces it, otherwise the game continues

So, who wants to play?

Humans can actually play Kriegspiel pretty well, and computers are starting to be able to as well

How might we explore this space?

  • Remember our belief states from Ch. 4? The “logically possible” states. Keeping track of this is exactly the “state estimation” problem we’ve seen.

  • Kriegspiel can be mapped onto the partially observable, nondeterministic framework and we can consider the opponent to be the source of the nondeterminism!

  • Instead of planning a move for any given move by the opponent, we need a move per any percept sequence

  • In this case, a winning strategy (or guaranteed checkmate) is one that for any possible percept sequence in the current belief state, leads to a checkmate (independent of how the opponent moves!

    • This would still work even if the opponent can see the board

Guaranteed checkmate plan (KRK endgame)

These trees can be solved with our old \(AndOr\) algorithm!

Besides Guaranteed Checkmates, there are now also Probabilistic Checkmates

Consider the situation of an empty board with two kings, they will eventually find each other with probability 1, right?

The KBNK endgame may be one in this way, forcing an eventual reveal into a checkmate

In contrast, the KBBK endgame is one with probability of \(1-\epsilon\), because there must be an eventual move where a bishop is not protected (in order to force a checkmate)

Also, sometimes a strategy works for some board states and not others, and in attempting to explore the space, a checkmate is caused (this commonly happens for humans)… so how likely is any given strategy going to succeed?

…well how likely is any given board state to result in a win?

We could attempt to assign probabilities with every state… but there a problem there too (we’ve already assumed optimal play

Wait there’s more (problems)! Remember information is an asset and liability too, so we’re not just trying to find the optimal moves, but also to minimize opponent information!

This can be done by the allowance to occasionally make random moves.

So… the ideal method is to produce an optimal randomized strategy (called equilibrium) but this is currently too expensive for general kriegspiel… the topic is considered “open”

Card Games

Name some card games with stochastic partial observability:

  • Bridge, whist, hearts, poker, etc.

It seems that the cards are dealt randomly, just like dice, determine the moves of each player, like dice, but are all rolled at the beginning!

This is not really correct, but it does suggest a way to reason about these systems:

  1. Treat the start of the game as a chance node with every possible deal as an outcome
  2. Use \(Expectiminimax\) to pick the best move
  3. Continue search as normal

This is called Averaging over Clairvoyance, because it assumes after the one random element (the deal) the game is fully observable to both players.

As you may imagine this leads to problems…

Consider the following

Day 1: Road \(A\) leads to a pot of gold. Road \(B\) leads to a fork. You can see that the left fork leads to two pots of gold, and the right fork leads to you being run over by a bus.

Day 2: Road \(A\) leads to a pot of gold. Road \(B\) leads to a fork. You can see that the right fork leads to two pots of gold, and the left fork leads to you being run over by a bus.

Day 3: Road \(A\) leads to a pot of gold. Road \(B\) leads to a fork. You can see that one fork leads to two pots of gold, and one fork leads to you being run over by a bus. But… you don’t know which is which.

If you average over clairvoyance, you may be lead astray…

This strategy doesn’t consider the belief state that results from any given action.

A belief state of complete ignorance is a bad place to be (especially if you might get run over by a bus)

After the deal, we still don’t have perfect knowedge, but the algorithm will never select actions to gather knowledge

It also won’t try to hide information from the other player, nor provide information to a partner (it assumes they know), nor bluff in a game like poker.

Despite all of this, it can still be effective, with some tweaking and observations

One way of making it more reasonable is with abstraction: for example, in bridge, you see two of the four hands at a time, giving two unseen hands of 13 cards each, therefore the potential number of deals are

\[ \binom{26}{13}=10,400,600 \]

Solving one deal is difficult, forget 10M

That said, it’s important where the aces and kings are, but not so much were the fours or fives are, so we can treat fours and fives (and less important cards) as the same thing.

Forward pruning can also be helpful, considering some number of random deals to get a sense of the statistics

(This can also be done in Kriegspiel)

Doing a heuristic search with cutoff, also contributes to good guesses, etc.

However, we still assume that all deals are equally likely, which is fine for games like hearts and whist, but not for bridge.

In bridge, players bid on the number of tricks they expect to win, since this bid is informed by the cards you know (but not your opponents, and vice versa) your opponents can learn something about your cards through your estimation.

It’s important to keep track of the probability of certain outcomes based on the information you get from your opponents while minimizing the information you give them

Computers have surpassed human performance in Poker

Liberatus competed against four top poker players in the world in a 20-day no-limits Texas hold ’em and decisively won.

It used abstraction extensively, considering a hand of \(AAA72\) and \(AAA74\) to be equivalent and a bid of \(\$200\) to be equivalent to \(\$201\). AND, it if detected another player was exploiting abstraction, it would devote some extra computation to make a better estimation of the probabilities than the other player.

Total, it used 25M CPU hours (on a supercomputer) to win.

This is to say, if you want to play a superhuman game of “insert game here”, on a budget, you’re likely out of luck.

That said, crowdsourcing the computation is an option (which has been utilized by \(LEELAZERO\) and others)

Machine learning models are often expensive to train, but once trained, can often be used (at tournament level) on a commodity CPU/GPU, \(ALPHASTAR\) did this successfully.

Limitations of Game Search Algorithms

It should be clear at this point that finding the optimal decision in games is an {tractable or intractable?} problem

We are required to make some assumption or approximation to explore the search space at all.

Alpha-Beta search uses some heuristic evaluation function

Monte Carlo search computes and approximate average over a random selection of rollouts

If branching factor is high, Monte Carlo works well, but both have intrinsic drawbacks.

Consider the following situation:

Two-ply game tree (minimax error possible)

What if our heuristic function has some small error?

If our estimation has a \(\sigma = 5\), then the left node is the better option \(71\%\) of the time, and \(58\%\) of the time when \(\sigma = 2\)(bang!)

And that’s assuming the error-per-node is independent, if it’s not the estimation gets even worse!

This is difficult to compensate for when we don’t have a good model about how the error in one node affects the other.

Another issue is that both Alpha-Beta and Monte Carlo searches calculate the (bounds on) the values of legal moves.

If a move is “obviously better”, then the algorithm should simply make the move (a valid checkmate for example)

A better algorithm would reason with the idea of utility of node expansion, that is, which searches are likely to lead to the reveal of better moves? If there’s no node with a utility that is higher than the cost of the search, then don’t keep searching, make a move!

(this works not just for “obvious” moves, but also moves that are exactly equivalent)

This kind of reasoning about reasoning is called metareasoning, and is generally applicable to all reasoning (not just games)

Monte Carlo somewhat uses this idea to allocate resources to the more important parts of the tree, but does so clumsily

Also, humans don’t really play games like computers do, humans don’t just play move-by-move, we plan, which allows us to make higher-level goals (trapping opponent’s queen) then selectively make plans to satisfy that goal.

The ability to incorporate machine learning will be an important constraint in the future, given its proven ability to capture complicated behavior of real-world play.

Go check out the historical notes for the some interesting extra details!