Following on from the idea of learning to make an optimal single decision, we can expand this to making multiple sequential decisions in an optimal way. To do this we’ll be exploring Monte Carlo Tree Search (MCTS); an algorithm that combines ideas from traditional tree search algorithms, and reinforcement learning (RL).

Today we’re going to focus on building intuition. In later posts we’ll discuss and implement vanilla MCTS, then we’ll explore some exciting improvements and variations 😀 In the end you’ll have implemented a superhuman AI that can beat you and everyone you know in a game.

## Markov Decision Processes (MDPs):

Whereas in previous posts on multi-armed bandits we were concerned with finding the single best choice, now we will be trying to find the best sequence of choices of actions to reach some goal or maximise some value. An example problem might be: what’s the best sequence of actions a Pac-Man 🟡 can take to get the highest score and win the game?

To do this, Pac-Man must plan ahead to decide which actions lead to food, and collect as much of that as possible, while also avoiding the ghosts. We can formalise this in a nice way as a ** Markov Decision Process **(MPD). In an MDP, we call the situation Pac-Man finds himself in at any time a

**state**. From a state Pac-Man can take

**actions**, which might grant Pac-Man some

**reward**, and then the action leads to new states.

In the above diagram, Pac-Man is in state s_t, where the t denotes the current time. Pac-Man has several choices of action, and chooses to go down (that is the action a_t at time t). As a result, Pac-Man gets a reward of +20 (that is the reward r_t at time t), and moves to state s_{t+1} at the next timestep. So Pac-Man’s goal is to maximise the reward from eating food, by planning a sequence of actions we call a **policy** (\pi).

## The Value of a State:

Since it will take Pac-Man many time-steps to achieve the goal, we can formulate this as the expected sum of future rewards ➕:

G_t = \sum_{\tau=t+1}^{T}R_\tau \space

All this says is that the **return** at any given timestep is the cumulative reward we will receive by the end of the game (the **terminal** timestep T). The policy \pi dictates which actions Pac-Man will take in each state, and so this also dictates how much reward we expect Pac-Man to receive by the end of the game. This allows us to determine the **value** of of each state when following a given policy:

\begin{aligned} v_\pi(s) &= \mathop{\mathbb{E}}_\pi \left[ \sum_{\tau = t+1}^{T} R_\tau \space | \space S_t=s\right] \\ &= \mathop{\mathbb{E}}_\pi \left[ G_t \space | \space S_t=s\right] \end{aligned}

This is an *expectation* because even though Pac-Man can deterministically choose which actions he takes and where he ends up, there are other parts of the environment he cannot reliably predict. For example, Pac-Man can’t predict where each ghost will move, so he has to plan for all outcomes of his actions with some probability. It might be the case that Pac-Man chooses to move up to get some food (receiving a reward), but a nearby ghost also chooses to move to the same space, killing Pac-Man (no reward, and maybe *game over *if Pac-Man is out of lives). Pac-Man needs to account for these possibilities when he is deciding where to move 🤔.

To expand that example a bit more: if Pac-Man doesn’t know how the ghosts move, he must treat them as totally random. In this case Pac-Man would assume the ghosts could move in any of the four directions with equal probability (0.25). If Pac-Man chooses to move up, there is a 0.75 probability that he gets the +20 reward, but a 0.25 probability that he gets 0 (walks into the ghost and loses a life). The expected immediate value of moving up to that piece of food is therefore 0.75 \times 20 = 15. If Pac-Man has another piece of food to his left, and there is no ghost in the immediate neighbourhood of that piece of food, then the expected value of moving left is the full 20. Pac-Man should move left.

## One-step lookahead is not enough:

There are two problems that we have to solve here.

- To know the expected value of the state, we had to simulate taking the action and calculate the probabilities of each outcome that could occur.
- Once we know the expected value of each state we can move to, we must compare them to determine the best action to take. Doing this repeatedly from each state provides us with our new policy \pi.

There is a subtle issue you may have noticed with problem 1, and in my example above: moving left may *seem* like the best choice if we only simulate one turn ahead, but it may also lead Pac-Man to a dead end where the ghost backs him into a corner and kills him 😱. We can only know this is the case if we simulate not one step ahead, but many. So the *true* value of moving to a state does not just come from the immediate reward we receive multiplied by a single probability. It instead comes from simulating all the possibilities many many steps ahead (preferably up to the end of the game) and figuring out the expected value from *all of those outcomes* recursively 🔁:

\begin{aligned} v_\pi(s) &= \mathop{\mathbb{E}}_\pi \left[ G_t | S_t = s \right] \\ &= \mathop{\mathbb{E}}_\pi \left[ R_{t+1} + G_{t+1} | S_t =s\right] \\ &= \sum_a \pi(a|s)\sum_{s'}\sum_{r}p(s', r | s, a) \left[ r +\mathop{\mathbb{E}}_\pi \left[ G_{t+1} | S_{t+1}=s' \right] \right] \\ &= \sum_a \pi(a|s) \sum_{s' r} p(s', r | s, a)[r+v_\pi(s')] \end{aligned}

Okay, I admit that looks terrifying. But all this is showing is how the expectation breaks down (how we calculate it); summing up all the rewards multiplied by the corresponding probability of that action/outcome occurring. You can see this by looking at the last line: it is the same thing we walked through in our example above, when multiplying the probabilities of each outcome with the reward for Pac-Man moving up (maybe getting killed by the ghost) or moving left (guaranteed reward).

Notice also on the last line, that v_\pi(s') is included. What does this mean? It means that we apply the same calculations again for each possible action and outcome in the next state (s’). It is a *recursive *function. So these calculations will happen for every state, and then every subsequent state produced by taking the possible actions, and so on and so on. We could almost think of this as expanding an enormous tree, where every branch is an action leading to a new state… 🌲

## Monte Carlo Tree Search:

This is where MCTS comes in! It’s a search algorithm which balances exploration and exploitation by combining familiar concepts from multi-armed bandits (like UCB action selection), as well as tree search (best-first search) and Monte Carlo prediction (rollouts and back-propagation).

The balancing part is important ⚖️: it allows the algorithm to explore a state space efficiently by focussing the search on only the most promising areas. This allows the algorithm to explore deeper in the state space, calculate good value estimates for states, and construct intelligent plans. It does not need to explore the state space exhaustively to do this effectively.

Another benefit of the algorithm is that while it is technically a heuristic search algorithm, the heuristic (UCB) is entirely problem independent; MCTS can be deployed out-of-the-box and work well in many settings without the need for hand-crafted guidelines on how to play.

We’ll dig into the details and implement MCTS in my next posts 🙂