Dan Carter - Daniel Carter - Doctoral Researcher

Following on from the idea of learning to make an optimal single decision, we can expand this to making multiple sequential decisions in an optimal way. To do this we’ll be exploring Monte Carlo Tree Search (MCTS); an algorithm that combines ideas from traditional tree search algorithms, and reinforcement learning (RL).

Today we’re going to focus on building intuition. In later posts we’ll discuss and implement vanilla MCTS, then we’ll explore some exciting improvements and variations 😀 In the end you’ll have implemented a superhuman AI that can beat you and everyone you know in a game.

Markov Decision Processes (MDPs):

Whereas in previous posts on multi-armed bandits we were concerned with finding the single best choice, now we will be trying to find the best sequence of choices of actions to reach some goal or maximise some value. An example problem might be: what’s the best sequence of actions a Pac-Man 🟡 can take to get the highest score and win the game?

To do this, Pac-Man must plan ahead to decide which actions lead to food, and collect as much of that as possible, while also avoiding the ghosts. We can formalise this in a nice way as a Markov Decision Process (MPD). In an MDP, we call the situation Pac-Man finds himself in at any time a state. From a state Pac-Man can take actions, which might grant Pac-Man some reward, and then the action leads to new states.

Source: Author & https://freepacman.org/

In the above diagram, Pac-Man is in state $s_t$ , where the t denotes the current time. Pac-Man has several choices of action, and chooses to go down (that is the action $a_t$ at time t). As a result, Pac-Man gets a reward of +20 (that is the reward $r_t$ at time t), and moves to state $s_{t+1}$ at the next timestep. So Pac-Man’s goal is to maximise the reward from eating food, by planning a sequence of actions we call a policy ( $\pi$ ).

The Value of a State:

Since it will take Pac-Man many time-steps to achieve the goal, we can formulate this as the expected sum of future rewards ➕:

G_t = \sum_{\tau=t+1}^{T}R_\tau \space

All this says is that the return at any given timestep is the cumulative reward we will receive by the end of the game (the terminal timestep $T$ ). The policy $\pi$ dictates which actions Pac-Man will take in each state, and so this also dictates how much reward we expect Pac-Man to receive by the end of the game. This allows us to determine the value of of each state when following a given policy:

\begin{aligned} v_\pi(s) &= \mathop{\mathbb{E}}_\pi \left[ \sum_{\tau = t+1}^{T} R_\tau \space | \space S_t=s\right] \\ &= \mathop{\mathbb{E}}_\pi \left[ G_t \space | \space S_t=s\right] \end{aligned}

This is an expectation because even though Pac-Man can deterministically choose which actions he takes and where he ends up, there are other parts of the environment he cannot reliably predict. For example, Pac-Man can’t predict where each ghost will move, so he has to plan for all outcomes of his actions with some probability. It might be the case that Pac-Man chooses to move up to get some food (receiving a reward), but a nearby ghost also chooses to move to the same space, killing Pac-Man (no reward, and maybe game over if Pac-Man is out of lives). Pac-Man needs to account for these possibilities when he is deciding where to move 🤔.

To expand that example a bit more: if Pac-Man doesn’t know how the ghosts move, he must treat them as totally random. In this case Pac-Man would assume the ghosts could move in any of the four directions with equal probability (0.25). If Pac-Man chooses to move up, there is a 0.75 probability that he gets the +20 reward, but a 0.25 probability that he gets 0 (walks into the ghost and loses a life). The expected immediate value of moving up to that piece of food is therefore $0.75 \times 20 = 15$ . If Pac-Man has another piece of food to his left, and there is no ghost in the immediate neighbourhood of that piece of food, then the expected value of moving left is the full 20. Pac-Man should move left.

One-step lookahead is not enough:

There are two problems that we have to solve here.

To know the expected value of the state, we had to simulate taking the action and calculate the probabilities of each outcome that could occur.
Once we know the expected value of each state we can move to, we must compare them to determine the best action to take. Doing this repeatedly from each state provides us with our new policy $\pi$ .

There is a subtle issue you may have noticed with problem 1, and in my example above: moving left may seem like the best choice if we only simulate one turn ahead, but it may also lead Pac-Man to a dead end where the ghost backs him into a corner and kills him 😱. We can only know this is the case if we simulate not one step ahead, but many. So the true value of moving to a state does not just come from the immediate reward we receive multiplied by a single probability. It instead comes from simulating all the possibilities many many steps ahead (preferably up to the end of the game) and figuring out the expected value from all of those outcomes recursively 🔁:

\begin{aligned} v_\pi(s) &= \mathop{\mathbb{E}}_\pi \left[ G_t | S_t = s \right] \\ &= \mathop{\mathbb{E}}_\pi \left[ R_{t+1} + G_{t+1} | S_t =s\right] \\ &= \sum_a \pi(a|s)\sum_{s'}\sum_{r}p(s', r | s, a) \left[ r +\mathop{\mathbb{E}}_\pi \left[ G_{t+1} | S_{t+1}=s' \right] \right] \\ &= \sum_a \pi(a|s) \sum_{s' r} p(s', r | s, a)[r+v_\pi(s')] \end{aligned}

Okay, I admit that looks terrifying. But all this is showing is how the expectation breaks down (how we calculate it); summing up all the rewards multiplied by the corresponding probability of that action/outcome occurring. You can see this by looking at the last line: it is the same thing we walked through in our example above, when multiplying the probabilities of each outcome with the reward for Pac-Man moving up (maybe getting killed by the ghost) or moving left (guaranteed reward).

Notice also on the last line, that $v_\pi(s')$ is included. What does this mean? It means that we apply the same calculations again for each possible action and outcome in the next state (s’). It is a recursive function. So these calculations will happen for every state, and then every subsequent state produced by taking the possible actions, and so on and so on. We could almost think of this as expanding an enormous tree, where every branch is an action leading to a new state… 🌲

Monte Carlo Tree Search:

This is where MCTS comes in! It’s a search algorithm which balances exploration and exploitation by combining familiar concepts from multi-armed bandits (like UCB action selection), as well as tree search (best-first search) and Monte Carlo prediction (rollouts and back-propagation).

The balancing part is important ⚖️: it allows the algorithm to explore a state space efficiently by focussing the search on only the most promising areas. This allows the algorithm to explore deeper in the state space, calculate good value estimates for states, and construct intelligent plans. It does not need to explore the state space exhaustively to do this effectively.

Another benefit of the algorithm is that while it is technically a heuristic search algorithm, the heuristic (UCB) is entirely problem independent; MCTS can be deployed out-of-the-box and work well in many settings without the need for hand-crafted guidelines on how to play.

We’ll dig into the details and implement MCTS in my next posts 🙂

In this post we’ll walk through some neat tricks to make $\epsilon$ -greedy more effective, and then we’ll dig into a smarter way to handle exploration: upper confidence bound action selection. We’ll be building on what we learned in my last post, and as always the code can be found in this colab notebook so you can follow along and try it out yourself 🙂

Optimistic initial values:

One problem with $\epsilon$ -greedy is that randomness isn’t a very intelligent way to conduct exploration, especially at the start of an episode when you don’t know anything about your options. Using our slot machines analogy again: due to the randomness, $\epsilon$ -greedy might not actually try all of the machines at least once for quite some time. Instead it might explore a few of the arms to begin with and then spend most of the time early on exploiting a sub-optimal arm.

A really simple and clever way we can make this early exploration more systematic is to give the $\epsilon$ -greedy agent optimistic initial estimates for each arm’s rewards. In practice this means setting the initial q-values quite high (not zero). This exploits the greediness of $\epsilon$ -greedy. The agent selects each arm one by one, expecting a high reward but instead getting a relatively low one. It then revises each estimate down each timestep until the estimate starts to converge on the true value. Overall, the agent is encouraged to explore much more effectively and the optimistic starting estimates gradually get reduced down to something more realistic.

You can see the results after applying this trick below:

Optimistic initialisation improves early exploration

The optimistic $\epsilon$ -greedy agent converges much faster than regular $\epsilon$ -greedy. This can make a big difference in problem settings when the number of steps per episode is smaller. Over longer timesteps the advantage mostly disappears as the impact of the early exploration reduces over time. It’s worth keeping in mind that this approach won’t help much in a non-stationary setting.

Unbiased constant step-size

There is a very subtle issue that we should address with the constant step size update rules (using a constant $\alpha$ for recency-weighting). The issue is that this update rule is biased by the inclusion of the initial estimate. Recall that the recency-weighted update rule for Q-values is essentially a weighted sum of all past rewards, plus a weighting of the initial Q value estimate (see my last post for a deeper analysis):

\begin{aligned} Q_{n+t} &= Q_n + \alpha[R_n - Q_n] \\ &= (1-\alpha)^nQ_1+ \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_i \end{aligned}

This means our initial Q-value estimate permanently alters the subsequent Q-value estimate updates. The good news is that while the impact is permanent, it gradually reduces over more time until it is virtually non-existent. However, we’d like to remove this bias if possible, to make our agent’s early exploration more effective. To do that we need to alter our update rule formulation a little bit. The change involves altering the step size from just $\alpha$ to be:

\begin{aligned} \beta_n \doteq \alpha / \bar\omicron_n \end{aligned}

This $\omicron_n$ is the real interesting part, which is defined as follows:

\begin{aligned} \bar\omicron_n \doteq \bar\omicron_{n-1} + \alpha(1- \bar\omicron_{n-1}), \space \text{for} \space n \ge 0, \space \text{with} \space \bar\omicron_{0} = 0 \end{aligned}

Okay, so that maybe looks a bit confusing. In plain English this just means we need to keep track of a separate $\bar\omicron_{n}$ for each arm/action, and we update it using the above rule each time we pull its assigned arm. In code this just means a slightly modified update function that looks like:

  def update_estimate_weighted_unbiased(self, choice_index, reward):

    beta = self.alpha/self.omicrons[choice_index] if self.omicrons[choice_index] > 0 else self.alpha
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (beta*(reward - self.arm_qs[choice_index]))

    # update omicron for this action
    self.omicrons[choice_index] = self.omicrons[choice_index] + self.alpha*(1-self.omicrons[choice_index])

Simple, right? 😊 To see why this is an unbiased recency-weighted average we’ll have to do a bit more algebra. I’ll explain it as we go, but if you’re not interested then feel free to skip to the results.

First we’ll start with our original recency weighted average formula, but we’ll swap $\beta$ instead of $\alpha$ and rework things a bit to get only one $Q_n$ on the right hand side:

\begin{aligned} Q_{n+1} &= Q_n +\beta_n(R_n - Q_n) \\ &= Q_n + \beta_nR_n - \beta_nQ_n\\ &= \beta_n R_n + (1- \beta_n)Q_n \end{aligned}

The second line is just an expansion of the brackets in the first line, and the third line factors the two terms including $Q_n$ together. We’re mostly interested in $Q_2$ (because $Q_1$ is the biased initial estimate) to understand how the bias is eliminated:

\begin{aligned} Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\ \end{aligned}

But to figure this out we first need to figure out what $\beta_1$ is (this is the important part):

\begin{aligned} \beta_1 &= \frac{\alpha}{\bar\omicron_1} \\ &= \frac{\alpha}{\bar\omicron_0 + \alpha(1-\bar\omicron)} \\ &= \frac{\alpha}{0 + \alpha(1- 0)} \\ &= \frac{\alpha}{\alpha} \\ &= 1 \end{aligned}

In the first line we use the definition of $\Beta$ we saw earlier. Then the second line follows from the definition of $\bar\omicron$ we saw earlier too (it is calculated from the previous omicron and alpha). Finally, the third line follows from that same definition where $\bar\omicron_0$ is a special edge case which = 0. Now we can plug this back into the equation for $Q_2$ to see how the bias disappears:

\begin{aligned} Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\ &= 1 \cdot R_1 + (1 - 1)Q_1 \\ &= R_1 + Q_1 - Q_1 \\ &= R_1 \end{aligned}

Phew, and there you have it! 😅 There is no bias because the initial estimate $Q1$ is eliminated when we calculate $Q_2$ ! Since all the subsequent Q estimates are based on earlier Q estimates we have eliminated the bias permanently. This might seem like a lot of work, but in practice it’s just a couple of extra lines of code.

Now let’s take a look at how the unbiased version of recency-weighted e-greedy performs:

Unbiased update rule speeds up early exploration

The difference is hard to spot on these plots, but look at the plot on the right and you can see that the unbiased e-greedy is slightly faster at converging early on. This difference becomes negligible over time, but it’s a nice little performance boost for early steps in non stationary problems!

Upper confidence bound (UCB) action selection:

So far we’ve still been considering $\epsilon$ -greedy based agents, and we haven’t really tackled the problem that exploring randomly is not a smart way to approach exploration. Now we’re going to change that and discuss UCB – upper confidence bound action selection. This method selects actions by calculating confidence in the q-value estimate of each arm. The diagram below is a good visualisation to help explain:

What this shows is that the confidence in the estimate of Arm 1 is quite high (blue) and the confidence for the estimate of Arm 2 is low (orange). UCB will use this confidence estimate and calculate what the upper-confidence bound is (approximately 95% confidence level). In the plot the 95% confidence level for Arm 1 is shown by the green line, and the 95% confidence level of Arm 2 is shown by the red line. When choosing an action, UCB selects the action with the highest 95% upper-confidence bound, which in this case would be Arm 2.

This choice of action is controlled by the following formula:

\begin{aligned} A_t \doteq \argmax_{a} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \space \right] \end{aligned}

$t$ is the current timestep and $\ln t$ is the natural logarithm of that value. $N_t(a)$ is the number of times the action a has been selected before time t. $c$ is a confidence level parameter that we can select (usually set to 2, which is 95% confidence). In code this formula completely replaces the action selection logic in choose_action() compared to $\epsilon$ -greedy:

def choose_action(self):
    action = np.argmax(self.ucb_calc())
    self.arm_ns[action] += 1

    reward, optimal = self.problem.draw_from_arm(action)

    self.update_estimate_incremental(action, reward)

    return reward, optimal
  
  def ucb_calc(self):
    t = np.sum(self.arm_ns)
    arm_ucbs = np.zeros(len(self.arm_qs))

    for i in range(0, len(self.arm_qs)):

      if self.arm_ns[i] == 0:
        # If we have not explored this arm before, we consider it maximising
        arm_ucbs[i] = np.Inf
        continue

      arm_variance = self.arm_qs[i] + self.confidence_level * np.sqrt((np.log(t)/self.arm_ns[i]))
      arm_ucbs[i] = arm_variance

    return arm_ucbs

For now we still update the q-value estimates using the incremental average update rule. Here are the results on a stationary problem:

upper confidence bound action selection performs well on stationary problems

The UCB algorithm outperforms both $\epsilon$ -greedy and fixed exploration greedy ( $\epsilon$ -first) on stationary problems! 📈 We can see that UCB finds the optimal action faster and more often than either of the other two previous best methods. But what about non-stationary problems?

upper confidence bound action selection performs poorly on non-stationary problems

Hmm, that’s disappointing 🤔 But this makes sense since we are using the incremental average update rule, which does not update estimates well in non-stationary environments! The problem is that UCB is tricky to adapt to the non-stationary setting, and this is an ongoing area of research (often with state-of-the-art results). This is mostly due to the confidence estimate requiring both the Q-value estimates and the pull counts too. However, there is one simple implementation we can use which follows a similar principle to the recency-weighted average update called discounted-UCB (D-UCB):

  def update_estimate_discounted(self, choice_index, reward):
    # Discount all reward estimates and pull counts (i.e. gradually forget old steps)
    self.arm_qs *= self.gamma
    self.arm_ns *= self.gamma

    self.arm_qs[choice_index] += reward
    self.arm_ns[choice_index] += 1

We just discount all the values for each arm q-estimate and pull count alike, like a recency-weighting for both the q-value estimate and the confidence estimate. Results using this method are much better and also beat the previous best approaches on non-stationary problems:

Discussion and future work:

That’s it! We’re finally done with multi-armed bandits. 🐙

We have covered a range of algorithms that make decisions and explore under uncertainty, in stationary or non-stationary settings. Most of these techniques will be applicable in more advanced reinforcement learning settings, but we’ll those cover in future posts. There are definitely more advanced multi-armed bandit methods out there, like Thompson sampling or klUCB, however I won’t write about those yet (and I’m not sure if/when I plan to). Hopefully you now have a good and thorough intuition about how and why all of these Multi-Armed Bandit methods work!

Author: Dan Carter

Monte Carlo Tree Search (Part 1): Introduction to MDPs