Monte Carlo Tree Search (Part 2): A Complete Explanation with Code

In the last post we discussed the problem of acting optimally in an episodic environment by estimating the value of a state. Monte Carlo Tree Search (MCTS) naturally fits the problem by incorporating intelligent exploration into decision-time multi-step planning. Give that post a read if you haven’t checked it out yet, but it isn’t necessary to understand today’s post. It might also be beneficial to get some intuition on UCB action selection.

Today we’re going to dig into ‘vanilla’ MCTS, the same algorithm used in state of the art game-playing agents like AlphaZero. When we’re done you’ll understand how and why this works so well for general game playing (especially in board games like Chess and Go).

Accompanying code snippets are included in the post to help explain the ideas. You can also follow along and run the code in this colab notebook. A python implementation of Connect4 is included in the notebook so you can play against the finished MCTS algorithm 😄 🔴🔵🔴🔴

The Four Steps in MCTS:

MCTS builds up estimates for the values of each possible state and action by planning ahead. In order to do this it must have a perfect model of the environment. In other words it must know exactly where it ends up after taking each action – without actually having to take that action. This is a downside to MCTS, since it can’t work without this model.

Anyway, assuming we have this perfect model we can simulate taking actions and choose the best ones based on the outcomes of simulated games. Doing so repeatedly we can build up a game tree of actions and states the algorithm has explored 🌳. Here are the four steps MCTS repeats to do this:

  1. Selection
  2. Expansion
  3. Rollouts/Simulation
  4. Backup of results
Figure from Reinforcement Learning: An Introduction (Sutton & Barto, 2018), page 186.

1. Selection:

MCTS starts by selecting the best node in the tree so far according to the UCB formula. This formula estimates the value (and uncertainty in that value) of each state in the tree. This form of MCTS is actually called “upper-confidence bound applied to trees” (UCT). Furthermore, UCT is just an extension of UCB action selection from multi-armed bandit problems, but applied to multi-step decision making instead. Here’s the UCB formula adapted to tree search (with explanation below):

Score_i &= \bar{q}_i + U_i \\
&= \bar{q_i}+C \sqrt{\frac{2 \ln N}{n_i}}

To demonstrate; think about this from the perspective of a parent node with several child nodes to consider. The action selection score of each child node is made up of the current estimate and uncertainty. The value estimate of the i-th child node is the current mean value of that node (\bar{q_i}). The uncertainty part has a constant C, that scales the uncertainty (bigger means more exploration, smaller means less exploration). We’ll leave this C=1. In the square root, N is the number of visits to the parent, and n_i are the number of times the child node i was chosen when passing through the parent N.

As an example, check the diagram below:

Monte Carlo Tree Search selection process

If we are in the parent node deciding which child node to explore, we apply the above UCB formula to each of the children and choose the one with the highest score. For instance evaluating child 1 and child 2:

Score_1 &= \frac{5}{10} + \sqrt{\frac{{2\ln20}}{10}} = 1.274 \\
Score_2 &= \frac{3}{5} + \sqrt{\frac{{2\ln20}}{5}} = 1.694 \\

According to this score the action selection will decide to further explore child 2. This is because we have explored Child 2 only 5 times, but exploring that child won 3 out of the 5 simulated games! This is a high win rate. We are also less certain about this result because the number of times explored is low (5). So the UCB action selection correctly pushes the algorithm to further explore this promising node 🧭.

Important side notes !!!

  1. When selecting from child nodes: if two children have an equal score, they must be selected from randomly. Any other tie-breaking method is not guaranteed to converge to an optimal policy.
  2. Additionally if a child node has not been explored at all (no visits) we automatically set it’s Q+U value to be +infinity (to guarantee exploring any unexplored children when visiting a parent node).
  3. Actually just about any bandit action selection algorithm could work in place of UCB. \epsilon-greedy for example.

Here’s the first part of the code in python, wrapped in a TurnBasedUCTNode class. We start in tree_policy():

class TurnBasedUCTNode():
  def __init__(self, 

    self.env = env
    self.player_1_state = player_1_state # 2d np array
    self.player_2_state = player_2_state # 2d np array
    self.player_turn = player_turn
    self.action = action # which action was chosen previously Q(s,a)
    self.parent = parent # parent node

    # self.action != id_in_parent e.g. action may be 6, but id_in_parent may be 0
    self.id_in_parent = id_in_parent # index of this child in parent's list of children

    self.is_expanded = False
    self.children = [] # list of child nodes
    self.child_visit_count = None  # need to expand before we know size of this list
    self.child_q_sum = None  # need to expand before we know size of this list
    self.reward = reward

  def child_q_estimates(self):
    return self.child_q_sum/(self.child_visit_count+1)
  def child_ucb1_estimates(self):
    # handle case where we are at root with no parent
    if self.parent is None:
      my_visits = np.sum(self.child_visit_count)
      my_visits = self.number_visits

    U = np.sqrt(2.0*np.log(my_visits)/(self.child_visit_count+1))

  def select_best_child(self, max_value=1e6):
    # Get Q + U for each child, return max, break ties randomly

    if not self.is_expanded:
      return self
    q_u = self.child_q_estimates() + self.child_ucb1_estimates()
    q_u[self.child_visit_count==0.0] = max_value

    max_choices = np.flatnonzero(q_u == np.max(q_u))

    if len(max_choices) == 1:
      return max_choices[0]

    random_choice = rando._randbelow(len(max_choices))
    best_child_index = max_choices[random_choice]

    return best_child_index

  def tree_policy(self):
    current = self

    node_visits = 0
    while current.is_expanded:
      node_visits += 1
      best_child_index = current.select_best_child()
      current = current.children[best_child_index]

    return current, node_visits # a not expanded leaf node

2. Expansion:

This part is pretty easy to understand. Once selection reaches a node that has no children, we need to create those children. So we are expanding the tree. We do this by asking our simulated environment (perfect world model) which actions we can take given the state we are in. Then we take each of these actions in turn in our simulated environment and store the resulting states as new child nodes. Once this is done, we select one of these at random for the rollout/simulation stage.

Monte Carlo Tree Search expansion

The code for the expansion step is below, it is a method of the TurnBasedUCTNode class:

def expand(self):

    if self.reward is not None:
      return self

    possible_actions = self.env.get_legal_actions(self.player_1_state, 

    # perform action filtering in env.legal_actions to get legal actions
    action_num = len(possible_actions)

    if action_num == 0:
      return self

    next_player_turn = -1 if self.player_turn == 1 else 1 # flip player turn

    self.child_visit_count = np.zeros(action_num, dtype=np.uint32)
    self.child_q_sum = np.zeros(action_num, dtype=np.int32)

    # loop thru legal actions and simulate stepping each one
    i = 0
    for action in possible_actions:
      p1_state, p2_state, reward = self.env.step(action,
      child = TurnBasedUCTNode(self.env, 

      i+=1 # increment the index used for id_in_parent

    self.is_expanded = True

    # return a random child node for rollouts
    random_child = self.children[rando._randbelow(len(self.children))]                                              

    return random_child

3. Rollout / Simulation:

This part is pretty fun, and is at the heart of why MCTS works. It took me a while to get my head around this when I was first learning about MCTS. So I’ll spend some extra time on this section to map out the concepts and intuition.

You may have been wondering where the \bar{q_i} part came from in the UCB formula discussed earlier. It’s made up of two parts: the number of visits, and the sum of rewards. The number of visits makes sense, but where does the sum of rewards come from? It comes from the Monte Carlo rollouts 🧻.

In this stage the algorithm simulates a full game (all the way to the end) by randomly selecting actions for each player until the simulated game ends. Once that happens, the environment tells us if the game ended in a win/loss/draw, which is the reward.

Monte Carlo Tree Search rollout

To clarify during rollouts we don’t need to create new nodes and copy state variables. Therefore the rollout stage is fast.

This still doesn’t really explain why rollouts work though. How does the result of a random game help us decide which actions to take?

A great way to think about this is as an extension of multi-armed bandit problems. In M.A.B.s we usually sample an action several times to build an estimate of its value. The same thinking applies here too. Hence, the more times we expand, reach a leaf node, and randomly rollout, the better the estimate of the value of each node/action.

To make this clearer: imagine if we sorted all of the possible end-game states from left to right, in order of lose, draw, win (see the diagram below). So a winning rollout will always land to the right, and a losing one to the left, and a draw is somewhere in the middle. In this visualisation we can imagine that good actions move us to the right, whereas bad actions move us to the left.

Monte Carlo Tree Search builds a value estimate using rollouts

Consider that for any node we are building the probability density function of the rewards from the simulated games (-1, 0, +1). Therefore the q-estimate of a node is the mean of this p.d.f generated from rollouts that passed through this node. This allows us to determine if an action is good or bad! 😈

Code for the rollout method is shared below to neatly summarise:

def rollout(self):
    max_t = (self.env.board_height*self.env.board_width)
    # If this node is terminal state, backup reward immediately
    if self.reward is not None:

    reward = None

    temp_p1_state = self.player_1_state.copy()
    temp_p2_state = self.player_2_state.copy()
    rollout_turn = self.player_turn

    while i < max_t:
      # get legal actions from the environment
      legal_actions = self.env.get_legal_actions(temp_p1_state, temp_p2_state)

      # In this case, must be a draw.
      if not legal_actions:
        reward = 0

      # choose random actions during rollouts
      action = legal_actions[rando._randbelow(len(legal_actions))]

      # don't need to copy states during rollout steps, just act on same state
      temp_p1_state, temp_p2_state, reward = self.env.step(action,

      # reward signals end of game
      if reward is not None:
        # reward is -1 if the player turn is not same as rollout turn
        # in other words: this action led to eventual loss
        if rollout_turn != self.player_turn:
          reward = reward * -1

      # flip player_turn on each loop
      rollout_turn = -1 if rollout_turn == 1 else 1

    self.backup(reward) # backup the reward

4. Backup:

Once a rollout is complete we have to traverse the tree backwards and update the \bar{q_i} estimate of each node. This is done by incrementing its visit count and adding the reward from the rollout to its reward sum. Again, we need to backup so that each node is aware of the outcomes of games that ‘passed’ through it.

We first update the values in the leaf node, then move to the parent of that node and update those values, and so on. Until we reach the root node. At this point there is no parent, and we don’t need to update the values of the root node anyway.

Monte Carlo Tree Search backup

Code for the backup method is provided below. Beware: we backup in a NegaMax fashion. Meaning we negate the value of the reward on each visit to subsequent parent nodes. Why? Because Connect4 is a two-player turn based game, so every second node is an opponent move. A win for us is a loss for the opponent. During the selection phase this allows MCTS to simulate the opponent picking the ‘best’ moves from their perspective during their turns! 🧠

 def backup(self, reward):
    current = self

    if reward is None:
      reward = 0.0

    # in case we reached root
    while current.parent is not None:
      current.number_visits += 1
      current.total_value += reward 
      current = current.parent
      reward = reward * -1 # Ensure correct sign when backing up rewards

Playing against MCTS in Connect 4:

Now it’s time to play against the AI we have created 🙂 You can do so by running the colab notebook, and interacting with the game at the very bottom of the notebook. 🔴🔵🔴🔴

  1. Open the link.
  2. At the top, click “Runtime”.
  3. Press “Run All”
  4. Scroll to the bottom and enter your moves in the text box (you are circles, see image below).

Adjusting the difficulty: The MCTS agent will take 5 seconds per turn. You can adjust the difficulty by giving the agent more or less time. Do this by passing adjusting the time_limit parameter in the uct_search function call.

for i in range(0, max_turns):

  # p1: looks ahead using MCTS
  root_node = TurnBasedUCTNode(env, p1_state, p2_state, player_turn=1)
  action = uct_search(root_node, time_limit=7.0)
  print(f"Chosen action: {action}")


We’ve seen how the MCTS algorithm works step by step. In summary: first the algorithm repeatedly selects nodes using the UCB action selection formula until it reaches a leaf node. It then expands a leaf node to create new child leaf nodes. Next, randomly selecting a new child leaf, the algorithm simulates a random game all the way to the end. The result or reward from that simulated game is then backed up the tree by revisiting each parent node. Furthermore, MCTS goes through this process repeatedly until some terminal condition is reached (usually a time limit). 📝

MCTS is one of the few general game playing algorithms out there. It doesn’t have any game-specific logic hard coded into it. Meaning we could drop the same algorithm into a different game like Chess and it would still work. All it needs is a simulator environment.

However, it also has its flaws. One of these is the need for the simulator environment – MCTS cannot plan ahead without one. This limits its use to situations where we have a simulator.

Future Work:

A relatively simple problem we could solve is that on every turn we build a brand new tree, decide on an action, and then throw the tree away. This means we’re rebuilding a lot of the same parts of the tree over and over again! With this in mind we could re-use the tree rather than starting from scratch each turn.

Another problem we could address is with the game-theoretic convergence. This means that in immediate sudden-death situations the algorithm must simulate many times in order to converge on the game theoretic win/loss value. We can address this problem by implementing a MCTS-Solver (a topic I’ll cover in future). 🛣️

Thanks for reading! 🙂

Multi-Armed Bandits 2: ε-Greedy and Non-Stationary Problems

Today we’re going to address some of the problems with an ε-first exploration approach for multi-armed bandits problems. In the last post we saw how ε-first can perform very well on stationary problems where the true value Q_* of each bandit arm (slot machine in our example) never changes. But in the real world we are often faced with problems where the true value of a choice changes over time. In these situations the ε-first exploration approach will not adapt to the changing environment, and will ignorantly keep selecting the same suboptimal action over and over. As with my previous post, you can follow along and run the code in this colab notebook. 📝

Non-stationary problems:

A non-stationary problem is one where the underlying true value (q_*) of each bandit arm can gradually change over the course of an episode. Using our slot machines analogy, we can imagine that the slot machines start the day with a random average payout value, and this average payout value can slowly increase of decrease throughout the day. We model this by adapting our BanditProblem class to allow the arm values to change gradually, this is done by simulating ‘random walks’ for each arm, by drawing a small random number from a normal distribution for each arm and adding these to the true value of each arm. So this means the true payouts can gradually go up or down for each arm.

class BanditProblem():
  def __init__(self, arms, seed, stationary=True):
    self.stationary = stationary

    self.bandit_arms_values = np.random.normal(loc=0.0, scale=1.0, size=arms)
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

  def draw_from_arm(self, arm_index):
    chose_optimal = 1 if arm_index == self.optimal_action_index else 0
    reward = np.random.normal(loc=self.bandit_arms_values[arm_index], scale=1.0)
    return reward, chose_optimal

  def step_arm_values(self):
    Step to be called manually in episode loop.
    q_star_value_shift = np.random.normal(loc=0.0, scale=0.01, size=len(self.bandit_arms_values))
    self.bandit_arms_values += q_star_value_shift
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

This new logic is handled in the step_arm_values function above, which makes small changes to the true arm values and is called after the bandits are done drawing from the arms.

Introducing ε-Greedy:

Our first attempt at tackling the non-stationary bandit problem uses the well-known ε-greedy approach. It is fairly simple and similar to ε-first but with a small difference; instead of exploring randomly for some fixed number of steps, ε-greedy explores randomly some % of the time throughout the entire episode. This means ε-greedy never stops exploring some small portion of the time, determined by the value ε. This could be a small number, like 0.01, meaning that the agent explores randomly approximately 1% of the time. It also means that ε-greedy can start exploiting the best found option right away – there is no long period of exploration time needed before exploitation. The incremental average update rule for the q-value estimate stays exactly the same as it was for ε-first, here it is as a reminder:

Q_{n+1} &=  Q_n + \frac{1}{n}[R_n - Q_n] 

Adding this one simple change – introducing some small chance of exploring randomly every step – is enough to allow the ε-greedy bandit to adapt to non-stationary problems, because it constantly updates its belief about the best choice by some small amount.
So is ε-greedy ‘better’ than ε-first? Well, it depends. On a stationary problem where the values of the slot machines never change, then ε-first is probably better (if you can afford the upfront exploration). On a non-stationary problem, ε-greedy will be better.

Here’s the code for an ε-greedy bandit:

class EpsilonGreedyAgent():
  def __init__(self, epsilon, bandit_problem, alpha=0.1, update_type="incremental"):
    self.epsilon = epsilon
    self.alpha = alpha
    self.problem = bandit_problem
    self.update_type = update_type

    self.arm_qs = np.zeros(len(bandit_problem.bandit_arms_values))
    self.arm_ns = np.zeros(len(bandit_problem.bandit_arms_values))

  def choose_action(self):
    if np.random.rand() > self.epsilon:
      # greedily pull best arm
      choice = np.argmax(self.arm_qs)
      # explore pull any random arm (still a chance to pull best arm too)
      choice = np.random.randint(0, len(self.arm_qs))

    self.arm_ns[choice] += 1

    reward, optimal = self.problem.draw_from_arm(choice)

    if self.update_type == "incremental":
      self.update_estimate_incremental(choice, reward)
    elif self.update_type == "weighted":
      self.update_estimate_weighted(choice, reward)

    return reward, optimal

  def update_estimate_incremental(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (1/self.arm_ns[choice_index])*(reward - self.arm_qs[choice_index])

  def update_estimate_weighted(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (self.alpha*(reward - self.arm_qs[choice_index]))

Once again most of the logic is handled in the choose_action function. Notice how at each step (choice) we draw a random real number between 0.0 and 1.0, and we exploit the best found arm if that random number is bigger than ε (epsilon). But, if the random number is less than ε, then we explore (choose any random arm). Then we take note of the reward/payout received and update our estimate. There are two update types here, but the weighted one can be ignored for now, I’ll explain it later.

Results: ε-First vs ε-Greedy

So how does ε-greedy fare against ε-first on a stationary problem? See the graphs below:

As expected, on our stationary 10-armed bandit problem the ε-first agent fares better than the ε-greedy agent. This is because the values of the bandit arms never change, so once the ε-first bandit has locked on to the best choice, it exploits that choice continually. In contrast, ε-greedy takes a while longer to find the optimal choice, and even when it does find it, it still explores the other sub-optimal options 10% of the time. The only upside to the ε-greedy approach here is that it starts gathering good rewards almost right away, whereas ε-first takes 1000 exploration steps before collecting high rewards.

But what about a non-stationary problem – when the values of the bandit arms change? Can you predict what will happen in this case? Which approach will fare better?

Aha! Now the tables have turned. While ε-first does for a short moment often find a better choice than ε-greedy right after it is done with 1000 exploration steps, this choice quickly becomes stale as the values of the arms change. ε-greedy fare much better overall, and continues to increase its average score over the course of the episode! However, ε-greedy seems to exhibit the same staleness problem (though not as bad) as ε-first: as the episode goes on it chooses the optimal action less and less. Can you think of a reason for this? Hint: think about the q-value update rule shown above (we discussed this in my last post too).

The reason is because the update rule puts n in the denominator of the fraction that behaves as the step size, so this step size gets smaller and smaller as the episode goes on. Eventually, this step size will be so small that the q-value estimates will barely change on each update, and their rank (the order of each bandit arm according to our estimates) will almost never change after many episodes. So, eventually, ε-greedy almost becomes exactly like ε-first and gets stuck pulling the same arm when exploiting (but still pulls other arms randomly some % of the time). This explains why ε-greedy slowly makes less and less optimal choices – its arm value estimates are not keeping up with the changes to each arms’ true value q_* the longer the episode goes on!

Recency Weighted Q-Update:

To truly solve the stale choice issue for non-stationary bandit problems we need to be a bit more intelligent with our q-value estimate update rule. We saw previously how ε-greedy with a sample average update rule has some problems: the step size gradually gets smaller and smaller. The fix is simple: keep this step size constant! The reason why this works is subtle.

Theory warning! The following section will be a bit math heavy. But if you find maths a little dense (I can relate) then I also include a plain English description and accompanying code below. Hopefully that helps! 😊

So we want to change our update rule to include a fixed step size \alpha. That’s easy enough – see line (1) below. Line (1) below is all we need to implement this in code, the rest are just there for understanding. The reason why this works – and the key to understanding why it works – as a recency-weighted update (new updates are given more weight) is due to the recursive way \alpha is applied to the existing estimate Q_n. Realise that Q_n at any time step is the result of previous updates being applied one after another. Lines (2) to (6) demonstrate this below, where we are essentially unrolling Q_n as a sum of updates from all previous time steps.

Q_{n+1} &= Q_n + \alpha[R_n - Q_n] \\
&=  \alpha R_n + (1- \alpha)Q_n \\
&= \alpha R_n + (1- \alpha)[\alpha R_{n-1} + (1- \alpha)Q_{n-1}] \\
&= \alpha R_n + (1 - \alpha) \alpha R_{n-1} + (1- \alpha)^2 Q_{n-1} \\
&= \alpha R_n + (1 - \alpha)\alpha R_{n-1} + (1 - \alpha)^2 \alpha R_{n-2} + \\
& \qquad \qquad \qquad \cdots + (1- \alpha)^{n-1}\alpha R_{1} + (1- \alpha)^nQ_1\\
&= (1-\alpha)^nQ_1 + \sum_{i=1}^{n} (1-\alpha)^{n-i}\alpha R_i

On line (7) we show this idea in a single line where we express Q_{n+1} as the sum of all previous rewards weighted by alpha raised to the power of n-i (the number of steps ago this update occurred). So as we update Q, the contribution of old updates to our current estimate gets exponentially smaller and smaller.

It’s this attribute of our new recency-weighted average that allows our estimates to stay fresh as the ε-greedy continually explores and the true values of the bandit arms change over time. This update rule reacts much faster to changes in the values of the bandit arms and can do so for as long as the episode continues. Although, if it experiences an unlucky streak of exploration it may be temporarily misled into believing that the optimal action is not actually the best. Even so, it will usually fix this mistake quite quickly when the unlucky streak breaks.

You can find the code for this update rule in the update_estimate_weighted function in the code snippet above.

Results: ε-Fixed vs ε-Greedy vs Recency-Weighted ε-Greedy

Naturally, we wouldn’t expect the recency-weighted update ε-greedy to be any better when it comes to stationary problems, and this is true based on our results in the graphs below.

It’s clear that recency-weighted averages hurt performance in the stationary setting. Whereas ε-greedy eventually approaches choosing the best action 90% of the time (the remaining steps are exploration), the recency weighted ε-greedy chooses the optimal action 80% of the time and does not seems to be improving. In this situation the recency-weighted ε-greedy is limited by the \alpha value, which determines the ‘forgetfulness’ of the update rule, so this could be tuned a bit to improve performance for the stationary problem setting. But where recency-weighted ε-greedy really shines is in the non-stationary setting:

Much better! Adding the recency-weighted update rule allows the ε-greedy agent to outperform both prior approaches on non-stationary problems. The recency-weighted update gradually ‘forgets’ about old updates so that it can quickly switch to the new optimal choice as the episode progresses and the values of the bandit arms gradually change.

Discussion and future work:

So that’s it, right? We’ve solved the exploration/exploitation problem? Uhm… actually no. In the last two posts we’ve seen some really simple (and quite effective) methods to balance exploration/exploitation, but there’s one more method I want to cover. This method addresses a problem with ε-greedy: when it explores it does so totally at random, but couldn’t there be a way to focus exploration on choices that seem most promising? This is the topic of our next post when I’ll cover Upper Confidence Bound (UCB) action selection, and also some neat tricks to make ε-greedy exploration more effective too! 🤖

Scroll to Top