## Monte Carlo Tree Search (Part 2): A Complete Explanation with Code

In the last post we discussed the problem of acting optimally in an episodic environment by estimating the value of a state. Monte Carlo Tree Search (MCTS) naturally fits the problem by incorporating intelligent exploration into decision-time multi-step planning. Give that post a read if you haven’t checked it out yet, but it isn’t necessary to understand today’s post. It might also be beneficial to get some intuition on UCB action selection.

Today we’re going to dig into ‘vanilla’ MCTS, the same algorithm used in state of the art game-playing agents like AlphaZero. When we’re done you’ll understand how and why this works so well for general game playing (especially in board games like Chess and Go).

Accompanying code snippets are included in the post to help explain the ideas. You can also follow along and run the code in this colab notebook. A python implementation of Connect4 is included in the notebook for you can play against the finished MCTS algorithm ๐ ๐ด๐ต๐ด๐ด

## The Four Steps in MCTS:

MCTS builds up estimates for the values of each possible state by planning ahead. In order to do this it must have a perfect model of the environment. In other words it must know exactly where it ends up after taking each action – without actually having to take that action. This is a downside to MCTS, since it can’t work without this model.

Anyway, assuming we have this perfect model we can simulate taking actions and choose the best ones based on the outcomes of simulated games. Doing so repeatedly we can build up a game tree of actions and states the algorithm has explored ๐ณ. Here are the four steps MCTS repeats to do this:

1. Selection
2. Expansion
3. Rollouts/Simulation
4. Backup of results

## 1. Selection:

MCTS starts by selecting the best node in the tree so far according to the UCB formula. This formula estimates the value (and uncertainty in that value) of each state in the tree. This form of MCTS is actually called “upper-confidence bound applied to trees” (UCT). Furthermore, UCT is just an extension of UCB action selection from multi-armed bandit problems, but applied to multi-step decision making instead. Here’s the UCB formula adapted to tree search (with explanation below):

\begin{aligned}
Score_i &= \bar{q}_i + U_i \\
&= \bar{q_i}+C \sqrt{\frac{2 \ln N}{n_i}}
\end{aligned}

To demonstrate; think about this from the perspective of a parent node with several child nodes to consider. The action selection score of each child node is made up of the current estimate and uncertainty. The value estimate of the i-th child node is the current mean value of that node (\bar{q_i}). The uncertainty part has a constant C, that scales the uncertainty (bigger means more exploration, smaller means less exploration). We’ll leave this C=1. In the square root, N is the number of visits to the parent, and n_i are the number of times the child node i was chosen when passing through the parent N.

As an example, check the diagram below:

If we are in the parent node deciding which child node to explore, we apply the above UCB formula to each of the children and choose the one with the highest score. For instance evaluating child 1 and child 2:

\begin{aligned}
Score_1 &= \frac{5}{10} + \sqrt{\frac{{2\ln20}}{10}} = 1.274 \\
Score_2 &= \frac{3}{5} + \sqrt{\frac{{2\ln20}}{5}} = 1.694 \\
\end{aligned}

According to this score the action selection will decide to further explore child 2. This is because we have explored Child 2 only 5 times, but exploring that child won 3 out of the 5 simulated games! This is a high win rate. We are also less certain about this result because the number of times explored is low (5). So the UCB action selection correctly pushes the algorithm to further explore this promising node ๐งญ.

Important side notes !!!

1. When selecting from child nodes: if two children have an equal score, they must be selected from randomly. Any other tie-breaking method is not guaranteed to converge to an optimal policy.
2. Additionally if a child node has not been explored at all (no visits) we automatically set it’s Q+U value to be +infinity (to guarantee exploring any unexplored children when visiting a parent node).
3. Actually just about any bandit action selection algorithm could work in place of UCB. \epsilon-greedy for example.

Here’s the first part of the code in python, wrapped in a TurnBasedUCTNode class. We start in tree_policy():

class TurnBasedUCTNode():
def __init__(self,
env,
player_1_state,
player_2_state,
player_turn,
action=None,
reward=None,
parent=None,
id_in_parent=None):

self.env = env
self.player_1_state = player_1_state # 2d np array
self.player_2_state = player_2_state # 2d np array
self.player_turn = player_turn

self.action = action # which action was chosen previously Q(s,a)
self.parent = parent # parent node

# self.action != id_in_parent e.g. action may be 6, but id_in_parent may be 0
self.id_in_parent = id_in_parent # index of this child in parent's list of children

self.is_expanded = False
self.children = [] # list of child nodes
self.child_visit_count = None  # need to expand before we know size of this list
self.child_q_sum = None  # need to expand before we know size of this list
self.reward = reward

def child_q_estimates(self):
return self.child_q_sum/(self.child_visit_count+1)

def child_ucb1_estimates(self):
# handle case where we are at root with no parent
if self.parent is None:
my_visits = np.sum(self.child_visit_count)
else:
my_visits = self.number_visits

U = np.sqrt(2.0*np.log(my_visits)/(self.child_visit_count+1))
return U * EXPLORATION_CONSTANT

def select_best_child(self, max_value=1e6):
# Get Q + U for each child, return max, break ties randomly

if not self.is_expanded:
return self

q_u = self.child_q_estimates() + self.child_ucb1_estimates()
q_u[self.child_visit_count==0.0] = max_value

max_choices = np.flatnonzero(q_u == np.max(q_u))

if len(max_choices) == 1:
return max_choices[0]

random_choice = rando._randbelow(len(max_choices))
best_child_index = max_choices[random_choice]

return best_child_index

def tree_policy(self):
current = self

node_visits = 0
while current.is_expanded:
node_visits += 1
best_child_index = current.select_best_child()
current = current.children[best_child_index]

return current, node_visits # a not expanded leaf node

## 2. Expansion:

This part is pretty easy to understand. Once selection reaches a node that has no children, we need to create those children. So we are expanding the tree. We do this by asking our simulated environment (perfect world model) which actions we can take given the state we are in. Then we take each of these actions in turn in our simulated environment and store the resulting states as new child nodes. Once this is done, we select one of these at random for the rollout/simulation stage.

The code for the expansion step is below, it is a method of the TurnBasedUCTNode class:

def expand(self):

if self.reward is not None:
return self

possible_actions = self.env.get_legal_actions(self.player_1_state,
self.player_2_state)

# perform action filtering in env.legal_actions to get legal actions
action_num = len(possible_actions)

if action_num == 0:
return self

next_player_turn = -1 if self.player_turn == 1 else 1 # flip player turn

self.child_visit_count = np.zeros(action_num, dtype=np.uint32)
self.child_q_sum = np.zeros(action_num, dtype=np.int32)

# loop thru legal actions and simulate stepping each one
i = 0
for action in possible_actions:
p1_state, p2_state, reward = self.env.step(action,
self.player_turn,
self.player_1_state.copy(),
self.player_2_state.copy())
child = TurnBasedUCTNode(self.env,
p1_state,
p2_state,
player_turn=next_player_turn,
action=action,
reward=reward,
parent=self,
id_in_parent=i)

self.children.append(child)
i+=1 # increment the index used for id_in_parent

self.is_expanded = True

# return a random child node for rollouts
random_child = self.children[rando._randbelow(len(self.children))]

return random_child

## 3. Rollout / Simulation:

This part is pretty fun, and is at the heart of why MCTS works. It took me a while to get my head around this when I was first learning about MCTS. So I’ll spend some extra time on this section to map out the concepts and intuition.

You may have been wondering where the \bar{q_i} part came from in the UCB formula discussed earlier. It’s made up of two parts: the number of visits, and the sum of rewards. The number of visits makes sense, but where does the sum of rewards come from? It comes from the Monte Carlo rollouts ๐งป.

In this stage the algorithm simulates a full game (all the way to the end) by randomly selecting actions for each player until the simulated game ends. Once that happens, the environment tells us if the game ended in a win/loss/draw, which is the reward.

To clarify during rollouts we don’t need to create new nodes and copy state variables. Therefore the rollout stage is fast.

This still doesn’t really explain why rollouts work though. How does the result of a random game help us decide which actions to take?

A great way to think about this is as an extension of multi-armed bandit problems. In M.A.B.s we usually sample an action several times to build an estimate of its value. The same thinking applies here too. Hence, the more times we expand, reach a leaf node, and randomly rollout, the better the estimate of the value of each node/action.

To make this clearer: imagine if we sorted all of the possible end-game states from left to right, in order of lose, draw, win (see the diagram below). So a winning rollout will always land to the right, and a losing one to the left, and a draw is somewhere in the middle. In this visualisation we can imagine that good actions move us to the right, whereas bad actions move us to the left.

Consider that for any node we are building the probability density function of the rewards from the simulated games (-1, 0, +1). Therefore the q-estimate of a node is the mean of this p.d.f generated from rollouts that passed through this node. This allows us to determine if an action is good or bad! ๐

Code for the rollout method is shared below to neatly summarise:

def rollout(self):
max_t = (self.env.board_height*self.env.board_width)
# If this node is terminal state, backup reward immediately
if self.reward is not None:
self.backup(self.reward)

reward = None

temp_p1_state = self.player_1_state.copy()
temp_p2_state = self.player_2_state.copy()
rollout_turn = self.player_turn

while i < max_t:
# get legal actions from the environment
legal_actions = self.env.get_legal_actions(temp_p1_state, temp_p2_state)

# In this case, must be a draw.
if not legal_actions:
reward = 0
break

# choose random actions during rollouts
action = legal_actions[rando._randbelow(len(legal_actions))]

# don't need to copy states during rollout steps, just act on same state
temp_p1_state, temp_p2_state, reward = self.env.step(action,
rollout_turn,
temp_p2_state,
temp_p1_state)

# reward signals end of game
if reward is not None:
# reward is -1 if the player turn is not same as rollout turn
# in other words: this action led to eventual loss
if rollout_turn != self.player_turn:
reward = reward * -1
break

# flip player_turn on each loop
rollout_turn = -1 if rollout_turn == 1 else 1

self.backup(reward) # backup the reward

## 4. Backup:

Once a rollout is complete we have to traverse the tree backwards and update the \bar{q_i} estimate of each node. This is done by incrementing its visit count and adding the reward from the rollout to its reward sum. Again, we need to backup so that each node is aware of the outcomes of games that ‘passed’ through it.

We first update the values in the leaf node, then move to the parent of that node and update those values, and so on. Until we reach the root node. At this point there is no parent, and we don’t need to update the values of the root node anyway.

Code for the backup method is provided below. Beware: we backup in a NegaMax fashion. Meaning we negate the value of the reward on each visit to subsequent parent nodes. Why? Because Connect4 is a two-player turn based game, so every second node is an opponent move. A win for us is a loss for the opponent. During the selection phase this allows MCTS to simulate the opponent picking the ‘best’ moves from their perspective during their turns! ๐ง

 def backup(self, reward):
current = self

if reward is None:
reward = 0.0

# in case we reached root
while current.parent is not None:
current.number_visits += 1
current.total_value += reward
current = current.parent
reward = reward * -1 # Ensure correct sign when backing up rewards

## Playing against MCTS in Connect 4:

Now it’s time to play against the AI we have created ๐ You can do so by running the colab notebook, and interacting with the game at the very bottom of the notebook. ๐ด๐ต๐ด๐ด

2. At the top, click “Runtime”.
3. Press “Run All”
4. Scroll to the bottom and enter your moves in the text box (you are circles, see image below).

Adjusting the difficulty: The MCTS agent will take 5 seconds per turn. You can adjust the difficulty by giving the agent more or less time. Do this by passing adjusting the time_limit parameter in the uct_search function call.

for i in range(0, max_turns):

# p1: looks ahead using MCTS
root_node = TurnBasedUCTNode(env, p1_state, p2_state, player_turn=1)
action = uct_search(root_node, time_limit=7.0)
print(f"Chosen action: {action}")

## Summary:

We’ve seen how the MCTS algorithm works step by step. In summary: first the algorithm repeatedly selects nodes using the UCB action selection formula until it reaches a leaf node. It then expands a leaf node to create new child leaf nodes. Next, randomly selecting a new child leaf, the algorithm simulates a random game all the way to the end. The result or reward from that simulated game is then backed up the tree by revisiting each parent node. Furthermore, MCTS goes through this process repeatedly until some terminal condition is reached (usually a time limit). ๐

MCTS is one of the few general game playing algorithms out there. It doesn’t have any game-specific logic hard coded into it. Meaning we could drop the same algorithm into a different game like Chess and it would still work. All it needs is a simulator environment.

However, it also has its flaws. One of these is the need for the simulator environment – MCTS cannot plan ahead without one. This limits its use to situations where we have a simulator.

## Future Work:

A relatively simple problem we could solve is that on every turn we build a brand new tree, decide on an action, and then throw the tree away. This means we’re rebuilding a lot of the same parts of the tree over and over again! With this in mind we could re-use the tree rather than starting from scratch each turn.

Another problem we could address is with the game-theoretic convergence. This means that in immediate sudden-death situations the algorithm must simulate many times in order to converge on the game theoretic win/loss value. We can address this problem by implementing a MCTS-Solver (a topic I’ll cover in future). ๐ฃ๏ธ

## Monte Carlo Tree Search (Part 1): Introduction to MDPs

Following on from the idea of learning to make an optimal single decision, we can expand this to making multiple sequential decisions in an optimal way. To do this we’ll be exploring Monte Carlo Tree Search (MCTS); an algorithm that combines ideas from traditional tree search algorithms, and reinforcement learning (RL).

Today we’re going to focus on building intuition. In later posts we’ll discuss and implement vanilla MCTS, then we’ll explore some exciting improvements and variations ๐ In the end you’ll have implemented a superhuman AI that can beat you and everyone you know in a game.

## Markov Decision Processes (MDPs):

Whereas in previous posts on multi-armed bandits we were concerned with finding the single best choice, now we will be trying to find the best sequence of choices of actions to reach some goal or maximise some value. An example problem might be: what’s the best sequence of actions a Pac-Man ๐ก can take to get the highest score and win the game?

To do this, Pac-Man must plan ahead to decide which actions lead to food, and collect as much of that as possible, while also avoiding the ghosts. We can formalise this in a nice way as a Markov Decision Process (MPD). In an MDP, we call the situation Pac-Man finds himself in at any time a state. From a state Pac-Man can take actions, which might grant Pac-Man some reward, and then the action leads to new states.

In the above diagram, Pac-Man is in state s_t, where the t denotes the current time. Pac-Man has several choices of action, and chooses to go down (that is the action a_t at time t). As a result, Pac-Man gets a reward of +20 (that is the reward r_t at time t), and moves to state s_{t+1} at the next timestep. So Pac-Man’s goal is to maximise the reward from eating food, by planning a sequence of actions we call a policy (\pi).

## The Value of a State:

Since it will take Pac-Man many time-steps to achieve the goal, we can formulate this as the expected sum of future rewards โ:

G_t =  \sum_{\tau=t+1}^{T}R_\tau \space

All this says is that the return at any given timestep is the cumulative reward we will receive by the end of the game (the terminal timestep T). The policy \pi dictates which actions Pac-Man will take in each state, and so this also dictates how much reward we expect Pac-Man to receive by the end of the game. This allows us to determine the value of of each state when following a given policy:

\begin{aligned}
v_\pi(s) &= \mathop{\mathbb{E}}_\pi \left[ \sum_{\tau = t+1}^{T} R_\tau \space | \space S_t=s\right] \\
&= \mathop{\mathbb{E}}_\pi \left[ G_t \space | \space S_t=s\right]
\end{aligned}

This is an expectation because even though Pac-Man can deterministically choose which actions he takes and where he ends up, there are other parts of the environment he cannot reliably predict. For example, Pac-Man can’t predict where each ghost will move, so he has to plan for all outcomes of his actions with some probability. It might be the case that Pac-Man chooses to move up to get some food (receiving a reward), but a nearby ghost also chooses to move to the same space, killing Pac-Man (no reward, and maybe game over if Pac-Man is out of lives). Pac-Man needs to account for these possibilities when he is deciding where to move ๐ค.

To expand that example a bit more: if Pac-Man doesn’t know how the ghosts move, he must treat them as totally random. In this case Pac-Man would assume the ghosts could move in any of the four directions with equal probability (0.25). If Pac-Man chooses to move up, there is a 0.75 probability that he gets the +20 reward, but a 0.25 probability that he gets 0 (walks into the ghost and loses a life). The expected immediate value of moving up to that piece of food is therefore 0.75 \times 20 = 15. If Pac-Man has another piece of food to his left, and there is no ghost in the immediate neighbourhood of that piece of food, then the expected value of moving left is the full 20. Pac-Man should move left.

## One-step lookahead is not enough:

There are two problems that we have to solve here.

1. To know the expected value of the state, we had to simulate taking the action and calculate the probabilities of each outcome that could occur.
2. Once we know the expected value of each state we can move to, we must compare them to determine the best action to take. Doing this repeatedly from each state provides us with our new policy \pi.

There is a subtle issue you may have noticed with problem 1, and in my example above: moving left may seem like the best choice if we only simulate one turn ahead, but it may also lead Pac-Man to a dead end where the ghost backs him into a corner and kills him ๐ฑ. We can only know this is the case if we simulate not one step ahead, but many. So the true value of moving to a state does not just come from the immediate reward we receive multiplied by a single probability. It instead comes from simulating all the possibilities many many steps ahead (preferably up to the end of the game) and figuring out the expected value from all of those outcomes recursively ๐:

\begin{aligned}
v_\pi(s) &= \mathop{\mathbb{E}}_\pi \left[ G_t | S_t = s \right] \\
&= \mathop{\mathbb{E}}_\pi \left[ R_{t+1} + G_{t+1} | S_t =s\right] \\
&= \sum_a \pi(a|s)\sum_{s'}\sum_{r}p(s', r | s, a) \left[ r +\mathop{\mathbb{E}}_\pi \left[ G_{t+1} | S_{t+1}=s' \right] \right] \\
&= \sum_a \pi(a|s) \sum_{s' r} p(s', r | s, a)[r+v_\pi(s')]

\end{aligned}

Okay, I admit that looks terrifying. But all this is showing is how the expectation breaks down (how we calculate it); summing up all the rewards multiplied by the corresponding probability of that action/outcome occurring. You can see this by looking at the last line: it is the same thing we walked through in our example above, when multiplying the probabilities of each outcome with the reward for Pac-Man moving up (maybe getting killed by the ghost) or moving left (guaranteed reward).

Notice also on the last line, that v_\pi(s') is included. What does this mean? It means that we apply the same calculations again for each possible action and outcome in the next state (s’). It is a recursive function. So these calculations will happen for every state, and then every subsequent state produced by taking the possible actions, and so on and so on. We could almost think of this as expanding an enormous tree, where every branch is an action leading to a new state… ๐ฒ

## Monte Carlo Tree Search:

This is where MCTS comes in! It’s a search algorithm which balances exploration and exploitation by combining familiar concepts from multi-armed bandits (like UCB action selection), as well as tree search (best-first search) and Monte Carlo prediction (rollouts and back-propagation).

The balancing part is important โ๏ธ: it allows the algorithm to explore a state space efficiently by focussing the search on only the most promising areas. This allows the algorithm to explore deeper in the state space, calculate good value estimates for states, and construct intelligent plans. It does not need to explore the state space exhaustively to do this effectively.

Another benefit of the algorithm is that while it is technically a heuristic search algorithm, the heuristic (UCB) is entirely problem independent; MCTS can be deployed out-of-the-box and work well in many settings without the need for hand-crafted guidelines on how to play.

We’ll dig into the details and implement MCTS in my next posts ๐

## Multi-Armed Bandits 3: UCB and some exploration tricks

In this post we’ll walk through some neat tricks to make \epsilon-greedy more effective, and then we’ll dig into a smarter way to handle exploration: upper confidence bound action selection. We’ll be building on what we learned in my last post, and as always the code can be found in this colab notebook so you can follow along and try it out yourself ๐

## Optimistic initial values:

One problem with \epsilon-greedy is that randomness isn’t a very intelligent way to conduct exploration, especially at the start of an episode when you don’t know anything about your options. Using our slot machines analogy again: due to the randomness, \epsilon-greedy might not actually try all of the machines at least once for quite some time. Instead it might explore a few of the arms to begin with and then spend most of the time early on exploiting a sub-optimal arm.

A really simple and clever way we can make this early exploration more systematic is to give the \epsilon-greedy agent optimistic initial estimates for each arm’s rewards. In practice this means setting the initial q-values quite high (not zero). This exploits the greediness of \epsilon-greedy. The agent selects each arm one by one, expecting a high reward but instead getting a relatively low one. It then revises each estimate down each timestep until the estimate starts to converge on the true value. Overall, the agent is encouraged to explore much more effectively and the optimistic starting estimates gradually get reduced down to something more realistic.

You can see the results after applying this trick below:

The optimistic \epsilon-greedy agent converges much faster than regular \epsilon-greedy. This can make a big difference in problem settings when the number of steps per episode is smaller. Over longer timesteps the advantage mostly disappears as the impact of the early exploration reduces over time. It’s worth keeping in mind that this approach won’t help much in a non-stationary setting.

## Unbiased constant step-size

There is a very subtle issue that we should address with the constant step size update rules (using a constant \alpha for recency-weighting). The issue is that this update rule is biased by the inclusion of the initial estimate. Recall that the recency-weighted update rule for Q-values is essentially a weighted sum of all past rewards, plus a weighting of the initial Q value estimate (see my last post for a deeper analysis):

\begin{aligned}
Q_{n+t} &= Q_n + \alpha[R_n - Q_n] \\
&= (1-\alpha)^nQ_1+ \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_i
\end{aligned}

This means our initial Q-value estimate permanently alters the subsequent Q-value estimate updates. The good news is that while the impact is permanent, it gradually reduces over more time until it is virtually non-existent. However, we’d like to remove this bias if possible, to make our agent’s early exploration more effective. To do that we need to alter our update rule formulation a little bit. The change involves altering the step size from just \alpha to be:

\begin{aligned}
\beta_n \doteq \alpha / \bar\omicron_n
\end{aligned}

This \omicron_n is the real interesting part, which is defined as follows:

\begin{aligned}
\bar\omicron_n \doteq \bar\omicron_{n-1} + \alpha(1- \bar\omicron_{n-1}), \space \text{for} \space n \ge 0, \space \text{with} \space \bar\omicron_{0} = 0
\end{aligned}

Okay, so that maybe looks a bit confusing. In plain English this just means we need to keep track of a separate \bar\omicron_{n} for each arm/action, and we update it using the above rule each time we pull its assigned arm. In code this just means a slightly modified update function that looks like:

  def update_estimate_weighted_unbiased(self, choice_index, reward):

beta = self.alpha/self.omicrons[choice_index] if self.omicrons[choice_index] > 0 else self.alpha
self.arm_qs[choice_index] = self.arm_qs[choice_index] + (beta*(reward - self.arm_qs[choice_index]))

# update omicron for this action
self.omicrons[choice_index] = self.omicrons[choice_index] + self.alpha*(1-self.omicrons[choice_index])

Simple, right? ๐ To see why this is an unbiased recency-weighted average we’ll have to do a bit more algebra. I’ll explain it as we go, but if you’re not interested then feel free to skip to the results.

First we’ll start with our original recency weighted average formula, but we’ll swap \beta instead of \alpha and rework things a bit to get only one Q_n on the right hand side:

\begin{aligned}
Q_{n+1} &= Q_n +\beta_n(R_n - Q_n) \\
&= Q_n + \beta_nR_n - \beta_nQ_n\\
&= \beta_n R_n + (1- \beta_n)Q_n
\end{aligned}

The second line is just an expansion of the brackets in the first line, and the third line factors the two terms including Q_n together. We’re mostly interested in Q_2 (because Q_1 is the biased initial estimate) to understand how the bias is eliminated:

\begin{aligned}
Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\
\end{aligned}

But to figure this out we first need to figure out what \beta_1 is (this is the important part):

\begin{aligned}
\beta_1 &= \frac{\alpha}{\bar\omicron_1} \\
&= \frac{\alpha}{\bar\omicron_0 + \alpha(1-\bar\omicron)} \\
&= \frac{\alpha}{0 + \alpha(1- 0)} \\
&= \frac{\alpha}{\alpha} \\
&= 1
\end{aligned}

In the first line we use the definition of \Beta we saw earlier. Then the second line follows from the definition of \bar\omicron we saw earlier too (it is calculated from the previous omicron and alpha). Finally, the third line follows from that same definition where \bar\omicron_0 is a special edge case which = 0. Now we can plug this back into the equation for Q_2 to see how the bias disappears:

\begin{aligned}
Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\
&= 1 \cdot R_1 + (1 - 1)Q_1 \\
&= R_1 + Q_1 - Q_1 \\
&= R_1
\end{aligned}

Phew, and there you have it! ๐ There is no bias because the initial estimate Q1 is eliminated when we calculate Q_2! Since all the subsequent Q estimates are based on earlier Q estimates we have eliminated the bias permanently. This might seem like a lot of work, but in practice it’s just a couple of extra lines of code.

Now let’s take a look at how the unbiased version of recency-weighted e-greedy performs:

The difference is hard to spot on these plots, but look at the plot on the right and you can see that the unbiased e-greedy is slightly faster at converging early on. This difference becomes negligible over time, but it’s a nice little performance boost for early steps in non stationary problems!

## Upper confidence bound (UCB) action selection:

So far we’ve still been considering \epsilon-greedy based agents, and we haven’t really tackled the problem that exploring randomly is not a smart way to approach exploration. Now we’re going to change that and discuss UCB – upper confidence bound action selection. This method selects actions by calculating confidence in the q-value estimate of each arm. The diagram below is a good visualisation to help explain:

What this shows is that the confidence in the estimate of Arm 1 is quite high (blue) and the confidence for the estimate of Arm 2 is low (orange). UCB will use this confidence estimate and calculate what the upper-confidence bound is (approximately 95% confidence level). In the plot the 95% confidence level for Arm 1 is shown by the green line, and the 95% confidence level of Arm 2 is shown by the red line. When choosing an action, UCB selects the action with the highest 95% upper-confidence bound, which in this case would be Arm 2.

This choice of action is controlled by the following formula:

\begin{aligned}
A_t \doteq \argmax_{a} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \space \right]
\end{aligned}

t is the current timestep and \ln t is the natural logarithm of that value. N_t(a) is the number of times the action a has been selected before time t. c is a confidence level parameter that we can select (usually set to 2, which is 95% confidence). In code this formula completely replaces the action selection logic in choose_action() compared to \epsilon-greedy:

def choose_action(self):
action = np.argmax(self.ucb_calc())
self.arm_ns[action] += 1

reward, optimal = self.problem.draw_from_arm(action)

self.update_estimate_incremental(action, reward)

return reward, optimal

def ucb_calc(self):
t = np.sum(self.arm_ns)
arm_ucbs = np.zeros(len(self.arm_qs))

for i in range(0, len(self.arm_qs)):

if self.arm_ns[i] == 0:
# If we have not explored this arm before, we consider it maximising
arm_ucbs[i] = np.Inf
continue

arm_variance = self.arm_qs[i] + self.confidence_level * np.sqrt((np.log(t)/self.arm_ns[i]))
arm_ucbs[i] = arm_variance

return arm_ucbs

For now we still update the q-value estimates using the incremental average update rule. Here are the results on a stationary problem:

The UCB algorithm outperforms both \epsilon-greedy and fixed exploration greedy (\epsilon-first) on stationary problems! ๐ We can see that UCB finds the optimal action faster and more often than either of the other two previous best methods. But what about non-stationary problems?

Hmm, that’s disappointing ๐ค But this makes sense since we are using the incremental average update rule, which does not update estimates well in non-stationary environments! The problem is that UCB is tricky to adapt to the non-stationary setting, and this is an ongoing area of research (often with state-of-the-art results). This is mostly due to the confidence estimate requiring both the Q-value estimates and the pull counts too. However, there is one simple implementation we can use which follows a similar principle to the recency-weighted average update called discounted-UCB (D-UCB):

  def update_estimate_discounted(self, choice_index, reward):
# Discount all reward estimates and pull counts (i.e. gradually forget old steps)
self.arm_qs *= self.gamma
self.arm_ns *= self.gamma

self.arm_qs[choice_index] += reward
self.arm_ns[choice_index] += 1

We just discount all the values for each arm q-estimate and pull count alike, like a recency-weighting for both the q-value estimate and the confidence estimate. Results using this method are much better and also beat the previous best approaches on non-stationary problems:

## Discussion and future work:

That’s it! We’re finally done with multi-armed bandits. ๐

We have covered a range of algorithms that make decisions and explore under uncertainty, in stationary or non-stationary settings. Most of these techniques will be applicable in more advanced reinforcement learning settings, but we’ll those cover in future posts. There are definitely more advanced multi-armed bandit methods out there, like Thompson sampling or klUCB, however I won’t write about those yet (and I’m not sure if/when I plan to). Hopefully you now have a good and thorough intuition about how and why all of these Multi-Armed Bandit methods work!

Scroll to Top