Multi-Armed Bandits 3: UCB and some exploration tricks

In this post we’ll walk through some neat tricks to make \epsilon-greedy more effective, and then we’ll dig into a smarter way to handle exploration: upper confidence bound action selection. We’ll be building on what we learned in my last post, and as always the code can be found in this colab notebook so you can follow along and try it out yourself 🙂

Optimistic initial values:

One problem with \epsilon-greedy is that randomness isn’t a very intelligent way to conduct exploration, especially at the start of an episode when you don’t know anything about your options. Using our slot machines analogy again: due to the randomness, \epsilon-greedy might not actually try all of the machines at least once for quite some time. Instead it might explore a few of the arms to begin with and then spend most of the time early on exploiting a sub-optimal arm.

A really simple and clever way we can make this early exploration more systematic is to give the \epsilon-greedy agent optimistic initial estimates for each arm’s rewards. In practice this means setting the initial q-values quite high (not zero). This exploits the greediness of \epsilon-greedy. The agent selects each arm one by one, expecting a high reward but instead getting a relatively low one. It then revises each estimate down each timestep until the estimate starts to converge on the true value. Overall, the agent is encouraged to explore much more effectively and the optimistic starting estimates gradually get reduced down to something more realistic.

You can see the results after applying this trick below:

Optimistic initialisation improves early exploration

The optimistic \epsilon-greedy agent converges much faster than regular \epsilon-greedy. This can make a big difference in problem settings when the number of steps per episode is smaller. Over longer timesteps the advantage mostly disappears as the impact of the early exploration reduces over time. It’s worth keeping in mind that this approach won’t help much in a non-stationary setting.

Unbiased constant step-size

There is a very subtle issue that we should address with the constant step size update rules (using a constant \alpha for recency-weighting). The issue is that this update rule is biased by the inclusion of the initial estimate. Recall that the recency-weighted update rule for Q-values is essentially a weighted sum of all past rewards, plus a weighting of the initial Q value estimate (see my last post for a deeper analysis):

\begin{aligned}
Q_{n+t} &= Q_n + \alpha[R_n - Q_n] \\
&= (1-\alpha)^nQ_1+ \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_i
\end{aligned}

This means our initial Q-value estimate permanently alters the subsequent Q-value estimate updates. The good news is that while the impact is permanent, it gradually reduces over more time until it is virtually non-existent. However, we’d like to remove this bias if possible, to make our agent’s early exploration more effective. To do that we need to alter our update rule formulation a little bit. The change involves altering the step size from just \alpha to be:

\begin{aligned}
\beta_n \doteq \alpha / \bar\omicron_n 
\end{aligned}

This \omicron_n is the real interesting part, which is defined as follows:

\begin{aligned}
\bar\omicron_n \doteq \bar\omicron_{n-1} + \alpha(1- \bar\omicron_{n-1}), \space \text{for} \space n \ge 0, \space \text{with} \space \bar\omicron_{0} = 0
\end{aligned}

Okay, so that maybe looks a bit confusing. In plain English this just means we need to keep track of a separate \bar\omicron_{n} for each arm/action, and we update it using the above rule each time we pull its assigned arm. In code this just means a slightly modified update function that looks like:

  def update_estimate_weighted_unbiased(self, choice_index, reward):

    beta = self.alpha/self.omicrons[choice_index] if self.omicrons[choice_index] > 0 else self.alpha
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (beta*(reward - self.arm_qs[choice_index]))

    # update omicron for this action
    self.omicrons[choice_index] = self.omicrons[choice_index] + self.alpha*(1-self.omicrons[choice_index])

Simple, right? 😊 To see why this is an unbiased recency-weighted average we’ll have to do a bit more algebra. I’ll explain it as we go, but if you’re not interested then feel free to skip to the results.

First we’ll start with our original recency weighted average formula, but we’ll swap \beta instead of \alpha and rework things a bit to get only one Q_n on the right hand side:

\begin{aligned}
Q_{n+1} &= Q_n +\beta_n(R_n - Q_n) \\
&= Q_n + \beta_nR_n - \beta_nQ_n\\
&= \beta_n R_n + (1- \beta_n)Q_n
\end{aligned}

The second line is just an expansion of the brackets in the first line, and the third line factors the two terms including Q_n together. We’re mostly interested in Q_2 (because Q_1 is the biased initial estimate) to understand how the bias is eliminated:

\begin{aligned}
Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\
\end{aligned}

But to figure this out we first need to figure out what \beta_1 is (this is the important part):

\begin{aligned}
\beta_1 &= \frac{\alpha}{\bar\omicron_1} \\
&= \frac{\alpha}{\bar\omicron_0 + \alpha(1-\bar\omicron)} \\
&= \frac{\alpha}{0 + \alpha(1- 0)} \\
&= \frac{\alpha}{\alpha} \\
&= 1
\end{aligned}

In the first line we use the definition of \Beta we saw earlier. Then the second line follows from the definition of \bar\omicron we saw earlier too (it is calculated from the previous omicron and alpha). Finally, the third line follows from that same definition where \bar\omicron_0 is a special edge case which = 0. Now we can plug this back into the equation for Q_2 to see how the bias disappears:

\begin{aligned}
Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\
&= 1 \cdot R_1 + (1 - 1)Q_1 \\
&= R_1 + Q_1 - Q_1 \\
&= R_1
\end{aligned}

Phew, and there you have it! 😅 There is no bias because the initial estimate Q1 is eliminated when we calculate Q_2! Since all the subsequent Q estimates are based on earlier Q estimates we have eliminated the bias permanently. This might seem like a lot of work, but in practice it’s just a couple of extra lines of code.

Now let’s take a look at how the unbiased version of recency-weighted e-greedy performs:

Unbiased update rule speeds up early exploration

The difference is hard to spot on these plots, but look at the plot on the right and you can see that the unbiased e-greedy is slightly faster at converging early on. This difference becomes negligible over time, but it’s a nice little performance boost for early steps in non stationary problems!

Upper confidence bound (UCB) action selection:

So far we’ve still been considering \epsilon-greedy based agents, and we haven’t really tackled the problem that exploring randomly is not a smart way to approach exploration. Now we’re going to change that and discuss UCB – upper confidence bound action selection. This method selects actions by calculating confidence in the q-value estimate of each arm. The diagram below is a good visualisation to help explain:

What this shows is that the confidence in the estimate of Arm 1 is quite high (blue) and the confidence for the estimate of Arm 2 is low (orange). UCB will use this confidence estimate and calculate what the upper-confidence bound is (approximately 95% confidence level). In the plot the 95% confidence level for Arm 1 is shown by the green line, and the 95% confidence level of Arm 2 is shown by the red line. When choosing an action, UCB selects the action with the highest 95% upper-confidence bound, which in this case would be Arm 2.

This choice of action is controlled by the following formula:

\begin{aligned}
A_t \doteq \argmax_{a} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \space \right]
\end{aligned}

t is the current timestep and \ln t is the natural logarithm of that value. N_t(a) is the number of times the action a has been selected before time t. c is a confidence level parameter that we can select (usually set to 2, which is 95% confidence). In code this formula completely replaces the action selection logic in choose_action() compared to \epsilon-greedy:

def choose_action(self):
    action = np.argmax(self.ucb_calc())
    self.arm_ns[action] += 1

    reward, optimal = self.problem.draw_from_arm(action)

    self.update_estimate_incremental(action, reward)

    return reward, optimal
  
  def ucb_calc(self):
    t = np.sum(self.arm_ns)
    arm_ucbs = np.zeros(len(self.arm_qs))

    for i in range(0, len(self.arm_qs)):

      if self.arm_ns[i] == 0:
        # If we have not explored this arm before, we consider it maximising
        arm_ucbs[i] = np.Inf
        continue

      arm_variance = self.arm_qs[i] + self.confidence_level * np.sqrt((np.log(t)/self.arm_ns[i]))
      arm_ucbs[i] = arm_variance

    return arm_ucbs

For now we still update the q-value estimates using the incremental average update rule. Here are the results on a stationary problem:

upper confidence bound action selection performs well on stationary problems

The UCB algorithm outperforms both \epsilon-greedy and fixed exploration greedy (\epsilon-first) on stationary problems! 📈 We can see that UCB finds the optimal action faster and more often than either of the other two previous best methods. But what about non-stationary problems?

upper confidence bound action selection performs poorly on non-stationary problems

Hmm, that’s disappointing 🤔 But this makes sense since we are using the incremental average update rule, which does not update estimates well in non-stationary environments! The problem is that UCB is tricky to adapt to the non-stationary setting, and this is an ongoing area of research (often with state-of-the-art results). This is mostly due to the confidence estimate requiring both the Q-value estimates and the pull counts too. However, there is one simple implementation we can use which follows a similar principle to the recency-weighted average update called discounted-UCB (D-UCB):

  def update_estimate_discounted(self, choice_index, reward):
    # Discount all reward estimates and pull counts (i.e. gradually forget old steps)
    self.arm_qs *= self.gamma
    self.arm_ns *= self.gamma

    self.arm_qs[choice_index] += reward
    self.arm_ns[choice_index] += 1

We just discount all the values for each arm q-estimate and pull count alike, like a recency-weighting for both the q-value estimate and the confidence estimate. Results using this method are much better and also beat the previous best approaches on non-stationary problems:

Discussion and future work:

That’s it! We’re finally done with multi-armed bandits. 🐙

We have covered a range of algorithms that make decisions and explore under uncertainty, in stationary or non-stationary settings. Most of these techniques will be applicable in more advanced reinforcement learning settings, but we’ll those cover in future posts. There are definitely more advanced multi-armed bandit methods out there, like Thompson sampling or klUCB, however I won’t write about those yet (and I’m not sure if/when I plan to). Hopefully you now have a good and thorough intuition about how and why all of these Multi-Armed Bandit methods work!

Multi-Armed Bandits 2: ε-Greedy and Non-Stationary Problems

Today we’re going to address some of the problems with an ε-first exploration approach for multi-armed bandits problems. In the last post we saw how ε-first can perform very well on stationary problems where the true value Q_* of each bandit arm (slot machine in our example) never changes. But in the real world we are often faced with problems where the true value of a choice changes over time. In these situations the ε-first exploration approach will not adapt to the changing environment, and will ignorantly keep selecting the same suboptimal action over and over. As with my previous post, you can follow along and run the code in this colab notebook. 📝

Non-stationary problems:

A non-stationary problem is one where the underlying true value (q_*) of each bandit arm can gradually change over the course of an episode. Using our slot machines analogy, we can imagine that the slot machines start the day with a random average payout value, and this average payout value can slowly increase of decrease throughout the day. We model this by adapting our BanditProblem class to allow the arm values to change gradually, this is done by simulating ‘random walks’ for each arm, by drawing a small random number from a normal distribution for each arm and adding these to the true value of each arm. So this means the true payouts can gradually go up or down for each arm.

class BanditProblem():
  def __init__(self, arms, seed, stationary=True):
    
    self.stationary = stationary

    self.bandit_arms_values = np.random.normal(loc=0.0, scale=1.0, size=arms)
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

  def draw_from_arm(self, arm_index):
    chose_optimal = 1 if arm_index == self.optimal_action_index else 0
    reward = np.random.normal(loc=self.bandit_arms_values[arm_index], scale=1.0)
    
    return reward, chose_optimal

  def step_arm_values(self):
    '''
    Step to be called manually in episode loop.
    '''
    q_star_value_shift = np.random.normal(loc=0.0, scale=0.01, size=len(self.bandit_arms_values))
    self.bandit_arms_values += q_star_value_shift
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

This new logic is handled in the step_arm_values function above, which makes small changes to the true arm values and is called after the bandits are done drawing from the arms.

Introducing ε-Greedy:

Our first attempt at tackling the non-stationary bandit problem uses the well-known ε-greedy approach. It is fairly simple and similar to ε-first but with a small difference; instead of exploring randomly for some fixed number of steps, ε-greedy explores randomly some % of the time throughout the entire episode. This means ε-greedy never stops exploring some small portion of the time, determined by the value ε. This could be a small number, like 0.01, meaning that the agent explores randomly approximately 1% of the time. It also means that ε-greedy can start exploiting the best found option right away – there is no long period of exploration time needed before exploitation. The incremental average update rule for the q-value estimate stays exactly the same as it was for ε-first, here it is as a reminder:

\begin{aligned}
Q_{n+1} &=  Q_n + \frac{1}{n}[R_n - Q_n] 
\end{aligned}

Adding this one simple change – introducing some small chance of exploring randomly every step – is enough to allow the ε-greedy bandit to adapt to non-stationary problems, because it constantly updates its belief about the best choice by some small amount.
So is ε-greedy ‘better’ than ε-first? Well, it depends. On a stationary problem where the values of the slot machines never change, then ε-first is probably better (if you can afford the upfront exploration). On a non-stationary problem, ε-greedy will be better.

Here’s the code for an ε-greedy bandit:

class EpsilonGreedyAgent():
  def __init__(self, epsilon, bandit_problem, alpha=0.1, update_type="incremental"):
    self.epsilon = epsilon
    self.alpha = alpha
    self.problem = bandit_problem
    self.update_type = update_type

    self.arm_qs = np.zeros(len(bandit_problem.bandit_arms_values))
    self.arm_ns = np.zeros(len(bandit_problem.bandit_arms_values))


  def choose_action(self):
    if np.random.rand() > self.epsilon:
      # greedily pull best arm
      choice = np.argmax(self.arm_qs)
    else:
      # explore pull any random arm (still a chance to pull best arm too)
      choice = np.random.randint(0, len(self.arm_qs))

    self.arm_ns[choice] += 1

    reward, optimal = self.problem.draw_from_arm(choice)

    if self.update_type == "incremental":
      self.update_estimate_incremental(choice, reward)
    elif self.update_type == "weighted":
      self.update_estimate_weighted(choice, reward)

    return reward, optimal

  def update_estimate_incremental(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (1/self.arm_ns[choice_index])*(reward - self.arm_qs[choice_index])

  def update_estimate_weighted(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (self.alpha*(reward - self.arm_qs[choice_index]))

Once again most of the logic is handled in the choose_action function. Notice how at each step (choice) we draw a random real number between 0.0 and 1.0, and we exploit the best found arm if that random number is bigger than ε (epsilon). But, if the random number is less than ε, then we explore (choose any random arm). Then we take note of the reward/payout received and update our estimate. There are two update types here, but the weighted one can be ignored for now, I’ll explain it later.

Results: ε-First vs ε-Greedy

So how does ε-greedy fare against ε-first on a stationary problem? See the graphs below:

As expected, on our stationary 10-armed bandit problem the ε-first agent fares better than the ε-greedy agent. This is because the values of the bandit arms never change, so once the ε-first bandit has locked on to the best choice, it exploits that choice continually. In contrast, ε-greedy takes a while longer to find the optimal choice, and even when it does find it, it still explores the other sub-optimal options 10% of the time. The only upside to the ε-greedy approach here is that it starts gathering good rewards almost right away, whereas ε-first takes 1000 exploration steps before collecting high rewards.

But what about a non-stationary problem – when the values of the bandit arms change? Can you predict what will happen in this case? Which approach will fare better?

Aha! Now the tables have turned. While ε-first does for a short moment often find a better choice than ε-greedy right after it is done with 1000 exploration steps, this choice quickly becomes stale as the values of the arms change. ε-greedy fare much better overall, and continues to increase its average score over the course of the episode! However, ε-greedy seems to exhibit the same staleness problem (though not as bad) as ε-first: as the episode goes on it chooses the optimal action less and less. Can you think of a reason for this? Hint: think about the q-value update rule shown above (we discussed this in my last post too).

The reason is because the update rule puts n in the denominator of the fraction that behaves as the step size, so this step size gets smaller and smaller as the episode goes on. Eventually, this step size will be so small that the q-value estimates will barely change on each update, and their rank (the order of each bandit arm according to our estimates) will almost never change after many episodes. So, eventually, ε-greedy almost becomes exactly like ε-first and gets stuck pulling the same arm when exploiting (but still pulls other arms randomly some % of the time). This explains why ε-greedy slowly makes less and less optimal choices – its arm value estimates are not keeping up with the changes to each arms’ true value q_* the longer the episode goes on!

Recency Weighted Q-Update:

To truly solve the stale choice issue for non-stationary bandit problems we need to be a bit more intelligent with our q-value estimate update rule. We saw previously how ε-greedy with a sample average update rule has some problems: the step size gradually gets smaller and smaller. The fix is simple: keep this step size constant! The reason why this works is subtle.

Theory warning! The following section will be a bit math heavy. But if you find maths a little dense (I can relate) then I also include a plain English description and accompanying code below. Hopefully that helps! 😊

So we want to change our update rule to include a fixed step size \alpha. That’s easy enough – see line (1) below. Line (1) below is all we need to implement this in code, the rest are just there for understanding. The reason why this works – and the key to understanding why it works – as a recency-weighted update (new updates are given more weight) is due to the recursive way \alpha is applied to the existing estimate Q_n. Realise that Q_n at any time step is the result of previous updates being applied one after another. Lines (2) to (6) demonstrate this below, where we are essentially unrolling Q_n as a sum of updates from all previous time steps.

\begin{align}
Q_{n+1} &= Q_n + \alpha[R_n - Q_n] \\
&=  \alpha R_n + (1- \alpha)Q_n \\
&= \alpha R_n + (1- \alpha)[\alpha R_{n-1} + (1- \alpha)Q_{n-1}] \\
&= \alpha R_n + (1 - \alpha) \alpha R_{n-1} + (1- \alpha)^2 Q_{n-1} \\
&= \alpha R_n + (1 - \alpha)\alpha R_{n-1} + (1 - \alpha)^2 \alpha R_{n-2} + \\
& \qquad \qquad \qquad \cdots + (1- \alpha)^{n-1}\alpha R_{1} + (1- \alpha)^nQ_1\\
&= (1-\alpha)^nQ_1 + \sum_{i=1}^{n} (1-\alpha)^{n-i}\alpha R_i
\end{align}

On line (7) we show this idea in a single line where we express Q_{n+1} as the sum of all previous rewards weighted by alpha raised to the power of n-i (the number of steps ago this update occurred). So as we update Q, the contribution of old updates to our current estimate gets exponentially smaller and smaller.

It’s this attribute of our new recency-weighted average that allows our estimates to stay fresh as the ε-greedy continually explores and the true values of the bandit arms change over time. This update rule reacts much faster to changes in the values of the bandit arms and can do so for as long as the episode continues. Although, if it experiences an unlucky streak of exploration it may be temporarily misled into believing that the optimal action is not actually the best. Even so, it will usually fix this mistake quite quickly when the unlucky streak breaks.

You can find the code for this update rule in the update_estimate_weighted function in the code snippet above.

Results: ε-Fixed vs ε-Greedy vs Recency-Weighted ε-Greedy

Naturally, we wouldn’t expect the recency-weighted update ε-greedy to be any better when it comes to stationary problems, and this is true based on our results in the graphs below.

It’s clear that recency-weighted averages hurt performance in the stationary setting. Whereas ε-greedy eventually approaches choosing the best action 90% of the time (the remaining steps are exploration), the recency weighted ε-greedy chooses the optimal action 80% of the time and does not seems to be improving. In this situation the recency-weighted ε-greedy is limited by the \alpha value, which determines the ‘forgetfulness’ of the update rule, so this could be tuned a bit to improve performance for the stationary problem setting. But where recency-weighted ε-greedy really shines is in the non-stationary setting:

Much better! Adding the recency-weighted update rule allows the ε-greedy agent to outperform both prior approaches on non-stationary problems. The recency-weighted update gradually ‘forgets’ about old updates so that it can quickly switch to the new optimal choice as the episode progresses and the values of the bandit arms gradually change.

Discussion and future work:

So that’s it, right? We’ve solved the exploration/exploitation problem? Uhm… actually no. In the last two posts we’ve seen some really simple (and quite effective) methods to balance exploration/exploitation, but there’s one more method I want to cover. This method addresses a problem with ε-greedy: when it explores it does so totally at random, but couldn’t there be a way to focus exploration on choices that seem most promising? This is the topic of our next post when I’ll cover Upper Confidence Bound (UCB) action selection, and also some neat tricks to make ε-greedy exploration more effective too! 🤖

Scroll to Top