In this post we’ll walk through some neat tricks to make \epsilon-greedy more effective, and then we’ll dig into a smarter way to handle exploration: upper confidence bound action selection. We’ll be building on what we learned in my last post, and as always the code can be found in this colab notebook so you can follow along and try it out yourself 🙂

## Optimistic initial values:

One problem with \epsilon-greedy is that randomness isn’t a very intelligent way to conduct exploration, especially at the start of an episode when you don’t know anything about your options. Using our slot machines analogy again: due to the randomness, \epsilon-greedy might not actually try all of the machines at least once for quite some time. Instead it might explore a few of the arms to begin with and then spend most of the time early on exploiting a sub-optimal arm.

A really simple and clever way we can make this early exploration more systematic is to give the \epsilon-greedy agent *optimistic* initial estimates for each arm’s rewards. In practice this means setting the initial q-values quite high (not zero). This exploits the greediness of \epsilon-greedy. The agent selects each arm one by one, expecting a high reward but instead getting a relatively low one. It then revises each estimate down each timestep until the estimate starts to converge on the true value. Overall, the agent is encouraged to explore much more effectively and the optimistic starting estimates gradually get reduced down to something more realistic.

You can see the results after applying this trick below:

The optimistic \epsilon-greedy agent converges much faster than regular \epsilon-greedy. This can make a big difference in problem settings when the number of steps per episode is smaller. Over longer timesteps the advantage mostly disappears as the impact of the early exploration reduces over time. It’s worth keeping in mind that this approach *won’t* help much in a non-stationary setting.

## Unbiased constant step-size

There is a very subtle issue that we should address with the constant step size update rules (using a constant \alpha for recency-weighting). The issue is that this update rule is biased by the inclusion of the initial estimate. Recall that the recency-weighted update rule for Q-values is essentially a weighted sum of all past rewards, plus a weighting of the initial Q value estimate (see my last post for a deeper analysis):

\begin{aligned} Q_{n+t} &= Q_n + \alpha[R_n - Q_n] \\ &= (1-\alpha)^nQ_1+ \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_i \end{aligned}

This means our initial Q-value estimate *permanently* alters the subsequent Q-value estimate updates. The good news is that while the impact is permanent, it gradually reduces over more time until it is virtually non-existent. However, we’d like to remove this bias if possible, to make our agent’s early exploration more effective. To do that we need to alter our update rule formulation a little bit. The change involves altering the step size from just \alpha to be:

\begin{aligned} \beta_n \doteq \alpha / \bar\omicron_n \end{aligned}

This \omicron_n is the real interesting part, which is defined as follows:

\begin{aligned} \bar\omicron_n \doteq \bar\omicron_{n-1} + \alpha(1- \bar\omicron_{n-1}), \space \text{for} \space n \ge 0, \space \text{with} \space \bar\omicron_{0} = 0 \end{aligned}

Okay, so that maybe looks a bit confusing. In plain English this just means we need to keep track of a separate \bar\omicron_{n} for each arm/action, and we update it using the above rule each time we pull its assigned arm. In code this just means a slightly modified update function that looks like:

```
def update_estimate_weighted_unbiased(self, choice_index, reward):
beta = self.alpha/self.omicrons[choice_index] if self.omicrons[choice_index] > 0 else self.alpha
self.arm_qs[choice_index] = self.arm_qs[choice_index] + (beta*(reward - self.arm_qs[choice_index]))
# update omicron for this action
self.omicrons[choice_index] = self.omicrons[choice_index] + self.alpha*(1-self.omicrons[choice_index])
```

Simple, right? 😊 To see *why* this is an unbiased recency-weighted average we’ll have to do a bit more algebra. I’ll explain it as we go, but if you’re not interested then feel free to skip to the results.

First we’ll start with our original recency weighted average formula, but we’ll swap \beta instead of \alpha and rework things a bit to get only one Q_n on the right hand side:

\begin{aligned} Q_{n+1} &= Q_n +\beta_n(R_n - Q_n) \\ &= Q_n + \beta_nR_n - \beta_nQ_n\\ &= \beta_n R_n + (1- \beta_n)Q_n \end{aligned}

The second line is just an expansion of the brackets in the first line, and the third line factors the two terms including Q_n together. We’re mostly interested in Q_2 (because Q_1 is the biased initial estimate) to understand how the bias is eliminated:

\begin{aligned} Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\ \end{aligned}

But to figure this out we first need to figure out what \beta_1 is (this is the important part):

\begin{aligned} \beta_1 &= \frac{\alpha}{\bar\omicron_1} \\ &= \frac{\alpha}{\bar\omicron_0 + \alpha(1-\bar\omicron)} \\ &= \frac{\alpha}{0 + \alpha(1- 0)} \\ &= \frac{\alpha}{\alpha} \\ &= 1 \end{aligned}

In the first line we use the definition of \Beta we saw earlier. Then the second line follows from the definition of \bar\omicron we saw earlier too (it is calculated from the previous omicron and alpha). Finally, the third line follows from that same definition where \bar\omicron_0 is a special edge case which = 0. Now we can plug this back into the equation for Q_2 to see how the bias disappears:

\begin{aligned} Q_2 &= \beta_1 R_1 + (1 - \beta_1)Q_1 \\ &= 1 \cdot R_1 + (1 - 1)Q_1 \\ &= R_1 + Q_1 - Q_1 \\ &= R_1 \end{aligned}

Phew, and there you have it! 😅 There is no bias because the initial estimate Q1 is eliminated when we calculate Q_2! Since all the subsequent Q estimates are based on earlier Q estimates we have eliminated the bias permanently. This might seem like a lot of work, but in practice it’s just a couple of extra lines of code.

Now let’s take a look at how the unbiased version of recency-weighted e-greedy performs:

The difference is hard to spot on these plots, but look at the plot on the right and you can see that the unbiased e-greedy is slightly faster at converging early on. This difference becomes negligible over time, but it’s a nice little performance boost for early steps in non stationary problems!

## Upper confidence bound (UCB) action selection:

So far we’ve still been considering \epsilon-greedy based agents, and we haven’t really tackled the problem that exploring randomly is not a smart way to approach exploration. Now we’re going to change that and discuss UCB – upper confidence bound action selection. This method selects actions by calculating confidence in the q-value estimate of each arm. The diagram below is a good visualisation to help explain:

What this shows is that the confidence in the estimate of Arm 1 is quite high (blue) and the confidence for the estimate of Arm 2 is low (orange). UCB will use this confidence estimate and calculate what the upper-confidence bound is (approximately 95% confidence level). In the plot the 95% confidence level for Arm 1 is shown by the green line, and the 95% confidence level of Arm 2 is shown by the red line. When choosing an action, UCB selects the action with the highest 95% upper-confidence bound, which in this case would be Arm 2.

This choice of action is controlled by the following formula:

\begin{aligned} A_t \doteq \argmax_{a} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \space \right] \end{aligned}

t is the current timestep and \ln t is the natural logarithm of that value. N_t(a) is the number of times the action a has been selected before time t. c is a confidence level parameter that we can select (usually set to 2, which is 95% confidence). In code this formula completely replaces the action selection logic in *choose_action()* compared to \epsilon-greedy:

```
def choose_action(self):
action = np.argmax(self.ucb_calc())
self.arm_ns[action] += 1
reward, optimal = self.problem.draw_from_arm(action)
self.update_estimate_incremental(action, reward)
return reward, optimal
def ucb_calc(self):
t = np.sum(self.arm_ns)
arm_ucbs = np.zeros(len(self.arm_qs))
for i in range(0, len(self.arm_qs)):
if self.arm_ns[i] == 0:
# If we have not explored this arm before, we consider it maximising
arm_ucbs[i] = np.Inf
continue
arm_variance = self.arm_qs[i] + self.confidence_level * np.sqrt((np.log(t)/self.arm_ns[i]))
arm_ucbs[i] = arm_variance
return arm_ucbs
```

For now we still update the q-value estimates using the incremental average update rule. Here are the results on a stationary problem:

The UCB algorithm outperforms both \epsilon-greedy *and* fixed exploration greedy (\epsilon-first) on stationary problems! 📈 We can see that UCB finds the optimal action faster and more often than either of the other two previous best methods. But what about non-stationary problems?

Hmm, that’s disappointing 🤔 But this makes sense since we are using the incremental average update rule, which does not update estimates well in non-stationary environments! The problem is that UCB is tricky to adapt to the non-stationary setting, and this is an ongoing area of research (often with state-of-the-art results). This is mostly due to the confidence estimate requiring both the Q-value estimates and the pull counts too. However, there is one simple implementation we can use which follows a similar principle to the recency-weighted average update called discounted-UCB (D-UCB):

```
def update_estimate_discounted(self, choice_index, reward):
# Discount all reward estimates and pull counts (i.e. gradually forget old steps)
self.arm_qs *= self.gamma
self.arm_ns *= self.gamma
self.arm_qs[choice_index] += reward
self.arm_ns[choice_index] += 1
```

We just discount all the values for each arm q-estimate and pull count alike, like a recency-weighting for both the q-value estimate and the confidence estimate. Results using this method are much better and also beat the previous best approaches on non-stationary problems:

## Discussion and future work:

That’s it! We’re finally done with multi-armed bandits. 🐙

We have covered a range of algorithms that make decisions and explore under uncertainty, in stationary or non-stationary settings. Most of these techniques will be applicable in more advanced reinforcement learning settings, but we’ll those cover in future posts. There are definitely more advanced multi-armed bandit methods out there, like Thompson sampling or klUCB, however I won’t write about those yet (and I’m not sure if/when I plan to). Hopefully you now have a good and thorough intuition about how and why all of these Multi-Armed Bandit methods work!