Multi-Armed Bandits 1: ε-first

In this post we’re going to discuss the age old problem of making decisions when there is uncertainty. To illustrate what I mean, we’re going to dive right into multi-armed bandits problems and what they are exactly. You can follow along and run the code yourself using this google colab notebook. I’ll be updating the notebook to include code and experiments for future MAB posts. 🔄

Intro to Bandit Problems:

A bandit problem is a situation where we have a series of choices of actions – some of those choices are better than others – but we don’t know which ones. The choices are totally independent of one another, meaning one choice does not affect the next choice, and we can choose from all the options on each turn. For these problem we want to figure out what the best option is as quickly as possible and then continually exploit that option until we run out of turns.

The classic example is given using a set of slot machines that have different average payouts. We use the term “average payouts” here because there is some randomness (noise) to the payouts, meaning we get a slightly different payout each time. If payouts were certain, we would just need to sample from each machine once to figure out which is best. 🎰

So, ideally we want to figure out which is the best slot machine with the highest average payout as quickly as possible – so we can then focus on playing only on that best slot machine and win as much money as possible. In other words we don’t want to waste turns playing on slot machines where the average payout is lower than the best machine – but first we have to figure out which machine is best.

The example we’ll work with from here on is the 10-armed bandit problem – where we have 10 slot machines to choose from, each with different average payouts, and our job is to figure out which is the highest paying machine. We initialise the true value Q_* of each slot machine (bandit arm) by sampling from a normal distribution with a mean of 0 and a standard deviation of 1. Code for the problem is supplied below 🙂

import numpy as np

class BanditProblem():
  def __init__(self, arms, seed, stationary=True):
    self.stationary = stationary
    self.bandit_arms_values = np.random.normal(loc=0.0, scale=1.0, size=arms)
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

  def draw_from_arm(self, arm_index):
    chose_optimal = 1 if arm_index == self.optimal_action_index else 0
    reward = np.random.normal(loc=self.bandit_arms_values[arm_index], scale=1.0)

    return reward, chose_optimal

Fixed exploration period (ε-first):

Our first and most intuitive solution is to choose some arbitrary fixed period of exploration where we play each machine at random and try to build up some useful estimate of what the average payout is. This strategy is sometimes referred to as fixed-exploration or ε-first (epsilon-first). With this approach we would aim to play on each machine say 100 times, and then average the payouts and use that as our estimate. We would then continually play the machine with the best estimated average payout. Similarly, we could pick randomly 1000 times which machine to play and then base our estimate on the average payouts received. Our estimate is just the sample average from the first ~100 turns on that slot machine.

Theory warning! The following section will be a bit math heavy. But if you find maths a little dense (I can relate) then I also include a plain English description and accompanying code below. Hopefully that helps!😊

The estimate for each bandit arm (slot) is called a q-value estimate, and the true value for each arm is referred to as q_* – this is the thing we’re trying to estimate. We can keep track of this q-value estimate for each arm using a simple update rule derived from the sample average formula:

Q_{n+1} &= \frac{1}{n}\sum_{i=1}^{n}R_i \\
&= \frac{1}{n} \left( R_n + \sum_{i=1}^{n-1}R_i \right) \\
&= \frac{1}{n} \left(  R_n +(n-1)\frac{1}{n-1} \sum_{i=1}^{n-1}R_i\right)  \\
&= \frac{1}{n} \left(  R_n + (n - 1)Q_n \right) \\
&= \frac{1}{n} \left( R_n + nQ_n - Qn\right) \\
&= Q_n + \frac{1}{n}[R_n - Q_n] 

where R_i is the reward from the current step from a given slot machine and n is the count of the number of times this slot machine has been played. Line (3) follows because (n-1)\frac{1}{n-1} = 1, and multiplying the sum by 1 does not change its value. Line (4) follows because:

Q_n &= \frac{R_1 + R_2 + \cdots +R_{n-1}}{n-1} = \frac{1}{n-1}\sum_{i=1}^{n-1}R_i

In plain English, line (6) in the first equation above says is that we can update the estimate Q_{n+1} by finding the difference between the received reward and the previous estimate (R_n - Q_n) let’s call this \delta. We then use this difference to update our estimate Q_n, by moving our old estimate a small step towards the most recent reward. This is done by weighting \delta by a step size, defined by dividing it by the number of times this arm has been pulled. We can think of this as taking the form:

NewEstimate \leftarrow OldEstimate + StepSize[Reward- Old Esimate]

Think of this as a gradually reducing the step size the more times we pull a given arm, because we can gradually become more certain of the true value. Notice in equation 1 at line (6), when the denominator n (number of arm pulls) increases, the step size decreases.

ε-first code:

Okay, enough theory for now, let’s put try putting what we’ve learned into code!

class FixedGreedyAgent():
  def __init__(self, exploration_steps, bandit_problem):
    self.problem = bandit_problem
    self.current_step = 0
    self.exploration_steps = exploration_steps

    self.arm_qs = np.zeros(len(bandit_problem.bandit_arms_values))
    self.arm_ns = np.zeros(len(bandit_problem.bandit_arms_values))

  def choose_action(self):
    if self.current_step > self.exploration_steps:
      # greedily pull best arm
      choice = np.argmax(self.arm_qs)
      # explore pull any random arm (still a chance to pull best arm too)
      choice = np.random.randint(0, len(self.arm_qs))

    reward, optimal = self.problem.draw_from_arm(choice)

    self.arm_ns[choice] += 1

    if self.current_step > self.exploration_steps:
      self.update_estimate_incremental(choice, reward)

    self.current_step = self.current_step + 1

    return reward, optimal

  def update_estimate_incremental(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (1/self.arm_ns[choice_index])*(reward - self.arm_qs[choice_index])

You can see line (6) from equation 1 implemented in the above in code in the update_estimate_incremental function. Most of the logic is implemented in the choose_action function, where our agent will randomly explore for some fixed number of exploration_steps and then exploit the best estimated “arm” (slot machine in our example) from then onwards. The FixedGreedyAgent will keep track of the current q-value and number of pulls for each arm, as well as a parameter for number of exploration steps to explore before exploiting.

All of the code for this blog is available in this google colab notebook so you can run it yourself and play around with the parameters 🙂

I chose to run the experiment for 50000 steps per episode and repeated this 2000 times to get an average score for reach step. Our fixed-exploration greedy agent explores for 1000 steps before exploiting. I recommend using a much lower value for steps and episodes when running on google colab or the experiment will take a long time to complete, but feel free to play around with the exploration steps parameter. We can see the results for the average reward per step for the fixed exploration bandit below:

Discussion and future work:

Okay, so it looks like our fixed exploration strategy worked pretty well, right? Well yes, but there are some problems with this strategy:

  1. We have to spend X number of turns exploring randomly before we can exploit and get a decent score. This could be very costly in some scenarios, for example if it costs $1 to play each slot machine, and our average return for the first 1000 steps is $0, then it is going to cost us $1000 out of pocket just to explore, before we see any returns! So we may need deep pockets for this exploration strategy.
  2. Also the more arms we have, the longer we have to spend randomly exploring before we can be confident we have found the optimal one and exploit it. If we had 150 arms or more then 1000 exploration steps may not be adequate. If there are thousands of bandit arms, then this exploration strategy may not be viable at all.
  3. If we don’t confidently find the optimal choice (slot machine) within those first X exploration steps, then we will lock on to a suboptimal choice for the rest of the episode! We could be stuck making bad decisions forever.
  4. If the value of the bandit arms (slots) changes over the course of an episode, our agent will never update its belief about the best arm to choose. It’s fixed decision could get worse and worse as an episode progresses.

This plot makes the problems visible. During the first 1000 steps we pick the optimal choice roughly 10% of the time (makes sense; 10 choices picked at random), and after the exploration phase is done the fixed exploration bandit manages to ‘find’ the best option ~95% of the time. But, once the agent is done exploring it never changes its mind about the best option. It also isn’t guaranteed to have found the best slot machine in the first place, especially if the number of arms is considerably more than 10.

This begs the question: can we explore more effectively? And that is a very, very good question. Actually, that question is so good that people are still trying to figure it out! In the next post we’ll learn about some more simple and effective ways to balance exploration and exploitation, even for non-stationary problems, using an ε-greedy approach (and others), and different q-value estimate update rules 🤓


  1. Russel, S. J., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (Global Edition) (4th ed.). Pearson.
  2. Sutton, R., & Barto, A. (2018). Reinforcement Learning: An Introduction (2nd ed.). The MIT Press.
  3. Fragkiadaki, K. (2019). Exploration/Exploitation in Multi-armed Bandits.

One thought on “Multi-Armed Bandits 1: ε-first

Leave a Reply

Scroll to Top
%d bloggers like this: