Multi-Armed Bandits 2: ε-Greedy and Non-Stationary Problems

Today we’re going to address some of the problems with an ε-first exploration approach for multi-armed bandits problems. In the last post we saw how ε-first can perform very well on stationary problems where the true value Q_* of each bandit arm (slot machine in our example) never changes. But in the real world we are often faced with problems where the true value of a choice changes over time. In these situations the ε-first exploration approach will not adapt to the changing environment, and will ignorantly keep selecting the same suboptimal action over and over. As with my previous post, you can follow along and run the code in this colab notebook. 📝

Non-stationary problems:

A non-stationary problem is one where the underlying true value (q_*) of each bandit arm can gradually change over the course of an episode. Using our slot machines analogy, we can imagine that the slot machines start the day with a random average payout value, and this average payout value can slowly increase of decrease throughout the day. We model this by adapting our BanditProblem class to allow the arm values to change gradually, this is done by simulating ‘random walks’ for each arm, by drawing a small random number from a normal distribution for each arm and adding these to the true value of each arm. So this means the true payouts can gradually go up or down for each arm.

class BanditProblem():
  def __init__(self, arms, seed, stationary=True):
    
    self.stationary = stationary

    self.bandit_arms_values = np.random.normal(loc=0.0, scale=1.0, size=arms)
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

  def draw_from_arm(self, arm_index):
    chose_optimal = 1 if arm_index == self.optimal_action_index else 0
    reward = np.random.normal(loc=self.bandit_arms_values[arm_index], scale=1.0)
    
    return reward, chose_optimal

  def step_arm_values(self):
    '''
    Step to be called manually in episode loop.
    '''
    q_star_value_shift = np.random.normal(loc=0.0, scale=0.01, size=len(self.bandit_arms_values))
    self.bandit_arms_values += q_star_value_shift
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

This new logic is handled in the step_arm_values function above, which makes small changes to the true arm values and is called after the bandits are done drawing from the arms.

Introducing ε-Greedy:

Our first attempt at tackling the non-stationary bandit problem uses the well-known ε-greedy approach. It is fairly simple and similar to ε-first but with a small difference; instead of exploring randomly for some fixed number of steps, ε-greedy explores randomly some % of the time throughout the entire episode. This means ε-greedy never stops exploring some small portion of the time, determined by the value ε. This could be a small number, like 0.01, meaning that the agent explores randomly approximately 1% of the time. It also means that ε-greedy can start exploiting the best found option right away – there is no long period of exploration time needed before exploitation. The incremental average update rule for the q-value estimate stays exactly the same as it was for ε-first, here it is as a reminder:

\begin{aligned}
Q_{n+1} &=  Q_n + \frac{1}{n}[R_n - Q_n] 
\end{aligned}

Adding this one simple change – introducing some small chance of exploring randomly every step – is enough to allow the ε-greedy bandit to adapt to non-stationary problems, because it constantly updates its belief about the best choice by some small amount.
So is ε-greedy ‘better’ than ε-first? Well, it depends. On a stationary problem where the values of the slot machines never change, then ε-first is probably better (if you can afford the upfront exploration). On a non-stationary problem, ε-greedy will be better.

Here’s the code for an ε-greedy bandit:

class EpsilonGreedyAgent():
  def __init__(self, epsilon, bandit_problem, alpha=0.1, update_type="incremental"):
    self.epsilon = epsilon
    self.alpha = alpha
    self.problem = bandit_problem
    self.update_type = update_type

    self.arm_qs = np.zeros(len(bandit_problem.bandit_arms_values))
    self.arm_ns = np.zeros(len(bandit_problem.bandit_arms_values))


  def choose_action(self):
    if np.random.rand() > self.epsilon:
      # greedily pull best arm
      choice = np.argmax(self.arm_qs)
    else:
      # explore pull any random arm (still a chance to pull best arm too)
      choice = np.random.randint(0, len(self.arm_qs))

    self.arm_ns[choice] += 1

    reward, optimal = self.problem.draw_from_arm(choice)

    if self.update_type == "incremental":
      self.update_estimate_incremental(choice, reward)
    elif self.update_type == "weighted":
      self.update_estimate_weighted(choice, reward)

    return reward, optimal

  def update_estimate_incremental(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (1/self.arm_ns[choice_index])*(reward - self.arm_qs[choice_index])

  def update_estimate_weighted(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (self.alpha*(reward - self.arm_qs[choice_index]))

Once again most of the logic is handled in the choose_action function. Notice how at each step (choice) we draw a random real number between 0.0 and 1.0, and we exploit the best found arm if that random number is bigger than ε (epsilon). But, if the random number is less than ε, then we explore (choose any random arm). Then we take note of the reward/payout received and update our estimate. There are two update types here, but the weighted one can be ignored for now, I’ll explain it later.

Results: ε-First vs ε-Greedy

So how does ε-greedy fare against ε-first on a stationary problem? See the graphs below:

As expected, on our stationary 10-armed bandit problem the ε-first agent fares better than the ε-greedy agent. This is because the values of the bandit arms never change, so once the ε-first bandit has locked on to the best choice, it exploits that choice continually. In contrast, ε-greedy takes a while longer to find the optimal choice, and even when it does find it, it still explores the other sub-optimal options 10% of the time. The only upside to the ε-greedy approach here is that it starts gathering good rewards almost right away, whereas ε-first takes 1000 exploration steps before collecting high rewards.

But what about a non-stationary problem – when the values of the bandit arms change? Can you predict what will happen in this case? Which approach will fare better?

Aha! Now the tables have turned. While ε-first does for a short moment often find a better choice than ε-greedy right after it is done with 1000 exploration steps, this choice quickly becomes stale as the values of the arms change. ε-greedy fare much better overall, and continues to increase its average score over the course of the episode! However, ε-greedy seems to exhibit the same staleness problem (though not as bad) as ε-first: as the episode goes on it chooses the optimal action less and less. Can you think of a reason for this? Hint: think about the q-value update rule shown above (we discussed this in my last post too).

The reason is because the update rule puts n in the denominator of the fraction that behaves as the step size, so this step size gets smaller and smaller as the episode goes on. Eventually, this step size will be so small that the q-value estimates will barely change on each update, and their rank (the order of each bandit arm according to our estimates) will almost never change after many episodes. So, eventually, ε-greedy almost becomes exactly like ε-first and gets stuck pulling the same arm when exploiting (but still pulls other arms randomly some % of the time). This explains why ε-greedy slowly makes less and less optimal choices – its arm value estimates are not keeping up with the changes to each arms’ true value q_* the longer the episode goes on!

Recency Weighted Q-Update:

To truly solve the stale choice issue for non-stationary bandit problems we need to be a bit more intelligent with our q-value estimate update rule. We saw previously how ε-greedy with a sample average update rule has some problems: the step size gradually gets smaller and smaller. The fix is simple: keep this step size constant! The reason why this works is subtle.

Theory warning! The following section will be a bit math heavy. But if you find maths a little dense (I can relate) then I also include a plain English description and accompanying code below. Hopefully that helps! 😊

So we want to change our update rule to include a fixed step size \alpha. That’s easy enough – see line (1) below. Line (1) below is all we need to implement this in code, the rest are just there for understanding. The reason why this works – and the key to understanding why it works – as a recency-weighted update (new updates are given more weight) is due to the recursive way \alpha is applied to the existing estimate Q_n. Realise that Q_n at any time step is the result of previous updates being applied one after another. Lines (2) to (6) demonstrate this below, where we are essentially unrolling Q_n as a sum of updates from all previous time steps.

\begin{align}
Q_{n+1} &= Q_n + \alpha[R_n - Q_n] \\
&=  \alpha R_n + (1- \alpha)Q_n \\
&= \alpha R_n + (1- \alpha)[\alpha R_{n-1} + (1- \alpha)Q_{n-1}] \\
&= \alpha R_n + (1 - \alpha) \alpha R_{n-1} + (1- \alpha)^2 Q_{n-1} \\
&= \alpha R_n + (1 - \alpha)\alpha R_{n-1} + (1 - \alpha)^2 \alpha R_{n-2} + \\
& \qquad \qquad \qquad \cdots + (1- \alpha)^{n-1}\alpha R_{1} + (1- \alpha)^nQ_1\\
&= (1-\alpha)^nQ_1 + \sum_{i=1}^{n} (1-\alpha)^{n-i}\alpha R_i
\end{align}

On line (7) we show this idea in a single line where we express Q_{n+1} as the sum of all previous rewards weighted by alpha raised to the power of n-i (the number of steps ago this update occurred). So as we update Q, the contribution of old updates to our current estimate gets exponentially smaller and smaller.

It’s this attribute of our new recency-weighted average that allows our estimates to stay fresh as the ε-greedy continually explores and the true values of the bandit arms change over time. This update rule reacts much faster to changes in the values of the bandit arms and can do so for as long as the episode continues. Although, if it experiences an unlucky streak of exploration it may be temporarily misled into believing that the optimal action is not actually the best. Even so, it will usually fix this mistake quite quickly when the unlucky streak breaks.

You can find the code for this update rule in the update_estimate_weighted function in the code snippet above.

Results: ε-Fixed vs ε-Greedy vs Recency-Weighted ε-Greedy

Naturally, we wouldn’t expect the recency-weighted update ε-greedy to be any better when it comes to stationary problems, and this is true based on our results in the graphs below.

It’s clear that recency-weighted averages hurt performance in the stationary setting. Whereas ε-greedy eventually approaches choosing the best action 90% of the time (the remaining steps are exploration), the recency weighted ε-greedy chooses the optimal action 80% of the time and does not seems to be improving. In this situation the recency-weighted ε-greedy is limited by the \alpha value, which determines the ‘forgetfulness’ of the update rule, so this could be tuned a bit to improve performance for the stationary problem setting. But where recency-weighted ε-greedy really shines is in the non-stationary setting:

Much better! Adding the recency-weighted update rule allows the ε-greedy agent to outperform both prior approaches on non-stationary problems. The recency-weighted update gradually ‘forgets’ about old updates so that it can quickly switch to the new optimal choice as the episode progresses and the values of the bandit arms gradually change.

Discussion and future work:

So that’s it, right? We’ve solved the exploration/exploitation problem? Uhm… actually no. In the last two posts we’ve seen some really simple (and quite effective) methods to balance exploration/exploitation, but there’s one more method I want to cover. This method addresses a problem with ε-greedy: when it explores it does so totally at random, but couldn’t there be a way to focus exploration on choices that seem most promising? This is the topic of our next post when I’ll cover Upper Confidence Bound (UCB) action selection, and also some neat tricks to make ε-greedy exploration more effective too! 🤖

Multi-Armed Bandits 1: ε-first

In this post we’re going to discuss the age old problem of making decisions when there is uncertainty. To illustrate what I mean, we’re going to dive right into multi-armed bandits problems and what they are exactly. You can follow along and run the code yourself using this google colab notebook. I’ll be updating the notebook to include code and experiments for future MAB posts. 🔄

Intro to Bandit Problems:

A bandit problem is a situation where we have a series of choices of actions – some of those choices are better than others – but we don’t know which ones. The choices are totally independent of one another, meaning one choice does not affect the next choice, and we can choose from all the options on each turn. For these problem we want to figure out what the best option is as quickly as possible and then continually exploit that option until we run out of turns.

The classic example is given using a set of slot machines that have different average payouts. We use the term “average payouts” here because there is some randomness (noise) to the payouts, meaning we get a slightly different payout each time. If payouts were certain, we would just need to sample from each machine once to figure out which is best. 🎰

So, ideally we want to figure out which is the best slot machine with the highest average payout as quickly as possible – so we can then focus on playing only on that best slot machine and win as much money as possible. In other words we don’t want to waste turns playing on slot machines where the average payout is lower than the best machine – but first we have to figure out which machine is best.

The example we’ll work with from here on is the 10-armed bandit problem – where we have 10 slot machines to choose from, each with different average payouts, and our job is to figure out which is the highest paying machine. We initialise the true value Q_* of each slot machine (bandit arm) by sampling from a normal distribution with a mean of 0 and a standard deviation of 1. Code for the problem is supplied below 🙂

import numpy as np

class BanditProblem():
  def __init__(self, arms, seed, stationary=True):
    self.stationary = stationary
    self.bandit_arms_values = np.random.normal(loc=0.0, scale=1.0, size=arms)
    self.optimal_action_index = np.argmax(self.bandit_arms_values)

  def draw_from_arm(self, arm_index):
    chose_optimal = 1 if arm_index == self.optimal_action_index else 0
    reward = np.random.normal(loc=self.bandit_arms_values[arm_index], scale=1.0)

    return reward, chose_optimal

Fixed exploration period (ε-first):

Our first and most intuitive solution is to choose some arbitrary fixed period of exploration where we play each machine at random and try to build up some useful estimate of what the average payout is. This strategy is sometimes referred to as fixed-exploration or ε-first (epsilon-first). With this approach we would aim to play on each machine say 100 times, and then average the payouts and use that as our estimate. We would then continually play the machine with the best estimated average payout. Similarly, we could pick randomly 1000 times which machine to play and then base our estimate on the average payouts received. Our estimate is just the sample average from the first ~100 turns on that slot machine.

Theory warning! The following section will be a bit math heavy. But if you find maths a little dense (I can relate) then I also include a plain English description and accompanying code below. Hopefully that helps!😊

The estimate for each bandit arm (slot) is called a q-value estimate, and the true value for each arm is referred to as q_* – this is the thing we’re trying to estimate. We can keep track of this q-value estimate for each arm using a simple update rule derived from the sample average formula:

\begin{align}
Q_{n+1} &= \frac{1}{n}\sum_{i=1}^{n}R_i \\
&= \frac{1}{n} \left( R_n + \sum_{i=1}^{n-1}R_i \right) \\
&= \frac{1}{n} \left(  R_n +(n-1)\frac{1}{n-1} \sum_{i=1}^{n-1}R_i\right)  \\
&= \frac{1}{n} \left(  R_n + (n - 1)Q_n \right) \\
&= \frac{1}{n} \left( R_n + nQ_n - Qn\right) \\
&= Q_n + \frac{1}{n}[R_n - Q_n] 
\end{align}

where R_i is the reward from the current step from a given slot machine and n is the count of the number of times this slot machine has been played. Line (3) follows because (n-1)\frac{1}{n-1} = 1, and multiplying the sum by 1 does not change its value. Line (4) follows because:

\begin{aligned}
Q_n &= \frac{R_1 + R_2 + \cdots +R_{n-1}}{n-1} = \frac{1}{n-1}\sum_{i=1}^{n-1}R_i
\end{aligned}

In plain English, line (6) in the first equation above says is that we can update the estimate Q_{n+1} by finding the difference between the received reward and the previous estimate (R_n - Q_n) let’s call this \delta. We then use this difference to update our estimate Q_n, by moving our old estimate a small step towards the most recent reward. This is done by weighting \delta by a step size, defined by dividing it by the number of times this arm has been pulled. We can think of this as taking the form:

NewEstimate \leftarrow OldEstimate + StepSize[Reward- Old Esimate]

Think of this as a gradually reducing the step size the more times we pull a given arm, because we can gradually become more certain of the true value. Notice in equation 1 at line (6), when the denominator n (number of arm pulls) increases, the step size decreases.

ε-first code:

Okay, enough theory for now, let’s put try putting what we’ve learned into code!

class FixedGreedyAgent():
  def __init__(self, exploration_steps, bandit_problem):
    self.problem = bandit_problem
    self.current_step = 0
    self.exploration_steps = exploration_steps

    self.arm_qs = np.zeros(len(bandit_problem.bandit_arms_values))
    self.arm_ns = np.zeros(len(bandit_problem.bandit_arms_values))


  def choose_action(self):
    if self.current_step > self.exploration_steps:
      # greedily pull best arm
      choice = np.argmax(self.arm_qs)
    else:
      # explore pull any random arm (still a chance to pull best arm too)
      choice = np.random.randint(0, len(self.arm_qs))

    reward, optimal = self.problem.draw_from_arm(choice)

    self.arm_ns[choice] += 1

    if self.current_step > self.exploration_steps:
      self.update_estimate_incremental(choice, reward)

    self.current_step = self.current_step + 1

    return reward, optimal

  def update_estimate_incremental(self, choice_index, reward):
    self.arm_qs[choice_index] = self.arm_qs[choice_index] + (1/self.arm_ns[choice_index])*(reward - self.arm_qs[choice_index])

You can see line (6) from equation 1 implemented in the above in code in the update_estimate_incremental function. Most of the logic is implemented in the choose_action function, where our agent will randomly explore for some fixed number of exploration_steps and then exploit the best estimated “arm” (slot machine in our example) from then onwards. The FixedGreedyAgent will keep track of the current q-value and number of pulls for each arm, as well as a parameter for number of exploration steps to explore before exploiting.

All of the code for this blog is available in this google colab notebook so you can run it yourself and play around with the parameters 🙂

I chose to run the experiment for 50000 steps per episode and repeated this 2000 times to get an average score for reach step. Our fixed-exploration greedy agent explores for 1000 steps before exploiting. I recommend using a much lower value for steps and episodes when running on google colab or the experiment will take a long time to complete, but feel free to play around with the exploration steps parameter. We can see the results for the average reward per step for the fixed exploration bandit below:

Discussion and future work:

Okay, so it looks like our fixed exploration strategy worked pretty well, right? Well yes, but there are some problems with this strategy:

  1. We have to spend X number of turns exploring randomly before we can exploit and get a decent score. This could be very costly in some scenarios, for example if it costs $1 to play each slot machine, and our average return for the first 1000 steps is $0, then it is going to cost us $1000 out of pocket just to explore, before we see any returns! So we may need deep pockets for this exploration strategy.
  2. Also the more arms we have, the longer we have to spend randomly exploring before we can be confident we have found the optimal one and exploit it. If we had 150 arms or more then 1000 exploration steps may not be adequate. If there are thousands of bandit arms, then this exploration strategy may not be viable at all.
  3. If we don’t confidently find the optimal choice (slot machine) within those first X exploration steps, then we will lock on to a suboptimal choice for the rest of the episode! We could be stuck making bad decisions forever.
  4. If the value of the bandit arms (slots) changes over the course of an episode, our agent will never update its belief about the best arm to choose. It’s fixed decision could get worse and worse as an episode progresses.

This plot makes the problems visible. During the first 1000 steps we pick the optimal choice roughly 10% of the time (makes sense; 10 choices picked at random), and after the exploration phase is done the fixed exploration bandit manages to ‘find’ the best option ~95% of the time. But, once the agent is done exploring it never changes its mind about the best option. It also isn’t guaranteed to have found the best slot machine in the first place, especially if the number of arms is considerably more than 10.

This begs the question: can we explore more effectively? And that is a very, very good question. Actually, that question is so good that people are still trying to figure it out! In the next post we’ll learn about some more simple and effective ways to balance exploration and exploitation, even for non-stationary problems, using an ε-greedy approach (and others), and different q-value estimate update rules 🤓

References:

  1. Russel, S. J., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (Global Edition) (4th ed.). Pearson.
  2. Sutton, R., & Barto, A. (2018). Reinforcement Learning: An Introduction (2nd ed.). The MIT Press.
  3. Fragkiadaki, K. (2019). Exploration/Exploitation in Multi-armed Bandits. https://www.andrew.cmu.edu/course/10-403/slides/S19_lecture6_exploreexploitinbandits.pdf

Calculus Made Easy: Differentiation

In the previous post we looked at limits, this time we’ll take a look at derivatives in calculus (which utilises limits). If you aren’t confident with limits yet, don’t worry, you can check out my last post to get up to speed super quick. Derivatives allow us to calculate the slope (also known as the gradient) at any point of a function, even if the function is not a straight line! The process is called differentiation, and it’s super useful in real life because most real-world functions are not straight lines, and we often need to know how fast a function is changing (the slope/gradient) 🙂

Finding the slope – straight lines

Recall from algebra (or my last post) that the slope of a straight line is just the rise over run, like in the first graph below. These straight lines usually take the slope-intercept form of y = mx + b, where m is the slope and b is the y-intercept (where the line crosses the y-axis).

Finding the slope for this line is easy: for every 1 unit across the x-axis, the y-axis increases by 2. So the slope is \frac{rise}{run} = \frac{2}{1} = 2. This is all well and good, but what do we do if the function is not a straight line? We could resort to the method proposed in my last post, but that works only for finding the slope at a given point. What we want instead is to be able to find the slope at every point of the function, by deriving some new function which gives us the slope of the original function for any value of x. This is where derivatives come in.

Finding the slope – curves

Remember a slope is just a rate of change, and that is exactly what a derivative gives us: a rate of change of a function f(x) at some point x. Check the graph below for an example.

This is the graph of the function y = x^{2}, which we can see is constantly changing (so it isn’t a straight line). Our approach to this problem uses limits: to see which value the gradient approaches as we make some ‘infinitely small’ change to the input of the function. You can visualise this as us zooming waaaay in on some part of the curve until it looks completely straight, and then drawing a straight tangent line at that point of the graph, which we then use to calculate the slope using our classic rise over run approach. Look at the graph below, I did the visualising for you 🙂

We zoomed in pretty far to the curve (red line) above at the point (1, 1), and the red line almost looks like it’s completely straight, just like the tangent line we drew (blue line). Imagine if we zoomed in as far as we possibly could, those two lines would look the same to us. This is where limits come in, and our familiar rise over run.

First, recall that the rise over run is just some difference in y values of the function when plugging in two different x values into the function. Also recall that the slope/gradient we get out from the rise is the average gradient between these to x points of the function. The maths below might look scary at first, but stick with me and I’ll explain everything:

\begin{aligned}
Slope &= \frac{rise}{run} \\
&= \frac{y_2 - y_1}{x_2 - x_1} \\
&= \frac{f(x+h) - f(x)}{h}
\end{aligned}

All we are saying here is that the slope or gradient is the difference of the y values divided by some small change in x (represented by h on the last line), when we plug two different x values in. Think of this as taking some small step h along the x-axis, and recording the change in the height of the curve. The slope is the change in y divided by the small step h. So the last line above is the same thing as the second line, just written slightly differently. Just remember they are the same.

Now remember that by using limits we can shrink this step h along the axis smaller and smaller and see what happens to the output of the function:

\begin{aligned}
f'(x) &= \lim_{h \to 0} \frac{f(x+h)-f(x)}{h}
\end{aligned}

The f'(x) part is read: “f prime”, and it’s just some notation to show that this is not the original function f(x) but a function derived from f(x). So f'(x) just means “the derivative of f(x)“. There’s actually an important point here so let’s make it clearer: the above formula is the general definition of the derivative.

Think for a moment about what will happen in the above formula as h gets closer to 0. Let’s work through an example where we use our curve f(x) = x^2 and plug in x = 2 to see what happens. Stick with me, I’ll explain it all in a moment:

\begin{aligned}
f'(2) &= \lim_{h \to 0} \frac{f(2+h)-f(2)}{h} \\
&= \lim_{h \to 0} \frac{(2+h)^2-2^2}{h} \\
&= \lim_{h \to 0} \frac{(4 + 4h + h^2) - 4}{h} \\
&= \lim_{h \to 0} \frac{4h + h^2}{h} \\
&= \lim_{h \to 0} (4+h) \\
&= 4 + 0 = 4
\end{aligned}

Okay, you might be wondering: “what sort of witchcraft is this?!”. Before you grab your torch and pitchfork, allow me to explain – it’s not witchcraft at all and I guarantee you can do it too! 🙂

First we just substituted our x value of 2 into the formula for the slope, and we add the limit part to show that our little steps h along the x-axis get closer and closer to 0 (smaller and smaller). Then we simplified a bit until we end up with \lim_{h \to 0} (4+h) .
Now this last step is the important part. Since we expressed that \lim_{h \to 0} approaches 0, we can now just substitute h=0 to get our final answer. It’s like we’re imagining what happens if we took a step with exactly 0 distance along the x-axis.

So, after all this, the gradient of the curve f(x) = x^2 at x = 2 is 4. Try the same process for yourself using a different value of x and see what you find.

We can follow this approach to find the derivative of any function, by replacing f(x) with the function we want to differentiate in the general formula given above. Let’s try using x^2 again, but without substituting any value in for f(x) this time:

\begin{aligned}
f(x) &= x^2 \\ \
f'(x) &= \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h} \\
&= \lim_{h \to 0} \frac{(x^2 + hx + hx +h^2) - x^2}{h} \\
&= \lim_{h \to 0} \frac{2hx +h^2}{h} \\
&= \lim_{h \to 0} (2x +h) \\
&=2x +0 \\
&=2x \\

\end{aligned}

Cool, so the derivative of f(x) = x^2 is: f'(x) = 2x. Simple! 🙂 We can even plot this alongside our original function to see what it looks like (black line below):

Some things to notice about this particular derivative function:

  • As the original function f(x) = x^2 approaches zero on the x-axis, the gradient (derivative function: green line) is negative and approaches 0. You can think of this as meaning that the slope of the curve is negative or going down.
  • Then, as the function passes 0 and begins to increase, the gradient also becomes positive (because the curve is now going up).
  • The derivative function f'(x) = 2x is linear, because the gradient of the original function is increasing at a constant rate as we move from left to right on the x-axis. In other words, the original function is getting steeper and steeper.

Now I have some good news for you: this is the complicated way of doing differentiation. If you found things difficult to understand so far, you can breathe a sigh of relief knowing that it all gets easier from here. I explained it this way so you could understand what is actually happening “under the hood” when we differentiate, but there are some shortcut rules you can learn to avoid using limits all the time.

Shortcut rules for differentiation:

Here are eight great shortcut rules you can use to differentiate most functions, you should really memorise these because they’ll save you so much time in future. For each of the examples below I recommend you plot the original function and its derivative using this handy tool. It will help you understand what the derivative of each function looks like.

1. Constant rule

This one is easy. When differentiating, any constants in the function simply become 0, so:

\begin{aligned}
&f(x) = c \\

&f'(x) = 0 \\
\end{aligned}

(Be careful, this doesn’t mean coefficients become 0, but we’ll get to that in a second).

2. Power rule

This differentiation rule is applicable if the input variable is raised to some power/exponent. For example: f(x) = x^3. For this rule we move the exponent n down to be a multiple of the input variable, then we replace the exponent with n-1. The rule looks like this:

\begin{aligned}
f(x) &= x^n \\
f'(x) &= nx^{n-1}\\
\end{aligned}

Here’s an example:

\begin{aligned}
&f(x) = x^3 \\
&f'(x) = 3x^{3-1} = 3x^2
\end{aligned}

When the exponent is a fraction (remember: fractional exponents are another way of writing roots), some more algebraic manipulation might be required to simplify the resulting derivative:

\begin{aligned}
&f(x) = x^{1/3} = \sqrt[3]{x} \\
&f'(x) = \frac{1}{3}x^{\frac{1}{3} -1 }= \frac{1}{3}x^{-\frac{2}{3}} = \frac{1}{3x^{\frac{2}{3}}}
\end{aligned}

Also be careful if the input variable is not raised to any power (it is understood that it is raised to the power of 1): f(x) = x. In these cases the rule still applies, but the outcome is simple: replace the whole variable with 1. Here’s why (remember: anything raised to the power of 0 is one):

\begin{aligned}
&f(x) = x = x^1 \\
&f'(x) = 1x^{1-1} = 1x^0 = 1\times1 = 1
\end{aligned}

The power rule is probably the most common rule you will need to use when differentiating, so learn it well! 🧠

3. Constant multiple rule

Here’s another easy one once you know the power rule: if the variable is raised to a power but also has a coefficient, then we multiply the coefficient by that power and continue with the power rule. Like this:

\begin{aligned}
&f(x) = 2x^3 \\
&f'(x) = 3\times2x^{3-1} = 6x^2
\end{aligned}

This one gets a little bit more complicated when fractions are involved, but there’s no need to worry. Just take a deep breath, apply the rules you already know and a bit of algebra, work through it step by step, and you’ll be fine:

\begin{align}
f(x) & = \frac{2}{5}x^{1/3} \\
\space f'(x) &= \frac{2}{5}\times \frac{1}{3}x^{\frac{1}{3} -1 }= \frac{2\times\frac{1}{3}x^{-\frac{2}{3}}}{5} \\
&= \frac{2}{5}\times\frac{1}{3}\times \frac{1}{x^{\frac{2}{3}}} \\
&= \frac{2 \times1 \times 1}{5 \times 3 \times x^{\frac{2}{3}}} \\
&= \frac{2}{15x^{\frac{2}{3}}}
\end{align}

In step (2) we applied the power rule. In step (3) we realised that anything to a negative power can be expressed as a fraction instead (and we switch the sign of the exponent too), e.g.: x^{-2} = \frac{1}{x^2}. In step (3) I also expressed each fractional term separately so you can see how it will all fit together more easily. After that, steps (4) and (5) should look pretty self explanatory to you 🙂

4. Sum rule

You know I said those other rules were easy? Well this one is super easy. For any function containing a sum of different terms involving the input variable, just treat all of them separately! So if you’re differentiating a function like this: f(x) = x^3 + x^2 + x + 7 you can use the rules we just learned for each term individually like this:

\begin{aligned}
f(x) & =  x^3 + x^2 + x + 7 \\
f'(x) & = 3x^{3-1} + 2x^{2-1} + 1x^{1-1} \\
& = 3x^{2} + 2x + 1 \\
\end{aligned}

5. Difference rule

This is exactly the same as the sum rule, but in this case instead of addition there is subtraction in the function. Just do exactly the same thing you did with the sum rule and treat each term separately 🙂

6. Product rule

Use this rule if you’re differentiating a function that contains the product of two functions with the input variable in them, like this example: f(x) = x^3 \times 2x^4. Here’s a neat way of remembering what to do in these situations (remember: the prime symbol ‘ indicates the derivative):

\begin{aligned}
y & = this \times that \\
y' & = this' \times that+ this \times that' \\
\end{aligned}

Let’s make that more concrete with an example:

\begin{aligned}
f(x) &= x^3 \times 2x^4 \\
f'(x) &= (3x^2 \times2x^4) + (x^3 \times 8x^3)
\end{aligned}

7. Quotient rule

Use this rule when the function contains a quotient (fraction) with a function of the input variable in the numerator and the denominator, something like this: f(x) = \frac{x^4}{3x^5}. For this rule we apply the following:

\begin{aligned}
y &= \frac{top}{bottom} \\
y' &= \frac{(top' \times bottom) - (top \times bottom') }{bottom^2} \\
\end{aligned}

Make sure you pay close attention to the prime ‘ symbols above, the subtraction, and the extra power of 2 in the denominator. Let’s try an example to see how this works:

\begin{aligned}
f(x) &= \frac{x^4}{3x^5} \\
f'(x) &= \frac{(4x^3 \times 3x^5) - (x^4 \times 15x^4)}{(3x^5)^2} \\
&= \frac{(4x^3 \times 3x^5) - (x^4 \times 15x^4)}{9x^{10}}
\end{aligned}

The trick with this rule is making sure you remember the correct order in the numerator and the extra power of 2 in the denominator.

8. Chain rule

This is probably the hardest rule for differentiating functions. You use it when there is a function inside another function. The name given to functions-within-a-function is a composite function. Here’s an example of what I mean: f(x) = \sqrt{x^4}. This is because the square root is a function, and the power counts as a function too. Here’s a handy rule for differentiating using the chain rule:

\begin{aligned}
y &= outer(inner) \\
y' &= outer(inner)' \times inner' \\
\end{aligned}

It’s really important to remember that when you differentiate to get outer(inner)' that you don’t differentiate the part inside the function. Let me show you what I mean:

\begin{aligned}
f(x) &= \sqrt{x^4} = (x^4)^{\frac{1}{2}}\\
f'(x) &= \frac{1}{2}(x^4)^{-\frac{1}{2}} \times 4x^3 \\
&=\frac{1}{2} \times \frac{1}{(x^4)^{\frac{1}{2}}} \times 4x^3 \\
&=\frac{1}{2} \times \frac{1}{x^2} \times 4x^3 \\
&=\frac{4x^3}{2x^2} \\
&= 2x
\end{aligned}

Notice in the first step how I differentiated \sqrt{something}. It doesn’t matter what this something is when differentiating the outer function. We don’t differentiate that inner something at the same time we differentiate the outer part. We differentiate it separately and then multiply the two together. I did also simplify the inner part after applying the chain rule, just so the final result was a bit nicer to read.

That’s really all there is to it. Depending on the situation you might have to do some different algebraic wizardry to simplify the result after using the chain rule, but just remembering the first part y' = outer(inner)' \times inner' is enough, and then you can rely on your algebra skills to carry you home 🙂

The chain rule is especially cool because deep neural networks use it extensively in the backpropagation algorithm – this lets the neural net figure out which connections it needs to adjust in order to do better at some prediction task. This works because deep neural networks are really just big composite functions with many layers of functions-within-a-function. Like an onion of functions 🧅. I’ll cover this in more detail at a later date.

Anyway, if all else fails when you’re differentiating, and you can’t seem to differentiate some function and you don’t understand why, you can use this simple online tool to help you. If you do use that tool, I really recommend you study the result and figure out where you were making a mistake before. That’s it, now you know most of differentiation.

Scroll to Top