Easy Reinforcement Learning - The Multi Armed Bandit

This is another draft that’s been sitting on my laptop and I was sitting on the Eurostar on the way to TradeTech and thought I’d try and formalise it into a blog post. This is all about reinforcement learning and a basic model that can be easily implemented in Julia. This post is me walking through and implementing the 2nd chapter of Reinforcement Learning: An Introduction.

Enjoy these types of posts? Then you should sign up for my newsletter.

Reinforcement learning is a pillar of machine learning and it combines the use of data and learning how to make a better decision automatically. One of the basic models in reinforcement learning is the multi-armed bandit. A bit of an anachronistic name, but the single-armed bandit refers to a casino game where you pull the lever (or push a button), some cassettes roll round and you might win a prize.

The multi-armed bandit is an extension to this type of game and means we have different levers we can pull that lead to a different reward. The reward depends on the lever pulled.

This simple mental model is surprisingly applicable to lots of different problems and it can act as a good approximation to whatever you are trying to solve. For example, let’s use an advertising example. You have multiple adverts that you display to try and get people to click through to your website. Each time a page loads you can load one advertisement, you then record how many people click on that advert and use that to decide which advert to show next. With each page load you decide, do I show the most succesful advert so far or try a new advert to see how that performs? Over time you will find out which advert performs the best and show that as much as possible to get as many clicks.

A Simple Bandit

Imagine we have a multi-armed bandit machine, where we pull a lever and get a reward. The reward depends on the lever pulled, how do we learn what the best lever is?

First let’s build our bandit. We will have 5 levers and the reward will be a sample from a normal distribution where each lever will have a random mean and standard deviation.

using Plots, StatsPlots
using Distributions

nLevers = 5

rewardMeans = rand(Normal(0, 3), nLevers)
rewardSD = rand(Gamma(2, 2), nLevers)

hcat(rewardMeans, rewardSD)

5×2 Matrix{Float64}:
 -4.7724   5.88533
 -4.60967  0.627556
 -5.96987  1.14465
  8.96919  3.80253
  2.11311  4.84983

These are the parameters of our levers in our bandit, so lets look at the distribution of the rewards.

density(rand(Normal(rewardMeans[1], rewardSD[1]), 1000), label = "Lever 1")

for i in 2:nLevers
    density!(rand(Normal(rewardMeans[i], rewardSD[i]), 1000), label = "Lever " * string(i))
end
plot!()

So our levers giving us a sample from a normal distribution is illustrated above. The 4th lever looks like the best as it has the most likely chance of getting a positive value and has the wider tail too. As we are talking about rewards, large positive values are better.

So given we have a process of pulling a lever and getting a reward, how do we learn what the best lever is and importantly as quickly as possible?

Like all good statistics problems, we start with the most basic model and start pulling levers randomly.

The Random Strategy

Just pull a random lever every time. Nothing is being learned here though and we are just demonstrating how the problem setup works. With each play we generate a random integer that corresponds to the lever, pull the lever (draw a random normal variable with mean/deviation of that lever), record what lever was pulled and the reward amount. Then repeat several times.

function random_learner(rewardMeans, rewardSD, nPlays)

    nLevers = length(rewardMeans)
    
    selectedLever = zeros(Int64, nPlays)
    rewards = zeros(nPlays)

    cumSelection = zeros(Int64, nLevers)
    cumRewards = zeros(nLevers)
    
    optimalChoice = Array{Bool}(undef, nPlays)

    bestLever = findmax(rewardMeans)[2]
    
    for i = 1:nPlays
    
        selectedLever[i] = rand(1:nLevers)
        
        optimalChoice[i] = selectedLever[i] == bestLever
        
        rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
    
        cumSelection[selectedLever[i]] += 1
        cumRewards[selectedLever[i]] += rewards[i]
    
    end
    return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end

We run this learner for 1,000 steps and look at the number of times each lever is pulled.

randomStrat = random_learner(rewardMeans, rewardSD, 1000);

histogram(randomStrat[1], label = "Number of Time Lever Pulled")

Each of the levers is pulled a roughly equal amount of times, with no learning, just randomly pulling. Moving on, how do we learn?

Action Value Methods

Reinforcement learning is about balancing the explore/exploit set-up of the problem. We need to sample each of the levers and work out what kind of rewards they provide and then use that information to inform our next decision.

For each iteration, we randomly decide if we will pull any lever or do we use the old information to choose our best guess at the best lever. Our information in this case is the rolling average of the reward each time we pulled the lever. This is called a greedy learner. It’s just doing its best with what it knows and has no real ability to decide whether to explore a new lever.

The probability of choosing a random lever is called the learning rate ($\eta$) and controls how often we make the perceived optimal choice. A high value of $\eta$ means lots of exploring (learning) and a low value restricts the learning and means we pull the (perceived) best lever each time. So if we had many levers and a low learning rate it is possible that we never find the globally optimal lever and instead just stick to the locally optimal lever, hence why it is called a greedy learner, it can get stuck.

function greedy_learner(rewardMeans, rewardSD, nPlays, eta)

    nLevers = length(rewardMeans)
    
    selectedLever = zeros(Int64, nPlays)
    rewards = zeros(nPlays)

    cumSelection = zeros(Int64, nLevers)
    cumRewards = zeros(nLevers)
    
    optimalChoice = Array{Bool}(undef, nPlays)
    
    bestLever = findmax(rewardMeans)[2]

    for i = 1:nPlays

        if rand() < eta
            selectedLever[i] = rand(1:nLevers)
        else 
            q = cumRewards ./ cumSelection
            q[isnan.(q)] .= 0
            selectedLever[i] = findmax(q)[2]
        end
        
        optimalChoice[i] = selectedLever[i] == bestLever
        
        rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))

        cumSelection[selectedLever[i]] += 1
        cumRewards[selectedLever[i]] += rewards[i]

    end
    return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end

Again, we can run it for 1,000 steps and we set our learning rate to 0.5.

greedyStrat = greedy_learner(rewardMeans, rewardSD, 1000, 0.5)

histogram(greedyStrat[1], label = "Number of Time Lever Pulled", legend = :topleft)

This has done what we thought, it has selected the 4th lever that we thought looked the best from the distribution. So we’ve learned something, hooray!

Varying in the Learning Rate

The $\eta$ parameter was set to 0.5 above, but how does varying change the outcome? To explore this we will do multiple runs of multiple plays of the game and also increase the number of levers. For each run, we will generate a new set of reward averages/standard deviations and run the random learner and the greedy learner with different $\eta$.

nRuns = 2000
nPlays = 1000
nLevers = 10

optimalLevel = zeros(nRuns)

randomRes = Array{Tuple}(undef, nRuns)
greedyRes = Array{Tuple}(undef, nRuns)
greedyRes05 = Array{Tuple}(undef, nRuns)
greedyRes01 = Array{Tuple}(undef, nRuns)
greedyRes001 = Array{Tuple}(undef, nRuns)
greedyRes0001 = Array{Tuple}(undef, nRuns)


for i=1:nRuns
    rewardMeans = rand(Normal(0, 1), nLevers)
    rewardSD = ones(nLevers)
   
    randomRes[i] = random_learner(rewardMeans, rewardSD, nPlays)
    greedyRes[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0)
    greedyRes05[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.5)
    greedyRes01[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
    greedyRes001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.01)
    greedyRes0001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.001)
    
    optimalLevel[i] = findmax(rewardMeans)[2]
    
end

For each of the runs we have the evolution of the reward, so we want to take the average of the reward on each time step and see how that evolves with each play of the game.

randomAvg = mapreduce(x-> x[2], +, randomRes) ./ nRuns
greedyAvg = mapreduce(x-> x[2], +, greedyRes) ./ nRuns
greedyAvg01 = mapreduce(x-> x[2], +, greedyRes01) ./ nRuns
greedyAvg09 = mapreduce(x-> x[2], +, greedyRes05) ./ nRuns
greedyAvg001 = mapreduce(x-> x[2], +, greedyRes001) ./ nRuns;
greedyAvg0001 = mapreduce(x-> x[2], +, greedyRes0001) ./ nRuns;

And plotting the average reward over time.

plot(1:nPlays, randomAvg, label="Random", legend = :bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg, label="0")
plot!(1:nPlays, greedyAvg05, label="0.5")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")

Good to see that all the greedy learners outperform the random learner, so that algorithm is doing something. If we focus on the gready learners we see how the learning rates changes performances.

plot(1:nPlays, greedyAvg, label="0", legend=:bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")

This is an interesting result! When $\eta = 0$ we see that it never reaches as high as the other learning rates. So when $\eta = 0$ we never explore the other options, we just select what we think is the best one from history and never stray away from our beliefs. This ultimately hurts us because if we don’t get the best level on the first try then we are stuck in a suboptimal. Likewise, when the learning rate is very low, it doesn’t get much better, so this shows there is always value in exploring the options.

Philosophically, this shows that with any procedure you need to iterate through different configurations and explore the outcomes rather than sticking with what you believe is optimal.

scatter([0, 0.5, 0.1,0.01, 0.001], 
    map(x-> mean(x[750:1000]), [greedyAvg, greedyAvg05, greedyAvg01, greedyAvg001, greedyAvg0001]),
    xlabel="Learning Rate",
    ylabel = "Converged Reward", legend=:none)

The learning rate looks like it is optimal around 0.1. You can do a grid search to see how the overall behaviour changes in terms of both the speed of convergence to the final state and how good that final reward state is.

Speed it Up - Incremental Implementation

We can improve the above implementation by just saving memory and CPU cycles by doing ‘online learning’ of the rewards and using that to drive the selection. We create one matrix $$Q$, update it with the average reward of each lever and use the maximum of each iteration to select our lever if we are not exploring.

function greedy_learner_incremental(rewardMeans, rewardSD, nPlays, eta)

    nLevers = length(rewardMeans)
    
    selectedLever = zeros(Int64, nPlays)
    rewards = zeros(nPlays)

    cumSelection = zeros(Int64, nLevers)
    cumRewards = zeros(nLevers)

    Q = zeros((nPlays+1, nLevers))
    rewardsArray = zeros(nLevers)
    
    optimalChoice = Array{Bool}(undef, nPlays)
    
    bestLever = findmax(rewardMeans)[2]
    
    for i = 1:nPlays

        if rand() < eta
            selectedLever[i] = rand(1:nLevers)
        else 
            selectedLever[i] = findmax(Q[i,:])[2]
        end
        
        optimalChoice[i] = selectedLever[i] == bestLever
        
        reward = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
        rewards[i] = reward
        rewardsArray[selectedLever[i]] = reward
        
        cumSelection[selectedLever[i]] += 1
        cumRewards[selectedLever[i]] += reward

        Q[i+1, :] = Q[i, :] + (1/i) * (rewardsArray - Q[i,:])
        
    end
    return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end

Using the normal Julia benchmarking tools we can get a good idea if this rewrite has changed anything materially.

using BenchmarkTools

oldImp = @benchmark greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
newImp = @benchmark greedy_learner_incremental(rewardMeans, rewardSD, nPlays, 0.1)

judge(median(oldImp), median(newImp))

BenchmarkTools.TrialJudgement: 
  time:   -43.91% => improvement (5.00% tolerance)
  memory: -70.15% => improvement (1.00% tolerance)

It’s 50% faster and uses 70% less memory, so a good optimisation.

Conclusion

This is the basic intro to reinforcement learning but a good foundation for how to think about these problems. The main step is going from data to decisions and how to update the decisions you make each time. You need to make sure you explore the problem space as otherwise you never know how much better some other options might be.