Multi-armed Bandit Algorithms

Multi-armed Bandid algorithms is another approach to unveil the performance difference between two algorithms when running campaigns.Why Multi-armed Bandit Algorithm is Not “Better” than A/B Testing suggests that it is not comparable with A/B testing in that this should primarily be used for continuous optimization whilst A/B testing should be used to draw conclusion on statistical significance with less traffic. This is because the incoming traffic is equally split up in A/B test. In contrast, for 90% of the time the traffic is diverted to currently best performing system.

A handy book on running bandit algorithms for web optimisation “Bandit Algorithms for Website Optimization” published by O’Reilly Media.

Drawbacks of epsilon greedy algorithm:
Say epsilon = 0.1. At the beginning this algorithm is not exploring much as 0.9 chance it is using the limited choices of algos that it explored. In the later stage, it still uses inferior algorithms knowing the algorithms have not performed well.

How to measure the performance of a particular bandit algorithm itself?
Probability of selecting the best arm: high epsilon keeps exploring, hence it reaches peak fast while the peak probability is low compared with low e settings, say 0.1.

Average Reward

Cumulative reward: this gives the whole picture whereas the above two give performance at a certain point.

In conclusion, the performance of one bandit algorithm is dependent on horizon, namely how many steps are there before finishing evaluation.

Softmax Algorithm
How worse is the worse one compared with the better one? 10 vs 13 is different than 10 vs 99. Therefore instead of choosing randomly when explore, we choose with probability of exp(Score_a) / (exp(Score_a) + exp(Score_b)). This one easily outperforms epsilon greedy bandit.

It’s better to less explore in later stage – Annealing Softmax.

UCB
Unlike epsilon greedy and softmax which are easily misled by noise, UCB is more robust. It looks at how much we know about each arm rather than simply how much reward we got already. Mind you, a good arm may initially perform bad by chance.

UCB metric: sqrt((2 * log(total_counts)) / float(self.counts[arm])).

UCB family is very practical in real world as it has no free parameter to tune.

Tips for Real world Deployment
A/A test to make sure the testing method and metrics are correct in that the same algorithm produce the same result.

Multiple concurrent tests may add certainty to the results.

A/B tests make sense for short-period periodic testing. While bandit algorithms are mostly useful for continuous experiments.

Selected Advanced Topics
Contextual Bandits
Update arms’ value using conventional ML, for example, GLM, LR to learn from and adapt to context.

Bandit Algorithm at Scales
Assign users in advance;
Update arm values in batches.