So, you’ve decided to start testing! Naturally, you’ve established a testing program, done your research — this is something we’ll always mention, since it’s so easy to get caught up in the test mentality without adequate research — and established sound hypotheses.
Only once you’ve defined the problems, devised possible solutions, and prepared hypotheses can you start testing with any hope of success.
You’ll often have multiple ideas for solving any given problem on your website. The simpler the problem, the fewer ideas that make sense to try. However, for complex issues, you will run into a situation where you need to devise a number of potential solutions, since you can’t really be sure what will work best.
As the number of experiments increases, so does the required sample size. This is important for a few reasons.
You can’t ignore the clock…
The larger the sample you need, the more time you will have to allocate for the test to run. For example, if you have a test that includes more than two variations, or a multivariate test, the sample size will likely go into the five figures or even higher, depending on the effect. Unless your site has hundreds of thousands of visitors per month, this will present a problem.
The time the test runs will be directly proportionate to the number of people who view it daily. If you require a sample size running in the tens of thousands and have daily visits of mere hundreds, your test will run for a prohibitively long period of time. And any test that runs for longer than a month risks suffering from sample pollution.
What is sample pollution?
Sample pollution is a statistical concept that signifies the amount of visitors who are subjected to your experiment multiple times. For example, you might run an experiment where you present two versions of your product page to visitors. After the test has run longer than one month, the visitor who viewed the page on the first day returns.
By this time, they’ve deleted or refreshed the cookie you left (a cookie is a piece of code left on a client browser by experiment software to track which variation was shown to that user). They will see a different version of your page than the one they saw the first time. Once this happens, that visitor’s actions can no longer be taken as conclusive. If this happens to a significant number of visitors, the test results will be unreliable.
Invalidating your test after such a long run would be a serious problem. That is why tests should be conceptualized in a way that ensures your sample size is manageable, especially for relatively low traffic site.
Iridion shares its testing benchmarks succinctly:
A test should run at least two to four weeks and include at least 1,000 to 2,000 conversions per variation. Activities (newsletters, TV spots, sales, etc.) should take place during this time. There should be as many different channels displayed as possible (total traffic mix, either through targeting in the preliminary stage or segmentation in follow-up). If the test has achieved a minimal statistical significance of 95% (two-tailed, that is positive and negative) and if it is stable, then stop the test.
Time also poses another problem. If you run a test with three, four, or even more variations, each variation that fails to deliver will mean lost revenue. The proper, rigorous statistical approach means you will have to maintain splitting your traffic in equal proportion the entire time, though you may realize that a certain variation is a clear loser. It means throwing 25% of your potential revenue down the drain.
Bandit testing allows you to steal back your time
The solution to this problem appears to be a simple one: observe the test as it runs and eliminate the variations that appear to be losing. Then allocate the remaining traffic to better-performing variations. The problem with this is that it breaks every statistical rule. You cannot possibly be sure that the losing variation is really losing without statistical significance, and that means having to wait until the test runs its course.
Scientists originally encountered this problem when they needed to consider allocation of efforts and resources on different experiments. The problem remained unsolved for decades, and it was only in 1952 that Herbert Robbins devised solutions for using a bandit testing approach on multiple concurrent experiments to select the best-performing one.
The solution quickly found its applications in both scientific research and the selection of the best-performing financial portfolios.
The name “bandit testing” or “multi-armed bandit testing” is derived from a simple analogy. The idea is that you are presented with a gambling machine that has multiple levers (arms). Pulling each of them results in a reward with a certain probability, which is different for every arm.
Your task? Find the arm that provides the most frequent reward.
There are multiple strategies for solving this problem, and each has its own advantages and disadvantages. They involve a fair number of complex statistical calculations — so for those of you who are really interested, check out the multi-armed bandit Wikipedia page.
For now, I’ll briefly explain one common solution: a two-phase approach combining exploration and exploitation.
Exploration is pulling levers to see what happens. Exploitation is pulling the levers that result in the most frequent rewards. The more time you spend on exploration, the more certain you will be as to which machine will provide the reward more frequently — but you’ll receive fewer rewards. The aim is to shorten the exploration phase to a minimum, and engage mainly in the exploitation phase.
Using bandit testing in CRO became popular relatively recently, and has gained traction by promising to deliver better results faster. But can it really be useful to all optimizers?
When you should use bandit testing
Using bandit testing in CRO is alluring, for obvious reasons. You can deliver a better conversion rate in less time and increase your revenue faster. But you have to be careful.
Why? First, due to the statistical methods used — especially frequentist statistics — interrupting an experiment before it reaches statistical significance carries an inherent risk that the result actually isn’t significant. This risk is hard (if impossible) to eliminate. However, given a large enough sample, we can with some amount of certainty and calculated risk, rely on bandit testing.
Here’s how Conductrics explains it:
While there are many different approaches to solving Bandits, one of the simplest is the value weighted selection approach, where the frequency that each optimization option is selected is based on how much it is currently estimated to be worth. So if we have two options, ‘A’ and ‘B’, and ‘A’ has been performing better, we weight the probability of selecting ‘A’ higher than ‘B’. We still randomly select between the two, but we give ‘A’ a better chance of being selected.
With this in mind, there are some guidelines you must follow to properly apply bandit testing. First off, the most appropriate use for bandit testing is multivariate testing. Here, you can use bandit testing to quickly eliminate the worst-performing variation, and reduce the amount of time required to run the test. Multivariate tests require a large sample size, and any reduction in that size is usually a welcome change.
Second, you can use bandit testing to test pages with a short shelf life: for example, landing pages for promotional campaigns. You want to run your campaign for two or three weeks, and you don’t want to spend a majority of that time experimenting on multiple versions of your landing page.
Essentially, if you stand to lose (or miss out on) a large amount of money during a promotional period, and you have the traffic, bandit testing could be the right approach for you. If not, then stick to A/B testing.
And if you have a website with sufficiently large traffic, you can use bandit testing methods for most of your tests, as you will hit statistical significance pretty soon. That way, you’ll have more faith in your estimate of losing variations.
Bandit test algorithms
Bandit test algorithms are automated pieces of software that solve the problem of selecting the optimal arm. There are several families of algorithms, with names such as:
- έ greedy algorithm
- Boltzmann exploration
- Pursuit algorithms
- Reinforcement comparison and
- Upper Confidence Bounds (UCB)
Each of these families of algorithms performs better than the other on different types of tests. By far, the most often used are UCB and έ greedy algorithms. For an in-depth exploration of the merits of each with a sound statistical foundation, we recommend this article.
To see how a combination of a Bayesian test and a bandit algorithm performs, check out learnforeverlearn.com.
AI is the future of bandit testing
Lately, there’s been a lot of buzz on the Internet over machine learning and artificial intelligence. One of the most talked-about potential use cases for AI and machine learning (at least in CRO circles) is applying it to A/B testing. The possibilities seem endless and the reward great.
The best-known, and one of the first AI software solutions for CRO, is Sentient Ascend. This piece of software relies on conducting large-scale multivariate tests and automatically discards lower-performing variants. Since it requires human intervention only to identify the variants to be tested once the research is completed and the hypotheses formed, the test program basically runs itself.
In the future, this type of testing will surely make testing easier and faster. It will never eliminate the human factor completely, however. Human input will still be necessary to point the algorithms to what to test and what combinations of variations to test.
Soon, AI will test for us
The point of the entire process of conversion optimization is to increase revenue — i.e. the amount of money ecommerce sites bring to their owners. The best way to increase the likelihood of visitors converting is by researching their behavior and devising changes on your website that you hope will deliver an increase in purchases.
Since you can’t know in advance what exact change(s) will improve performance, you need to test variations against each other and compare results.
The problem with this approach is that it requires following the rules of statistics to be sure that the variation you implement really is the winning one. Proper testing requires time, and when it comes to ecommerce, time is literal money. Every day you spend testing and sending traffic to each variation in equal proportion means revenue lost on the lower-converting page variation.
No wonder people were trying to find the solution to this problem! They found it in the form of bandit algorithms. The problem with bandit algorithms was that they require a complicated statistics to actually work. But once the computational problem was overcome, it quickly became apparent that bandit testing is in some situations superior to classic testing.
The combination of Bayesian statistics and bandit algorithm testing is a very happy marriage, since Bayesian statistics does not require the test to reach a significance threshold. Using Bayesian statistics, you can estimate the likelihood of the tests that will win, and eliminate those that have the least chance of success.
Bandit testing can also be combined with artificial intelligence (AI) to produce quick and successful test programs that use the AI to test many variations — much more than a human ever could. In the near future, a lot of the testing workload will be taken from humans and put on the shoulders of our electronic counterparts.
However, keep in mind that neither bandit testing, Bayesian statistics, nor machine intelligence are “Win automatically” buttons. They’ll never be able to function without optimizers’ thorough knowledge of what we are doing. Otherwise, we’d be entrusting our fates to a black box.