“Hey, why don’t we just test that?”
A/B testing is an indispensable element of conversion optimization. Backed by extensive and methodical research, testing promises to be the best way to improve website performance in a relatively short time. That’s why conducting A/B tests is considered the pinnacle of the CRO process.
And testing actually sounds like a pretty easy thing to do. You just create your page variations, put them into the experimentation tool of your choice, and that’s it. Sit back, wait for the results to roll in, and implement the winning variation — enjoy!
If it were actually that easy, this post would end right here.
I would have directed your attention to our blog on conversion optimization, and you would have clicked something like “How to properly conduct conversion research,” and that would be it. The reality, though, is a bit more complicated.
What the “black box” concept has to do with A/B testing
‘Wait, what? I thought we were talking about testing!”
We are. But first, let’s just briefly examine the concept of the black box.
What is a black box? Simply put, it’s any complex system (device or process) that uses inputs and process them to outputs (results).
The user does not understand the process used to turn inputs into outputs.
If you use your testing tool in the way described above, you’re using it as a black box. In essence, you do not know why the outputs came out the way they did, or how the system processes them. The box, AKA the process, remains a mystery.
Using any system as a black box is inherently dangerous, since you can’t know if the results are valid. While detailed knowledge of the system is not necessary nor expected, at least basic knowledge is required.
Put another way, you need to know how a system works to use it effectively. Otherwise, you face a real risk of GIGO (garbage in, garbage out).
To make the experimental tools we use in A/B testing more transparent and reduce the risk of relying on false conclusions, we must learn their basic mechanics. Since A/B testing at its core is a statistical method, that means we need to learn some basic statistics and their role in the A/B testing process.
Hey, wait! Don’t close this tab just yet. I said basic statistics.
… Ah, well. For the two people who didn’t close the window, rest assured that I’ll put this in as clear and simple terms as possible.
I’ll explain some of the basic statistics terms involved in A/B testing. There will be no advanced math or formulas to remember. The tools will take care of that for you.
All you’ll learn here is what these numbers mean, and how they influence your test results, so you can test properly and confidently.
What is statistics?
Let’s start with a quick foundation. Statistics is a branch of mathematics that deals with the properties of large sets of elements.
Think back to your grade school, where you learned about sets and subsets. Those familiar concepts are used in statistics, too.
An element of a set in statistics is called an observation. This name fits nicely, because they are precisely the observations we make of the properties of an object, numerical figure, or person.
For example, imagine you write down the height of every person in a room. These data points together create a set of data that we can examine and use as a basis for larger, more meaningful observations.
Or, say you have 100 people in a room. If you wanted to measure everyone’s height, you’d have to measure each individually. That might be easy enough — but what if it were 1,000 people? Or 1,000,000?
At a certain point, you can’t measure everyone individually. So you have to take a sample.
Samples provide a representation of a larger data set
What you could do instead of measuring everyone is make a limited number of “observations” (measurements). This limited number of observations is known as a “sample” in statistics.
Once you know the properties of your sample, you can use it to deduce the properties of the entire set of elements (called a “population” in statistics).
And you won’t even have to measure 100,000 people!
Another quick example (this one from real life)
Let’s now look at a real example of how statistics are used. One of the most common uses of statistical methods is polling.
If, for example, you want to know the purchasing power of the population of New York City, you would need to gather data from every household in the city. But that is frequently not possible nor efficient to do.
Instead of relying on the time-consuming process of polling everyone, you could instead poll a number of random inhabitants, and calculate their purchasing power. This limits the number of measurements necessary, saving time, effort, and money that can instead be expended on analyzing the data.
This limited number is the sample, and all the inhabitants of New York City comprise the population.
So, now we poll a sample — say 1,000 people out of 500,000 living in NYC. What can we do with it? How do we derive the data we need and project it to the entire population? Let’s see.
First off, you need to be sure your sample is actually representative of the population. This means the ideal sample should resemble the entire population, but at a smaller scale.
To continue with our NYC example, if you took a sample of people living in relatively well-off parts of the city or in suburbs, but avoided polling people in poorer communities, you would not get an accurate sample of the population.
Therefore, sampling must be done with care — but not only that. You also need to take a sufficiently large sample to eliminate the possibility of error. There are methods for sampling without bias, and we will examine them later. For now, all we need to know is that a sample is a limited part of the total population that should ideally represent the entire population.
Averages indicate the “central tendency” of your data
Once you have a sample, you can use it to calculate many statistical properties that describe the sample and the total population.
One of the best-known properties is the “average” of the set. The average, or mean, is simply the value of the sum of all the elements divided by the total number of elements in the set.
Average = (Sum of all of the elements) / (Number of total elements in set)
Basically, if you have 5 different values in a sample, you can add them all together and divide that number by 5 to get the “average” value. This is also called the arithmetic mean or simple average.
An important thing to keep in mind when testing is a phenomenon called “regression to mean”. This essentially means that when you start testing, the results may vary significantly. Sometimes you’ll notice the initial values in the test that are far above the eventual mean of the sample.
Think of flipping a coin
Flipping or tossing a coin is a test with a binary result — the coin can either land on heads or tails, and there’s a probability of 50% either way. But if you start a coin-toss experiment, your first ten tosses might well look like this:
Heads, tails, heads, heads, heads, tails, heads, tails, heads, tails
Notice that heads came up 6 times in 10 tosses. If you stopped the experiment there, you might conclude that heads has a 60% chance of showing up in any random coin toss. But if you continued your experiment, chances are that regression to the mean would kick in, and after 100 tosses, you’d be tempted to conclude that it is 50%.
And after 1,000 tosses, you’d be absolutely sure that the chance of either side of the coin landing face-up was exactly 50%.
There are other types of averages, such as geometric mean, harmonic mean, and so on. All these are “central tendency estimates,” which serve to determine the basic characteristics of a sample or its distribution.
The distribution of a sample shows us how far each individual element of a sample is from the sample mean. Let’s dig into that a little more.
Depending on the sample size and the distribution type, the number and distance of your data’s outliers can vary.
This is represented by the curve of distribution, plotted on a graph. There are many different distributions; the most frequently used is called a bell curve (for its shape) or “normal” distribution.
The main characteristic of normal distribution is that its arithmetic mean and median (median being the element that lies in the exact middle of all of the data points) are the same. Normal distribution is most commonly used in testing, and all assumptions start with normal distribution.
One of the most important properties of distribution is variance.
Variances can indicate the accuracy of your data, or provide a range
Variance is best illustrated and then explained. Look at the following picture:
You see the tallest curve in the graph — the blue one, right? By its shape, you can instantly tell that all the observations in that sample are closely clustered around the mean. Can you guess what its variance is? Large or small?
If you answered small, you’re right.
Variance measures how much each individual data point differs from the sample average. It is used to determine how precise the average is. If we have many outliers, like in the orange curve on the graph, the mean will be a less accurate predictor of each individual element.
Let’s go back to our household purchasing power example.
We’ve collected 1,000 observations in our sample of the population, and discovered that the average purchasing power of a household is, say, $1,000.
—> If we calculate the variance at 9%, we can safely conclude that the purchasing power of the population is between $910 and $1,090.
—> However, if the variance is larger — say 50% — then the purchasing power of an individual household can be anywhere between $500 and $1,500.
If we detect very large variances in our data, it may mean the sample we used was deficient, and we should improve our sampling technique or include more observations.
Now back to A/B testing
So, now that we’ve got the basics out of the way, we can proceed with what we need to do to successfully conduct testing.
What is testing and why do we use it? In a statistical sense, testing serves to distinguish between two samples and determine the difference. In a practical sense, this means taking two variations of a web page and comparing the results they create in terms of engagement, views, conversions, etc.
The point of the test is to find out which of the variations performs better in some key measurement. Usually this is conversion rate. To produce meaningful results, this test must be conducted in a way that creates a meaningful comparison.
We will now see in practice all the theoretical concepts we just discussed, and… gulp… introduce a few new ones.
Significance gives you confidence in your results
In statistical terms, significance means ensuring that the effect we measured is not due to chance.
In terms of A/B tests, significance basically means ensuring that the A/B tests you conduct are meaningful — or in other words, that the difference between the two tested variations is real and reliable.
The usual level of significance we look for in testing is 95%. This means that the likelihood of the test results being due to chance is only 5%.
Note that the size of your sample will depend on the level of significance you’re looking for. The higher the significance you’re aiming for, the larger sample size you’ll require to confirm it.
Let’s do an example A/B test right here, right now
Say we’re conducting a test on a product page. We want to improve the page’s conversion and get more visitors to actually buy that product.
The existing page converts at a rate of 2.54%, and we want to increase that to 4% (which is the average conversion rate of other products on the website).
To decide how to test, we need to know the following elements:
1. Expected effect
In this case, the expected effect is equal to the difference between 4% (the expected/anticipated conversion rate) and the base 2.54% conversion rate. The difference between those two percentages is 1.46% — but since we are operating in relative terms, we want the percentage of lift achieved.
Expected effect = (Target conversion rate -Base conversion rate)/target conversion * 100
So we have to take one more step, and divide the difference by the base rate and multiply by 100 to get the percentage of relative lift.
Once we calculate that, we know that we’re expecting a lift of 57%. This is the effect we look for.
Generally, the higher the effect, the smaller sample size we need to detect it. This is the reason to go for bigger effects in testing, especially on sites with lower traffic.
2. Sample size
As we’ve seen, the second element that influences sample size is statistical significance, which is in direct proportion to sample size. We will always aim for tests with 95% significance, as anything lower increases the amount of error.
You will often read that tests need to be run until they reach 95% significance, and this is true. What you won’t know if you stop reading there is that this will happen multiple times in the process of testing. If you do lots of tests, you will notice that many of them reach significance at one point, only to drop below the 95% mark at other points.
Back to our example test.
When we plug these elements into Evan Miller’s excellent sample size calculator, we get “2,577”. This is how many visits each page variation needs to receive for us to deduce with 95% significance that one variation has won over the other.
And now, time enters the picture.
How long should you run a test?
Test duration is a function of two variables. The first is, of course, the required sample size. If your website has a large amount of traffic (in the hundreds of thousands of unique visitors per month), you can safely run any test.
Conversely, if the amount of traffic on the site is lower, testing will need to be oriented toward bigger gains or run for longer periods.
To explain the second variable, let’s return to our NYC household purchasing power example from above, where we explained sampling. The same principle applies here — AKA if you start testing with an inadequate or too-small sample, some tests will reach significance almost at once.
If you only test your variations on a sample of 50 visitors, you will not be able to tell with certainty which variation won. This is called a false positive. It simply happened that the first observations showed you a certain result — but the reason behind that result may be pure chance or some outside influence you didn’t foresee.
But even if we test the entire required sample size, or all 2,600 people on the first day or testing, we still can’t be sure we won’t end up with a false positive.
The reason may simply be something outside of our control. People may just come to the page on other days. Or we started testing at the end of the month, and the customers were out of money, or something else altogether.
In order to remedy this issue, we need to run our test for long enough to eliminate any outside influence. The changes we made should be the only factor that is different (i.e., the only independent variable), in order to make sure that any test success is due to OUR actual website changes or improvements. Therefore, it’s generally recommended to keep the tests running for at least two to three weeks, and/or through one to two complete sales cycles.
Due to the nature of how testing tools operate, running the test for longer than four weeks exposes us to the risk of sample pollution. Sample pollution is the result of repeating observations in a test, unbeknownst to us.
Why would that happen? Since testing tools rely on browser cookies to mark which visitors have seen which variation, after one month, those cookies may get deleted. If that happens, the same visitor would see a different variation of the page, spoiling the results. For this reason, most tests are called at the end of a four-week period.
Common problems with testing
Some of the most frequent problems related to testing are tied to the testers’ statistical errors. For example, selecting too small of a sample size. People sometimes assume that 100 visitors is enough, but it is not! You should always determine your sample size in advance to avoid this error.
Calling the test too early is the second, possibly even more common error. Regardless of the sample size you tested, you should run the test for a long enough period of time. As we explained earlier, that way you will eliminate outside influences.
Testing for small gains is an error that is easy to make, and it’s the most frequent reason for inconclusive tests. If you test for a small gain, you increase the chances that your testing tool will not be able to detect the change. This results in an inconclusive test (and you’re left wondering what to do next).
The final thing to keep in mind is the mantra of statistics, which we can’t say enough: “Correlation does not imply causation.”
This means that if two events happened at the same time or close to one another, it does not mean that the second one was caused by the first one. It could be a coincidence. Only sufficient testing can establish causation.
Don’t let this scare you away from testing
A/B testing is seemingly simple and (before you read this post) you might have been tempted to think there isn’t much to it.
The sooner you accept the fact that it is a bit more complicated, the sooner you will get better results. Creating your hypothesis and new page variations and setting up your test is just half of the job. Conducting the test properly is the only way to make sure the results you get will be true and sustainable.
As you’ve seen, most errors in testing come from lack of knowledge of statistics. While you don’t need to be an expert in statistics to conduct tests (though it couldn’t hurt), you do need to know the basics.
Here’s a checklist to help you run better, more reliable tests:
- Determine your necessary sample size.
- Test for at least two or three full weeks.
- Identify possible outside influences (weekends, holidays, shopping patterns, etc.)
- Resist the temptation to call a winner after you see a “trend” emerge. There are no trends in statistics.
- Correlation is not causation! Repeat this before, during, and after the testing is done.