Testing is what makes the conversion optimization process tick – it’s the heart and soul, the raison d’etre, the very foundation upon which optimization is built. Testing will optimize conversions = wins! And it’s not hard to do.
Every online store can and should test to improve their conversion rate optimization. But not every store does. Some are woefully behind the times, giving stores who are actively optimizing the advantage.
But don’t expect that advantage to last for long. CRO is the future. We’ve written this post to help you get there a little faster, so you can get the edge on your competition.
Why you need to test to optimize conversions
Testing is essential to identifying and implementing optimal solutions to amorphous problems (as well as clear-cut problems) like, “Why isn’t my landing page converting as much as I think it should?”
Even if you haven’t done so formally, when you have a problem like a landing page that isn’t performing well, you form a guess – a hypothesis – as to why. Maybe you think it’s taking too long to load, or the “buy” button color or placement isn’t attention-grabbing enough. Maybe your web analytics holds a clue, or the qualitative surveys you’ve sent out hint that the reason could be a lackluster product photo.
Guesses – hypotheses – are often flawed, weak, or plain wrong, but we mostly realize that only in retrospect. The only way to know for sure is to test your theory.
Testing enables you to avoid costly mistakes, like deploying a ‘solution’ that causes new or exacerbates existing problems (instead of improving conversion). Testing also allows you to select the top-performing solution out of multiple available alternatives, thus avoiding suboptimal results, that are nonetheless improvements over the original content.
All the preliminary research and hypotheses creation you’ve done has lead to this point in the process: Creating the test. Here are the basic principles involved, so you can begin testing to optimize your conversion rates.
A/B Basics: A Testing Plan
Like embarking on any campaign (in warfare or marketing), you first need a plan of attack. If you read our post on Hypothesis Creation, you already have a foundation that is easy to build upon. (If you haven’t read it yet, read it now – we’ll wait).
Planning begins by choosing your criteria for which hypothesis you test first. There are number of methodologies CROs use to do this, four of the best known being:
- The PIE model
- The PXL model
- The TIR model
- The ICE model
PIE stands for Potential Importance Ease. It was developed by WiderFunnel agency as a means to prioritize tests and create testing plans. This model prioritizes hypotheses based on which have the highest potential for improvement, and the greatest possible impact (or importance), AND which require the least amount of effort to implement. Big improvement, big impact, little effort – with these three criteria scored from 1-10 (with 10 being the best score).
One variation is to penalize solutions that require great effort, or grading effort on a scale of 10 to 1 (with 1 denoting the most effort). The result: You’ll have a list of hypotheses to test that go from easiest to hardest, most impactful to least.
This model was developed and is used by ConversionXL. It represents an answer to problems inherent in the PIE method, namely the considerable subjectivity involved in making assumptions about Importance and Effort. It introduces a system of grading based upon multiple individual elements, such as position of the content on the page, developer effort required to implement change, length of time and amount of traffic necessary to test effectively, etc.
The strength of this method is that it is customizable. The only problem is that you may omit some important factors. Use with caution.
TIR model stands for Time Impact Resources. It advocates prioritizing tests according to time needed to complete the test, impact of the change on the conversion rate, and the resources necessary to implement the change (most often in terms of man/hours or number of people involved). This model uses a 1 to 5 grading scale.
Impact Confidence Effort model was developed by Sean Ellis, founder of the GrowthHackers. It prioritizes hypotheses to test by:
- Impact on conversion
- Confidence that the test will actually be successful
- Effort needed to implement the change
ICE seems like an ideal model – aren’t we all after impact on conversion? But, it also has the drawback that the second element, confidence, is subjective. When it comes to testing, subjectivity isn’t good – it’s where mistakes happen.
Once you choose a way to prioritize your testing order, the next step is to form a plan. Ideally, your plan will include all the hypotheses to be tested, and a deadline by which you’ll be finished. Each test should have a definite time limit, which is typically included as part of your hypothesis.
Your plan will also include which type of test to use for each hypothesis.
Finding the Right Test for the Job
For every hypothesis you must select the most appropriate test type to use.
A/B testing (or split testing) is the most straightforward variant of the test. It compares just two versions of a page: The existing version and the proposed variation.
The variation can have one major element changed, and possibly one or two minor ones changed. The two pages are set up in parallel, with A appearing to one audience at the same time B appears to another group of viewers, and are tracked through the testing tool (such as Optimizely or VWO or Google Optimize). The traffic reaching the page URL is split in predetermined proportions between the two variations, typically 50/50. The test is allowed to run until it reaches statistical significance.
Statistical significance is the percentage at which the test results are considered valid, and not the result of pure chance. In CRO, 95% is often used to validate significance.
Once the test has reached statistical significance, we can be reasonably confident that its results are valid. That is the moment when we call a winner between A and B and conclude the test.
But, we’re not quite done.
To increase our confidence in the result, we can shift a larger proportion of the traffic to the winning variation to ensure the result holds with larger samples. Or, if we are confident enough, we can transfer the entire traffic to the new web page, eliminating the losing variation altogether.
(When A/B Testing Goes Wrong) A/A testing
If you strongly suspect that your A/B tests return a false positive, you can use the A/A test to ensure external pollution (aka. “noise”) isn’t rendering your results inconclusive. This type of experiment is conducted by splitting the traffic in equal parts between the two identical pages. If there is no sample pollution and everything is normal, your A/A test should be inconclusive. However if you can call a clear winner with some statistical significance, something’s not right.
The Good & Bad of A & B
The A/B test has certain advantages and limitations. On one hand, it’s relatively simple to set up and can be run effectively with relatively little website traffic. We can also be reasonably sure that the proposed change is actually what is causing better performance of the page, since the variations are strictly limited. It’s fast to run, results come quickly, and you can be confident that you’ll see an improvement.
However, this type of tests suffers from some limitations that severely narrow its field of use. The most critical limitation is the most obvious: You have to identify a single alternative you want to test.
In the real world, it’s far more likely to have multiple variations that can be proposed with equal validity. The more variations we add to the page submitted to A/B testing, the less we understand how much each individual change affects the result. We may end up with a better performing page, but we will never know which change was responsible.
The solution to this is to run multiple A/B tests in sequence to see the effect of each change. This, of course, requires more time.
A/B tests work best when we have two clear alternatives that we want to test. We may also be forced to use A/B tests when the website doesn’t have enough traffic to conduct any other type of test.
In most other circumstances, running a multivariate test is more appropriate.
How Multivariate (A/B/C…/n) Tests Work
Multivariate tests allow us to overcome the main limitation of the straightforward A/B test – that we can only test two variants at a time (and one of those is the control). Multivariate testing lets us identify every possible combination of variations and put each to the test.This will allow us to determine the combination of variations that impact the conversion rate most and implement it.
Like the A/B test, website traffic is also proportionally allocated to every variation, preferably in equal proportions.
The main limitation of the multivariate test is that a properly conducted test requires significant amounts of traffic. In fact, the amount of traffic required to reach the statistical significance increases exponentially for every additional variation we include. This poses a serious challenge for websites that have relatively low traffic, as the sample size for each test quickly becomes too low to give a decisive conclusion.
The limitation of the multivariate tests can be overcome by testing over extended periods of time, but that introduces new variables into the experiment that we may not be able to account for, thus polluting our results. For example, if you began your testing in Spring, and extended it into the holidays, you’d likely get vastly different results that had nothing to do with what was on your website (and everything to do with the season).
Which brings us to another possible solution, the Bandit algorithm test.
Bandit Algorithm Tests
Bandit Algorithm Tests (or multi-armed bandit tests – an even more intriguing name) let you set up a multivariate test experiment and observe it over limited amount of time. This test works by progressively excluding obvious underperformers until it’s possible to determine the optimal variation with statistical significance.
Bandit tests aren’t perfect. While a bandit algorithm test helps to overcome the basic limitation of the multivariate test, it introduces the risk of terminating some variations prematurely.
Multi-arm bandit tests must be used judiciously and with the number of variations limited to the traffic numbers you know you can count on. The more traffic the website has, the more variations you can safely introduce.
Split Path Testing
Split Path Testing comes into play when we need to see which way works best to complete a task. For example, a typical conversion funnel on an e-commerce website looks like this:
Product page → Cart → Shipping info → Billing info → Confirmation page → Thank you
Using split path testing, you could test whether single-page checkout would work better.
Essentially, you’ll create two different paths to the conversion page (the product page, in this case) and split visitors between them equally. The top-performer wins.
Split path tests tend to be resource intensive. You may need to develop an entirely different design and code to support the different experience you want to offer to your visitors, and the result may not offer a return that outweighs the effort invested.
However, if the website has hit the local maxima, the plateau at which there are no obvious variations that will result in significant lift, this may be the one remaining thing left to test.
Testing Done? Great! You’re Not Finished Yet
I know, I know. You’ve hypothesized, you’ve tested, you’ve spent time, money and resources optimizing your site, and by golly, you’re ready for a beer.
But optimization is a journey. A long, rewarding journey, with total optimization as the ever-retreating goal. There will always be ‘one more thing’ to test. Even when you’ve gone through your entire site and optimized to the local maxima, it’s probably just time to consider a site redesign.
Don’t let this discourage you – this is exciting! It means there’s practically no limit to how much you can improve, grow, and profit.
With that in mind, remember a few key things when you’re testing:
- Never end the experiment when it reaches statistical significance. Be sure the experiment has been running for at least a week (or the length of your purchase -> delivery cycle) so that the results will cover all possible sample variations due to days of the week. Pay attention to the holidays or other periods of low (or increased) activity and take them into account when calling the experiment winners.
- When you create a test plan, always be aware of opportunity costs. Make an effort to identify all possible variations and judge the effort needed to implement them in order to avoid making suboptimal choices.
- Avoid testing for small-scale changes that have limited potential impact. These will most likely result in inconclusive tests and waste your time.
- Failed tests (tests in which the original variation wins) are still valuable learning moments. That said, you should aim to keep proportion of winning tests as high as possible.
- In observing test results and calling a winner, you should always check to make sure website performance has not been negatively affected in the testing process. For example, you might increase the performance of the desktop version of the site, but your mobile visitors suffer (and the mobile version of the site becomes unusable).
Unintended consequences and unforeseen results are why we test. And, they’re why we have to keep such tight control over our tests, so we can see the cause/effect relationship and continue on our optimization journeys a little smarter and a little wiser than when we began.