How to Create Effective A/B Tests from Scratch

Create Effective A/B Tests

Published by

April 28, 2017

So you want to create effective A/B tests? Did you ever compete in a science fair as a kid? You came up with a hypothesis – a plant grows much better in direct sunlight, for instance – and then created a scientifically sound experiment to test your theory.

You made sure to control for only what you were testing – sunshine – by using all the same types of plants and soil, and the exact same amount of water in every experiment. You grew your plants in various amounts of sunlight and recorded the results. When it was done, you created a three-sided board to display your results and analysis.

The controlled-variable science experiment analogy applies perfectly to creating effective A/B tests for e-commerce stores.

Of course, you’re dealing with way more variables and ambient noise than a kid growing plants for school, but the principles stay the same.

Effective A/B Testing Checklist
Effective A/B Testing Checklist

Step 1: Where to Begin Testing

We constantly emphasize to our clients that effective testing is a long-term game. It’s not always a popular stance in the era of “move fast and break things,” but the whole point of testing is so that you move in only the right directions and don’t waste time or money breaking big things that matter.

Effective testing begins with you.

Testing begins with your mindset, your resources, your e-commerce store – everything that’s available to you and unique to you.

Strengthening your understanding of the testing process and integrating testing into your team’s process gives you ability to reel off successful tests consistently – thus raising your revenue through improved conversions.

Testing is a way to build up a systematic approach to optimizing conversions across your entire organization. It’s not something you just “start doing” or “implement” overnight. Like our favorite weightlifting analogy, it’s a process. No one walks into the gym and lifts 300 pounds on their first day. First you need to learn how to lift properly and how to fuel your body for effective lifting.

Step 2: Figuring Out What to Test

Since you’re not jumping in and lifting 300 pounds on day one, you’re also not rolling up your sleeves and testing a complete redesign of your homepage on day one.

Even if you already have a little testing experience, we usually recommend against our clients starting with full-page redesigns. If you have no testing experience at all, definitely start with something smaller.

It’s important to start with a balance of something that gives you a true lift but also won’t take forever to set up and run. You’ll want to be able to work quickly through the process of setting up, monitoring and analyzing the test results with your team and testing agency.

Getting through some tests early on also shows you right away any flaws you might have in your testing approach and how to fix them before you start testing something huge, like crucial design elements on your highly-trafficked homepage, for instance.

Here are some ideas on tests you can set up:

  • Customer flow through your checkout, removing clutter
  • Copywriting, or specific words you use in headlines, buttons and more
  • Use of social proof
  • Use of security indicators
  • Colors
  • Navigation and search improvements
  • Videos, animation or none
  • Calls to action elements, including colors, copy, and location
  • Fonts and text size
  • Identifying and removing distractions from pages

It’s also worth mentioning here that you should be sure your testing tool is capable of and configured to measure the right things – and that you’ve chosen the right measurements for the test you’re running.

For instance, if you’re testing the effectiveness of two landing page headlines and you’re measuring conversions as “clicks on the CTA button,” your results could be skewed by visitors’ reactions to the CTA button itself, which wouldn’t have anything to do with the headlines.

This example from Copyhackers shows this issue in action. Although the instance on the right saw an almost 124% higher conversion rate, it’s hard to know whether to attribute the lift to the page copywriting or the button CTA copy changes:

Copyhackers - Create Effective A/B Tests
Copyhackers – Create Effective A/B Tests

Keep your experiments single-variable (just like in your school science experiments) by changing only one thing at a time.

Step 3: Setting Up a Statistically Significant Test

Perhaps the hardest part of A/B testing is making sure your test will mean something.

By that, we mean making sure that when you’ve gone through all the work of setting up, running and analyzing your test, you need to know those results mean what you think they mean.

For instance, if you run a test trying to figure out what color to make your homepage button, but you only get 20 visits, and 15 think it should be blue, while only 5 think it should be green, you could conclude that it should definitely be blue…except that’s not a lot of visits, and what if the next 7 people pick green? Your results would feel way less conclusive.

The issue here is a sample size that’s too small. A lot of sample size calculators exist for figuring out how to run a statistically sound and significant test, but many inexperienced testers fall prey to the illusion that they can enter in their parameters and the calculator will spit out perfect, exact answers to the number of observations or length of test necessary.

You need at least 1,000 transactions – or around 25,000 uniques – in order for any kind of testing to make sense.

Testing guru Evan Miller summed up this fallacy and issue so perfectly that we couldn’t say it better:

When an A/B testing dashboard says there is a “95% chance of beating original” or “90% probability of statistical significance,” it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance? The answer to that question is called the significance level, and “statistically significant results” mean that the significance level is low, e.g. 5% or 1%. Dashboards usually take the complement of this (e.g. 95% or 99%) and report it as a “chance of beating the original” or something like that.

However, the significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless. This result is completely counterintuitive and all the A/B testing packages out there ignore it, but I’ll try to explain the source of the problem with a simple example.

We generally recommend using Evan Miller’s A/B test sample size calculator to help avoid this issue, and to make sure you’re keeping an eye on statistical significance at all times as you set up your tests.

Step 4: Monitoring Your Test – Patiently!

This step may actually be the toughest part of testing, because we’re wired by human nature to want to peek at our ongoing tests.

The advice not to peek at your tests is everywhere, and sometimes our clients will assume it’s because they’ll somehow ruin the test by looking early.

Here’s the secret: nothing will happen if you peek at your test early.

The reason a lot of testing consultants and experts will warn you against it is to protect you from yourself. It’s almost impossible to resist acting based on what you see when you look early. Even if you try not to, you’ll have that information in the back of your mind. It’s best to just not look!

Step 5: Running Your Tests the Right Duration

This ties in with “Step 3: Setting Up a Statistically Significant Test” because when you stop a test depends a lot on when you’ve experienced enough observations of an action to make your test statistically meaningful.

For instance, if you calculate that you need a certain number of observations to reach a sound test, then you can stop your experiment after you’ve reached that number.

This reinforces the need to calculate your sample sizes based on significance and not to simply run your experiments until you see significant change.

Here’s an example from Evan Miller showing why stopping the experiment in the latter situation is problematic:

Example A: Experiment run until a statistically significant number of observations reached

Example A
Example A

Example B: Experiment run until statistically significant difference in instances of the observations

Example B
Example B

As you can see in the second example, in two scenarios the test was stopped too early. This gives you completely skewed results that would over-emphasize the percentage change of conversions – probably leading you to make a change to the thing you’re testing, potentially to your detriment.

Step 6: Analyzing the Results Properly

Your first action when you reach this step? Check your results. Then check them again. And maybe again, for good measure. (Ask our founder Emir about the time he saw a very popular testing software serving the control treatment to 100% of a client’s visitors while the software showed an active test running the entire time.)

Fancy software and shiny calculators can tell us a lot, and they make testing a lot easier than in the olden days, but because we’re optimizing for humans, we need to always give everything a human eye.

At Objeqt, we calculate and validate our results using chi-squared tests like this one:

Success Rate Significance
Success Rate Significance

If you’re relying solely on software like VWO and Optimizely, then you should know those two tools have hidden all of the information about their actual analyses – presumably to keep their methodologies secret as they engage in a shootout for testing supremacy.

We’re not saying you need to doubt all results from VWO or Optimizely, but it does become impossible to independently verify your results when the actual analysis is hidden behind an opaque black box.

Remember how our overall testing method combines quantitative analytics and qualitative user research before we reach the testing stage? We do that so we always have a human perspective on data – and so we have cold, hard data to counteract human biases. This complementary relationship between data and your creativity and instincts also applies when analyzing your test results.

Your New Motto: Move Deliberately and Improve Things

In e-commerce and business in general these days, we constantly hear the refrain “Move fast and break things.”

We understand the sentiment behind it – this mindset encourages you to experiment and try things, put ideas into action as quickly as possible, and move on to other things if the actions fail to pan out fast. It helps organizations avoid stagnation or endless planning loops without action.

The problem, though, is it also encourages hasty testing and iteration based on potentially incomplete results.

Testing requires patience, but it doesn’t mean stagnation. Applying science rather than gut instinct all the time results in making changes proven to result in higher conversions – and thus, higher revenue.

Creating a testing system and integrating a testing mindset and process into everything you do will help you create sustainable gains across the board. Rather than running headlong without a destination in mind, you’ll move steadily and inexorably toward your goals – and those revenue gains.

This article is fourth in a five-part series, The Foundation of A/B Testing for E-Commerce Growth. To read the rest of the series, click here.

Want to conduct effective A/B tests to grow revenue for your e-commerce store…but not sure how to get started? That’s literally what we do. Schedule a free consultation with us to see if we’re your kind of testing agency and how we can help.

Series Navigation<< Know Thy Customer: Why You Need Qualitative Research for Effective CROBuilding a Sustainable Testing Culture >>

Published by