Now that we’ve covered the basics of Google Optimize, Google’s new tool for A/B and multivariate testing, we thought we’d take it out for a test drive to show it in action. So we created an experiment, all the way from the research and hypothesis stage through to the end results.
Ready to see what this puppy can really do?
But, we’re going to do this like a real test, so you’ll not only see Optimize in action, but also how the idea creation process works in practice. Hang on to your hats, because we’ll also get into the difference between the usual frequentist method and Optimize’s Bayesian inference methods.
Before the fun starts – research
Any time we’re A/B testing, our primary objective is to improve the website, and we usually have a few specific pages in mind as starting points. But that’s a vague goal and not very helpful, so our first step is to break this large goal down into specific objectives. Otherwise, all we have is wishful thinking.
Improving a website requires that we first acknowledge there is a problem (usually it’s when we see low conversion rates, or a swift drop in conversion rates), and then diagnose what specifically is wrong and think of ways to fix it.
We will not go into the depths of research process in this post, since we have a dedicated post on that. Suffice it to say here that research needs to cover every aspect of website performance to indicate possible problems. Generally, there are four aspects to the research we do:
- Quantitative research – What the numbers and hard data look like
- Qualitative research – What people are saying
- Heuristics – What our experts think, based on experience
- Technical audit – Kicking the tires and checking nuts and bolts
Each of these helps us to identify and pinpoint different problems that visitors to our website encounter, allowing us to devise solutions to those problems.
For example, qualitative research allows us to peek into the minds of our visitors and see what they perceive as hurdles to conversion. We can use their qualitative feedback to efficiently adjust our marketing budget, structure our funnel in a more efficient way and compare different segments of our visitors.
Heuristics help overcome usability and user experience issues and structure the website in a user-friendly way.
Finally, a technical audit ensures that the site operates without any technical errors that can cost us dearly in terms of conversion and revenue.
That ideation thing
Once the research process is complete, and often this is the most time consuming part of the CRO process, we can start “ideation.”
Ideation sounds like something that only happens in the Silicon Valley Googleplex, but don’t let the fancy-pants name fool you: It’s just taking data from your research and using it to think of potential solutions and improvements. You’re brainstorming, essentially. But call it “ideation” and charge more, by all means.
Ideation works best as a team effort, so get everyone involved in digging through the data, unearthing problems, and surfacing solutions.
Hypotheses – so much more than ‘educated’ guesses
Ideas are the raw materials for creating hypotheses.
In high school science, we learned the definition of “hypothesis” as “an educated guess.” In CRO, that would be near criminal oversimplification. Here’s what you need to know to establish a functional, productive hypothesis.
A hypothesis contains not only the proposed solution the problem, but other elements, such as the impact of the problem, specific steps on how to solve it, and the expected effect from solving it.
It fits into the complete testing picture like this:
Research → Idea → Hypothesis → Test.
or as we show in our A/B testing process:
Not all hypotheses need to be tested; some are obvious solutions that can be applied immediately to fix the problems that affect instrumentation or technical issues.
However, the most complicated problems that don’t have immediately obvious solutions, or have multiple ways of solving the issue, need to be tested. Those hypotheses are the ones that become tests.
Research → Idea → Hypothesis → Test IRL (in real life)
Putting all of this into practice will help us understand this process better. So let’s look at a sample e-commerce store homepage and come up with a few hypotheses, and a test for possible solutions to improve the website and conversions.
Our experiment website sells art prints and decor for children’s rooms.
The first thing that strikes us about the homepage is that there is almost no copy – no stated reason why visitors should buy, nothing fun to hook them in (this is “heuristic” analysis). Unsurprisingly, when we check Google Analytics, we see a high bounce rate with almost zero interaction on the homepage (we are backing up our heuristic analysis with quantitative data analysis).
We also notice that the homepage doesn’t create a sense of urgency or of missing a unique opportunity – two psychological triggers proven to increase conversion and that we can test out. Ideally, this page should emphasize the point that these are unique, one of a kind prints. The website could also attract more sophisticated buyers by offering special, custom made items and expanding their product and service lines (though these suggestions are entirely up to the owner and out of scope for this article).
Adding a few security indicators and some testimonials, reviews or other kinds of social proof would also help the target audience feel at ease that these are quality products from a legitimate business, and that they can expect positive results from purchasing.
Finally, the call to action (CTA) should be more prominent. It now appears as a ‘mouse-over form’ that appears only once the visitor moves the pointer over the image. That might be confusing.
Every one of these ideas represent a potential for testing.
We’re going to experiment with improving copy, since the other suggestions involve interventions in page code and a lengthier implementation process for the test.
Heuristic research has found out that the copy on the homepage is too short/non-existent. This creates a high bounce rate, as the visitors see the homepage, find nothing to catch their interest and leave. We propose adding copy to the homepage, immediately below the title line. The text should contain a few sentences that emphasize a unique value proposition, such as solving the problem of selecting a unique, memorable present for a baby shower or birthday, for example. Expected result of these efforts would be increasing engagement of the visitors and lifting the conversion rate to at least 2.5 to 3% from the present 1.5%.
Notice the structure of our hypothesis. We say what problem our research has indicated; we say the effect we believe this has (a high bounce rate); we propose a change to improve the page; and we end with our expected, measurable, result.
But we’ve also done something here that is subtle, but very important. We chose a change that we expect will have the most effect on conversion for the least amount of effort to implement. Prioritizing your solutions will save a great deal of time and effort, and we recommend always looking out for ‘quick wins.’
The scope of this proposed change requires very little effort – it’s just changing the copy above the fold. But it will be seen by every visitor to the home page, which means the effect could be substantial.
Then, the mock-up
So now that we have the idea and a hypothesis, we need an experiment. In order to test our belief that adding copy to the homepage will positively impact conversions and visitor engagement, we need to create a new variation (challenger).
A challenger variation is the new version of the homepage where we’ve implemented the solution we think could solve the conversion problem: In this case, improved copy.
This is an ideal case for using an A/B/n split test. We could create a few different challengers and put them to test. So how do we do this?
We begin by making a mock-up webpage in a tool such as Balsamiq. This allows us to make a design proposal and get the opinions of all those involved in the process (ie. the client) before we move forward. You can use any wireframing tool you like.
After we create a mock-up and get everyone to agree to the proposed change, we proceed to the next step of the process – creating an actual experiment.
Finally, the experiment! See Google Optimize in Action
We can implement our variation in Google Optimize in one of two ways.
The first is to use an integrated visual editor and make the changes directly, without touching the code.
The second is to use the external editor and edit the code directly. The former option is appropriate for small-scale changes (such as moving elements of the page around) or changes concerning content (such as what we propose to do right now – changing the copy). We’ll use this one for our little experiment, although we strongly recommend making big and substantial changes to create a real lift.
This is the layout of the visual editor included in Google Optimize.
Below is our variation, with copy added
But be cautioned: most content and engagement issues on a website cannot be solved by just moving the call to actions around. For best results, offer substantial changes that really affect how your visitors perceive the website, increase their motivation and engender trust.
Defining measurable objectives
Once you have created a variation, but before it is live, you need to select your objectives. As we have seen in the previous article, the objectives can be anything from page views to custom events. But they do need to be measurable.
We need to select the most appropriate objective that will best measure and reflect the improvement we want to make.
In this case, we want to measure the number of products purchased, or macro conversions, of these two pages. This is the metric we want to improve and, while other metrics may be of interest, the only one that increases the revenue of the website.
Who are we experimenting on?
Once we create the variation and add the objectives to track, we only need to define the audience for whom the experiment will apply. You can select different audiences, either using default settings based on Google Analytics reports, or you can use your own custom settings, which is a feature available only in Optimize 360.
For this experiment, we will use ‘all users’ as our target audience, although, sometimes you will want to target specific segments of your audience.
Frequentist v. Bayesian (advanced explanation)
It is here that we must reflect a bit more on the difference between the ‘traditional’ frequentist inference and Bayesian inference which Google Optimize uses (and most other testing programs don’t). In our previous article on Google Optimize, we gave a simplified explanation of frequentist vs. Bayesian, but now it’s time to go a bit deeper.
While frequentist statistics rely purely on the retroactive analysis of available data (what happened before) and deriving its conclusions from it, Bayesian inference uses not only the data we have available (what happened before), but also the data that we expect based on those previous experiences.
What does this mean in practice?
Testing using frequentist statistics is based on significance, and we use the term “significance” to measure the likelihood of our data being explained by something other than pure chance. Practically, it means that if significance is 95% it is highly likely that the increase in a metric is due to the changes we made and not to some outside interference we did not control. Frequentist statistics do not tell us whether the variation is better or worse, but only that the observed result is not result of chance.
Bayesian statistics, on the other hand, tests the result against the expected value of the change, which we set before we start the test.
The test will be conducted by comparing the two samples and their results directly. The assumption of Bayesian statistics is that the lift achieved is significant, no matter what the underlying cause may be. In this sense, the Bayesian method does not require the sample size calculation.
The difference is also in the way the Bayesian test results are interpreted. While frequentist tests will report that your hypotheses has 95% chance of being significant, which in ordinary language means very little, the Bayesian test result may contain the following line: ‘The variation X has 83.2% chance of being better than the baseline’. Notice the difference here.
With the frequentist test, you need a person well versed in statistics to interpret the result and translate it to ordinary language; while the Bayesian result allows even a layman to immediately understand the test result.
So why have frequentist statistics been used when the Bayesian approach has been around since the 18th century? The answer to this question is simple: Frequentist statistics require a much simpler mathematical engine and less mathematical operations to execute. However, as virtually infinite computing power has become available to everyone, there is no excuse to remain bound to frequentist statistics.
In the immortal words of famous statistician Chris Stucchio (what, never heard of him? He was one of the first people to propose using Bayesian stats in Agile A/B Testing with Bayesian Statistics and Python. It was riveting.)…
Hypothesis testing, aka the underlying model of your everyday a/b test, is designed to determine whether a hypothesis is true or false – it cares very little whether it makes small or large mistakes. Furthermore, the hypothesis that you specified may or may not have much to do with the success or failure of your business in the long term.
In contrast, the Bayesian A/B test recognizes that not all errors are created equal. Many small errors might equal a single large one, but then again, they might not. The Bayesian test applies a weight to losses proportional to the lift being lost. This allows for freedom to make mistakes as long as the end goal is higher conversions. These weights are directly the same issues that your business cares about, as oppose to the indirect ones prepared by hypothesis testing. This allows you to apply the same rigor to your tests as you do to your business.
Okay, but what do the results mean exactly?
Now that we have our experiment running, we will examine how and when to expect the results. Although in theory, the Bayesian testing does not require explicit calculation of sample size, it is still necessary to have enough site visitors to be sure that your results aren’t just because your mother forwarded your homepage to 50 of her friends, saying “see what my son did!.”
Allow the test to run for at least two weeks and involve at least 1,000 transactions to be really sure of the result.
These are the results of our experiment as they appear in Google Optimize.
As we can see, our variation resulted in improving the conversion rate by 73.6% over the baseline, with 99% certainty, after 559 sessions.
This is in stark contrast with what the frequentist inference answer would be, that there is a 95% significance to the test. Here we have 100% certainty that our variation will beat the control on any given day, provided we ran our test long enough to insure significance.
We conclude: Google Optimize in all it’s Bayesian glory is a significant improvement
Whether we use Bayesian or frequentist methods is of no real consequence, provided we know what the end results mean. Bayesian results are easier to interpret and present to regular people (ie. not professional CROs or statisticians), but – rest assured – all of these testing methods will give the same result, if the hypothesis we test is sound.
Even though you don’t need to explicitly determine the sample size in order to conduct a Bayesian test, you’ll still need to use common sense. Don’t quit the experiment after a day or two, or a few dozen transactions recorded. Bayesian testing or not, this sample will simply not cut it.
With this, we will leave you test for yourself. Enjoy!