ACADEMY/A/B testing training blog
Stopping A/B tests too early is without a doubt the most common—and one of the most dangerous, A/B testing mistakes. By finishing tests too soon A/B testing may not deliver the benefits you are looking for. Even worse, as the decisions you make are based on invalid results they could negatively impact your conversion rates.
In this blog post we'll answer the following question: What concepts do I need to understand to avoid stopping my A/B tests too early? It will look at these factors:
- Significance level
- Sample size
- Duration of test
- Variability of data
None of these elements are rules for when to stop a test on their own. But having a better grasp of them will allow you to make more informed decisions about the length of your tests.
1 When is a significance level reached?
Don’t trust test results below a 95% significance level. But don’t stop a test just because it has reached this level.
What is an A/B testing significance level?
When your A/B Testing tool tells you that your variation has X% chance of beating the control, it’s actually giving you the statistical significance level. Another way to put the same data is that there is a 5% (1 in 20) chance that the result you see is completely random, or that there is a 5% chance that the difference in conversion measured between your control and variation is imaginary. You want a minimum level of 95% - no less.
Why shouldN't YOU stop AN A/B test before reaching a 95% significance level?
A significance level of 80% sounds like a solid winner but that’s not why you’re testing. You don’t want just a “winner”.
You want a statistically valid result. Your time and money are at stake, so don't gamble. From experience, it’s not uncommon for a test to have a clear 'winner' at 80% significance, but it then actually loses when you let the test run properly to the end.
Is having a 95% significance level HIGH enough To stop your A/B test?
Statistical significance doesn’t imply statistical validity and isn’t a stopping rule on its own. If you did a fake test, with the same version of a page (an A/A test), you'd have a more than 70% chance that your test would reach a 95% significance level at some point.
Shoot for a 95%+ significance level BUT don’t stop your test just because it reached it.
2 What sample size do you need to get significant A/B test results?
You need a sample that is representative of your audience and large enough not to be vulnerable to the data's natural variability.
When you do A/B testing, you can’t measure your “true conversion rate” because it’s an continually-moving target. You arbitrarily choose a portion of your audience with the following assumption: the selected visitors’ behavior will correlate with what would have happened with your entire audience.
Know your audience before creating your A/B test
Conduct a thorough analysis of your traffic before launching your A/B test. Here is a couple of examples of things you need to know:
- How many of my visitors come from sources such as PPC, direct traffic, organic search, email, referral, etc
- What % are returning or new visitors
The problem is your traffic keeps evolving, so you won’t know everything with 100% accuracy. So, ask this question: Is my sample representative of my entire audience, in its proportion and composition?
Another issue if your sample is too small is the impact that outliers will have on your experiment. The smaller your sample is, the higher the variations between measures will be.
What is the Problem with having a small sample for an A/b test?
Here's an analogy from a real-life experiment. Look at the data from tossing a coin 10 times. H is heads, while T stands for tails. We know the “true" probability of our coin landing on either heads or tails is 50%. Here's what you might see if you repeat the ten tosses 5 times and track the % of heads and tails.
The outcomes vary from 30% to 80%. Repeat the experiment, but instead toss the coin 100 times instead of 10:
The outcomes now vary from 47% to 54%. This highlights that the larger your sample size, the closer your result gets to the “true” value.
With conversion rates, you could have your variation winning by a long way on the first day for a whole variety of reasons. For example, you'd just sent out your newsletter and the majority of your traffic was from existing clients, who are already engaged with your brand, meaning they reacted positively to your experiment. If you stop the test at this point, even with the significance level at 95%, you would have skewed results. The real result could actually be the exact opposite, meaning you'd make a business decision based on false data.
How big should your A/B test sample be?
There is unfortunately no magic number that will answer this question. It comes down to how much of an improvement you want to be able to detect. The bigger a lift in conversions you want to detect, the smaller sample size you’ll need. And even if you have Google-like levels of traffic, it isn’t a stopping condition on its own.
One thing is true for all statistical methods: the more data you collect, the more accurate or “trustworthy” your results will be. But this varies depending on the method your tool uses.
To determine sample size, we advise our clients to use a calculator like this one (we also have one in our solution). It gives an easy-to-read number, without you having to worry about the math too much. And it prevents you from being tempted to stop your test prematurely, as you’ll know that until this sample size is reached, you shouldn’t even look at your data.
You’ll need to input the current conversion rate of your page, the minimum lift you want to track (i.e. what is the minimum improvement you’d be happy with). We then recommend at least 300 macro conversions (meaning your primary goal) per variation before even considering stopping the test.
We sometimes aim for 1000 conversions/variation if a client has sufficient traffic. The larger the volume the better, as we saw earlier. The traffic could also be slightly lower if there are considerable differences between the conversion rates of your control and variation.
However, simply having high volumes of traffic, a large enough sample size, and reaching a 95% significance level in 3 days is not enough. There are rules about the duration of A/B tests to take into account too.
3 The duration of A/B tests
How long should you run your A/B tests FOR?
You should run your tests for full weeks at a time and we recommend you test for at least 2–3 weeks. If you can, make the test lasts for 1 (or 2) business cycle(s).
This is because, as with emails and social media, there are optimal days (even hours) for traffic. People behave differently on given days and are influenced by a number of external events. Run an analysis of your conversions on a day-by-day basis for a week you’ll see how much it can vary from a day to another. This means, if you started a test on a Thursday, end it on a Thursday.
What are the best practiceS FOR A/B test duration?
Test for at least 2-3 weeks, although longer is preferable. As mentioned above, 1 to 2 business cycles would be optimal. This gives a range of traffic - from first time visitors to those close to buying - from different sources, while accounting for most external factors (we’ll explain more about these in our article on external validity threats). If you need to extend the duration of your test, extend it by a full week.
4 Check the variability of data during A/B tests
If your significance level and/or the conversion rates of your variations are still fluctuating considerably, leave your test running.
What does variability involve?There are two factors to consider:
- The novelty effect: When people react to your change just because it’s new. It will fade with time.
- Regression to the mean: The more data you record , the more you approach the “true value”. This is why your tests fluctuate so much at first, as outliers have an outsized impact.
In this example, the results given by the orange curve are too variable to be representative.
Why THE significance level ALONE is not enough when results still vary
This is also why the significance level isn’t enough on its own. During a test, you are likely to reach 95% several times before you can actually stop your test. Make sure your significance curve flattens out before ending the test. The same principle applies to the conversion rates of your variations - wait until the fluctuations are negligible considering the situation and your current rates.
For tools that give you a confidence interval, for example “Variation A has a conversion rate of 18.4% ± 1.2% and Variation B 14.7% ± 0.8%”, make sure you understand what this means. Essentially it is saying that the conversion rate of variation A is between (18.4% - 1.2%) and (18.4% + 1.2%), and the conversion rate of variation B is between (14.7% - 0.8%) and (14.7% + 0.8%). If the 2 intervals overlap, keep testing.
Your confidence intervals will get more precise as you gather more data. So, whatever you do, don’t report on a test before it’s actually over. To resist the temptation to stop a test, it’s often best not to look at the results before the end. If you’re unsure, it’s better to let it run a bit longer.
Before stopping an A/B test, consider the following:
- Is your significance level equal or superior to 95%?
- Is your sample large enough and representative of your overall audience in composition and proportions?
- Have you run your test for the appropriate length of time?
- Have your significance level and conversion rates curves flattened out?
Only after taking all of these factors into account can you stop your test - remember that ending your test early can invalidate the results, and any decisions you base on them.