Skip to main content
stop-ab-test-too-early

Are You Stopping Your A/B Tests Too Early?

Reading time
8 min
Author
Lauréline Kameleoon
Lauréline Saux
Laureline is Content Manager and is in charge of Kameleoon's content. She writes on best practice within A/B testing and personalization, based on in-depth analysis of the latest digital trends and conversations with Kameleoon's customers and consultants.

Stopping A/B Tests too early is without a doubt the most common—and one of the most potent, A/B Testing mistake. By doing so, the lever of ab testing on you conversion may not work. Worse, the decisions you might take could impact negatively on your conversion rate.

We will do our best to answer the following question: What concepts do I need to understand not to stop my A/B Test too early? Thus, we will cover today:

  • Significance level
  • Sample size
  • Duration
  • Variability of data

Note: None of these elements are stopping rules on their own. But having a better grasp of them will allow you to make better decisions.

When is significance level reached?

Don’t trust test results below a 95% significance level. But don’t stop a test just because it reached this level.

What is an A/B testing significance level?

When your A/B Testing tool tells you something along the lines of: “your variation has X % chances to beat the control”, it’s actually giving you the statistical significance level. Another way to put it: “there is 5% (1 in 20) chance that the result you see is completely random”. Or “there is 5% chance that the difference in conversion measured between your control and variation is imaginary”. You want at minimum 95%. Not less.

significance-level-ab-testing

 

Why you should not stop A/B testing before reaching a 95% significance level?

80% chance do sounds like a solid winner but that’s not why you’re testing. You don’t want just a “winner”.

You want a statistically valid result. Your time & money are at stake, so let’s not gamble! From experience, it’s not uncommon for a test to have a clear winner at 80% significance, and then it actually loses when you let it run properly.

Is having a 95% significance level enough for stopping your A/B tests?

Statistical significance doesn’t imply statistical validity and isn’t a stopping rule on its own. If you did a fake test, with the same version of a page, an A/A test, you'd have more than 70% chance that your test will reach 95% significance level at some point.

Shoot for 95%+ significance level BUT don’t stop your test just because it reached it.

Which Sample size do you need to get significant A/B tests?

You need a sample representative of your audience and large enough not to be vulnerable to the data's natural variability.

When you do A/B testing, you can’t measure your “true conversion rate” because it’s an ever-moving target. You arbitrarily choose a portion of your audience with the following assumption: the selected visitors’ behavior will correlate with what would have happened with your entire audience.

Know your audience before creating your A/B test

Conduct a thorough analysis of your traffic before launching your A/B tests. Here is a couple of examples of things you need to know:

  • How much of my visitors come from PPC, Direct Traffic, Organic Search, Email, Referral, ...
  • % Returning and New Visitors

The problem is your traffic keeps evolving, so you won’t know everything with 100% accuracy. So, ask this: Is my sample representative of my entire audience, in proportions and composition?

Another issue if your sample is too small is the impact your outliers will have on your experiment. The smaller your sample is, the higher the variations between measures will be.

What is the matter with a small sample for A/b tests?

Here is an analogy with a "in real life" experiment. Here’s the data from tossing a coin 10 times. H (head), T (tail). We know the “true" probability of our coin is 50%. We repeat the toss 5 times and track the % of heads.

 

1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th % H
T H T H T H T H H T 50
T H T T T H T H T T 30
H T H T T H H H H H 70
H T H T T T T H T H 40
H H H T H T H H H H 80

 

The outcomes vary from 30% to 80%. Same experience, but we toss the coin 100 times instead of 10.

Tosses % H
100 50
100 49
100 50
100 54
100 47

The outcomes vary from 47% to 54%. The larger your sample size is, the closer your result gets to the “true” value. It’s so much easier to grasp with an actual example :)

With conversion rates, you could have your variation winning by far the first day because you had just shot your newsletter and the majority of your traffic were your clients for example. They like you considerably more than normal visitors, so they reacted positively to your experiment. Should you stop the test here, even with a significance level at 95%, you would have skewed results. The real result could be the exact opposite for all you know. You made a business decision based on false data...

How big should your A/B test sample be?

There is no magical number that will solve all your problems, sorry. It comes down to how much of an improvement you want to be able to detect. The bigger of a lift you want to detect, the smaller sample size you’ll need. And even if you have Google-like traffic, it isn’t a stopping condition on its own. We’ll see that next. One thing is true for all statistical methods though: the more data you collect, the more accurate or “trustworthy” your results will be. But it varies depending on the method your tool uses.

To determine sample size, we advise our clients to use a calculator like this one (we have one in our solution, but this one is extremely good too). It gives an easy-to-read number, without you having to worry about the math too much. And it prevents you from being tempted to stop your test prematurely, as you’ll know that till this sample size is reached, you shouldn’t even look at your data.

You’ll need to input the current conversion rate of your page, the minimum lift you want to track (i.e. what is the minimum improvement you’d be happy with). We then recommend at least 300 macro conversions (meaning your primary goal) per variation before even considering stopping the test.

We sometimes shoot for 1000 conversions /variation if our client’s traffic allows us to. The larger the better, as we saw earlier. It also could be a bit less if there is a considerable difference between the conversion rates of your control and variation.

Okay, so if I have lots of traffic and a large enough sample size with 95% in 3 days it’s great right? Unfortunately, not. There are rules about the duration of A/B tests to respect too.

The duration of A/B tests

How long should you run your A/B tests?

You should run your tests for full weeks at a time and we recommend you test for at least 2–3 weeks. If you can, make the test lasts for 1 (or 2) business cycle(s).

Why? You already know that for emails and social media, there are optimal days (even hours) to post. People behave differently on given days and are influenced by a number of external events. Well, same thing for your conversion rates. Run a conversion by day for a week you’ll see how much it can vary from a day to another. This means, if you started a test on a Thursday, end it on a Thursday. (We’re not saying you should test for just one week.)

What are the best practice about A/B test duration?

Test for at least 2-3 weeks. More would be better though. 1 to 2 business cycles would be great. As you’ll get people that just heard of you and close to buying while accounting for most external factors (we’ll talk more about those in our article on external validity threats) and sources of traffic. If you must extend the duration of your test, extend it by a full week.

Check the variability of data during A/B tests

If your significance level and/or the conversion rates of your variations are still fluctuating considerably, let your test running.

What does variability involve?

Two phenomenons to consider here:
  • The novelty effect: When people react to your change just because it’s new. It will fade with time.
  • Regression to the mean: This is what we talked about earlier: the more you record data, the more you approach the “true value”. This why your tests fluctuate so much at first, you have few measures so outliers have a considerable impact.

variability-results-ab-test

In this example, the results given by the orange curve are too variable to be representative.

Whey significance level is not enough when results still vary?

This is also why the significance level isn’t enough on its own. During a test, you’ll most likely reach several times 95% before you can actually stop your test. Make sure your significance curve flattens out before calling it. Same thing with the conversion rates of your variations, wait until the fluctuations are negligible considering the situation and your current rates.

For tools that give you a confidence interval, for example “variation A has a conversion rate of 18,4% ± 1,2% and Variation B 14,7% ± 0,8%”. Meaning the conversion rate of the variation A is between (18,4 - 1,2) and (18,4 + 1,2), and conversion rate of the variation B is between (14,7 - 0,8) and (14,7 + 0,8). If the 2 intervals overlap, keep testing.

Your confidence intervals will get more precise as you gather more data. So, whatever you do, don’t report on a test before it’s actually over. To resist the temptation to stop a test, it’s often best not to peek at the results before the end. If you’re unsure, it’s better to let it run a bit longer.

 

To stop an A/B test, consider the following:

  • Is your significance level equal or superior to 95%?
  • Is your sample large enough and representative of your overall audience in composition and proportions?
  • Have you run your test for the appropriate length of time?
  • Have your significance level and conversion rates curves flattened out?

Only after taking all of those into account can you stop your test. Don't skip them, it'd cost you money.

 

New Call-to-action

Lauréline Kameleoon
Lauréline Saux
Laureline is Content Manager and is in charge of Kameleoon's content. She writes on best practice within A/B testing and personalization, based on in-depth analysis of the latest digital trends and conversations with Kameleoon's customers and consultants.
Topics covered by this article