In 2013, Deng, Xu, Kohavi, and Walker published a paper, Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data, in which they demonstrate several ways to improve the sensitivity of online controlled experiments and introduce a technique called Controlled-experiment Using Pre-Experiment Data (CUPED). The goal of CUPED is to introduce a new estimator with lower variance than the usual estimator.
In this technical article, we’ll summarize why reducing variance is important in AB testing, and how it can lead to faster test results and better effect estimations.
For a shorter overview on the benefits and Kameleoon-specific uses of CUPED, we recommend reading our blog article, What we learned from running 200+ experiments on CUPED.
Variance in A/B testing
Variance is a statistical tool that aims to quantify the dispersion around the mean. It allows us to measure how far a set of values is laid out around their mean. In an A/B testing setup, variance arises thanks to the wonderful Central Limit Theorem which gives us an understanding of how our estimator of the mean effect of our variation is distributed around its true value.
Typically, we do not work directly with the variance. Instead, we look at the standard deviation, which is the square root of the variance. The standard deviation, denoted as σ , is homogenous with the data we manipulate. It directly affects the width of our confidence intervals, as shown in the general formula used to estimate the confidence interval for the difference between the means of samples 1 and 2:
Here, μ1 and μ2 represent the samples’ means, n1 and n2 the samples’ sizes, Z the statistic matching our desired confidence level, and σP the pooled standard deviation of the two samples. We can easily deduce that, as the standard error increases, our interval widens. Accordingly, a reduction in the standard deviation will result in the confidence interval shrinking, which will allow us to get a better estimate. You can read more on this topic in Chapter 10 of Alex Deng’s (2021) publication titled Causal Inference and Its Applications in Online Industry.
Variance reduction techniques are widely studied and crucial in many different fields. The simplest way to reduce variance is by increasing the sample size, since there is a direct link between sample size and variance. However, in the realm of A/B/n testing, increasing the sample size usually means collecting more data and running the test for a longer period of time, which is not always feasible, nor desirable. For this reason, researchers have developed other variance reduction techniques, including CUPED.
Reducing the variance with CUPED
Deng et al. (2013) introduced CUPED while working at Microsoft and exploring ways to reduce the variance of their estimators in order to detect smaller lifts for a given test.
However, as the saying goes, there is no such thing as a free lunch. If we want to improve our estimator, we have to pay a price. CUPED makes it possible to not pay the price in the form of test duration. Additionally, as CUPED is a post-assignment variance reduction technique, it does not add any complexity to the experiment setup. We can simply apply it after collecting test data.
CUPED is based on the control variate method, which aims to reduce the error of an estimate of an unknown quantity by exploiting information about the errors in estimates of known quantities. To do this, we need to identify a second variable (covariate) that is correlated to the variable we want to estimate. We will use this covariate to build our CUPED metric and estimate the lift. The choice of covariate is critical, as it directly impacts the amount of variance we can explain. The variance reduction is directly proportional to the level of correlation between the covariate and the metric of interest.
Deng et al.'s (2013) key point is that it is crucial to find a covariate independent from the treatment’s effect to avoid any bias. One great solution for this is bringing in additional information using data already at our disposal, prior to the experiment’s launch, as it is guaranteed to have this property.
It’s also important to note the behavior of CUPED in cases where the covariate is categorical. For example, when we use a boolean covariate, we can see CUPED as making a first estimation for when our covariate is 0, then a separate one for when our covariate is 1. The combination of those two estimations yields the CUPED estimation. Deng et al. (2013) demonstrate that this is exactly equivalent to applying stratification –another variance reduction technique– with the same covariate.
How CUPED works at Kameleoon
To improve the accuracy of our experiment, we need to select a variable that is highly correlated with the outcome variable and built from data collected prior to the experiment. Deng et al. (2013) recommend using the same variable from the pre-experiment period, as empirical results showed it yielded the best results.
At Kameleoon, we conducted our own tests using different variables at our disposal. On average, we found that using the same objective (i.e., experiment goal) evaluated for two weeks prior to launching the experiment yielded the best results. (This means that we estimated the correlation between the conversion of the main goal of a live experiment and its conversion during the 2 weeks preceding the start of the experiment). By "best results" we mean that the correlation between the pre-experiment variable and the outcome variable was maximal, while keeping the request complexity low enough.
Why should you use cuped?
Using CUPED will have a direct impact on organizations’ experimentation processes and strategies, as it allows for faster iteration on test ideas and provides more confidence in decision-making. Additionally, it gives experimenters better estimates for the effect of variations and reduces the risk of getting a false positive result, without having to collect more data. Better yet, this technique does not add complexity to the experiment setup.
When should you use cuped?
In order for CUPED to be effective, there needs to be a correlation between the goal conversions before the start of the experiment and during the live experiment. This means that we need to have been collecting data for the main goal of a current experiment (for which we want to enable CUPED) prior to launching the experiment. The more correlation we have for the goal conversions of the previous and current experiments, the better we’ll be able to predict the real impact of the current test. Therefore, we recommend applying CUPED only to the key goals involved in your decision-making, like transactions.
is cuped suitable for all sites?
After running 200+ internal tests, we have found that CUPED will have the most impact on an experiment when it includes returning visitors: CUPED can improve predictions when the experiment is exposed to returning visitors because we will have historical data on them. Contrarily, we won’t have pre-experiment data for new visitors that can help us explain their behaviors.
The estimates of CUPED get more precise with time and use, so the more we use Kameleoon, the more experiment data we’ll have that our algorithm can use to improve the effectiveness of CUPED.
a note on methodology
It’s important to note that the original and the CUPED-adjusted results of an experiment should not be compared. Moreover, after enabling CUPED for an experiment, we advise against switching back to the original results. Comparing or switching between the two results can inflate the risk of getting a false-positive result, as experimenters may be tempted to make decisions based on the more favorable-looking results.
Therefore, we suggest that users commit to applying CUPED to their results prior to running the experiment, then stick to their decisions post-launch.
Some empirical results
As mentioned above, we analyzed over 200 experiments in Q1 of 2023 in order to estimate correlations and how they will translate into experimentation speed. Running these analyses, we compared the metric of each experiment to the same metric two weeks prior to the start of the experiment. This allowed us to compute the correlation between the two, then estimate the required increase in sample size to get a similar effect.
We observed, as expected, that the variance reduction is highly connected to the number of returning visitors, and to their behavior. For experiments with over 5% returning visitors (for all of whom we had pre-experiment data), we observed the highest variance reduction.
To get a more precise estimate on the impact of CUPED, we filtered out experiments that had less than 5% returning visitors. This yielded a 5-40 percent variance reduction, which translates to a potential reduction of 10-60 percent in sample size.
We also estimated the impact of CUPED on those experiments that had fewer targeted visitors present during the pre-experiment time window. For these experiments, we observed that the variance reduction was impacted quite a bit, with the potential sample size reduction ranging from 1 to 10 percent.
In the coming months, we will monitor the performance of CUPED and keep experimenting with its parameters in order to continuously optimize the value out of this method.
We are excited to see the difference this powerful technique is going to make for our users! Make sure to check CUPED out on your individual campaigns’ Reporting pages and activate it for your KPIs.