How can I trust and use test results to my advantage?

June 13, 2022

Reading time:

10 mins

This interview is part of Kameleoon's Expert FAQs series, where we interview leading experts in data-driven CX optimization and experimentation. Georgi Georgiev is an expert on A/B test statistics and the author of "Statistical Methods in Online A/B Testing.” Georgi also developed A/B testing statistical tools available at Analytics-toolkit.com.

What are the implications of running multiple tests at the same time?

Running concurrent tests is one of the best things you can do for your A/B testing program if you are not doing so already. Far too many CROs worry far too much about interaction effects stemming from running tests simultaneously, be it on a single page or a single website. The more significant issues are typically found with statistical rigor, representativeness of results, or which measurements are used.

Unfortunately, worries about concurrently running tests lead some practitioners to adopt counterproductive ‘solutions’ such as silo-ing tests or running only a single test at any given time. The latter results in unnecessary throttling of the whole experimentation program. This results in the release of many changes without any testing whatsoever due to the self-imposed capacity limitations. The former has the same drawbacks and results in a high probability of releasing untested user experiences that would have otherwise been part of a test.

The fear of running concurrent tests and the idea of isolating tests as a solution are probably two of the most harmful ideas I’m witnessing in online A/B testing.

Georgi Georgiev

Owner, Web Focus LLC

While some probability for bad interaction effects exist, it is way overblown and overshadows more critical issues. For the most part, you should not worry much about interaction effects from concurrent A/B tests unless they are doing something obviously questionable such as testing a new text and color for a button in one test while testing completely removing it in another. It would be better if these two tests do not overlap in time, or alternatively, these experiences can be tested together versus the current state of things in an A/B/N test.

Is it ok to use different thresholds of confidence/power in an experiment?

I’m completely behind the idea that each A/B test is different, implying a somewhat different confidence threshold and sample size can be used depending on the test. The question is how these are chosen. The crucial element is that any change in rigor considers risk versus reward.

In my view, a low uncertainty is just one aspect of statistical rigor, and statistical rigor is just one aspect of a rigorously conducted experiment. Think of external validity as another, for example.

I think you should not compromise the overall rigor of experimentation in business or otherwise. However, designing tests that allow for a high(er) level of uncertainty is definitely on the table if it's justified. Since there is no default level of acceptable uncertainty in business experiments, any chosen confidence threshold and sample size must be justified.

Clinical trials are an exercise in balancing risk and reward, too, but their outcomes can have broad potential implications over an undefined timeframe involving many stakeholders. This makes a rigorous estimation of risks and rewards an impossible task, and the same follows for setting the level of acceptable uncertainty. That’s why the FDA sets a minimum threshold for uncertainty, and most clinical experiments use that threshold.

Contrast the nature of a clinical experiment to a business experiment;

Typically the outcome of a business experiment can be captured in one or two primary metrics.
The use of the test result is relatively well defined in terms of scope and timeline.
The stakeholders are limited in number, and their interest is known.

All of the above allow businesses to better estimate the risk/reward relationship in various scenarios for the confidence threshold and sample size (or duration) of a test.

Explore different scenarios to find a combination that results in an optimal return on investment through optimization algorithms. You can choose a more rigorous confidence threshold and sample size in business experiments than in clinical trials, e.g., the confidence level for a given test might be 80%, 90%, or 99.9%. It has no bearing on how good the test is from a business perspective. If that particular level of uncertainty produces the best test ROI, it is the one to use. If the test is rigorous as a whole, the threshold of uncertainty that has to be met simply reflects the optimal balance between risk and reward.

What are the most dangerous misinterpretations people make when analyzing test results?

It is hard to single out one. Assuming the test was well executed and the statistical analysis is not flawed, I think the most dangerous misinterpretation is to forget the uncertainty in the data. For example, seeing a 10% increase in average revenue per user at the test’s outcome and confusing it with the actual gain, which could be 10%, 15%, or 2%.

While 10% is the most likely value estimated from the data, reporting a range of uncertainty around it should always be done. For example, a lower bound of a confidence interval at 1% suggests significant uncertainty, whereas a much lower uncertainty would be implied if it had been at 8%.

It’s also important to consider this uncertainty in derivative estimates such as a longer-term prognosis. It cannot contain just a single number; it should have a lower interval bound to it. For example, if one extrapolates a $15 million business impact based on the 10% estimate, it should be accompanied by a lower bound of $1.5 million, corresponding to the 1% lower interval bound.

How can you measure the impact of an experiment on longer-term outcomes (rather than an action which happens in a single session), such as the impact on loyalty, return rates, or customer lifetime value?

A single session is too short a timeframe for many A/B tests, so user-based metrics are preferred alongside tests that last at least 1-2 weeks.

There are different approaches to capturing longer-term outcomes, and none are perfect. For example, one can use well-correlated proxy metrics which react more quickly. For things like LTV, the proxy metrics will be inputs to the LTV model. All such approaches assume that the historical relationship between model and performance, or proxy metric and primary metric, will hold for the tested variant just as they have held so far. If the assumption does not hold, it can have significant implications. What makes the exercise more error prone is that you can only discover if the historical relationship holds true at some point in the future, say 2 years from running the test.

If you can get stakeholders on board, running tests for much longer can be a solution. Still, even this is not without issues due to problems with identity persistence and various technological challenges, among others. Businesses whose nature leads to persistent identification of their clients and prospective clients are better positioned to run such tests than your typical ecommerce store.

How can scientific rigor be used to support creativity and innovation in business?

True creativity and innovation absolutely need scientific rigor. It’s how you can prove something novel is also useful to an inherently skeptical audience. It is a precious tool when creativity and innovation need to break through risk-averse management layers or convince external clients wary of sales pitches.

Having the option to run an experiment makes businesses more likely to try new things since it allows them to control the level of risk taken in each case.

Georgi Georgiev

Owner, Web Focus LLC

Even the most skeptical crowd has to settle on a standard of evidence beyond which they have to accept the benefit of an unlikely breakthrough. A well-designed controlled experiment can provide just that standard of evidence for causal impact. Even better, it will not only prove the innovation useful, but it can give an estimate of that utility. With such a measure, creativity and innovation can be rewarded appropriately.

On the flip side, if a test results in an idea being shut down, it probably was not a true innovation. Discarding a poor solution quickly rather than adopting it in the long run likely prevented unnecessary harm to the business and its clients.

If you could ensure everyone in the experimentation industry knew one statistical principle, which would it be?

The severity principle of D.Mayo.* It can be viewed as a principle of statistical inference or, more broadly, as a principle of scientific inference. Simple in its brevity and profound in its implications, it should be at the heart of all experiments.

*it’s best outlined in an accessible manner in her latest book “Statistical Inference as Severe Testing” (2018)

Topics covered by this article

Web Experimentation

How can I trust and use test results to my advantage?

What are the implications of running multiple tests at the same time?

Is it ok to use different thresholds of confidence/power in an experiment?

What are the most dangerous misinterpretations people make when analyzing test results?

How can you measure the impact of an experiment on longer-term outcomes (rather than an action which happens in a single session), such as the impact on loyalty, return rates, or customer lifetime value?

How can scientific rigor be used to support creativity and innovation in business?

If you could ensure everyone in the experimentation industry knew one statistical principle, which would it be?

89 A/B testing influencers you need to follow in 2024

The best A/B testing tools in 2024

Mastering executive buy-in for experimentation: Insights from industry experts

The role of sequential testing in A/B testing

What does the MACH Architecture mean to Kameleoon

89 A/B testing influencers you need to follow in 2024