Has experimentation become an anchor?

Katie Green x Paul Davidson

Also available on:

home

EPISODE

Has experimentation become an anchor?

Katie Green x Paul Davidson

Also play on:

Published on

July 1, 2026

About the episode

Most teams blame the platform when experimentation stalls. The uncomfortable truth is that it's usually people and process.

This conversation breaks down how a program drifts into an anchor, and the specific moves that cut one team's test cycles from six weeks to two. You walk away knowing how to simplify your metrics, validate earlier, and hold the trust of stakeholders when real money is on the line.

‍

About our guest

Paul Davidson was a Senior Product Manager at Expedia Group and Senior Product Owner at NI (National Instruments). He spent years as a product manager on an internal experimentation platform, working on the roadmap and alongside the analysts and engineers using it every day. He has watched thousands of tests, and he knows the patterns behind the ones that go nowhere.

Paul Davidson

Product Management

Katie Green

Principal Advocate & Host of Unite Voices

Kameleoon

Key takeaways

Trim your secondary metrics and move your primary metric closer to the change you're testing, so readings come back in days instead of weeks.
Validate early and accept imperfect feature work, because a test that fails fast on instrumentation saves the weeks you would otherwise lose validating at the end.
Protect financial goals with circuit breakers and guardrail metrics, then lean on proven proxy metrics so not every test has to move revenue on its own.

Transcript

Welcome & Introductions

Katie Green: Paul, welcome to Unite Voices. I’m really excited to have you on the show. You have incredible experience in experimentation, and I’m excited for the world to hear what you and I have discussed and get to know you a little bit better. So why don’t you get us started and tell the people who you are?

Paul Davidson: Thanks. I’m Paul Davidson. I was a product manager on an internal experimentation platform for a few years. That work had me operating on two tracks — working on platform features, strategy, and roadmap, but also on the practitioner side: working with product managers, analysts, and engineers who were using the platform and helping them experiment better.

When Experimentation Becomes an Anchor

Katie Green: When you and I last spoke, you mentioned that experimentation can become an anchor — something that teams have to do but aren’t necessarily learning from. I’ve personally seen that a ton, where people have really good experimentation engines but don’t have the infrastructure to be learning at the same rate they’re experimenting. I’m curious to get your learnings on that, because people in the community are probably going to listen and say, “Oh my gosh, that sounds like my program.” Have you ever experienced that? What’s your insight?

Paul Davidson: The feeling of it becoming an anchor — for us, it happened gradually through a series of factors. Teams were designing experiments that were too complex. They were waiting way too late to validate. They weren’t really setting experiments up for a good outcome.

What was happening is that teams were spending months developing really complex features, putting all this work in — and then they’d get to the readout and it would be conflicting or inconclusive. They were left frustrated, with a massive sunk cost fallacy weighing on them. That’s where the anchor started to be felt.

Is the Platform the Problem — or Are People and Process?

Katie Green: Something I’ve seen with teams operating at high scale is they come to the conclusion that maybe the platform is the problem. At Kameleoon, we’re obviously a platform — we talk about this a lot. But how does the tool rationalize some of that thinking? What role does the platform play when you’re running into that kind of scenario?

Paul Davidson: For us — and I’d imagine for Kameleoon — you’re working with a pretty mature product. That product is always going to keep maturing: new features, rounded-out edges, and most importantly, new data. But if you’re working with a growing company, business requirements are going to keep evolving too. We were never going to get to perfect, but people wanted that — and were using it as justification to hold off, saying, “We shouldn’t experiment until this thing is perfect.”

My counter to that was: look, for about 70 to 80% of our use cases, the platform was pretty much complete and ready to go. Instead of a platform challenge, my hypothesis was that it was really a people and process challenge.

That’s uncomfortable for an organization to hear. When it’s a platform feature, you can file it in the backlog and come back to it. But a people problem? It can feel like an indictment — and it’s not something you can solve by pushing out a code change.

Katie Green: I feel like we could have an entirely separate episode about that. It’s really tough. It’s easy for teams to change their platform, and it’s a lot harder to change their culture. But as experimenters, we almost don’t let perfect be the enemy of good when things are going well — yet at the same time, we’re striving for perfect through experimentation. Some people actually believe that.

I always say: the biggest threat to experimentation is not doing it at all. And that feels like exactly what happens when teams say, “The platform is fine, we just can’t figure it out.” So — you’ve observed thousands of tests. Are there particular patterns or traits of failure you saw, both at the micro level of individual tests and the macro level of programs?

Paul Davidson: At the micro level, three core areas kept repeating.

First: the scope of the change. Teams would spend months building something complex — changing 10 or 15 things at once. And when you’re trying to understand what moved the needle, some of those changes might be pushing your goal up while others push it down. It’s really hard to untangle what actually happened.

Second: tests came in too late. All the development was done, costs were sunk, and it was really painful to step back. That led to a behavior we called “trying to save the test” — digging through results post hoc, looking for anything that would justify a rollout, because no one wanted to go back through that process again.

Third: the metrics. Over time, the metric list had really grown. Even though we only supported one primary metric, secondary metrics had ballooned to 10 or 15 in a lot of cases — all overlapping and contradicting each other. On top of that, teams often chose a really insensitive primary metric, tied directly to their org’s financial goals. They felt every single experiment had to prove it moved the big thing. But that wasn’t achievable in most cases, especially for teams who were way up funnel from where those financial goals were happening.

The Drift Problem in Decentralized Programs

Katie Green: I want to attach onto that — “teams who are way up funnel from where the financial goals are happening” sounds like a large, decentralized program. You and I have talked about how culture can drift when a program is too decentralized. Is that drift inevitable? Is it preventable? What should teams be looking out for?

Paul Davidson: The challenge with a decentralized org is that over time, it loses the central connective tissue that would pull everyone back together. As the org grew — through leadership changes, shifting financial goals, without a central anchor — people gradually forgot why we were there. The drift just happened quietly.

What we tried to address it with was iterations: champions, heroes, informal groups. But that’s really challenging, because those tend to be informal roles that don’t show up on your performance review. It’s hard to get people interested even if they genuinely love experimentation — because their annual review says they need to go after their product goals. And when they have to choose between joining the experimentation community or driving product outcomes, they’ll always pick the product goals.

Katie Green: And that connects to something I’ve talked about a lot on this show: why would a team test if their OKR is to ship more features? If testing slows shipping, the incentive to experiment disappears. I want to make sure we come back to metrics, but before we do — something you’ve mentioned is that experimentation for learning has been abandoned. Are you seeing it shift toward a validation exercise for decisions that are already made?

Paul Davidson: That’s exactly the trap we’d fallen into. There were financial goals set at the top, and teams were measuring themselves against them — individually, as a team, as an org — all rowing toward the same target. A lot of pressure.

The drift meant every experiment had to serve those financial goals, not generate learning. That led to a bunch of really bad behaviors: running long tests just to try to get enough statistical power to detect a meaningful signal — which could still come back inconclusive at the end. They weren’t learning much. Experimentation had become just the thing you do before you go live. That’s all.

And there were all these assumptions being made about how metrics correlate with each other — and with the core purpose of a feature. That’s what we really started challenging in our pilot: what if a feature’s purpose isn’t always to progress users through a clean, linear path? The reality is the user journey is messy. People loop. What if a feature doesn’t always move users forward — what if it actually serves users better by showing them they’re in the wrong place and getting them back faster?

Blowing Up the Playbook: A Pilot in Redesign

Katie Green: I want everyone to take away from this that the user journey is not A to B. Let’s get into what you actually did with your teams. You worked with a team through the very uncomfortable task of redesigning their approach — whittling 15 key metrics down to three. What happened when you did that? Can you walk us through the tactical experience?

Paul Davidson: The first key was finding the right team. A lot of teams knew they were experiencing some of this pain — but we found a team that was acutely aware of it. I got halfway through my sales pitch and they said, “Yeah, we’re in.” Easier than expected.

Then the real work started. I asked the product manager to share her strategy documents, PRDs, and some recent experiments. That gave me a foundation. From there, I identified three work streams, and pulled together a tight team of three: the product manager, her analyst, and me.

I took everything she had shared and generated the experimentation strategy for the first experiment stream myself. That was harder than I expected — it took me about an hour just to really think it through. Even though I’d seen the problem from a distance, getting my hands dirty, I realized how much I’d been influenced by the org’s thinking. Pulling apart 10 or 15 metrics and getting down to three is genuinely uncomfortable. Those metrics are always there; they feel important.

We then got together with the team for a refinement process. They had a ton of questions — which was great — and we discovered some gaps in their understanding of platform features too. We whittled it down, launched our first test, and got good learnings from it.

Katie Green: Something practitioners are really going to value from this episode: your test cycles dropped from six weeks to two weeks. That time savings is going to inspire a lot of people. What do you think was the primary driver? For someone listening who wants to make that same jump — what are the tangible things that moved the needle?

Paul Davidson: It was all the things we talked about. First, we got the change into the experiment much sooner, and we were willing to accept that the feature work wasn’t perfect yet. We also accepted that the metrics weren’t fully instrumented the way we wanted — and because we were moving fast and willing to launch anyway, we actually discovered the test had failed within a couple of days. Not because of the experience or a bug, but because of the metrics.

That was actually really helpful. If we had waited until the end, it still would have failed — we just would have lost more time. By reducing the metrics, the results came back crystal clear. We moved the key metric much closer to the actual change. We cut the conflicting signals. Some of the old metrics we kept around as informational — they’d show up in the readout, but we committed not to make decisions on them.

And here’s a side learning: some things the team was convinced were highly correlated to the change showed no correlation at all — p-values close to one. The metric reduction meant fewer things to interpret, and the metrics we kept were sensitive enough to get a strong reading within days. That opened up the possibility of not even needing a minimum two-week runtime for some tests.

Building Trust When Executives Want to See Dollars Move

Katie Green: The idea of moving the metric closer to the actual change — that’s something I struggled with early in my CRO career. Decision makers would say, “I don’t care if the navigation change increased navigation clicks, as long as it’s increasing dollars.” And I’d say, “But there are so many steps between those two things.” How do you convince a team to trust a metric closer to the change? What does that trust-building look like?

Paul Davidson: A few things to unpack there. First, we had implemented circuit breakers and canary metrics — protections that fire the moment a test launches to flag large degradations. And we had guardrail metrics tied to the financial goals that ran at the end of the experiment. So the protections were already built in and automatically calculated. Leaders had a safety net.

Second: not everything needs to move the financial goal today. One way to address that is working with an analyst to develop a rigorous proxy metric and prove the correlation — showing that when this funnel metric moves, there’s a statistically demonstrated link to what moves at the bottom of the funnel.

The other approach I was advocating for is simpler: collect your learnings from a series of short tests, make your tweaks, and then — if you really need that bottom-line reading — run one bigger experiment at the end to prove you can move the financial goal. In many cases though, I don’t think that’s necessary. A well-developed proxy metric can just get the job done.

The Role of AI in Sustaining Experimentation Culture

Katie Green: I have to ask the AI question — everyone will be looking for it. You did a lot of work breaking this down and rebuilding it. I don’t think AI could have done that heavy lifting. But is there a role for AI in creating sustainability long term, especially for decentralized programs?

Paul Davidson: Absolutely. I think AI mixed with traditional logic and metrics can do a lot here. You can instrument health dashboards. You can layer in insights. You could take the learnings from this kind of pilot and train an AI to look for those same patterns — signs of drift, runaway metric lists, experiments that are too broad.

You can also use it to start showing how things correlate at scale: not just how individual tests are moving the needle, but how the full body of work across teams and sister orgs is shifting the big funnel metrics over time. A lot of that is really unclear right now. AI can start recognizing those connections — bringing clarity to what analytics alone can’t easily surface.

Monday Morning Advice: Don’t Boil the Ocean

Katie Green: We’re coming up on the end of our time, and I want to ask my favorite question: the Monday morning advice. If someone is listening to this and their program feels gated — like an anchor they can’t escape — and you’ve successfully done this work, what’s the first thing they should do tomorrow morning?

Paul Davidson: Find a beachhead. Don’t try to boil the ocean. Find someone who’s bought in, who’s feeling that pain acutely — sit down with them, learn what they’re working on, and bring them an alternate plan. Focus on the fundamentals: don’t test too much at once, reduce the number of metrics, increase metric sensitivity. Work through it together.

Then, if you can show the change and show the outcomes the way we did, go on a promotional tour. We had a monthly newsletter. We had a full piece written by that product manager from her own point of view. We opened a Slack channel, ran office hours, and I built a full training out of it. We ran a version of that with one team, and then we started institutionalizing it within our learning and development so that I wasn’t in the loop anymore.

Simplify, find a beachhead, win it — and then use that momentum to roll it out.

Katie Green: I’m calling this episode “Don’t Boil the Ocean.” That’s perfect advice. And I hope we can do a part two in a year — I want the update on all things Paul. Thank you so much for being on the show.

Paul Davidson: Thank you very much.

Katie Green: We’ll see you next time. Bye!

Paul Davidson: Bye.

‍

Read THE FULL TRANSCript

hide transcript

When experimentation becomes an anchor: 5 signs your program has stopped learning

Usu,ally experimentation teams are filled with hardworking individuals building complex features, but none of them can actually discuss what they’ve learned this quarter because the whole process is on autopilot.

Paul Davidson, an engineer and product leader with experience at Expedia and National Instruments, calls this the anchor, a program that weighs down the teams it’s supposed to strengthen.

In his recent chat with Katie Green on he noted there are ways to check the health of your A/B testing program, and ways to solve each problem as they arise.

‍

Read the blog