How long should you let a split test run?

Just wrote this up based on a question I got yesterday and I thought it would be useful for you guys!

This is always a fun question because there isn’t a clear answer and there's a lot of nuance.

First and foremost, we need to make sure the changes make don’t HARM conversion rate. That will happen about 50% of the time. The trick is we don’t know which times that’s gonna be… so we have to test.

Obviously, the more data we have the better. But we don’t want to run tests for months and months.

Ask any statistician if you have enough data and they’re always going to say more is better. But we can’t tests run forevermore so we need to compromise and be ok with some level of uncertainty.

At the same time, running a test for one single day also doesn’t feel right (for reasons we’ll go over).

So the optimal strategy must be somewhere in the middle.

Let’s go over some of the competing interests;

✅ Volume of visitors in the test - We don’t want to run a test to 20 visitors and decide the variant is a winner because it has one more conversion than the control. More data is almost certainly better for certainty that a variant is indeed better than the control.

✅ Difference in conversion rate. A control that has 1% CVR and a variant that has 4% CVR requires less data to be certain that we have an improvement in conversion rate. By the same token, if you have a 1% vs. 1.1% conversion rate, you’re going to need a lot of data to be confident that difference isn’t due to random chance.

✅ Product pricing/AOV. Higher ticket products can have a lot more variability day to day. If you have a product that’s more expensive, generally that means there’s a longer buying cycle. If your average buying cycle from click to buy is 7 days, you don’t want to make a decision after 4 days. You haven’t even let one business cycle run through yet.

✅ Getting a representative sample of traffic (days of week) - similar to above, when we are making long term predictions about conversion rate differences, we need to make sure that we have a sample that is close to our long term traffic. Would you want to poll a random set of Americans to make predictions on the Japanese economy? So when running a split test we want to make sure that we are running it during a relatively normal time period AND account for different traffic throughout the week.

If you have awesome performance on Saturday and Sunday, make sure you aren’t just running your test from Monday - Friday.

✅ Funnel Stage - Upsells can have less data than frontend funnels. We have a biased sample, in a good way. These are already people who have made a purchase - they are NOT a random subset of visitors.

✅ Accurately estimating the impact a test will have - we want to make sure a test isn’t hurting CVR, but usually we also want to know how much it’s helping CVR too. If the variant has a 100% improvement in CVR after 3 days, i feel decently confident that it won’t harm CVR - but I don’t feel super confident in predicting that it will improve the baseline by 100%.

To understand the incremental impact, more data and time is better.

✅ Potential negative impact of the change - if you’re split testing different softwares/tech stacks, you want to be as CERTAIN as possible that you aren’t implementing tech that’s going to be tough to unwind and have negative impacts on your entire business.

Now putting that all together, here are some general guidelines that will be good to get you started and is safe enough.

✅ 14 days to call a winner

✅ 7 days to call a loser

✅ A few thousand sessions per variant (less if testing upsells)

✅ 90% chance to win from your split testing software

If you have a case that you think might be different, just post it below and I’ll let you know what I think.

3 comments

Kyle Rutledge

• Sep '24

One thing that's top of mind for me is the downstream impact of conversions since I focus on lead-generation efforts. You can sometimes get a 20% decrease in initial leads that doubles your downstream numbers because they're much higher quality. That's a huge win, but if you only tracked leads you would have called off that test at a loss.

In summary, make sure you're tracking all correlating metrics.

Ethan Bence

• Sep '24

@Kyle Rutledge

exactly. full funnel tracking is the way!

Blake Wyatt

• Sep '24

@Ethan Bence

worth mentioning that the first few days of a split-test seem to not even matter either... extremely volatile lol. Especially so on Google traffic it seems.

Your comment

Lords of Marketing

skool.com/lordsofmarketing

Welcome to the house of marketing lords.

Leaderboard (30-day)