How long should you let a split test run?
Just wrote this up based on a question I got yesterday and I thought it would be useful for you guys! This is always a fun question because there isn’t a clear answer and there's a lot of nuance. First and foremost, we need to make sure the changes make don’t HARM conversion rate. That will happen about 50% of the time. The trick is we don’t know which times that’s gonna be… so we have to test. Obviously, the more data we have the better. But we don’t want to run tests for months and months. Ask any statistician if you have enough data and they’re always going to say more is better. But we can’t tests run forevermore so we need to compromise and be ok with some level of uncertainty. At the same time, running a test for one single day also doesn’t feel right (for reasons we’ll go over). So the optimal strategy must be somewhere in the middle. Let’s go over some of the competing interests; ✅ Volume of visitors in the test - We don’t want to run a test to 20 visitors and decide the variant is a winner because it has one more conversion than the control. More data is almost certainly better for certainty that a variant is indeed better than the control. ✅ Difference in conversion rate. A control that has 1% CVR and a variant that has 4% CVR requires less data to be certain that we have an improvement in conversion rate. By the same token, if you have a 1% vs. 1.1% conversion rate, you’re going to need a lot of data to be confident that difference isn’t due to random chance. ✅ Product pricing/AOV. Higher ticket products can have a lot more variability day to day. If you have a product that’s more expensive, generally that means there’s a longer buying cycle. If your average buying cycle from click to buy is 7 days, you don’t want to make a decision after 4 days. You haven’t even let one business cycle run through yet. ✅ Getting a representative sample of traffic (days of week) - similar to above, when we are making long term predictions about conversion rate differences, we need to make sure that we have a sample that is close to our long term traffic. Would you want to poll a random set of Americans to make predictions on the Japanese economy? So when running a split test we want to make sure that we are running it during a relatively normal time period AND account for different traffic throughout the week.