How long should you let a split test run?
Just wrote this up based on a question I got yesterday and I thought it would be useful for you guys! This is always a fun question because there isnāt a clear answer and there's a lot of nuance. First and foremost, we need to make sure the changes make donāt HARM conversion rate. That will happen about 50% of the time. The trick is we donāt know which times thatās gonna beā¦ so we have to test. Obviously, the more data we have the better. But we donāt want to run tests for months and months. Ask any statistician if you have enough data and theyāre always going to say more is better. But we canāt tests run forevermore so we need to compromise and be ok with some level of uncertainty. At the same time, running a test for one single day also doesnāt feel right (for reasons weāll go over). So the optimal strategy must be somewhere in the middle. Letās go over some of the competing interests; ā
Volume of visitors in the test - We donāt want to run a test to 20 visitors and decide the variant is a winner because it has one more conversion than the control. More data is almost certainly better for certainty that a variant is indeed better than the control. ā
Difference in conversion rate. A control that has 1% CVR and a variant that has 4% CVR requires less data to be certain that we have an improvement in conversion rate. By the same token, if you have a 1% vs. 1.1% conversion rate, youāre going to need a lot of data to be confident that difference isnāt due to random chance. ā
Product pricing/AOV. Higher ticket products can have a lot more variability day to day. If you have a product thatās more expensive, generally that means thereās a longer buying cycle. If your average buying cycle from click to buy is 7 days, you donāt want to make a decision after 4 days. You havenāt even let one business cycle run through yet. ā
Getting a representative sample of traffic (days of week) - similar to above, when we are making long term predictions about conversion rate differences, we need to make sure that we have a sample that is close to our long term traffic. Would you want to poll a random set of Americans to make predictions on the Japanese economy? So when running a split test we want to make sure that we are running it during a relatively normal time period AND account for different traffic throughout the week.