July 6, A very common scenario: The answer is this: There was no uplift to begin with. Statistical significance and validity are not the same. Two days after starting a test these were the results: Is this a statistically significant result?

Yes it is, punch the same numbers into any AB test calculatorand they will say the same. Here are the results using this significance calculator: Or how about we give it some more time instead.

This is what it looked like 10 days later: In this scenario many most? Even worse than the imaginary lift that you got, is the false confidence that you now have. You think you learned something, and go on applying that learning elsewhere on the site.

But the learning is actually invalid, thus rendering all your efforts and time a complete waste. Sample is too small, the absolute difference in conversions is just 19 transactions. That can change in a day. That alone should not determine whether you end a test or not. Statistical significance does not tell us the probability that B is better than A.

Nor is it telling us the probability that we will make a mistake in selecting B over A. These are both extraordinarily commons misconceptions, but they are false.

To learn what the p-values are really about, read this post. One of the difficulties with running tests online is that we are not in control of our user cohorts. This can be an issue if the users distribute differently by time and day of week, and even by season.

Because of this, we probably want to make sure that we collect our data over any relevant data cycles. That way our treatments are exposed to a more representative sample of the average user population.

First couple of days: B is winning big. Typically due to the novelty factor. B still winning, but the relative difference is smaller. Run your tests longer.

Test duration was 35 days, it targeted desktop visitors only, and had close to transactions per variation. Many people end the test here. The Stopping Rule So when is a test cooked? I will not believe any test that has less than conversions per variation. More traffic means you have a higher chance of recognizing your winner on the significance level your testing on!

Test for a maximum of 4 weeks. What if after 3 or 4 weeks the sample size is less than conversions per variation? I will let the test run longer.Significance in Statistics & Surveys "Significance level" is a misleading term that many researchers do not fully understand.

Constitution and certain Congressional acts, especially as applied to . Ton Wesseling, founder of Testing Agency, had this to say about it. You should know that stopping a test once it’s significant is deadly sin number 1 in A/B-testing land.

