It's VERY common to see split test results that look fantastic, but then fail to reproduce when rolled out.

It wasn't until I understood bootstrapping that I fully grasped HOW frequently this occurs, and how to predict HOW LIKELY it is to occur for a given result. Bootstrapping is one tool that helps you place intelligent bets on whether, say, an ARPU lift observed in an experiment will repro in the wild.

For me, it has helped me calibrate my mind regarding what types of sample sizes are necessary, for the type of result we're projecting in a given experiment.

Here, even if the changes you're testing have ZERO TRUE impact… you'll nevertheless OBSERVE dramatic swings in ARPU (20%+) b/t the control and variant about 30% of the time.

A phantom +/- 10% result? More than 50% of the time! This means that more than HALF of these experiments would produce false positives or false negatives, IF you aren't calculating P-values.

This is why before/after testing doesn't work, unless you are operating at *massive scale, and driving BIG (10%+) KPI lifts that can rise above the expected noise.