How teams fail at retention testing (without realizing it)

I hear this all the time

"Our new build has a D1 of 35%."

And perhaps...

"That's 4% better than the previous build! We're headed in the right direction!"

Cool!

Just to be safe though... These tests were based on how many installs?

Because if the answer is "in the hundreds,"

We are likely getting fooled by statistical noise.

An example from a client this year

Now, at 140 installs, a 35% D1 read has HUGE error bars.

Within 90% confidence interval that "35%" could be

So, what did we actually learn from the test?

Not much. So, $500 (and critically, time) was wasted.

What most teams consistently fail to do

The point of failure is forgetting to ask

"How precise do we need this test to be?"

And then

"Given that, How many installs are needed?

Most teams

What competent testing looks like

  1. Choose Acceptable error bar size (±X%) per test goals
  2. Choose Confidence Interval (usually 90%)
  3. Calculate Required installs (use your favorite LLM)
  4. Buy the installs from cheapest Geo with valid signal

You buy traffic to meet the math, not your gut.

A simple example

  1. A team wants a Tier-1 D1R benchmark
  2. They aren't comparing to prior build
  3. Given this, ±3pp error seems acceptable (90% conf).
  4. Calculates required installs at 750.
  5. Team saves $ buying from Denmark, not USA
  6. Total cost is $975.
  7. Test delivers exactly what was requested.

Have you ever been fooled by retention tests?