Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?
Of course there’s nothing morally or legally wrong with looking at interim data, but is there anything statistically wrong? That depends on what you do with your observation.
There are basically two statistical camps, Frequentist and Bayesian. (There are others, but these two account for the lion’s share.) Frequentists say no, you should not look at interim data, unless you apply something called alpha spending. Bayesians, on the other hand, say go ahead. Why shouldn’t you look at your data? I remember one Bayesian colleague mocking alpha spending as being embarrassing.
So who is right, the Frequentists or the Bayesians? Both are right, given their respective criteria.
If you want to achieve success, as defined by Frequentists, you have to play by Frequentist rules, alpha spending and all. Suppose you design a hypothesis test to have a confidence level α, and you look at the data midway through an experiment. If the results are conclusive at the midpoint, you stop early. This procedure does not have a confidence level α. You would have to require stronger evidence for early stopping, as specified by alpha spending.
The Bayesian interprets the data differently. This approach says to quantify what is believed before conducting the experiment in the form of a prior probability distribution. Then after each data point comes in, you update your prior distribution using Bayes’ theorem to produce a posterior distribution, which becomes your new prior distribution. That’s it. At every point along the way, this distribution captures all that is known about what you’re testing. Your planned sample size is irrelevant, and the fact that you’ve looked at your data is irrelevant.
Now regarding your A/B test, why are you looking at the data early? If it’s simply out of curiosity, and cannot affect your actions, then it doesn’t matter. But if you act on your observation, you change the Frequentist operating characteristics of your experiment.
Stepping back a little, why are you conducting your test in the first place? If you want to make the correct decision with probability 1 − α in an infinite sequence of experiments, then you need to take into account alpha spending, or else you will lower that probability.
But if you don’t care about a hypothetical infinite sequence of experiments, you may find the Bayesian approach more intuitive. What do you know right at this point in the experiment? It’s all encapsulated in the posterior distribution.
Suppose your experiment size is a substantial portion of your potential visitors. You want to experiment with graphics, yes, but you also want to make money along the way. You’re ultimately concerned with profit, not publishing a scientific paper on your results. Then you could use a Bayesian approach to maximize your expected profit. This leads to things like “bandits,” so called by analogy to slot machines (“one-armed bandits”).
If you want to keep things simple, and if the experiment size is negligible relative to your expected number of visitors, just design your experiment according to Frequentist principles and don’t look at your data early.
But if you have good business reasons to want to look at the data early, not simply to satisfy curiosity, then you should probably interpret the interim data from a Bayesian perspective. As the next post explains, the Bayesian approach aligns well with common sense.
I’d recommend taking either a Frequentist approach or a Bayesian approach, but not complicating things by hybrid approaches such as alpha spending or designing Bayesian experiments to have desired Frequentist operating characteristics. The middle ground is more complicated and prone to subtle mistakes, though we can help you navigate this middle ground if you need to.
If you need help designing, conducting, or interpreting experiments, we can help. If you want/need to look at interim results, we can show you how to do it the right way.
I vaguely recall sequential statistics which IIRC was a frequentist version of this you set up some rule like choose A if some function of the A sales and Bsale is bigger than x at any point during the sampling
Pharmaceutical drug trials are a case where peeking is inherently necessary. Drug trials are often cut short when the drug turns out to be too dangerous and sometimes cut short when they are too obviously effective. There’s no way of separating the trial from treating the subject as a patient.
Right. Drug trials are the context where I first encountered this.
With early phase chemotherapy drug trials, you’re administering toxic chemicals, though hopefully not too toxic, and so safety monitoring is critical.
We did a lot of simulations, designing Bayesian methods to have the right frequentist operating characteristics.