When statisticians analyze data, they don’t just by look at the data you bring to them. They also consider hypothetical data that you could have brought. In other words, they consider what could have happened as well as what actually did happen.
This may seem strange, and sometimes it does lead to strange conclusions. But often it is undeniably the right thing to do. It also leads to endless debates among statisticians. The cause of the debates lies at the root of statistics.
The central dogma of statistics is that data should be viewed as realizations of random variables. This has been a very fruitful idea, but it has its limits. It’s a reification of the world. And like all reifications, it eventually becomes invisible to those who rely on it.
Data are what they are. In order to think of the data as having come from a random process, you have to construct a hypothetical process that could have produced the data. Sometimes there is near universal agreement on how this should be done. But often different statisticians create different hypothetical worlds in which to place the data. This is at the root of such arguments as how to handle multiple testing.
You can debunk any conclusion by placing the data in a large enough hypothetical model. Suppose it’s Jake’s birthday, and when he comes home, there are Scrabble tiles on the floor spelling out “Happy birthday Jake.” You might conclude that someone arranged the tiles to leave him a birthday greeting. But if you are so inclined, you could attribute the apparent pattern to chance. You could argue that there are many people around the world who have dropped bags of Scrabble tiles, and eventually something like this was bound to happen. If that seems to be an inadequate explanation, you could take a “many worlds” approach and posit entire new universes. Not only are people dropping Scrabble tiles in this universe, they’re dropping them in countless other universes too. We’re only remarking on Jake’s apparent birthday greeting because we happen to inhabit the universe in which it happened.
I used to think that Bayesian statistics provided a way out of this morass. Some Bayesians say “Data are not random. What happened happened. But the model parameters are random (or at least uncertain, which we model as random).” But that’s not true. The left side of Bayes theorem treats data as known, but the right side does not. The arbitrariness of probability models is a bigger problem for Bayesians because they need the frequentist probability model plus a probability model on parameters.
The paper you cited a while ago, by Hogg Bovy & Lang, is precisely on topic for this.
I don’t think it’s quite right to say that Bayesians need the frequentist probability model. (If there is a distinctly frequentist model of probability. The formal frequentist model offers no advice on how to interpret a finite data set. It seems to me that in practice a frequentist is a Bayesian who refuses to tell even himself what his priors are.)
It seems to me that the fundamental difference is not between frequentists and Bayesians, but between people who think that probability statements are about sigma algebras, and those who think they are about information. There is a rough alignment of those categories between frequentist and Bayesian camps, but that’s a contingent historical fact. Laplace (for example) was a frequentist who interpreted probabilities as statments about information.
What’s the probability that Oxford wrote the plays attributed to Shakespeare? Can a frequentist even ask that question?
“data should be viewed as realizations of random variables. This has been a very fruitful idea, but it has its limits. It’s a reification of the world.”
With that phrase, my statistical mind is now permanently blown.
Mike: You’re welcome. :)
Thankfully, the ML community has been moving beyond this tradition into the realm of “prediction of individual sequences”, which I find much more satisfying epistemologically, albeit probably less useful in practical applications.
John: I don’t see why you think the ML approach would be less useful.
Sorry, my last comment was too short to be understandable.
The trouble with making weaker assumptions and not depending upon hypothetical random processes is that you need to make much more conservative predictions. In the multi-armed bandit literature there are algorithms like Exp3 that are able to cope with adversarial processes that generate data, but they do this by making predictions that are (loosely) heavily regularized.
The very strong assumptions in the old statistical theory literature let you define algorithms that are very sure of themselves. Once you make weaker assumptions, you have to hedge your bets quite a lot and will probably underfit a data set in which the real data generating process is some sort of IID sequence.
John: You and I talked before about value-laden technical terms in statistics. Confidence is an example of such terms.
Traditional statistical methods give you greater confidence, as mathematically defined. But my psychological confidence is diminished by the existence of strong modeling assumptions that may not be justified. I would have more personal confidence in modest conclusions drawn from modest assumptions.
Oh, I entirely agree with you. This is one of the lessons I really took from David Freedman’s work: the accuracy of claims that rely upon traditional statistical methods is very much dependent upon the validity of assumptions we don’t really know how to test well.
But the opposite end, which seems to be the adversarial data generating process, leads to the opposite weakness: you spend so much time making weak assumptions that you get weak conclusions. For example, in the bandits literature, you get shifts from logarithmic convergence rates to square root convergence rates . You pay a serious price for skepticism.
If some scientists were more candid, they’d say “I don’t care whether my results are true, I care whether they’re publishable. So I need my p-value less than 0.05. Make as strong assumptions as you have to.”
My sense of statistical education in the sciences is basically Upton Sinclair’s view of the Gilded Age: “it is difficult to get a man to understand something when his salary depends upon his not understanding it.”