Experience with the normal distribution makes people think all distributions have (useful) sufficient statistics [1]. If you have data from a normal distribution, then the sufficient statistics are the sample mean and sample variance. These statistics are “sufficient” in that the entire data set isn’t any more informative than those two statistics. They effectively condense the data for you. (This is conditional on knowing the data come from a normal. More on that shortly.)
With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough.
Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.
Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, conditional on the normality assumption, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.
***
[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.
Thanks for the post. It’s good to keep in mind what you say about assumed distributions. By the way, have you thought on writing an entry on approximate Bayesian computation?? Keep the good blogging!
One interesting thing about sufficiency is that some statistics like the median can never be sufficient, no matter what the distribution is. And another one we found in connection with ABC is that sufficiency for estimation is not the same as sufficiency for testing, ie a statistic may be sufficient for estimating a parameter and ancillary for testing one distribution versus another!
mean and variance still let you do Markov and Chebyshev bounds, though.
(Which is to saying nothing of Chernoff bounds, which have much stricter –i.e. independence– conditions.)
Thinking of things in terms of bounds is at a minimum complementary to the ‘traditional’ approach of ‘knowing’ the distribution.