If two random variables X and Y have the same first few moments, how different can their distributions be?
Suppose E[Xi] = E[Yi] for i = 0, 1, 2, … 2p. Then there is a polynomial P(x) of degree 2p such that
|F(x) − G(x)| ≤ 1/P(x)
where F and G are the CDFs of X and Y respectively.
The polynomial P(x) is given by
V‘ M−1 V
where V is a vector of dimension p + 1 and M is a (p + 1) × (p + 1) matrix. The ith element of V is xi and the (i, j) element of M is E(Xi+j) if we start our indexes start from 0.
Reference: “Moments determine the tail of a distribution (but not much else)” by Bruce Lindsay and Prasanta Basak, The American Statistician, Vol 54, No 4, p. 248–251.
The title of that paper really helps put this result in context.
Yes, agree with Mr Noble: This is neat, but to be appreciated beyond the “priesthood”, needs to be wrapped in explanatory language.
I haven’t read the linked article, but I think the following Wikipedia links for the moment and truncated moment problem are relevant.
The second link shows a connection with orthogonal polynomials.
http://en.wikipedia.org/wiki/Moment_problem
http://en.wikipedia.org/wiki/Chebyshev%E2%80%93Markov%E2%80%93Stieltjes_inequalities
If I read this summary correctly, if two distributions have the same first 2p-1 moments, then if you know the first 4p moments of one of the distributions, you can determine inverse-polynomial bounds for the difference between them. Those bounds tend to infinite width near the shared mean (revealing less than the trivial bound |F(x)-G(x)|≤1). The larger the higher moments of the chosen distribution, the faster the bound narrows. That’s how I read it anyway.
David: Your comment made me realize I’d incorrectly written the dimensions of V and M. The matrix M depends only on the 2p moments in common between X and Y.
I updated the post. Thanks for pointing out the error.
If V is of dimension p+1, wouldn’t this make P a polynomial of degree p?
Jonathan: You multiply by V on the left and the right. That’s why it’s degree 2p. For example, if the matrix in the middle were the identity, you’d get 1 + x^2 + … + x^2p.
This is an interesting result. But I would like to pay attention to one circumstance – it is assumed that moments which we know, we know exactly. In fact, as a rule, it is not so. And it would be interesting to recieve estimates, taking into account the errors of moments.
I’m not exactly sure how to interpret this result.
On the one hand, the title of the paper is “Moments determine the tail of a distribution (but not much else).”
On the other hand, the fact that there is a polynomial P(x) where |F(x) – G(x)| ≤ 1/P(x) doesn’t tell us much if we don’t know what P(x) looks like. Moreover, this is an upper bound, which limits the dissimilarity of the distributions.
In other words, the title of the paper implies that moments don’t determine a distribution very well, whereas the result allows you to conclude nothing more than that moments *might* match a distribution quite well.
What’s the takeaway, then?
For a particular set of moments, you can calculate the matrix M and get P(x) exactly. The paper I reference gives specific examples.
But in any case, you know that asymptotically the difference between the two distribution functions is O(1/x^2p). The reason moments tell you more in the tails than in the middle is that for sufficiently large values of x, only the leading term in the polynomial matters.
Statistical inference depends on the tails. So how do we get to “Moments determine the tail[s] of a distribution (but not much else).”
The core is not informative with or without moments.
Alas, statistical inference is tail driven. Alpha and Beta tell us how much of the tails we used. Cores tell us nothing.