A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle
- a + b + c
- ab + bc + ac
- abc
are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.
This post looks briefly at symmetric functions in the context of statistics.
Let h be a symmetric function of r variables and suppose we have a set S of n numbers where n ≥ r. If we average h over all subsets of size r drawn from S then the result is another symmetric function, called a U-statistic. The “U” stands for unbiased.
If h(x) = x then the corresponding U-statistic is the sample mean.
If h(x, y) = (x − y)²/2 then the corresponding U-function is the sample variance. Note that this is the sample variance, not the population variance. You could see this as a justification for why sample variance as an n − 1 in the denominator while the corresponding term for population variance has an n.
Here is some Python code that demonstrates that the average of (x − y)²/2 over all pairs in a sample is indeed the sample variance.
import numpy as np from itertools import combinations def var(xs): n = len(xs) bin = n*(n-1)/2 h = lambda x, y: (x - y)**2/2 return sum(h(*c) for c in combinations(xs, 2)) / bin xs = np.array([2, 3, 5, 7, 11]) print(np.var(xs, ddof=1)) print(var(xs))
Note the ddof
term that causes NumPy to compute the sample variance rather than the population variance.
Many statistics can be formulated as U-statistics, and so numerous properties of such statistics are corollaries general results about U-statistics. For example U-statistics are asymptotically normal, and so sample variance is asymptotically normal.