A few days ago I wrote about U-statistics, statistics which can be expressed as the average of a symmetric function over all combinations of elements of a set. V-statistics can be written as an average of over all products of elements of a set.
Let S be a statistical sample of size n and let h be a symmetric function of r elements. The average of h over all subsets of S with r elements is a U-statistic. The average of h over the Cartesian product of S with itself r times
is a V-statistic.
As in the previous post, let h(x, y) = (x − y)²/2. We can illustrate the V-statistic associated with h with Python code as before.
import numpy as np from itertools import product def var(xs): n = len(xs) h = lambda x, y: (x - y)**2/2 return sum(h(*c) for c in product(xs, repeat=2)) / n**2 xs = np.array([2, 3, 5, 7, 11]) print(np.var(xs)) print(var(xs))
This time, however, we iterate over product
rather than over combinations
. Note also that at the bottom of the code we print
np.var(xs)
rather than
np.var(xs, ddof=1)
This means our code here is computing the population variance, not the sample variance. We could make this more explicit by supplying the default value of ddof
.
np.var(xs, ddof=0)
The point of V-statistics is not to calculate them as above, but that they could be calculated as above. Knowing that a statistic is an average of a symmetric function is theoretically advantageous, but computing a statistic this way would be inefficient.
U-statistics are averages of a function h over all subsamples of S of size r without replacement. V-statistics are averages of h over all subsamples of size r with replacement. The difference between sampling with or without replacement goes away as n increases, and so V-statistics have the same asymptotic properties as U-statistics.