Variance of variances. All is variance.

In the previous post, I did a simulation to illustrate a theorem about the number of steps needed in the Euclidean algorithm. The distribution of the number of steps is asymptotically normal, and for numbers 0 < a < b < x the mean is asymptotically

12 log(2) log(x) / π².

What about the variance? The reference cited in the previous post [1] gives an expression for the variance, but it’s complicated. The article said the variance is close to a constant times log x, where that constant involves the second derivative of “a certain pseudo zeta function associated with continued fractions.”

So let’s estimate the variance empirically. We can easily compute the sample variance to get an unbiased point estimate of the population variance, but how can we judge how accurate this point estimate is? We’d need the variance of our estimate of variance.

When I taught statistics classes, this is where students would struggle. There are levels of randomness going on. We have some random process. (Algorithm run times are not really random, but we’ll let that slide.)

We have a statistic S², the sample variance, which is computed from samples from a random process, so our statistic is itself a random variable. As such, it has a mean and a variance. The mean of S² is the population variance; that’s what it means for a statistic to be an unbiased estimator.

The statistic S² has a known distribution, which we can use to create a confidence interval. Namely,

$\frac{(n-1)S^2}{\sigma^2} \sim \chi^2$

where the chi squared random variable has n − 1 degrees of freedom. Here n is the number of samples and σ² is the population variance. But wait, isn’t the whole purpose of this discussion to estimate σ² because we don’t know what it is?!

This isn’t quite circular reasoning. More like iterative reasoning. We have an estimate of σ² which we then turn around and use in constructing its confidence interval. It’s not perfect, but it’s useful.

A (1 − α) 100% confidence interval is given by

$\left( \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{\alpha/2}} \right)$

Notice that the right tail probability appears on the left end of the interval and the left tail probability appears on the right end of the interval. The order flips because they’re in a denominator.

I modified the simulation code from the previous post to compute the sample standard deviation for the runs, and got 4.6926, which corresponds to a sample variance of 22.0205. There were 10,000 reps, so our chi squared distribution has 9,999 degrees of freedom. We can compute the confidence interval as follows.

>>> from scipy.stats import chi2
>>> 9999*22.0205/chi2.ppf([0.975, 0.025], 9999)

This says a 95% confidence interval for the variance, in one particular simulation run, was [21.42, 22.64].

The result in [1] said that the variance is proportional to log x, and in our case n = 2⁶³ because that post simulated random numbers up to that limit. We can estimate that the proportionality constant is 63 log(2)/22.0205 = 0.5042. In other words, the variance is about half of log x.

The theorem in [1] gives us the asymptotic variance. If we used a bigger value of x, the variance could be a little different, but I expect it would be around 0.5 log x.

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

Related posts