1000 most common words

Posted on 14 April 2025 by John

Last week I wrote about a hypothetical radio station that plays the top 100 songs in some genre, with songs being chosen randomly according to Zipf’s law. The nth most popular song is played with probability proportional to 1/n.

This post is a variation on that post looking at text consisting of the the 1,000 most common words in a language, where word frequencies follow Zipf’s law.

How many words of text would you expect to read until you’ve seen all 1000 words at least once? The math is the same as in the radio station post. The simulation code is the same too: I just changed a parameter from 100 to 1,000.

The result of a thousand simulation runs was an average of 41,246 words with a standard deviation of 8,417.

This has pedagogical implications. Say you were learning a foreign language by studying naturally occurring text with a relatively small vocabulary, such as newspaper articles. You might have to read a lot of text before you’ve seen all of the thousand most common words.

On the one hand, it’s satisfying to read natural text. And it’s good to have the most common words reinforced the most. But it might be more effective to have slightly engineered text, text that has been subtly edited to make sure common words have not been left out. Ideally this would be done with such a light touch that it isn’t noticeable, unlike heavy-handed textbook dialogs.

Simple error function approximation

Posted on 18 December 2024 by John

I recently ran across the fact that

$\int_0^x \exp(-t^2)\, dt \approx \sin(\sin(x))$

is a remarkably good approximation for −1 ≤ x ≤ 1.

Since the integral above defines the error function erf(x), modulo a constant, this says we have a good approximation for the error function

$\text{erf}(x) \approx \frac{2}{\sqrt{\pi}} \sin( \sin(x) )$

again provided −1 ≤ x ≤ 1.

The error function is closely related to the Gaussian integral, i.e. the normal probability distribution CDF Φ. The relation between erf and Φ is simple but error-prone. I wrote up some a page notes for myself a few years ago so I wouldn’t make a mistake again moving between these functions and their inverses.

Update: This post makes the connection to probability explicit.

You can derive the approximation by writing out the power series for exp(t), substituting −t² for t, and integrating term-by-term from 0 to x. You’ll see that the result is the same as the power series for sine until you get to the x⁵ term, so the error is on the order of x⁵. Here’s a plot of the error.

The error is extremely small near 0, which is what you’d expect since the error is on the order of x⁵.

A better integral for the normal distribution

Posted on 31 August 2024 by John

For a standard normal random variable Z, the probability that Z exceeds some cutoff z is given by

$\mbox{Prob}(Z \geq z) = Q(z) = \frac{1}{\sqrt{2\pi}} \int_z^\infty \exp(-x^2/2)\, dx$

If you wanted to compute this probability numerically, you could obviously evaluate its defining integral numerically. But as is often the case in numerical analysis, the most obvious approach is not the best approach. The range of integration is unbounded and it varies with the argument.

J. W. Craig [1] came up with a better integral representation, better from the perspective of numerical integration. The integration is always over the same finite interval, with the argument appearing inside the integrand. The integrand is smooth and bounded, well suited to numerical integration.

For positive z, Craig’s integer representation is

$Q(z) = \frac{1}{\pi} \int_0^{\pi/2} \exp\left( -\frac{z^2}{2\sin^2 \theta} \right) \, d\theta$

Illustration

To show that the Craig’s integral is easy to integrate numerically, we’ll evaluate it using Gaussian quadrature with only 10 integration points.

    from numpy import sin, exp, pi
    from scipy import integrate
    from scipy.stats import norm

    for x in [0.5, 2, 5]:
        q, _ = integrate.fixed_quad(
            lambda t: exp(-x**2 / (2*sin(t)**2))/pi,
            0.0, pi/2, n=10)
        print(q, norm.sf(x))

(SciPy uses sf (“survival function”) for the CCDF. More on that here.)

The code above produces the following.

    0.30858301 0.30853754
    0.02274966 0.02275013
    2.86638437e-07 2.86651572e-07

So with 10 integration points, we get four correct figures. And the accuracy seems to be consistent for small, medium, and large values of x. (Five standard deviations is pretty far out in the tail of a normal distribution, as evidenced by the small value of the integral.)

[1] J. W. Craig, A new, simple and exact result for calculating the probability of error for two-dimensional signal constellations, in TEEE MILCOM’91 Conf. Rec., Boston, MA (1991) рр. 25.2.1-25.5.5.

Too clever Monte Carlo

Posted on 4 August 2024 by John

One way to find the volume of a sphere would be to imagine the sphere in a box, randomly select points in the box, and count how many of these points fall inside the sphere. In principle this would work in any dimension.

The problem with naive Monte Carlo

We could write a program to estimate the volume of a high-dimensional sphere this way. But there’s a problem: very few random samples will fall in the sphere. The ratio of the volume of a sphere to the volume of a box it fits in goes to zero as the dimension increases. We might take a large number of samples and none of them fall inside the sphere. In this case we’d estimate the volume as zero. This estimate would have small absolute error, but 100% relative error.

A more clever approach

So instead of actually writing a program to randomly sample a high dimensional cube, let’s imagine that we did. Instead of doing a big Monte Carlo study, we could be clever and use theory.

Let n be our dimension. We want to draw uniform random samples from [−1, 1]ⁿ and see whether they land inside the unit sphere. So we’d draw n random samples from [−1, 1] and see whether the sum of their squares is less than or equal to 1.

Let X_i be a uniform random variable on [−1, 1]. We want to know the probability that

X₁² + X₂² + X₃² + … + X_n² ≤ 1.

This would be an ugly calculation, but since we’re primarily interested in the case of large n, we can approximate the sum using the central limit theorem (CLT). We can show, using the transformation theorem, that each X_i² has mean 1/3 and variance 4/45. The CLT says that the sum has approximately the distribution of a normal random variable with mean n/3 and variance 4n/45.

Too clever by half

The approach above turns out to be a bad idea, though it’s not obvious why.

The CLT does provide a good approximation of the sum above, near the mean. But we have a sum with mean n/3, with n large, and we’re asking for the probability that the sum is less than 1. In other words, we’re asking for the probability in the tail where the CLT approximation error is a bad (relative) fit. More on this here.

This post turned out to not be about what I thought it would be about. I thought this post would lead to a asymptotic approximation for the volume of an n-dimensional sphere. I would compare the approximation to the exact value and see how well it did. Except it did terribly. So instead, this post a cautionary tale about remembering how convergence works in the CLT.

Pairs in poker

Posted on 10 April 2024 by John

An article by Y. L. Cheung [1] gives reasons why poker is usually played with five cards. The author gives several reasons, but here I’ll just look at one reason: pairs don’t act like you might expect if you have more than five cards.

In five-card poker, the more pairs the better. Better here means less likely. One pair is better than no pair, and two pairs is better than one pair. But in six-card or seven-card poker, a hand with no pair is less likely than a hand with one pair.

For a five-card hand, the probabilities of 0, 1, or 2 pair are 0.5012, 0.4226, and 0.0475 respectively.

For a six-card hand, the same probabilities are 0.3431, 0.4855, and 0.1214.

For a seven-card hand, the probabilities are 0.2091, 0.4728, and 0.2216.

[1] Y. L. Cheung. Why Poker Is Played with Five Cards. The Mathematical Gazette, Dec., 1989, Vol. 73, No. 466 (Dec., 1989), pp. 313–315

Probability resources

Posted on 7 January 2015 by John

Each Wednesday I post a list of notes on some topic. This week it’s probability.

Last week: Python resources

Next week: Regular expression resources

Probability

1000 most common words

Simple error function approximation

A better integral for the normal distribution

Illustration

Related posts

Too clever Monte Carlo

The problem with naive Monte Carlo

A more clever approach

Too clever by half

Related posts

Pairs in poker

Related posts

Probability resources