Suppose you want to hear the top 100 songs in some music genre, so you listen to a radio station that specializes in that kind of music. How long would it take to hear each of the songs at least once?
This is a variation on the coupon collector problem. If a radio station plays the top N songs in some genre, selected at random, the expected number of songs you would need to listen to before hearing each one at least once is
N HN
where HN is the Nth harmonic number. For large N this is approximately equal to
N(log N + γ)
where γ is the Euler-Mascheroni constant. When N = 100, this equals 518.
But radio stations do not choose songs at random. Even a station that only plays the 100 most popular songs will presumably play the most popular songs more often than the less popular ones. They probably also have some rules about waiting a while before playing the same song again, but we’ll ignore that for this post.
Suppose we’re listening to ZIPF, a radio station that plays the top 100 songs according to Zipf’s law: the probability of playing the nth most popular song is proportional to 1/n. How long before you’ve heard all 100 songs?
We know it’ll take more than 518 songs on average. The unequal frequency of the songs will cause us to wait longer to hear the least popular song. The 10oth most popular song, for example, will only play with probability 1/518 rather than probability 1/100.
Does the number 518 look familiar? It’s exactly the same number as above:
100 H100
That’s because the probability of hearing the 100th most popular song is proportional to 1/100, and the proportionality constant is the reciprocal of
1 + 1/2 + 1/3 + … 1/100 = H100 = 5.1873…
So in general the expected number of draws from a set of N items until you’ve seen everything at least once equals the probability of selected the least likely item when the items have draw probabilities according to Zipf’s law.
I don’t know how hard it would be to work out the exact probability for our problem of hearing all 100 songs on ZIPF, but we could estimate the probability by simulation.
When I ran it, the average of 1000 simulation reps was 1900 with a standard deviation of 595.
So it would take roughly 20x longer to hear every song selected according to Zipf’s law than to listen to the list.
The asymptomatics would be n log^2 n, since most songs have probability which is roughly 1/ n log n.
The Zipf version reminds me a bit of Diaconis & Mosteller 1989 https://gwern.net/doc/statistics/bias/1989-diaconis.pdf#page=6