Last week I wrote about a hypothetical radio station that plays the top 100 songs in some genre, with songs being chosen randomly according to Zipf’s law. The nth most popular song is played with probability proportional to 1/n.
This post is a variation on that post looking at text consisting of the the 1,000 most common words in a language, where word frequencies follow Zipf’s law.
How many words of text would you expect to read until you’ve seen all 1000 words at least once? The math is the same as in the radio station post. The simulation code is the same too: I just changed a parameter from 100 to 1,000.
The result of a thousand simulation runs was an average of 41,246 words with a standard deviation of 8,417.
This has pedagogical implications. Say you were learning a foreign language by studying naturally occurring text with a relatively small vocabulary, such as newspaper articles. You might have to read a lot of text before you’ve seen all of the thousand most common words.
On the one hand, it’s satisfying to read natural text. And it’s good to have the most common words reinforced the most. But it might be more effective to have slightly engineered text, text that has been subtly edited to make sure common words have not been left out. Ideally this would be done with such a light touch that it isn’t noticeable, unlike heavy-handed textbook dialogs.
It’ll depend strongly on the nature of the text you sample. Here, I have three corpuses of Japanese text: (1) literary works 1868-1945 (100k pages), (2) popular works 1868-1945 (100k pages), and (3) recent writing 2015-2020 (200k pages). I also have a sorted by frequency file with the most common 2-kanji words across these corpuses. So:
Here are the line numbers (a line is either a blank line (usually separating paragraphs) or a single sentence) of the first occurance of the last 10 of the 1000 most frequent of those words, in the literary, the popular, and then in the recent writing corpuses.
観念 – 30572 – 1116 – 1298
青春 – 10043 – 1117 – 6908
家臣 – 971 – 3285 – 10221
方角 – 1856 – 3258 – 101508
建設 – 27328 – 5686 – 1116
資格 – 6837 – 776 – 464
道徳 – 3382 – 1154 – 866
感想 – 77671 – 3003 – 2165
小田 – 4022 – 8647 – 11632
優勝 – 70893 – 18729 – 86
Note that 方角 (“direction”) was used way more commonly before 1945 than after 2015. And there was much less concern for victory in sporting events (優勝) before 1945 than there is nowadays, although line 70893 is still pretty early in the file, since it’s over 1.8 million lines.
Your conclusion is reminiscent of the kinds of data augmentation techniques we find helpful for training ML models