The birthday problem is a party trick with serious practical applications. It’s well known to people who have studied probability, but the general public is often amazed by it.
If you have a group of 23 people, there’s a 50-50 chance that at least two people have the same birthday. With a larger group, say 30, its quite likely two birthdays are the same. Not only is this a theoretical result, based on certain modeling assumptions, it actually works in practice, essentially as predicted.
Variations of the birthday problem come up routinely in applications. For example, in cryptography it’s important to know the probability of secure hash collisions. Hash functions are deterministic, but for practical purposes they act random. If you are hashing into a space of N possible hash values, you can expect to compute about √N items before two items have the same hash value.
The square root rule of thumb is very useful. For example, if you’re computing 128-bit hash values, there’s about a 50-50 chance of seeing two duplicate hash values after hashing about 264 items.
The square root heuristic works well for large N, but gives mediocre results for N as small as 365. When applied to the original birthday problem, it predicts even odds for seeing a pair of equal birthdays in a group of 19 people. That’s a little low, but not too far off.
As useful as the square root rule is, it is only good for finding when the probability of duplication is 1/2. What if you’d like to know when the probability of a collision is, say, 0.01?
Let N be the number of possible options and let r be the number of items chosen independently from the set of N options. Let P(N, r) be the probability that all r choices are distinct. Then
P(N, r) ≈ exp( −r²/2N).
This approximation [1] is valid when N is large and r is small relative to N. We could be more precise about the error bounds, but suffice it to say that bigger N is better.
When N = 365 and r = 23, the approximation above computes the probability that all 23 choices are distinct as 0.48, matching the canonical birthday problem and showing an improvement over the square root heuristic.
Related posts
[1] Anthony C. Robin. The Birthday Distribution: Some More Approximations. The Mathematical Gazette, Vol. 68, No. 445 (October, 1984), pp. 204–206
This can be thought of as a Poisson approximation, and is useful for estimating the number of triplets or quadruplets in a larger group of birthdays. #DumbTeacherTricks
Amusing to think that in around half of all soccer matches you watch, there are at least two people on the pitch with the same birthday.
R enables you to “Compute[s] answers to a generalised birthday paradox problem. pbirthday computes the probability of a coincidence and qbirthday computes the smallest number of observations needed to have at least a specified probability of coincidence. “