According to this paper [1], the empirical distribution of real passwords follows a power law [2]. In the authors’ terms, a Zipf-like distribution. The frequency of the rth most common password is proportional to something like 1/r. More precisely,
fr = C r−s
where s is on the order of 1. The value of s that best fit the data depended on the set of passwords, but their estimates of s varied from 0.46 to 0.91.
This means that the most common passwords are very common and easy to guess.
Size of password spaces
If passwords come from an alphabet of size A and have length n, then there are An possibilities. For example, if a password has length 10 and consists of uppercase and lowercase English letters and digits, there are
6210 = 839,299,365,868,340,224
possible such passwords. If users chose passwords randomly from this set, brute force password attacks would be impractical. But brute force attacks are practical because passwords are not chosen uniformly from this large space of possibilities, far from it.
Attackers do not randomly try passwords. They start with the most common passwords and work their way down the list. In other words, attackers use Pareto’s rule.
Rules requiring, say, one upper case letter, don’t help much because most users will respond by using exactly one upper case letter, probably the first letter. If passwords must have one special character, most people will use exactly one special character, most likely at the end of the word. Expanding the alphabet size A exponentially increases the possible passwords, but does little to increase the actual number of passwords.
What’s interesting about the power law distribution is that there’s not a dichotomy between naive and sophisticated users. If there were, there would be a lot of common passwords, and all the rest uniformly distributed. Instead, there’s a continuum between the most naive and most sophisticated. That means a lot of people are not exactly naive, but not as secure as they think they are.
Some math
Under the Zipf model [3], the number of times we’d expect to see the most common password is NC where N is the size of the data set. The constant C is what it has to be for the frequencies to sum to 1. That is, C depends on the number of data points N and the exponent s and is given by
We can bound the sum in the denominator from above and below with integrals, as in the integral test for series convergence. This gives us a way to see more easily how C depends on its parameters.
When s = 1 this reduces to
and otherwise to
Suppose you have N = 1,000,000 passwords. The range of s values found by Wang et al varied from roughly 0.5 to 0.9. Let’s first set s = 0.5. Then C is roughly 0.0005. This mean the most common password appears about 500 times.
If we keep N the same and set s to 0.9, then C is roughly 0.033, and so the most common password would appear about 33,000 times.
How to choose passwords
If you need to come up with a password, randomly generated passwords are best, but hard to remember. So people either use weak but memorable passwords, or use strong passwords and don’t try to remember them. The latter varies in sophistication from password management software down to Post-it notes stuck on a monitor.
One compromise is to concatenate a few randomly chosen words. Something like “thebestoftimes” would be weak because they are consecutive words from a famous novel. Something like “orangemarbleplungersoap” would be better.
Another compromise, one that takes more effort than most people are willing to expend, is to use Manuel Blum’s mental hash function.
Related posts
[1] In case the link rots in the future, the paper is “Zipf’s Law in Passwords” by Ding Wang, Haibo Cheng, Ping Wang, Xinyi Huang, and Gaopeng Jian. IEEE Transactions on Information Forensics and Security, vol. 12, no. 11, November 2017.
[2] Nothing follows a power law distribution exactly and everywhere. But that’s OK: nothing exactly follows any other distribution everywhere: not Gaussian, not exponential, etc. But a lot of things, like user passwords, approximately follow a power law, especially over the middle of their range. Power law’s like Zipf’s law tend to not fit as well at the beginning and the end.
[3] Here I’m using a pure Zipf model for simplicity. The paper [1] used a Zipf-like model that I’m not using here. Related to the footnote [2] above, it’s often necessary to use a minor variation on the pure Zipf model to get a better fit.
https://xkcd.com/936/
Diceware (http://world.std.com/~reinhold/diceware.html) has also been a popular method of passwords with high entropy. With a little bit of effort, the passwords become memorable.
Interesting that the distribution is a power law. When I see power law, I think ‘scale invariance’, so I imagine this means that the ‘common password checker’ that some sites are rolling out will make password guessing more difficult, but only by disallowing 99% of passwords. Still leaves lots of scope for brute-forcing.
I enjoyed this topic *and*, of course, Manuel Blum’s mental hash. Now I’ve seen several approaches which are intended to solve one-side of a long-standing problem: namely that of presenting a password over an unsafe medium and (potentially) being saved to a system which is open to master password list compromise. These are truly great ideas to alleviate this problem!
However, there is another aspect to this that remains unsolved, namely: how the user remembers what his password (perhaps unhashed) is. It may be that until we began to address the first one, that the other one becomes so much more obvious. It is such a serious problem, that it likely interferes with the user being willing to adopt one of these hashing mechanisms.
Though no one (I have run across) has mentioned this particular duality of these challenges presented by the historic use of user “remembered” passwords, per se, many of their consequences have been mentioned (here and elsewhere:
a) different site requirements for password length, strength, etc., and
b) site requiring password change (pre-planned, or by surprise), and
c) multiple “spheres” of interest (personal and work), and
c) multiple platforms (client devices) from which to send, create, host a “password safe”: cell phone, personal computer, work computer(s), different OSs or directories, etc., and
d) too many passwords to remember (my own is well into the hundreds), and everyone keeps trying to increase it, and
e) there are others.
Yes, I have my own way of dealing with it, and no I can’t tell you how. But I will tell you just as soon as I come up with something that will be safe to share.
I think SQRL has the potential to make things better. Sites don’t need your password per se, they need proof that *you* have your password, which can be accomplished via zero-knowledge proofs. If your password never leaves your machine, it can’t be compromised if a company mishandles it or is hacked.
Ah, yes, SQRL. It appears to be a lucid, well-conceived, comprehesive approach to these challenges and more. Additionally it [can] remove(s) the need for servers to “remember” actual personal or financial identifiers (e.g., social security numbers, credit card numbers), instead remembering only the “enHashed” value, created by the site’s per-user/per-site private key, that will only be “unlocked” when the user proves he knows the password. Widespread adoption would eliminate the one of the main drivers of the massive security breaches: the theft of personal (financial) information for fraudulent purposes.
In today’s (not SQRL) world, most of the web sites in which I interact, I have no concern as to whether they are compromised. Yet, I am required to behave as if they were important! This adds to the burden that must be carried along-side of those sites for which would be disasterous (to me) if compromised. Yet I have no choice but to treat them (largely) the same.
The site-by-site patchwork solutions leads to a (password) identity “fatigue” –with often grave results. Such a comprehensive solution as SQRL if in widespread use, would have an economy of effort by the user that could be enjoyed for all high-security sites, and then the low-security ones can take the free ride. I like it.
Where do we sign up?
There’s always a chicken-and-egg element to technology adoption: nobody uses the new thing because nobody uses it, or everybody uses the old thing because everybody uses it. Getting traction with something new is hard.
SQRL might take off because, for one thing, the author has a big podcast audience. A recent Google publication mentioned SQRL, so it’s on their radar.