A secure hash function maps a file to a string of bits in a way that is hard to reverse. Ideally such a function has three properties:
- pre-image resistance
- collision resistance
- second pre-image resistance
Pre-image resistance means that starting from the hash value, it is very difficult to infer what led to that output; it essentially requires a brute force attack, trying many inputs until something hashes to the given value.
Collision resistance means it’s extremely unlikely that two files would map to the same hash value, either by accident or by deliberate attack.
Second pre-image resistance is like collision resistance except one file is fixed. A second pre-image attack is harder than a collision attack because the attacker can only vary one file.
This post explains how to compute hash functions from the Linux command line, from Windows, from Python, and from Mathematica.
Files vs strings
Hash functions are often applied to files. If a website makes a file available for download, and publishes a hash value, you can compute the hash value yourself after downloading the file to make sure they match. A checksum could let you know if a bit was accidentally flipped in transit, but it’s easy to deliberately tamper with files without changing the checksum. But a secure hash function makes such tampering unfeasible.
You can think of a file as a string or a string as a file, but the distinction between files and strings may matter in practice. When you save a string to a file, you might implicitly add a newline character to the end, causing the string and its corresponding file to have different hash values. The problem is easy to resolve if you’re aware of it.
Another gotcha is that text encoding matters. You cannot hash text per se; you hash the binary representation of that text. Different representations will lead to different has values. In the examples below, only Python makes this explicit.
openssl digest
One way to compute hash values is using openssl
. You can give it a file as an argument, or pipe a string to it.
Here’s an example creating a file f
and computing its SHA256 hash.
$ echo "hello world" > f $ openssl dgst -sha256 f SHA256(f)= a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447
We get the same hash value if we pipe the string “hello world” to openssl
.
$ echo "hello world" | openssl dgst -sha256 a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447
However, echo
silently added a newline at the end of our string. To get the hash of “hello world” without this newline, use the -n
option.
$ echo -n "hello world" | openssl dgst -sha256 b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
To see the list of hash functions openssl
supports, use list --digest-commands
. Here’s what I got, though the output could vary with version.
$ openssl list --digest-commands blake2b512 blake2s256 gost md4 md5 mdc2 rmd160 sha1 sha224 sha256 sha3-224 sha3-256 sha3-384 sha3-512 sha384 sha512 sha512-224 sha512-256 shake128 shake256 sm3
A la carte commands
If you’re interested in multiple hash functions, openssl
has the advantage of handling various hashing algorithms uniformly. But if you’re interested in a particular hash function, it may have its only command line utility, such as sha256sum
and md5sum
. But these are not named consistently. For example, the utility to compute BLAKE2 hashes is b2sum
.
hashalot
The hashalot
utility is designed for hashing passphrases. As you type in a string, the characters are not displayed, and the input is hashed without a trailing newline character.
Here’s what I get when I type “hello world” at the passphrase prompt below.
$ hashalot -x sha256 Enter passphrase: b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
The -x
option tells hashalot
to output hexadecimal rather than binary.
Note that this produces the same output as
echo -n "hello world" | openssl dgst -sha256
According to the documentation,
Supported values for HASHTYPE: ripemd160 rmd160 rmd160compat sha256 sha384 sha512
Python hashlib
Python’s hashlib
library supports several hashing algorithms. And unlike the examples above, it makes the encoding of the input and output explicit.
import hashlib print(hashlib.sha256("hello world".encode('utf-8')).hexdigest())
This produces b94d…cde9
as in the examples above.
hashlib
has two attributes that let you know which algorithms are available. The algorithms_available
attribute is the set of hashing algorithms available in your particular instance, and the algorithms_guaranteed
attribute is the set of algorithm guaranteed to be available anywhere the library is installed.
Here’s what I got on my computer.
>>> a = hashlib.algorithms_available >>> g = hashlib.algorithms_guaranteed >>> assert(g.issubset(a)) >>> g {'sha1', 'sha512', 'sha3_224', 'shake_256', 'sha3_256', 'sha256', 'shake_128', 'sha224', 'md5', 'sha384', 'blake2s', 'sha3_512', 'blake2b', 'sha3_384'} >>> a.difference(g) {'md5-sha1', 'mdc2', 'sha3-384', 'ripemd160', 'blake2s256', 'md4', 'sha3-224', 'whirlpool', 'sha512-256', 'blake2b512', 'sha512-224', 'sm3', 'shake128', 'shake256', 'sha3-512', 'sha3-256'}
Hashing on Windows
Windows has a utility fciv
whose name stands for “file checksum integrity verifier”. It only supports the broken hashes MD5 and SHA1 [1].
PowerShell has a function Get-FileHash
that uses SHA256 by default, but also supports SHA1, SHA384, SHA512, and MD5.
Hashing with Mathematica
Here’s our running example, this time in Mathematica.
Hash["hello world", "SHA256", "HexString"]
This returns b94d…cde9
as above. Other hash algorithms supported by Mathematica: Adler32, CRC32, MD2, MD3, MD4, MD5, RIPEMD160, RIPEMD160SHA256, SHA, SHA256, SHA256SHA256, SHA384, SHA512, SHA3-224, SHA3-256, SHA3-384, SHA3-512, Keccak224, Keccak256, Keccak384, Keccak512, Expression.
Names above that concatenate two names are the composition of the two functions. RIPEMD160SHA256 is included because of its use in Bitcoin. Here “SHA” is SHA-1. “Expression” is a non-secure 64-bit hash used internally by Mathematica.
Mathematica also supports several output formats besides hexadecimal: Integer, DecimalString, HexStringLittleEndian, Base36String, Base64Encoding, and ByteArray.
More hash function posts
[1] It’s possible to produce MD5 collisions quickly. MD5 remains commonly used, and is fine as a checksum, though it cannot be considered a secure hash function any more.
Google researchers were able to produce SHA1 collisions, but it took over 6,000 CPU years distributed across many machines.
On windows there is also certutil with the -hashfile option. Doesn’t support everything but has SHA256 SHA384 and SHA512.
I’ve always seen 2nd preimage resistance compared / contrasted with collision resistance, but to me the comparison that is more interesting / important is pre-image vs. 2nd pre-image… what is the difference, really?
I understand the words and the mechanics, but it seems really really like a distinction without much of a difference?
if I’m understanding correctly, for pre-image resistance, the goal is “find any input that hashes to this value”. OK fine. And 2nd pre-image resistance is “find any input that hashes to this value, EXCEPT for this one input, that one we know already.”
I mean, there is a huge number of preimages that hash to any value, so is removing one of them from contention really so different? If you’re brute forcing, it’s very unlikely you’re going to also hit “Pre Image A”.
So again, (I think) I understand the wording and the definitions, I’m just not sure I understand the importance of the distinction. Maybe a use case would help? I often hear “passwords” for pre-image, and “digital signatures” for 2nd preimage, but either way, if the hash is weak enough such that I can find an “m” that hashes to “H”, both passwords and digital signatures are broken.
What am I missing here?
Here’s the difference between collision and second pre-image.
There are two kinds of collision: accidental and intentional. Accidental collisions matter for applications like comparing files in a version control system. It would break git if, for example, two developers edited a file in different ways that happened to lead to the same hash value.
Intentional collisions would occur if someone tinkered with two different files simultaneously until they made them have the same hash value. Someone working on such a problem might start with a file X and say “I can’t make another file that has the same hash as X. I wish X had this property, so I’ll change X to make it have that property and try again.”
What a second pre-image attack implies is that someone could take an ordinary file, one that occurs naturally rather than one that has been carefully constructed to make a collision possible, and create a new file with the same hash. For example, if you’re a software company distributing updates, you don’t want someone to be able to inject a virus without changing the hash value. The hash is your seal to certify that the file hasn’t been tampered with.
Google researchers were able to engineer an artificial collision for SHA1, but that required being able to change both files. They were able to carefully design a file to allow them to create a second file with the same hash value. But that in itself is not a security problem. The security problem is if someone could take an arbitrary file, like a software update, and create a malicious doppelganger without needing to change the original.
Existence of a collision is not a security problem per se, but a proof of concept suggesting that second pre-image attacks may be possible. Maybe the collision creation procedure requires a file to have some specific bit pattern that is extremely unlikely to occur naturally. That wouldn’t be a direct threat, but it would suggest that someone may be able to create an attack that is an threat.