Identifying hash algorithms

Given a hash value, can you determine what algorithm produced it? Or what algorithm probably produced it?

Obviously if a hash value is 128 bits long, then a 128-bit algorithm produced it. Such a hash value might have been produced by MD5, but not by SHA-1, because the former produces 128-bit hashes and the latter a 160-bit hash.

The more interesting question is this: given an n-bit hash, can you tell which n-bit hash algorithm it probably came from? You will find contradictory answers online. Some people confidently say no, that a hash value is for practical purposes a random selection from its range. Others say yes, and point to software that reports which algorithm the hash probably came from. Which is it?

Some hashing software has a custom output format, such as sticking a colon at a fixed position inside the hash value. Software such as hashid uses regular expressions to spot these custom formats.

But assuming you have a hash value that is simply a number, you cannot know which hash algorithm produced it, other than narrowing it to a list of known algorithms that produce a number of that length. If you could know, this would indicate a flaw in the hashing algorithm.

So, for example, a 160-bit hash value could come from SHA-1, or it could come from RIPEMD-160, Haval-160, Tiger-160, or any other hash function that produces 160-bit output.

To say which algorithm probably produced the hash, you need context, some sort of modeling assumption. In general SHA-1 is the most popular 160-bit hash algorithm, so if you have nothing else to go on, your best guess for the origin of a 160-bit hash value would be SHA-1. But RIPEMD is part of the Bitcoin standard, and so if you find a 160-bit hash value in the context of Bitcoin, it’s more likely to be RIPEMD. There must be contexts in which Haval-160 and Tiger-160 are more likely, though I don’t know what those contexts would be.

Barring telltale formatting, software that tells you which hash functions most likely produced a hash value is simply reporting the most popular hash functions for the given length.

For example, I produced a 160-bit hash of “hello world” using RIMEMD-160

   echo -n "hello world" | openssl dgst -rmd160

then asked hashid where it came from.

    hashid '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f'
    Analyzing '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f'
    [+] SHA-1
    [+] Double SHA-1
    [+] RIPEMD-160
    [+] Haval-160
    [+] Tiger-160
    [+] HAS-160
    [+] LinkedIn
    [+] Skein-256(160)
    [+] Skein-512(160)

I got exactly the same output when I hashed “Gran Torino” and “Henry V” because the output doesn’t depend on the hashes per se, only their length.

Whether software can tell you where a hash probably came from depends on your notion of “probably.” If you find a 160-bit hash in the wild, it’s more likely to have come from SHA-1 than RIPEMD. But if you were to write a program to generate random text, hash it with either SHA-1 or RIPEMD, it would likely fail badly.

Related posts