Why eliminate trusted third parties?

Suppose one company would like to buy another company’s client list, but only if the lists don’t overlap too much. Neither company wants to hand over their list to the other before a sale takes place. What can they do?

A low-tech solution would be for both parties to provide their client lists to a trusted third party who will report back how much the lists overlap. That may be the best thing to do.

But it is possible to solve this problem without a trusted third party. With homomorphic encryption, the companies can exchange encrypted versions of their client lists that will allow both to calculate the amount of overlap without revealing any further information.

But why go to the effort? Many peer-to-peer technologies raise this question. So you’ve eliminated a third party; what’s so great about that? If you can send someone cryptocurrency, for example, without going through an intermediary like a bank or credit card company, what good is that if the transaction fees are no lower?

It’s often not worth using sophisticated technology to eliminate a trusted third party, but sometimes it is. Here are some reasons the technology might be worth using.

Transaction speed

The two companies hiring a third party to compare their lists have to wait on the third party, and the amount of time they need to wait is beyond their control. Maybe that’s acceptable for a one-time transaction, but not for repeated transactions. With homomorphic encryption, transactions could be automated and the only delay would be computation time.

Reproducibility

Sharing limited information via encryption reduces legal risk. If either party later accuses the other of providing incorrect information, the accused party can demonstrate that the software applied to the data gives the reported result.

Trust

To paraphrase Bob Dylan, you gotta trust somebody. Some technologies are labeled “zero-trust” or “trust no one,” but these terms need to be understood in context. When a company asks you to run a particular piece of software on your proprietary data and share the results, you have to trust that the software is not malicious or buggy.

Instead of trusting that a third party holding your data is honest and competent, you trust that a third party developing software is honest and competent. You have to decide that the software product is trustworthy. You might test the software on some sample data. Maybe you inspect the source code if it’s available. But at some point you have to trust the software and the context it runs in.

Related posts

RSA security in light of news

A recent article reported on the successful factoring of a 512-bit RSA key. The process took $8 worth of rented computing power. What does that say about the security of RSA encryption?

The following Python function estimates the computation steps required to factor a number b bits long using the best known algorithm. We’re going to take a ratio of two such estimates, so proportionality constants will cancel.

def f(b):
    logn = b*log(2)
    return exp((8/3)**(2/3) * logn**(1/3) * (log(logn))**(2/3))

The minimum recommended RSA key size is 2048 bits. The cost ratio for factoring a 2048 bit key to a 512 bit key is f(2048) / f(512), which is on the order of 1016. So factoring a 2048-bit key would take 80 quadrillion dollars.

This is sort of a back-of-the-envelope estimate. There things it doesn’t take into account. If a sophisticated and highly determined entity wanted to factor a 2048 number, they could probably do so for less than 80 quadrillion dollars. But it’s safe to say that the people who factored the 512 bit key are unlikely to factor a 2048 bit key by the same approach.

Converse of RSA

The security of RSA encryption depends on the difficulty of factoring the product of two large primes. If you can factor large numbers efficiently, you can break RSA. But if can break RSA, can you factor large numbers?

Sorta. It’s conceivable that there is a way to break RSA encryption without having to recover the private key. But if you can recover the private key, then you can factor efficiently. That is, if you can start with a public key (N, e), where N is the modulus and e is the encryption key, and recover the private key d, then you can efficiently factor N. See “Fact 1” in [1].

Let n = log2 N. Then the algorithm alluded to above can be run in O(n³) time. But the best known factoring algorithms take more than O(exp(n1/3)) time.

Related posts

[1] Dan Boneh. Twenty Years of Attacks on the RSA Cryptosystem. Available here.

Unicode Steganography

Steganography attempts to prevent messages from being read by unintended recipients by hiding the messages rather than (or in addition to) encrypting them. Steganography is used when you not only want to keep your communication private, you want to hide the fact that you’ve communicated at all.

Fun fact: The words steganography and stegosaurus are related [1].

Famous example

A famous example of steganography was a secret message sent by Jeremiah Denton during the Vietnam War. While a prisoner of war, Denton was forced to participate in a Vietnamese propaganda video. He send the word torture by blinking the Morse code for the letters in the word. You can find the video here.

Clip from Jeremiah Denton propaganda video with Morse code blinking

Famous non-example

Some alleged examples of steganography have turned out to be apophenia, looking for patterns where they do not exist. The book The Woman Who Smashed Codes details Elizebeth Smith’s introduction to cryptography, being tasked to find messages hidden in minor variations in Shakespeare’s handwriting that were not there. The book goes on to describe her cryptographic work during WWII, deciphering messages that most certainly did exist.

Incidentally, Elizebeth Smith [2] married fellow cryptographer William F. Friedman. I wrote about Friedman’s index of coincidence a while back.

Enter Unicode

Randall Monroe said “I am endlessly delighted by the hopeless task that the Unicode Consortium has created for themselves.” One of the things that makes their task delightful and hopeless is trying to distinguish semantics from appearance.

For example, the capital letters  at the beginning of the Roman and Greek alphabets have different Unicode values even though they both look like alike. A (U+0041) is a Roman letter and Α (U+0391) is a Greek letter and so they’re not the same. Also, the Roman letter M (U+004D) is semantically different from the Roman numeral Ⅿ (U+216F) that represents 1,000.

But it quickly becomes impossible to consistently make such distinctions, and so Unicode is full of compromises. Should the letter i and the imaginary unit i have different code points? What about the symbol i for current and the unit basis vector i? You can’t have a different code point for every use of a symbol.

Because Unicode has numerous pairs of characters with identical appearance, it’s possible to hide binary data in Unicode text by using one member of a pair to represent a 0 and the other to represent a 1. So maybe d (U+0064 Latin Small Letter D) represents a 0 and ԁ (U+0501 Cyrillic Small Letter Komi De) represents a 1.

There is a potential problem with this scheme. Unicode does not dictate appearance, and it’s entirely possible a typographer might create a font that has distinct glyphs for characters that are not distinct in other fonts.

Security

Look-alike characters are often used to create malicious URLs. For instance, someone might take “Microsoft.com” and substitute the Roman numeral Ⅿ for the first letter, or substitute a Greek omicron for one of the o‘s.

Text that is expected to ASCII should be turned into ASCII to prevent mistakes or malice, or the user warned. “Do you really want to visit this URL that contains nine Roman letters and one Cyrillic letter?”

When I’m reading, I want fonts with broad Unicode support. No missing symbols, no jarring change in font for foreign words. But when I’m debugging, it would be nice to have the opposite, a xenophobic font that displays non-ASCII characters in some ugly way that makes them jump out. I imagine someone has developed such a font, but it’s hard to find because most people are looking for better Unicode support, not worse.

Related posts

[1] Both derive from the Greek word for ‘cover’. Steganographic writing is covered in the sense of being hidden. A stegosaurus has armored plates that look like roof tiles, i.e. like the covering of a house.

[2] That’s not a typo. She spelled her name with ‘e’ as the fifth letter rather than the more common ‘a’.

Details of generating primes for cryptography

RSA public key cryptography begins by finding a couple large primes. You essentially do this by testing random numbers until you find primes, but not quite.

Filippo Valsorda just posted a good article on this.

Suppose you’re looking for a 1024-bit prime number. You generate random 1024-bit numbers and then test until you find one that’s prime. You can immediately make this process twice as fast by setting the last bit to 1: it would be wasteful to generate a new number every time you happened to draw an even number.

A little less obvious is that it’s common to set the top bit as well. When you’re generating a number between 0 and 21024 − 1, it’s possible that you could generate a small number. It’s possible that you generate a very small number, like 59, but extremely unlikely. But it’s not so unlikely that you’d generate a number on the order of 21020, for example. By setting the top bit, you know you’re generating a number between 21023 and 21024.

Most composite numbers have small factors, so you check for divisibility by 3, 5, 7 etc. before running more time-consuming tests. Probabilistic tests are far more efficient than deterministic tests, so in practice everyone uses probable primes in RSA. For details of how you apply these tests, and how many tests to run, see Filippo Valsorda’s article.

Should you be concerned about setting the top bit of prime candidates? There are naive and sophisticated reasons not work worry, and an intermediate reason to at least think about it.

The naive response is that you’re just losing one bit of randomness. How much could that hurt? But in other contexts, such as losing one bit of randomness in an AES key, people do worry about such losses.

The bits in the prime factors of an RSA modulus do not correspond directly to bits of security. A 2048-bit modulus, the product of two 1024-bit primes, has about 112 bits of security. (See NIST SP 800-57.) You could set several bits in your prime factors before losing a bit of security. If this bothers you, move up to using a 3072-bit modulus rather than worrying that you 2048-bit modulus is in a sense a 2046-bit modulus.

More cryptography posts

RNG, PRNG, CSPRNG

Most random number generators are pseudorandom number generators (PRNGs). The distinction may be pedantic or crucial, depending on context. In the context of cryptography, it’s critical.

For this post, RNG will mean a physical, true random number generator.

A PRNG may be suitable for many uses—Monte Carlo simulation, numerical integration, game development, etc.—but not be suitable for cryptography. A PNRG which is suitable for cryptography is called a CSPRNG (cryptographically secure pseudorandom number generator).

A PRNG may have excellent statistical properties, and pass standard test suites for random number generators, and yet be insecure. The output of an insecure generator may have no statistical regularities, and yet have regularities that a sneaky cryptanalyst can exploit. For example, the popular Mersenne Twister is fine for simulations but its output can be predicted after a relatively short run. The prediction depends on a clever set of calculations that would be unnatural from a statistical perspective, which is why its statistical performance is better than its cryptographic performance.

CSPRNGs tend to be much slower than PRNGs, so you pay for security. And for a non-cryptographic application this cost isn’t worth it.

In general, statistical tests give necessary but not sufficient conditions for a PRNG to be a CSPRNG. If a PRNG fails statistical tests, it has some sort of regularity that potentially could be exploited by cryptanalysis. I had to tell a client once that although his PRNG passed the standard statistical tests, I was pretty sure I could break the cryptographic system he wanted to use it in. This news was not well received.

Lava lamp photo

I suspect that a physical RNG with good statistical properties will have good cryptographic properties as well, contrary to the usual case. Cloudflare famously uses lava lamps to generate random bits for TLS keys. Cryptanalysts have exploited minor flaws in PRNGs, and so the lava lamps give Cloudflare one less thing to worry about. (I’m sure they still have plenty else to worry about.)

A physical RNG might fail statistical tests. For example, maybe the physical process is truly random but biased. Or maybe the process of turning physical phenomena into numbers introduces some bias. But it’s hard to imagine that an RNG could have a clean bill of statistical health and yet have a cryptographically exploitable weakness. It’s conceivable that a statistically impeccable physical RNG might have some unforeseen exploitable regularity, but this seems highly doubtful.

Related posts

Identifying hash algorithms

Given a hash value, can you determine what algorithm produced it? Or what algorithm probably produced it?

Obviously if a hash value is 128 bits long, then a 128-bit algorithm produced it. Such a hash value might have been produced by MD5, but not by SHA-1, because the former produces 128-bit hashes and the latter a 160-bit hash.

The more interesting question is this: given an n-bit hash, can you tell which n-bit hash algorithm it probably came from? You will find contradictory answers online. Some people confidently say no, that a hash value is for practical purposes a random selection from its range. Others say yes, and point to software that reports which algorithm the hash probably came from. Which is it?

Some hashing software has a custom output format, such as sticking a colon at a fixed position inside the hash value. Software such as hashid uses regular expressions to spot these custom formats.

But assuming you have a hash value that is simply a number, you cannot know which hash algorithm produced it, other than narrowing it to a list of known algorithms that produce a number of that length. If you could know, this would indicate a flaw in the hashing algorithm.

So, for example, a 160-bit hash value could come from SHA-1, or it could come from RIPEMD-160, Haval-160, Tiger-160, or any other hash function that produces 160-bit output.

To say which algorithm probably produced the hash, you need context, some sort of modeling assumption. In general SHA-1 is the most popular 160-bit hash algorithm, so if you have nothing else to go on, your best guess for the origin of a 160-bit hash value would be SHA-1. But RIPEMD is part of the Bitcoin standard, and so if you find a 160-bit hash value in the context of Bitcoin, it’s more likely to be RIPEMD. There must be contexts in which Haval-160 and Tiger-160 are more likely, though I don’t know what those contexts would be.

Barring telltale formatting, software that tells you which hash functions most likely produced a hash value is simply reporting the most popular hash functions for the given length.

For example, I produced a 160-bit hash of “hello world” using RIMEMD-160

   echo -n "hello world" | openssl dgst -rmd160

then asked hashid where it came from.

    hashid '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f'
    Analyzing '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f'
    [+] SHA-1
    [+] Double SHA-1
    [+] RIPEMD-160
    [+] Haval-160
    [+] Tiger-160
    [+] HAS-160
    [+] LinkedIn
    [+] Skein-256(160)
    [+] Skein-512(160)

I got exactly the same output when I hashed “Gran Torino” and “Henry V” because the output doesn’t depend on the hashes per se, only their length.

Whether software can tell you where a hash probably came from depends on your notion of “probably.” If you find a 160-bit hash in the wild, it’s more likely to have come from SHA-1 than RIPEMD. But if you were to write a program to generate random text, hash it with either SHA-1 or RIPEMD, it would likely fail badly.

Related posts

Probability, cryptography, and naïveté

Probability and cryptography have this in common: really smart people can be confidently wrong about both.

I wrote years ago about how striking it was to see two senior professors arguing over an undergraduate probability exercise. As I commented in that post, “Professors might forget how to do a calculus problem, or make a mistake in a calculation, but you wouldn’t see two professors defending incompatible solutions.”

Not only do smart people often get probability wrong, they can be very confident while doing so. The same applies to cryptography.

I recently learned of a cipher J. E. Littlewood invented that he believed was unbreakable. His idea was essentially a stream cipher, simulating a one-time pad by using a pseudorandom number generator. He assumed that since a one-time pad is secure, his simulacrum of a one-time pad would be secure. But it was not, for reasons explained in this paper.

Littlewood was a brilliant mathematician, but he was naive, and even arrogant, about cryptography. Here’s the opening to the paper in which he explained his method.

The legend that every cipher is breakable is of course absurd, though still widespread among people who should know better. I give a sufficient example …

He seems to be saying “Here’s a little example off the top of my head that shows how easy it is to create an unbreakable cipher.” He was the one who should have known better.

Related posts

Breach Safe Harbor

In the context of medical data, Safe Harbor typically refers to the Safe Harbor provisions of the HIPAA Privacy Rule explained here. Breach Safe Harbor is a little different. It basically means you’re off the hook if you breach encrypted health data. (But not necessarily. More on that below.)

I’m not a lawyer, so this isn’t legal advice. Even the HHS, who coin the term “Breach Safe Harbor” in their guidance portal, weasels out of saying they’re giving legal guidance by saying “The contents of this database lack the force and effect of law, except as authorized by law …”

Quality of encryption

You can’t just say that data were encrypted before they were breached. Weak encryption won’t cut it. You have to use acceptable algorithms and procedures.

How can you know whether you’ve encrypted data well enough to be covered Breach Safe Harbor? HHS cites four NIST publications for further guidance. (Not that I’m giving legal advice. I’m merely citing the HHS, who also is not giving legal advice.)

Here are the four publications.

Maybe encryption isn’t enough

At one point Tennessee law said a breach of encrypted data was still a breach. According to Dempsey and Carlin [1]

In 2016, Tennessee repealed its encryption safe harbor, requiring notice of breach of even encrypted data, but then in 2017, after criticism, the state restored a safe harbor for “information that has been encrypted in accordance with the current version of the Federal Information Processing Standard (FIPS) 140-2 if the encryption key has not been acquired by an unauthorized person.”

This is interesting for a couple reasons. First, there is a precedent for requiring notification of encrypted data. Second, this underscores the point above that encryption in general is not sufficient to avoid having to give notice of a breach: standard-compliant encryption is sufficient.

Consulting help

If you would like technical or statistical advice on how to prevent or prepare for a data breach, or how to respond after a data breach after the fact, we can help.

 

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition.

MD5 hash collision example

Marc Stevens gave an example of two alphanumeric strings that differ in only one byte that have the same MD5 hash value. It may seem like beating a dead horse to demonstrate weaknesses in MD5, but it’s instructive to study the flaws of broken methods. And despite the fact that MD5 has been broken for years, lawyers still use it.

The example claims that

TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

and

TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

have the same hash value.

This raises several questions.

Are these two strings really different, and if so, how do they differ? If you stare at the strings long enough you can see that they do indeed differ by one character. But how could you compare long strings like this in a more automated way?

How could you compute the MD5 hash values of the strings to verify that they are the same?

The following Python code addresses the questions above.

from hashlib import md5
from difflib import ndiff

def showdiff(a, b):
    for i,s in enumerate(ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    

a = "TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"
b = "TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"

showdiff(a, b)

ahash = md5(a.encode('utf-8')).hexdigest()
bhash = md5(b.encode('utf-8')).hexdigest()
assert(ahash == bhash)

The basis of the showdiff function was from an answer to a question on Stack Overflow.

The output of the call to showdiff is as follows.

Delete "A" from position 21
Add "E" to position 22

This means we can form string b from a by changing the ‘A’ in position 21 to an ‘E’. There was only one difference between the two strings in this example, but showdiff could be useful for understanding more complex differences.

The assert statement passes because both strings hash to faad49866e9498fc1719f5289e7a0269.

Related posts