Prefix code examples

In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows whether you’ve dialed an in-office extension or the first three digits of an outside call.

This is an example of a prefix code: No valid phone number is a prefix of another valid phone number. We’ll look at a few more examples of prefix codes in the context of phone numbers, then look at Roman numerals, Morse code, Unicode encoding, data compression, and RPN calculators.

It used to be true in the US that you could dial four or five digits for a local call, seven digits for a call within the area code, and ten digits for a long distance phone call. This didn’t cause any ambiguity because no local number would begin with the digits of your area code. You had to dial a 1 before dialing a long distance number, and no local or area code numbers begin with 1.

There are still parts of the US where you can dial either a seven digit number or a ten digit number. In most of the US you always enter 10 digits. This is a trivial form of prefix coding because all fixed-length codes are prefix codes.

A final example of prefix codes related to telephony are country calling codes. These codes have varying lengths, but phone exchanges can know when the country code stops and when the number within a country starts because no prefix of a country code is a valid country code.

Prefix codes are sometimes called self-punctuating codes. This is because you don’t need an additional symbol, a form of punctuation, to mark the end of codes.

We usually think of Morse code as a system of two symbols: dots and dashes. But it’s really a system of four symbols: dots, dashes, short space between letters, and longer space between words. As a system of only dots and dashes, Morse code is not self-punctuating. Without spaces, you couldn’t tell, for example, whether dot dash represented an A (dot dash) , or an E (dot) followed by a T (dash). It’s possible to design a prefix code with just two symbols, but that’s not what Morse did.

Q codes are not part of Morse code per se but are often used in Morse code. These are three letter codes beginning with Q. For example, QRP is an abbreviation for the question “Should I reduce power?”. Q codes are prefix codes because they have fixed length. If QR were a valid Q code by itself, that would ruin the prefix property; a recipient would not know whether to interpret QR as a complete code until listening for the next letter.

The letter components of Roman numerals are not a prefix code because you can’t tell whether a letter stands for a positive or negative amount until you read the next letter. For example, in CLX the X represents 10 but in CXL the X represents -10. If you wrote Roman numerals backward, the letters would form a prefix code.

In the previous post, I discussed UTF-16 encoding of Unicode. The way UTF-16 encodes characters outside the Basic Multilingual Plane makes it a prefix code; the meaning of a surrogate doesn’t depend on any down-stream information. UTF-8 is also a prefix code, which I discuss in detail here.

You may have seen Huffman coding, a form of data compression that uses a prefix code.

Reverse Polish Notation is an example of prefix coding. An RPN calculator doesn’t need parentheses for punctuation. You can enter calculations unambiguously with just digits and arithmetic operators because the meaning of a computation does not depend on any future input. Prefix codes are sometimes called instantaneous codes because of this feature.

Error-correcting codes

Error correcting code make digital communication possible. Without some way to detect and correct errors, the corruption of a single bit could wreak havoc. A simple example of an error-detection code is check sums. A more sophisticated example would be erasure codes, a method used by data centers to protect customer data against hard drive failures or even entire data centers going offline.

People who work in coding theory are quick to point out that they do not work in cryptography. “No, not that kind of code. Error-correcting codes, not secret codes.” The goal isn’t secrecy. The goal is to maximize the probability of correctly transmitting data while minimizing the amount of extra information added.

Codes and ciphers

You don’t hear the word “code” used in connection with cryptography much anymore. People used to refer to “codes and ciphers” in one breath. Historically, the technical distinction was that a code operated on words, while a cipher operated on characters. Codes in this sense have long been obsolete, but people still speak of codes colloquially.

David Kahn’s classic book on pre-modern cryptography is entitled The Codebreakers, not the Cipherbreakers, because the public at the time was more familiar with the term code than the term cipher. Maybe that’s still the case because, for example, Jason Fagone entitled his biography of Elizabeth Friedman The Woman Who Smashed Codes. Perhaps the author suggested The Woman Who Smashed Ciphers and an editor objected.

Code-based cryptography

If you’re accustomed to the older use of “codes,” the term “code-based cryptography” is redundant. But it means something specific in modern usage: cryptographic systems that incorporate error-correction codes. So error-correcting codes and secret “codes” do have something to do with each other after all!

Robert McEliece had this idea back in 1978. His encryption method starts with a particular error-correcting code, a binary Goppa code, and scrambles it with an invertible linear transformation. At a very high level, McEliece’s method boils down to a secret factorization, sorta like RSA but even more like oil and vinegar. The public key is the product of the Goppa code and the linear transformation, but only the owner knows the factorization of this key.

To encrypt a message with McEliece’s method, the sender adds a specific amount of random noise, noise that the Goppa code can remove. An attacker faces a challenging computational problem to recover the message without knowing how to factor the public key.

Post-quantum cryptography

McEliece’s method did not attract much interest at the time because it requires much larger public keys than other methods, such as RSA. However, there is renewed interest in McEliece’s approach because his scheme is apparently quantum-resistant whereas RSA and other popular public key systems are not.

If and when large quantum computers become practical, they could factor the product of large primes efficiently, and thus break RSA. They could also solve the discrete logarithm and elliptic discrete logarithm problems, breaking Diffie-Hellman and elliptic curve cryptosystems. All public key cryptosystems now in common use would be broken.

Why worry about this now while quantum computers don’t exist? (They exist, but only as prototypes. So far the largest number a quantum computer has been able to factor is 21.) The reason is that it takes a long time to develop, analyze, standardize, and deploy encryption methods. There’s also the matter of forward security: someone could store encrypted messages with the hope of decrypting them in the future. This doesn’t matter for cat photos transmitted over TLS, but it could matter for state secrets; governments may be encrypting documents that they wish to keep secret for decades.

NIST is sponsoring a competition to develop and standardize quantum-resistant encryption methods. Two months ago NIST announced the candidates that advanced to the second round. Seven of these methods use code-based cryptography, including the classic McEliece method and six variations: BIKE, HQC, LEDAcrypt, NTS-KEM, ROLLO, and RQC.

Base85 encoding

I wrote a while back about Base32 and Base64 encoding, and yesterday I wrote about Bitcoin’s Base58 encoding. For completeness I wanted to mention Base85 encoding, also known as Ascii85. Adobe uses it in PostScript and PDF files, and git uses it for encoding patches.

Like Base64, the goal of Base85 encoding is to encode binary data printable ASCII characters. But it uses a larger set of characters, and so it can be a little more efficient. Specifically, it can encode 4 bytes (32 bits) in 5 characters.

Why 85?

There are 95 printable ASCII characters, and

log95(232) = 4.87

and so it would take 5 characters encode 4 bytes if you use all possible printable ASCII characters. Given that you have to use 5 characters, what’s the smallest base that will still work? It’s 85 because

log85(232) = 4.993


log84(232) = 5.006.

(If you’re not comfortable with logarithms, see an alternate explanation in the footnote [1].)

Now Base85 is different from the other bases I’ve written about because it only works on 4 bytes at a time. That is, if you have a number larger than 4 bytes, you break it into words of 4 bytes and convert each word to Base 85.

Character set

The 95 printable ASCII characters are 32 through 126. Base 85 uses characters 33 (“!”) through 117 (‘u’). ASCII character 32 is a space, so it makes sense you’d want to avoid that one. Since Base85 uses a consecutive range of characters, you can first convert a number to a pure mathematical radix 85 form, then add 33 to each number to find its Base85 character.


Suppose we start with the word 0x89255d9, equal to 143807961 in decimal.

143807961 = 2×854 + 64×853 + 14×852 + 18×85 + 31

and so the radix 85 representation is (2, 64, 14, 18, 31). Adding 33 to each we find that the ASCII values of the characters in the Base85 representation are (35, 97, 47, 51, 64), or (‘#’, ‘a’, ‘/’, ‘3’, ‘@’) and so #a/3@ is the Base85 encoding of 0x89255d9.


The Z85 encoding method is also based on a radix 85 representation, but it chose to use a different subset of the 95 printable characters. Compared to Base85, Z85 adds seven characters

    v w x y z { }

and removes seven characters

    ` \ " ' _ , ;

to make the encoding work more easily with programming languages. For example, you can quote Z85 strings with single or double quotes because neither kind of quote is a valid Z85 character. And you don’t have to worry about escape sequences since the backslash character is not part of a Z85 representation.


There are a couple things that could trip someone up with Base85. First of all, Base 85 only works on 32-bit words, as noted above. For larger numbers it’s not a base conversion in the usual mathematical sense.

Second, the letter z can be used to denote a word consisting of all zeros. Since such words come up disproportionately often, this is a handy shortcut, though it means you can’t just divide characters into groups of 5 when converting back to binary.

[1] 954 = 81450625 < 232 = 4294967296, so four characters from an alphabet of 95 elements is not enough to represent 232 possibilities. So we need at least five characters.

855 = 4437053125 > 232, so five characters is enough, and in fact it’s enough for them to come from an alphabet of size 85. But 845 = 4182119424 < 232, so an alphabet of 84 characters isn’t enough to represent 32 bits with five characters.

Check sums and error detection

The previous post looked at Crockford’s base 32 encoding, a minor variation on the way math conventionally represents base 32 numbers, with concessions for human use. By not using the letter O, for example, it avoids confusion with the digit 0.

Crockford recommends the following check sum procedure, a simple error detection code:

The check symbol encodes the number modulo 37, 37 being the least prime number greater than 32.

That is, we take the remainder when the base 32 number is divided by 37 and append the result to the original encoded number. The remainder could be larger than 31, so we need to expand our alphabet of symbols. Crockford recommends using the symbols *, ~, $, =, and U to represent remainders of 32, 33, 34, 35, and 36.

Crockford says his check sum will “detect wrong-symbol and transposed-symbol errors.” We will show that this is the case in the proof below.

Python example

Here’s a little Python code to demonstrate how the checksum works

    from base32_crockford import encode, decode

    s = "H88CMK9BVJ1V"
    w = "H88CMK9BVJ1W" # wrong last char
    t = "H88CMK9BVJV1" # transposed last chars

    def append_checksum(s):
        return encode(decode(s), checksum=True)


This produces the following output.


The checksum character of the original string is P. When the last character is changed, the checksum changes from P to Q. Similarly, transposing the last two characters changes the checksum from P to E.

The following code illustrates that the check sum can be a non-alphabetic character.

    s = "H88CMK9BVJ10"
    n = decode(s)
    r = n % 37
    print(encode(n, checksum=True))

This produces


As we said above, a remainder of 32 is represented by *.


If you change one character in a base 32 number, its remainder by 37 will change as well, and so the check sum changes.

Specifically, if you change the nth digit from the right, counting from 0, by an amount k, then you change the number by a factor of k 32n. Since 0 < k < 32, k is not divisible by 37, nor is 32n. Because 37 is prime, k 32n is not divisible by 37 [1]. The same argument holds if we replace 37 by any larger prime.

Now what about transpositions? If you swap consecutive digits a and b in a number, you also change the remainder by 37 (or any larger prime) and hence the check sum.

Again, let’s be specific. Suppose we transpose the nth and (n + 1)st digits from the right, again counting from 0. Denote these digits by a and b respectively. Then swapping these two digits changes the number by an amount

(b 2n+1 + a 2n) − (a 2n+1 + b 2n) = (ba) 2n

If ab, then ba is a number between −31 and 31, but not 0, and so ba is not divisible by 37. Neither is any power of 2 divisible by 37, so we’ve changed the remainder by 37, i.e. changed the check sum. And as before, the same argument works for any prime larger than 47.

[1] A prime p divides a product ab only if it divides a or it divides b. This isn’t true for composite numbers. For example, 6 divides 4*9 = 36, but 6 doesn’t divide 4 or 9.

Base 32 and base 64 encoding

Math has a conventional way to represent numbers in bases larger than 10, and software development has a couple variations on this theme that are only incidentally mathematical.

Math convention

By convention, math books typically represent numbers in bases larger than 10 by using letters as new digit symbols following 9. For example, base 16 would use 0, 1, 2, …, 9, A, B, C, D, E, and F as its “digits.” This works for bases up to 36; base 36 would use all the letters of the alphabet. There’s no firm convention for whether to use upper or lower case letters.

Base 64 encoding

The common use for base 64 encoding isn’t to represent bits as numbers per se, but to have an efficient way to transmit bits in a context that requires text characters.

There are around 100 possible characters on a keyboard, and 64 is the largest power of 2 less than 100 [1], and so base 64 is the most dense encoding using common characters in a base that is a power of 2.

Base 64 encoding does not follow the math convention of using the digits first and then adding more symbols; it’s free not to because there’s no intention of treating the output as numbers. Instead, the capital letters A through Z represent the numbers 0 though 25, the lower case letters a through z represent the numbers 26 through 51, and the digits 0 through 9 represent the numbers 52 through 61. The symbol + is used for 62 and / is used for 63.

Crockford’s base 32 encoding

Douglas Crockford proposed an interesting form of base 32 encoding. His encoding mostly follows the math convention: 0, 1, 2, …, 9, A, B, …, except he does not use the letters I, L, O, and U. This eliminates the possibility of confusing i, I, or l with 1, or confusing O with 0. Crockford had one more letter he could eliminate, and he chose U in order to avoid an “accidental obscenity.” [2]

Crockford’s base 32 encoding is a compromise between efficiency and human legibility. It is more efficient than hexadecimal, representing 25% more bits per character. It’s less efficient than base 64, representing 17% fewer bits per character, but is more legible than base 64 encoding because it eliminates commonly confused characters.

His encoding is also case insensitive. He recommends using only capital letters for output, but permitting upper or lower case letters in input. This is in the spirit of Postel’s law, also known as the robustness principle:

Be conservative in what you send, and liberal in what you accept.

See the next post for an explanation of Crockford’s check sum proposal.

A password generator

Here’s a Python script to generate passwords using Crockford’s base 32 encoding.

    from secrets import randbits
    from base32_crockford import encode

    def gen_pwd(numbits):

For example, gen_pwd(60) would create a 12-character password with 60-bits of entropy, and this password would be free of commonly confused characters.

[1] We want to use powers of 2 because it’s easy to convert between base 2 and base 2n: start at the right end and convert bits in groups of n. For example, to convert a binary string to hexadecimal (base 24 = 16), convert groups of four bits each to hexadecimal. So to convert the binary number 101111001 to hex, we break it into 1 0111 1001 and convert each piece to hex, with 1 -> 1, 0111 -> 7, and 1001 -> 9, to find 101111001 -> 179. If we a base that is not a power of 2, the conversion would be more complicated and not so localized.

[2] All the words on George Carlin’s infamous list include either an I or a U, and so none can result from Crockford’s base 32 encoding. If one were willing to risk accidental obscenities, it would be good to put U back in and remove S since the latter resembles 5, particularly in some fonts.

How efficient is Morse code?


Morse code was designed so that the most frequently used letters have the shortest codes. In general, code length increases as frequency decreases.

How efficient is Morse code? We’ll compare letter frequencies based on Google’s research with the length of each code, and make the standard assumption that a dash is three times as long as a dot.

| Letter | Code | Length | Frequency |
| E      | .    |      1 |    12.49% |
| T      | -    |      3 |     9.28% |
| A      | .-   |      4 |     8.04% |
| O      | ---  |      9 |     7.64% |
| I      | ..   |      2 |     7.57% |
| N      | -.   |      4 |     7.23% |
| S      | ...  |      3 |     6.51% |
| R      | .-.  |      5 |     6.28% |
| H      | .... |      4 |     5.05% |
| L      | .-.. |      6 |     4.07% |
| D      | -..  |      5 |     3.82% |
| C      | -.-. |      8 |     3.34% |
| U      | ..-  |      5 |     2.73% |
| M      | --   |      6 |     2.51% |
| F      | ..-. |      6 |     2.40% |
| P      | .--. |      8 |     2.14% |
| G      | --.  |      7 |     1.87% |
| W      | .--  |      7 |     1.68% |
| Y      | -.-- |     10 |     1.66% |
| B      | -... |      6 |     1.48% |
| V      | ...- |      6 |     1.05% |
| K      | -.-  |      7 |     0.54% |
| X      | -..- |      8 |     0.23% |
| J      | .--- |     10 |     0.16% |
| Q      | --.- |     10 |     0.12% |
| Z      | --.. |      8 |     0.09% |

There’s room for improvement. Assigning the letter O such a long code, for example, was clearly not optimal.

But how much difference does it make? If we were to rearrange the codes so that they corresponded to letter frequency, how much shorter would a typical text transmission be?

Multiplying the code lengths by their frequency, we find that an average letter, weighted by frequency, has code length 4.5268.

What if we rearranged the codes? Then we would get 4.1257 which would be about 9% more efficient. To put it another way, Morse code achieved 91% of the efficiency that it could have achieved with the same codes. This is relative to Google’s English corpus. A different corpus would give slightly different results.

Toward the bottom of the table above, letter frequencies correspond poorly to code lengths, though this hardly matters for efficiency. But some of the choices near the top of the table are puzzling. The relative frequency of the first few letters has remained stable over time and was well known long before Google. (See ETAOIN SHRDLU.) Maybe there were factors other than efficiency that influenced how the most frequently used characters were encoded.

Update: Some sources I looked at said that a dash is three times as long as a dot, including the space between dots or dashes. Others said there is a pause as long as a dot between elements. The latter is the official standard of the International Telecommunications Union.

If you use the official timing, it takes an average time equal to 6.0054 dots to transmit an English letter, and this could be improved to 5.6616. By that measure Morse code is about 93.5% efficient. (I only added time for space inside the code for a letter because the space between letters is the same no matter how they are coded.)

Erasure coding white paper

Last year I worked with Hitachi Data Systems to evaluate the trade-offs of replication and erasure coding as ways to increase data storage reliability while minimizing costs. This lead to a white paper that has just been published:

Compare Cost and Performance of Replication and Erasure Coding
Hitachi Review Vol. 63 (July 2014)
John D. Cook
Robert Primmer
Ab de Kwant