In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows whether you’ve dialed an in-office extension or the first three digits of an outside call.
This is an example of a prefix code: No valid phone number is a prefix of another valid phone number. We’ll look at a few more examples of prefix codes in the context of phone numbers, then look at Roman numerals, Morse code, Unicode encoding, data compression, and RPN calculators.
It used to be true in the US that you could dial four or five digits for a local call, seven digits for a call within the area code, and ten digits for a long distance phone call. This didn’t cause any ambiguity because no local number would begin with the digits of your area code. You had to dial a 1 before dialing a long distance number, and no local or area code numbers begin with 1.
There are still parts of the US where you can dial either a seven digit number or a ten digit number. In most of the US you always enter 10 digits. This is a trivial form of prefix coding because all fixed-length codes are prefix codes.
A final example of prefix codes related to telephony are country calling codes. These codes have varying lengths, but phone exchanges can know when the country code stops and when the number within a country starts because no prefix of a country code is a valid country code.
Prefix codes are sometimes called self-punctuating codes. This is because you don’t need an additional symbol, a form of punctuation, to mark the end of codes.
We usually think of Morse code as a system of two symbols: dots and dashes. But it’s really a system of four symbols: dots, dashes, short space between letters, and longer space between words. As a system of only dots and dashes, Morse code is not self-punctuating. Without spaces, you couldn’t tell, for example, whether dot dash represented an A (dot dash) , or an E (dot) followed by a T (dash). It’s possible to design a prefix code with just two symbols, but that’s not what Morse did.
Q codes are not part of Morse code per se but are often used in Morse code. These are three letter codes beginning with Q. For example, QRP is an abbreviation for the question “Should I reduce power?”. Q codes are prefix codes because they have fixed length. If QR were a valid Q code by itself, that would ruin the prefix property; a recipient would not know whether to interpret QR as a complete code until listening for the next letter.
The letter components of Roman numerals are not a prefix code because you can’t tell whether a letter stands for a positive or negative amount until you read the next letter. For example, in CLX the X represents 10 but in CXL the X represents -10. If you wrote Roman numerals backward, the letters would form a prefix code.
In the previous post, I discussed UTF-16 encoding of Unicode. The way UTF-16 encodes characters outside the Basic Multilingual Plane makes it a prefix code; the meaning of a surrogate doesn’t depend on any down-stream information. UTF-8 is also a prefix code, which I discuss in detail here.
You may have seen Huffman coding, a form of data compression that uses a prefix code.
Reverse Polish Notation is an example of prefix coding. An RPN calculator doesn’t need parentheses for punctuation. You can enter calculations unambiguously with just digits and arithmetic operators because the meaning of a computation does not depend on any future input. Prefix codes are sometimes called instantaneous codes because of this feature.
Interesting post. I really like the real world examples of prefix code.
is *dits* intentional short for digits? Or is this a slip of the keyboard
Nope. Typo on my part. Thanks.
Decades ago a friend published a book, leading to a discussion on ISBN encodings. ISBNs have four fields, the first is a single digit (I forget the purpose), the second is variable length and is a number representing the publisher, the third, also variable, is for the book from that publisher, and the last is a single checksum digit. This is convenient, as large publishers get a number with fewer digits representing them and thus have more digits to use for the large numbers of books they publish. Small publishers will have a large series of digits for their publisher number, and fewer digits for the fewer books they publish.
I was confused because ISBNs are often listed as a single 10-digit (at that time) number with no separation between fields, yet there was allegedly a way of discerning the variable-length fields of the publisher digits and the book digits. I since discovered the encoding (it’s on Wikipedia as well as elsewhere), and it’s ‘complicated’ but still easy enough to do by hand. The number of digits in the publisher field is determined by what the first digit(s) in the field are, and there’s a lookup table for the length vs. the digits. Even the first “digit” (always 0 or 1 for English titles) is a variable-length (one or two digits) field encoded in this way.
Before visiting family in Hawaii I knew the area code, 808. While visiting Maui, I was delighted to learn the NXX prefix for my in-law’s town of Haiku is 575.
That’s fantastic. I assume that wasn’t an accident, that some anonymous person thought it would be cool to give Haiku the code 575.