I wrote a while back about Base32 and Base64 encoding, and yesterday I wrote about Bitcoin’s Base58 encoding. For completeness I wanted to mention Base85 encoding, also known as Ascii85. Adobe uses it in PostScript and PDF files, and git uses it for encoding patches.
Like Base64, the goal of Base85 encoding is to encode binary data printable ASCII characters. But it uses a larger set of characters, and so it can be a little more efficient. Specifically, it can encode 4 bytes (32 bits) in 5 characters.
Why 85?
There are 95 printable ASCII characters, and
log95(232) = 4.87
and so it would take 5 characters encode 4 bytes if you use all possible printable ASCII characters. Given that you have to use 5 characters, what’s the smallest base that will still work? It’s 85 because
log85(232) = 4.993
and
log84(232) = 5.006.
(If you’re not comfortable with logarithms, see an alternate explanation in the footnote [1].)
Now Base85 is different from the other bases I’ve written about because it only works on 4 bytes at a time. That is, if you have a number larger than 4 bytes, you break it into words of 4 bytes and convert each word to Base 85.
Character set
The 95 printable ASCII characters are 32 through 126. Base 85 uses characters 33 (“!”) through 117 (‘u’). ASCII character 32 is a space, so it makes sense you’d want to avoid that one. Since Base85 uses a consecutive range of characters, you can first convert a number to a pure mathematical radix 85 form, then add 33 to each number to find its Base85 character.
Example
Suppose we start with the word 0x89255d9, equal to 143807961 in decimal.
143807961 = 2×854 + 64×853 + 14×852 + 18×85 + 31
and so the radix 85 representation is (2, 64, 14, 18, 31). Adding 33 to each we find that the ASCII values of the characters in the Base85 representation are (35, 97, 47, 51, 64), or (‘#’, ‘a’, ‘/’, ‘3’, ‘@’) and so #a/3@
is the Base85 encoding of 0x89255d9.
Z85
The Z85 encoding method is also based on a radix 85 representation, but it chose to use a different subset of the 95 printable characters. Compared to Base85, Z85 adds seven characters
v w x y z { }
and removes seven characters
` \ " ' _ , ;
to make the encoding work more easily with programming languages. For example, you can quote Z85 strings with single or double quotes because neither kind of quote is a valid Z85 character. And you don’t have to worry about escape sequences since the backslash character is not part of a Z85 representation.
Gotchas
There are a couple things that could trip someone up with Base85. First of all, Base 85 only works on 32-bit words, as noted above. For larger numbers it’s not a base conversion in the usual mathematical sense.
Second, the letter z can be used to denote a word consisting of all zeros. Since such words come up disproportionately often, this is a handy shortcut, though it means you can’t just divide characters into groups of 5 when converting back to binary.
Related posts
[1] 954 = 81450625 < 232 = 4294967296, so four characters from an alphabet of 95 elements is not enough to represent 232 possibilities. So we need at least five characters.
855 = 4437053125 > 232, so five characters is enough, and in fact it’s enough for them to come from an alphabet of size 85. But 845 = 4182119424 < 232, so an alphabet of 84 characters isn’t enough to represent 32 bits with five characters.
It reminds me of a project I did where space was important. I encoded fixed-length part number strings in RAD50 (which ironically cannot encode 50 characters, just 40). It got us uppercase alphanumerics, space, and three additional characters (I made one of them “-” since that was used in these part numbers). https://en.wikipedia.org/wiki/DEC_Radix-50
“And you don’t have to worry about escape sequences since the backspace character is not part of a Z85 representation.”
Should be “backslash”, not “backspace”.
Nice piece!
Please, note that the link to Base58 encoding post is wrong.
Thanks. Fixed.
Thanks for this, i am gonna switch to Z85. See if you can decode this:
iQ)q:_FruPuU^I”J,g#Y#bh;;ucpn’*+=X’1)_m%^AIALQ1″98E%+92qfUp’h(h(E,+:)XjReuQfa^^p
Ns.t89″p_pP”]D=&24Y4?pG,f/Ge0an);”9\u5&/YO80F31O+FnumJ-H”h0EcNK#RCDAHiP=3[KP[jiP
E#*K<bXcj9c;$2?3^W&u)$FVRm6$D7A5"XS[J%u/:i!]Tn1]_".#%82[66o*P71]HP+9;I-0dJ/o^Dh\
!
A while ago, I proposed a base-120 encoding using two-letter digits formed from one consonant (b-z) and one vowel (a, e, i, o, u, y). This gives us numbers like cewo (945), cipote (123571) or ritufyna (139596420). There are more details and some code at https://github.com/mjambon/base120.