We have seen Base16 (hex), Base32 and Base64 encoding, and each one is more efficient than the previous, so an obvious question is, can we find an even more efficient encoding?
Unfortunately we cannot create a Base128 encoder, because there are only 94 printable characters in ASCII (excluding whitespace). We could create a Base94 encoder, and it would be more efficient than Base64. However, as it turns out Base85 is just as efficient as Base94 would be. There is no point using an alphabet of 94 characters where 85 would do.
ASCII85 is a base 85 encoding scheme used by the PostScript and PDF languages. It is described in the PDF specification. Beware, it is a large document (and only a tiny section relates to ASCII85).
ASCII85 encodes the input data four bytes at a time. Each block of four input bytes is encoded to create a block of five printable characters.
What the encoder does, in effect, is treat each block of four bytes as a single 32 bit number, x (the first byte is taken as the most significant byte). It then represents this number as a 5 digit number using base 85, ie
x = N0*52200625 + N1*614125 + N2*7225 + N3*85 + N4
Where N0…N4 are all in the range 0 to 84. An alternative way of expressing this calculation is:
N0 = (x/52200625) mod 85 N1 = (x/614125 ) mod 85 N2 = (x/7225 ) mod 85 N3 = (x/85) mod 85 N4 = x mod 85
Note that we are using integer division, so that the result is rounded down to the nearest integer. mod is the modulo function. If you are wondering where the magic numbers come from, they are just powers of 85 (7225 is 85 squared, 614125 is 85 cubed, etc).
The second stage is to convert numbers N0…N4 into ASCII characters, C0…C4. This is done by adding 33, and taking the resulting character from the ASCII table. This results in ASCII characters in the range “!” (33) to “u” (117).
In addition, if all 4 original bytes are zero, the result is coded as a single character “z” instead of the five character string “!!!!!”.
The ASCII85 specification does not require you to add line breaks at particular intervals, however you are permitted to do this if you wish (because whitespace characters are ignored in the encoded data). It would probably be sensible to add line breaks regularly (eg every 80 characters).
The original data can be any length, not necessarily a multiple of 4. This means that the last block of binary data could be 1, 2, 3 or 4 bytes long. To code the final block, we add zeros to the final block to make it a multiple of 4, and convert it to 5 characters as usual. However, the special use of the z character does not apply to the final block. We indicate the length of the block in the following way:
if the final block has a length of 1 byte, the encoded characters consist of C0, C1, followed by the 2 characters ~> (C2, C3 and C4 contain no useful information anyway).
- if the final block has a length of 2 bytes, the encoded characters consist of C0..C2 followed by the 2 characters ~>
- if the final block has a length of 3 bytes, the encoded characters consist of C0..C3 followed by the 2 characters ~>
- if the final block has a length of 4 bytes, the encoded characters consist of C0..C4 followed by the 2 characters ~>
As a practical example, consider how we would encode the following sequence of 5 bytes:
0x12 0x34 0x56 0x78 0x9A
First we form each set of 4 bytes into a 32 bit quantity. In this case the final block is only 1 byte long so we pad it with zeros.
0x12345678 (305,419,896 in decimal) 0x9A000000 (2,583,691,264 in decimal)
Now we need to convert these 2 numbers to base 85. Using the formulae above we get:
N0 = 5, N1 = 72, N2 = 27, N3 = 55, N4 = 21 N0 = 49, N1 = 42, N2 = 9, N3 = 27, N4 = 69
Next we add 33 and look it up in the ASCII table to obtain the following characters:
Finally, remembering that the last block contained only 1 byte, the rule says that we retain just the first 2 characters followed by the end marker ~> which gives:
A decoder might encounter data which does not completely conform to the specification above. It is then up to the decoder to decide whether to ignore the discrepancy, or indicate an error. Here are some the main error cases:
Whitespace characters - this is not an error. The specification allows whitespace, but does not require it.
Illegal characters - any character other than whitespace, ASCII range 33 to 117, “z”, “~” and “>” are illegal and should be considered an error.
Illegal use of z character - “z” (representing 5 zeros) is used to replace an entire block “!!!!!”. It cannot legally occur in the middle of a block.
The specification implies that “z” must be used in place of “!!!!!”, however it is not clear if the intention is for the sequence “!!!!!” to actually be illegal. There wouldn’t seem to be any good reason for a decoder to reject it.
Out of range block - the largest 5 digit base 85 number is slightly larger than a 32 bit number. So, it is possible to create a 5 character block which cannot be converted back to a 32 bit integer. This is illegal, because no legal encoder could produce such a sequence.
Invalid last data block - the last block must terminate in the character pair “~>” (and it is also illegal for these characters to appear in any other context). The last block must contain at least 2 characters before the “~>” pair.
Why Base 85?
So what is so special about the number 85? Simply, you can represent a 32 bit quantity as a 5 digit base 85 number. 85 is the smallest base which allows you to do that.
Largest 32 bit number is 4294967295
Largest 5 digit base 85 number is 4437053124
Largest 5 digit base 84 number is 4182119423
Looking at the second digit of those numbers, a 5 digit base 84 number is not quite large enough to represent every possible 32 bit number. A 5 digit base 85 number is.
Is this the most efficient coding possible? Look at the pattern. Base64 encodes 3 bytes as 4 characters, using a 64 character alphabet – this increases data size by 33%. ASCII85 encodes 4 bytes as 5 characters, using an 85 character alphabet, which increases data size by just 20%.
The next logical step would be to encode 5 bytes as 6 characters. Unfortunately if we want to do this we need an alphabet of 102 characters, and there are only 94 printable, non-whitespace, ASCII characters.
In fact it is possible to achieve tiny improvements, by using larger block sizes. You can create a scheme which encodes 16 bytes as 19 characters, which increases the data size by 18.75% (a small improvement on the 20% which ASCII85 achieves). However, it probably isn’t worth the effort – it is inconvenient to have to deal with such large blocks, and the algorithm involves maths based on 128 bit numbers (at the time of writing, this is relatively inefficient). For most practical purposes, ASCII85 is as good as it gets.