Base64 encoding is reasonably efficient (encoded data is roughly one third larger than the original data), and has the advantage of being highly compatible with most operating sytems because it only uses a limited character set in its encoded form.
The algorithm is described as part of the MIME specification. Also, RFC 3548 describes Base64 encoding, but takes a slightly different attitude to line breaks. We will discuss this below.
Base64 encodes the input data three bytes at a time. Each block of three input bytes is encoded to create a block of four printable characters.
The three bytes are ordered into a 24 bit value, starting from the most significant bit of Byte0, and ending with the least significant bit of Byte2. The bits are then arranged as a set of four 6 bit numbers, N0 to N3. The first 6 bits form N0, the next 6 form N1 etc. Each 6 bit number has a range 0 to 63.
The second stage is to convert numbers N0…N3 into ASCII characters, C0…C3. This is done according to the following table:
In addition, the MIME specification states that encoded data should have a CRLF character pair inserted after every 76 characters (or less) of encoded data.
The original data can be any length, not necessarily a multiple of 3. This means that the last block of binary data could be 1, 2 or 3 bytes long. To code the final block, we add zeros to the final block to make it a multiple of 3, and convert it to 4 characters as usual. However, we indicate the length of the block in the following way:
- If the final block has a length of 1 byte, the encoded characters consist of C0, C1, followed by 2 “=” characters (C2 and C3 contain no useful information anyway).
- If the final block has a length of 2 bytes, the encoded characters consist of C0, C1, C2, followed by a single “=” character.
- If the final block has a length of 3 bytes, the encoded characters consist of C0, C1, C2, C3 in the normal way.
As a practical example, consider how we would encode the following sequence of 5 bytes:
0x12 0x34 0x56 0x78 0x9A
First we form each set of 3 bytes into a 24 bit quantity. In this case the final block is only 2 bytes long so we pad it with zeros.
in binary this is:
Dividing this into 6 bit blocks we get:
000100 100011 010001 010110 011110 001001 101000 000000
or, in decimal:
4, 35, 17, 22, 30, 9, 40, 0
Now we look each value up in the table, which gives:
Finally, remember that the final block was only 2 bytes long. This means that the final encoded block should consist of the first 3 encoded characters followed by a single “=” character (ie the final A must be replaced). the completed encoding is:
A decoder might encounter data which does not completely conform to the specification above. It is then up to the decoder to decide whether to ignore the discrepancy, or indicate an error. Here are some the main error cases:
Whitespace characters - the MIME specification of Base64 requires a CRLF pair at least every 76 characters. The reason behind this is the 76 character-per-line limit in SMTP (which is what MIME was designed to support).
RFC 3548 updates this view, and says that Base64 itself shouldn’t require (or even allow) CRLF characters. However, a protocol which uses Base64 (eg MIME) can impose its own requirements. This is sensible, but in practical terms the MIME version has been the de facto standard for a long time and no general purpose decoder would be wise to ignore it.
Perhaps the safest option is to always produce data which obeys this line limit, but accept data whether it has CRLF pairs or not.
Illegal characters - any characters, other than CRLF, which are not part of the Base64 alphabet, probably indicate data corruption. There could be an argument for making a special case for space and tab characters, and ignoring those too (rather than reporting an error).
Incomplete last data block - if the data stream does not terminate correctly, then the data might have been truncated or otherwise corrupted.
Base64 requires 64 ASCII characters to encode data. This set of characters is called its alphabet. There are only 62 alphanumeric characters, so Base64 necessarily uses 2 punctuation characters in addition to the alphanumeric characters. The characters “+” and “/” were chosen (a long time ago) for maximum compatibility with archaic, non-ASCII character encodings such as EBCDIC.
Unfortunately, if you ever need to use Base64 data in a filename or URL, these characters are problematic - they are already used by some filing systems. There is an alternative alphabet, the Filename Safe Alphabet, which uses:
- ”-” (minus) instead of “+” for vaue 62.
- “_” (underscore) instead of “/” for value 63.
Apart from that the algorithm is identical.
The term Base64 always refers to an encoding using the original alphabet. If an encoding uses an alternative alphabet, it must always be made explicit that it is not standard Base64 encoding.