Binary encoding
A common problem in many branches of computing occurs when we need to pass binary data through a system which was designed to process text only data. There are countless examples of applications, file formats, transmission protocols, etc which were specifically design to handle text – but inevitably someone, somewhere will need to use these system to process raw binary data.
The solution to this is to use a binary encoding scheme.
Binary, ASCII and Text data
Binary data is a sequence of 8 bit bytes, where each byte can have a value between 0×00 and 0×FF. In general we can’t assume much about this data, except that any byte could potentially have any value.
ASCII data represents text as a sequence of bytes. In the ASCII system, byte values in the range 0×00 to 0×7F are used to represent English language letters (upper and lower case), numerals, punctuation symbols, and various “control characters”. Byte values above 0×80 have no well defined meaning in ASCII.
Since ASCII data is not expected to contain byte values of 0×80 or greater (ie with the most significant bit set), it is often called 7 bit data.
Printable characters in ASCII are values in the range 0×21 to 0×7E, which includes letters a-z, A-Z, digits 0–9 and all standard punctuation.
Whitespace in ASCII consists of the space character (0×20), carriage return (CR, 0×0A), line feed (LF, 0×0D) and tab (0×09).
Text data is ASCII data which only contains printable and whitespace characters.
Problems with Binary Data
If a system is designed to handle text data, it might make certain assumptions about that data. This can easily cause the system to fail if binary data is passed through it. Here are some of the most common problems:
Line endings – different computer operating systems have different conventions for representing line endings. Some use a CR, some use a LF, and some use CR followed by LF. Some systems try to be helpful by automatically substituting these characters. This is great for genuine text data, but absolutely disastrous for binary data.
Tab substitution – in a similar way, some systems automatically substitute tab characters for multiple spaces, or vice versa.
Special characters – some systems assign special meanings to particular non-printable characters. For instance, some text systems use “end of data” control characters, and might terminate the data when they find such a character. Typically NUL (0×00), Ctrl-D (0×03) or Ctrl-Z (0×19) are used for this purpose. Some systems even emit a beep when they encounter the BEL character (0×07)!
Line length – some systems process text on a line by line basis, and they often make assumptions about how long text lines will be (eg 80 characters maximum). If a file is encountered where the lines are too long, it might lead to data loss, program errors, or even a crash.
As we noted earlier, lines are delimited by either CR, LF or CRLF characters. But in a binary file, there is no reason to suppose that these characters will appear regularly, if at all.
Rejection – some systems scan the data for non-text characters, and simply refuse to process binary data.
A Solution – Binary Encoding
We have listed some of the possible problems with processing binary data in a text based system. Of course some systems are more robust than others, but you are likely to encounter one or more of these types of problems in many cases.
A solution to this problem is to use binary encoding. Before passing our binary data through a text based system, we encode as a (longer) sequence of text characters. When we get the data back out of the system, we must decode it to obtain our original data.
We obviously need to be careful about whitespace characters, because they might not be transferred reliably. On the other hand they are clearly necessary (CR or LF are needed to split the data into manageable line lengths). Most encoding schemes use only printable characters for encoding but allow line breaks to be present (but ignore them when decoding).
