TenMinuteTutor

Programming tutorials

What is binary encoding

What is binary encoding, and why is it useful? All is explained here.

Binary, ASCII and Text data

Binary data is a sequence of 8 bit bytes, where each byte can have a value between 0x00 and 0xFF. In general we can’t assume much about this data, except that any byte could potentially have any value.

ASCII data represents text as a sequence of bytes. In the ASCII system, byte values in the range 0x00 to 0x7F are used to represent English language letters (upper and lower case), numerals, punctuation symbols, and various “control characters”. Byte values above 0x80 have no well defined meaning in ASCII.

Since ASCII data is not expected to contain byte values of 0x80 or greater (ie with the most significant bit set), it is often called 7 bit data.

Printable characters in ASCII are values in the range 0x21 to 0x7E, which includes letters a-z, A-Z, digits 0-9 and all standard punctuation.

Whitespace in ASCII consists of the space character (0x20), carriage return (CR, 0x0A), line feed (LF, 0x0D) and tab (0x09).

Text data is ASCII data which only contains printable and whitespace characters.

Problems with Binary Data

If a system is designed to handle text data, it might make certain assumptions about that data. This can easily cause the system to fail if binary data is passed through it. Here are some of the most common problems:

Line endings - different computer operating systems have different conventions for representing line endings. Some use a CR, some use a LF, and some use CR followed by LF. Some systems try to be helpful by automatically substituting these characters. This is great for genuine text data, but absolutely disastrous for binary data.

Tab substitution - in a similar way, some systems automatically substitute tab characters for multiple spaces, or vice versa.

Special characters - some systems assign special meanings to particular non-printable characters. For instance, some text systems use “end of data” control characters, and might terminate the data when they find such a character. Typically NUL (0x00), Ctrl-D (0x03) or Ctrl-Z (0x19) are used for this purpose. Some systems even emit a beep when they encounter the BEL character (0x07)!

Line length - some systems process text on a line by line basis, and they often make assumptions about how long text lines will be (eg 80 characters maximum). If a file is encountered where the lines are too long, it might lead to data loss, program errors, or even a crash.

As we noted earlier, lines are delimited by either CR, LF or CRLF characters. But in a binary file, there is no reason to suppose that these characters will appear regularly, if at all.

Rejection - some systems scan the data for non-text characters, and simply refuse to process binary data.

A Solution – Binary Encoding

We have listed some of the possible problems with processing binary data in a text based system. Of course some systems are more robust than others, but you are likely to encounter one or more of these types of problems in many cases.

A solution to this problem is to use binary encoding. Before passing our binary data through a text based system, we encode as a (longer) sequence of text characters. When we get the data back out of the system, we must decode it to obtain our original data.

We obviously need to be careful about whitespace characters, because they might not be transferred reliably. On the other hand they are clearly necessary (CR or LF are needed to split the data into manageable line lengths). Most encoding schemes use only printable characters for encoding but allow line breaks to be present (but ignore them when decoding).