Monday, 25 February 2013

How many ways can you mess up IO?

There are so many interesting ways that people find to make a mess of reading binary data, but actually it doesn’t need to be hard. This post is intended to describe some of the more common misunderstandings people have.

Text encodings: what they are not

Text is simple, right?

Text encodings are great; you use them (whether you know it or not) all the time, whenever your computer opens any kind of text file, or downloads text (such as the ubiquitous html and json) from the internet. If you don’t know what an encoding is, then first: go and read Joel’s treatise on the subject. No really, go read that and then come back.

1So now you know: an encoding is a way of working with character data (you know... words etc) over a binary medium (such as a file on disk, or a network socket). Common encodings are things like UTF-8, UTF-7, UTF-16 (in either endian-ness), UTF-32, or “Windows-1252” - however, there is a wide range of encodings available. Some (in particular the UTF-* encodings) can handle most-any unicode character, but many are restricted to the characters used most commonly in a particular locale. Ultimately, an encoding defines a map between characters and bytes, for example defining that “abc” should be represented as the bytes 00-61-00-62-00-63 (and equally, that the bytes 00-61-00-62-00-63 can be interpreted as the text “abc”).

Here’s the thing an encoding is not: it is not a way to turn arbitrary bytes into text for storage. You would be amazed how often people try this, but: that simply doesn’t work. If an encoding tries to read something that isn’t text data, then at best: it will throw an error. At worst, it will silently corrupt the data without realizing it has made a mistake. If you want to store arbitrary binary data as text, then there are other tools for that: primarily, things like base-n. In this case, it is the text that is specially formatted. For example, we might need to convey the bytes 07-A2-00-B2-EE-02 using only text.The observant will notice that I’ve just done exactly that using hexadecimal (base-16), but that it took 17 characters (12 without the dashes) to represent just 6 bytes of actual data. A common alternative to reduce this overhead is base-64, which uses characters that avoid the “control characters” likely to cause problems, while also staying in the first range 0-127, which is the most reliable region. Our 6 bytes become “B6IAsu4C” when stored as a base-64 string. Most platforms have utility methods that provide translation to and from base-64.

The key in choosing between a text-encoding and something like base-64 is simple: all you need to figure out is: which of these two things can have arbitrary contents, and which is required to follow rules (otherwise it is nonsensical)?

arbitrary* text / rules-based binary: use a text encoding
arbitrary binary / rules-based text: use base-64 (or similar)

*=at least, for the characters that the encoding supports, since not every encoding supports every character.

One last thing to say about encodings: don’t use the “system default” encoding - ever. This goes back to the old days of 8-bit systems where your text console needed a code-page that fitted into a single byte but reached at least most of the characters the user was likely to need. For me, this is code-page 1252, but yours may be different. It is exceptionally rare that this is what you want; if in doubt, explicitly specify something like UTF-8. Sometimes you can use byte-order-mark detection, but this is also pretty unreliable - many files omit a BOM. The best answer, obviously, is: define and document what encoding you are planning to use. Then use it.

Network packets: what you send is not (usually) what you get

3TCP usually works as a stream of data. When you send things, it is usually not guaranteed that it will arrive at the destination in exactly the same chunks that you send. For example, let’s say that
you try to “send” 3 messages over the same socket using your chosen network library- one with 10 bytes, one with 4 bytes, and one with 8 bytes. You might think that the receiving client can call “read” 3 times and get 10 bytes, 4 bytes and 8 bytes - but it is much more interesting than that. Because it is just a stream, this is simply very unlikely to be what happens. The client could find they get 1 message with 22 bytes. Or 22 messages each with 1 byte. Or any other combination. They could get 10 bytes, 10 bytes, and the last 2 bytes never arrive (because of some network issue). All that TCP guarantees is that whatever portion of the data does make it to the receiver, it will be the correct bytes in the correct order.

Because of this, if you want to send multiple messages over the same socket it is necessary to implement some kind of “framing” protocol - i.e. some way of splitting the data back into logical pieces at the other end. How you do this is up to you, and often depends on the data being sent. For example, text-based protocols often detect a “sentinel” value to split messages (possibly a carriage return, line-feed, or some other “control character” rarely seen in regular text; the characters with values 0 (“nul”) and 3 (“etx”) are also popular choices.

If your messages are binary then it is more challenging, as usually there aren’t really any safe “sentinel” bytes to choose from - so commonly some header information is sent before each message that includes (perhaps only includes) the length of the message that follows. This could be as simple as dumping the 4-byte (32-bit) or 8-byte (64-bit) native representation of the length onto the stream (although you also need to decide whether it is “big endian” or “little endian”, obviously). There are also a range of variable-length representations, for when most messages are expected to be small, but the protocol needs to allow for much longer messages. The key thing here is that you must decide how you are going to do this, and clearly document it for consumers. This of course has the unfortunate requirement that you must actually know the length of the data you want to write before you write it - not always convenient. To help with this, some protocols (for example, the more recent web-sockets protocols) allow you to further break down a single message into multiple fragments that the client must stitch back together into a single logical piece - with an additional marker (perhaps a trailing zero-length message, or a special bit set in the header message) to indicate the end of the logical message.

Learning to read

Most frameworks have a “read” API that looks something like:

    int read(byte[] buffer, int offset, int count)

2(or something appropriately similar for asynchronous access). The point being that you supply a buffer (somewhere for it to put the data), tell it where in that buffer to start writing (often, but not always, 0, and how much data you want it to read. It then goes away and fetches some data for you. Here’s the rub: in most cases, the last parameter is just a maximum - it is not “get this much data”; it is “get at most this much data”. The return value tells you what it could do: this could be non-positive if no more data will ever be available (the stream has ended), it could be count - which is to say: it could fetch all the data you wanted, or it could be any other value greater than zero and less than count. Because of this, reading often involves a loop; for example, let’s say we want to write a method to read an exact number of bytes; me might do that as:

    void readExact(byte[] buffer, int offset, int count) {
        int bytesRead;
        if(count < 0) {
            throw new ArgumentOutOfRangeException(“count”);
        }
        while(count != 0 && 
          (bytesRead = source.read(buffer, offset, count)) > 0) {
            offset += bytesRead;
            count -= bytesRead;
        }
        if(count != 0) throw new EndOfStreamException();
    }

Dissecting that:

  • first we check that we aren’t requesting a negative amount of data, which is clearly silly
  • then in a loop, we:
    • check to see if we want more data, then try to read at most count more data
    • if the stream ends, we break out
    • otherwise, we increment out offset, so that if we need to do another read, we don’t over-stamp the data we just fetched) - and decrement count, because we now have less work required
  • finally, we check whether we still have data outstanding, because the stream terminated before we expected to - perhaps raising an error to indicate this


This is a fairly classic pattern for reading data that shows how to process the number of bytes obtained in each iteration.

Gotchas when buffering data in memory

Sometimes it isn’t possible to process all the data in a single buffer; in many cases it becomes necessary to use an in-memory store of data that you have received but still need to process (perhaps because you need yet more data before you can make proper sense of it). A common way to do this is to make use of a “memory-stream” - an in-memory object that acts like a stream, but that simply uses local memory rather than a disk or network socket. You basically “write” to the memory-stream until you think you have something you can process, then “read” from it. But did you spot the deliberate mistake? Most memory-streams acts like an old-fashioned VCR: if you record a show, hit “stop”, and then hit “play” - you will find that you are unexpectedly either watching a black screen, or something that you recorded 3 weeks ago. We forgot to rewind it. Essentially, the memory-stream has a single cursor position; after writing the data, the cursor is at the end of the data: what follows is either all zeros, or garbage from the last thing that happened to be in that space. Fortunately, rewinding a memory-stream is usually trivial:

    stream.Position = 0;

There’s a second very common confusion with memory-streams; sometimes, after writing to it you want to get the contents, but as a raw buffer (byte[]) rather than as a stream. Often, there are two different APIs for this:

  • one which gives you a brand new buffer of the current logical length of the stream, copying the data into the new buffer
  • one which hands you the oversized buffer that it is using internally - oversized so that it doesn’t need to allocate a new backing buffer every time you write to it

They both have uses; the second is often more efficient, as it avoids an allocation - but it is only useful if you also track the logical length of the data (often via stream.Length or stream.getLength()). To go back to our VCR analogy: one creates a brand new cassette of exactly the 37.5 minutes we recorded, and copies the video - the other simply hands us the existing 180 minute cassette: this is only useful if we also say “the bit you want is the first 37.5 minutes”.

Conclusion

This isn’t actually all that complex, but people tend to bring a lot of incorrect preconceptions to the table, which results in the same bugs – over and over and over. So let’s all save some time, and do it right. Please?

4 comments:

Anonymous said...

It would be certainly easier if the examples given in articles such as this were in fact correct. How about while (count > 0... for a start?

Marc Gravell said...

@Anonymous read it again - it explicitly excludes the < 0 case at the start.

Simone said...

I think what anonymous intends is that the first condition in the while loop should read count > 0 as to avoid count becoming negative, but he probably missed that since you are reading at most count and then subtracting that value from count, then count cannot decrease beyond zero. Easy to miss if you're not used to write such code every so often.

Thanks for the article Marc, a nice refresher.

manu said...

I thought you actually could use the "Extended ASCII" encoding to convert arbitrary binary data into text and vice versa w/o corrupting the data.