Hackery: December 2007

Work continues on my data compression library. I have completed:

  • RLE
  • PackBits (including the 16-bit variant)
  • Delta Row
  • Huffman (including the dynamic variant)
  • Arithmetic Encoding (including the dynamic variant)
  • LZW (including support for non-byte alphabets)
  • LZ77 (including the LZSS variant, as well as adding an index hash to speed up pattern searches)

The performance is very good across all the algorithms, particularly when using the index hash with the LZ77 algorithm. For a highly-compressible 10MB data set, the performance improved from 15.5 sec to 0.4 sec with the index hash with a 32768-byte window size and a 31-byte pattern length. My next algorithm will be DEFLATE, an LZ77 variant that utilizes Huffman coding to further compress the character and index/length pairs that the LZ77 algorithms emit. PKZIP and other related applications use DEFLATE to compress the supplied data files. When I finish writing this, I should be able to build my own ZIP file archiver.

One thing that I’ve neglected to mention is that I’m using a buffer object, called DBuffer, as the input(s) to all of my compression routines. DBuffer is an attempt to encapsulate all the usual headaches inherent in managing data buffers in a C/C++ program.

In C/C++, the programmer manages all the memory usage of an application through the use of function calls and pointer variables. Frex, suppose an application needed a canvas upon which to write/draw/paint an image. At some point it would need to ask the OS for some memory. In standard C, this is done by using the malloc() function call and receiving a pointer to that memory as a function return, like so:

void *canvasBuffer = malloc(requestedCanvasBufferSize);
if (canvasBuffer == NULL)
    {
    puts("Uh-oh!");
    return FAIL;
    }

One Achilles Heel of programming at this level is the management of allocated memory. Note that in order to request memory from the OS, we need to know how much memory we need ahead of time. Sometimes, this is very straightforward. In our canvas example, if we know that we’re creating a 100×100 pixel 8-bit RGB image, then the request is one of simple math: #-bytes-needed = 100 * 100 * 1 * 3 = 30,000. (100pixels * 100pixels * 1 byte (per 8 bits) * 3 color components (RGB)). Other times, it’s not so easy.

For instance, when you have a compressed data stream, you don’t know ahead of time what the decompressed size will be. You know that it will be larger than the compressed data size, but that’s about it. Some formats, however, will store hints about the original data stream. Frex, one can envision a data stream consisting of 2 integers followed by a compressed data stream where the first integer is the size of the original data stream (which should also be the size of the decompressed data stream), the second integers is the size of the compressed data stream with that many bytes following as the compressed data stream itself. Something like this:

typedef struct {
    int origSize;
    int comprSize;
    byte comprData[];
} CompressedDataInfo;

CompressedDataInfo cdi;
/*  Comment: cdi is loaded with some real data.  */
byte *decompressedDataBuffer = (byte*) malloc(cdi.origSize);
if (decompressedDataBuffer != NULL) {
    Decompress(cdi.comprSize, cdi.comprData, decompressedData);
}

While the above code works, it is vulnerable to abuse by hackers. How? Well, the CompressedDataInfo structure may be receiving its data over the Internet from a malicious web site. If the supplied cdi.origSize is less than what the decompression function really needs, then data will be written past the end of the allocated memory space and into either other allocated memory space or into unallocated memory space. Anything written into that extra space will be inaccessible by the rest of the program, except through an unintended side effect, which may cause such data to be executed as code, ie: a virus.

Instead of relying on pre-allocated memory outside of the decompression function, DBuffer tries to manage itself. If it needs more memory, then it will try to allocate more space into a new replacement internal buffer, copy the already existing data from the older, now-too-small buffer into the newer one, then free the too-small buffer, giving it back to the OS for other applications or buffer requests.

DBuffer has two primary modes of operation: Raw and Managed. In Raw mode, a DBuffer is similar to a simple chunk of memory. It’s all usable. It has a very simple API for reading and writing: DBPeek() and DBPoke(). In Managed mode, the DBuffer is used as a stream buffer. There are offset indicators to specify how much of the available buffer space is being used. Reading/writing use a special API: DBRead() and DBWrite(), which manage the offset indicators and ultimately call DBPeek() and DBPoke(). When the buffer is full after or during a DBWrite(), the DBuffer will take appropriate action depending on certain configuration flags. One of these, DB_EXPAND_WHEN_FULL, will attempt to reallocate more space so that more write operations can safely occur within the buffer. Another, as yet unsupported flag, DB_USES_SINK_PROC, will call a specified function to drain the contents of the buffer, presumably into a file. When DB_EXPAND_WHEN_FULL is set, other flags will give hints as to how to grow the buffer.

DBuffer also supports other options when it is managed. DB_BIT_IO_FLAG is used to transform the buffer from a byte-oriented buffer into a bit-oriented buffer. This eases the support for bit-wise I/O operations and includes the support for managed buffer expansion. The I/O operations for bitwise I/O include: DBReadBits(), DBWriteBits(), DBPushClearBits(), DBPushSetBits() and DBStampBits().

Once I get DEFLATE written and tested, I will be making another pass through the compression code to add support for re-entrant operation. It is not feasible to only support the compression and decompression of entire memory buffers. I do not wish to compress a 1 GB data stream by reading it into a 1 GB memory buffer first. So, I will be adding to the compression code APIs to support re-entrancy. Following that, I will be building Ruby bindings to the code. I may write a Codec module to support the compression API set and/or write a binding that will extend the String and IO classes to support compression and decompression within each class.

Comments are closed.