Compression Library Status

My new compression API and new, enhanced test procedure have revealed quite a bit about the state of my compression code. Adaptive arithmetic encoding is not working at all, which is a bit of a downer because I was only just able to grasp the theory at the time I wrote it a couple of years ago. LZW is broken, which explains why I can’t write my GIF plug-in and LZ77 is horribly, abysmally slow. Deflate, which is based on LZ77, suffers a similar fate.

I am currently looking at making improvements to LZ77. Decompression is almost instantaneous. Compression, however, is very slow.

In my earlier post, I had incorrectly labelled LZ77 as a dictionary-based compression scheme. It is not. It is a sliding window scheme. Instead of storing common “phrases” in a dictionary, LZ77 uses a negative offset into a “window” that is based on the current position in the input stream. For example, if the window size is 2048 bytes and our current position in the input buffer is byte 4200, then the compressor will look as far back as byte 2152 for a similar phrase.

My first attempt at figuring out the reason behind the slow operation of this compressor targetted the bit I/O API of my DataBuffer class. While it indeed is not an optimal implementation, and I wrote a BitPipeline API to replace it, it was not the prime reason. Instead, the culprit was the common phrase search algorithm, which was a very simplistic algorithm and, like most simple algorithms, worked poorly. For each byte in the input stream, I would scan the preceding bytes in the input buffer for a matching byte, then I would see how many bytes forward of both positions would match. After that, I would continue my search for another matching byte and look for another pattern match. Repeat at the next byte.

It occurred to me as I looked over the code that most of the time was being wasted in looking for matching bytes to check for pattern matches. So, instead of doing repeated scans, I decided to build an IndexDictionary. An IndexDictionary is, conceptually, a 256-entry table that contains lists of the indices of those values’ locations in the data stream.

For example, if the first 5 bytes of the data stream contained the values: { 0, 5, 255, 128, 255, … }, then the IndexDictionary would look like this:

dictionary[0]: { 0 }
dictionary[5]: { 1 }
dictionary[128]: { 3 }
dictionary[255]: { 2, 4 }

So now, instead of repeated scanning the same data over and over looking for byte values that haven’t changed, I can use the IndexDictionary to know exactly where to check for common phrases.

Additionally, the IndexDictionary knows that indices will fall out of favor once the sliding window slides past. It has features to “age” the indices and to cull index vector blocks in order to keep memory use to a manageable level.

Comments are closed.