Weedpacket wrote:Does your scheme start out with a preloaded dictionary?
Uh the LZW implementation I played around with does, since that is standard LZW to preload dictionary entries 0-255. This is why in LZW, the dictionary can be throw away and all you transmit is the compressed data. Then the decomposer can start its own dictionary, pre-loaded, and be able to reverse the process.
The LZ77 does not.
"LZ77-based schemes keep track of the last n bytes of data seen, and when a phrase is encountered that has already been seen, they output a pair of values corresponding to the position of the phrase in the previously-seen buffer of data, and the length of the phrase .... The most commonly used algorithms are derived from the LZSS scheme described by James Storer and Thomas Szymanski in 1982. In this the compressor maintains a window of size N bytes and a lookahead buffer the contents of which it tries to find a match for in the window:"
I'm using a sliding window variant of LZ77.
So basically in effect, there really is no dictionary. The compressed data kind of doubles as its own dictionary too, but I'm not sure what is technically the correct way to say that.
So when a match is found the output is <Pos,Len>. When one is not found, the output is the 1st character of the lookahead buffer.
The way I encode this is a trigger character to tell the decomposer that a <Pos,Len> entry follows. I try to automatically find a character not used in the file to prevent confusing the decompressor. Then the position is actually stored in 2 characters/bytes because I use a large window to get better compression. And then the length is 1 character/byte since my look ahead buffer is only 114 characters, which allows the length to be stored in 1 character/byte.
So that is a length of 4 for each codeword, so I also place a minimum match length of 5, but I'm going to play around with that to see if I can find a more optimal number.
How I record those position and lengths to the data stream, is convert them to their character equivalents. So say it needs to place in the code word <94,120,10>, the data stream might output "Âx
", then slide the window, and look for another match... this is a bit CPU intensive when compressing, but decompression is fairly quick.