Lossless Javascript compression; anyone interested?

dream_scape

I posted this at Sitepoint, but thought I'd post it here as well to get a better survey.

Ok, I know Javascript compressors aren't rare, but they either only strip unneeded characters resulting in very low compression ratios (around 1.3:1), the javascript "crunchers" as many call themselves, or they crunch the script and perform some kind of regex based ASCII encoding to push the ratio to around 2.3:1 - 2.5:1. And while that ratio is a good ratio for text compression, both of these methods tend to be a little brittle because if all the new lines are stripped, then javascript that isn't strict about terminating statements with semi-colons will stop working. There are some crunchers that won't strip all new lines, but just blank new lines, but this results in even poorer ratios than the more brittle crunchers.

So I thought to myself, why can't there be a lossless javascript compressor? And so I set out a couple days ago to make one, testing and trying various proven compression algorithms.

And that's just what I've done. It uses an LZ77 variant algorithm with a quick decompression technique (since pure javascript has to decompress it). I've tested numerous javascript scripts, and only when the uncompressed file is huge (over 300K) do you notice a performance hit when decompressing the script.

I've got a few compression levels. The first is 100% lossless; all data is left in tact. And, what I think is the best part, this level results in a compression ratio that is competitive with the existing compressors that use brittle crunching + regex encoding, about 2.5:1 on average. Plus I've made a few lossy levels, one that does non-brittle crunching and then lossless compression on the result of that, giving about a 2.9:1 ratio, and then a higher lossy level that performs brittle crunching + lossless compression on the result, which yields about 3.2:1 or better.

Ok so to the point of this post. I'm willing to put it online free for use if people are interested in it. I can't find another lossless javascript compressor in existence, so I think this would be a pretty unique and useful tool [unless you count HTTP Gzip, but not everyone has that enabled on their server, and even if they do, not all have it enabled for JS files (I guess there can be some kind of problem with JS and HTTP Gzip?)]. But that does require some work (making the site, cleaning up the code, more testing of code, etc), so I'd thought I'd ask on here to get an idea if people would be interested in this kind of thing.

So, anyone interested?

Weedpacket

Does that compression ratio include the size of the decompression code?

dream_scape

Weedpacket wrote:
Does that compression ratio include the size of the decompression code?

Yes it does. the JS decompression code is pretty small, but yeah I am using the full JS code including that in the ratio and compressed size calculations.

Also the ratios are based on file sizes in Western ISO Latin 1 encoding. The same data as UTF-8 always will result in a slightly larger file because of the UTF-8 format, but not so much so to throw the ratios off too far. For example, compressing a 40KB file using lossless might result in a 2.5:1 or 2.6:1 ratio if you save it as Western ISO, and maybe 2.3:1 if you save it as UTF-8.

Weedpacket

dream.scape wrote:
throw the ratios off too far.

And not at all in the limit, since the codewords would end up being the same whatever the character encoding.

MarkR

Sounds like a somewhat obscure thing to do to me.

Javascript crunchers are potentially useful, because they remove superfluous things (whitespace, comments), hence making the code smaller and hence load faster.

Of course if you used a Javascript function to decompress it, which was lossless, it would still contain all the comments etc, and even more overhead would be added doing the decompression. So it ends up slower.

Javascript isn't a particularly efficient language at the best of times.

And if you've got so much Javascript it's worth compressing, why not use HTTP compression? That probably compresses better than your code, plus the decompressor is built into the browser and written in native code.

Mark

dream_scape

MarkR wrote:
And if you've got so much Javascript it's worth compressing, why not use HTTP compression?

Mark I've addressed this question in the original post, so if you'd like an answer, please read the original post 😉

dream_scape

Weedpacket wrote:
And not at all in the limit, since the codewords would end up being the same whatever the character encoding.

yeah that's right they are the same, but could be different byte sizes. I'm just using ASCII, including extended ASCII, but under some encodings extended ASCII characters are 1 byte, while under other encodings like UTF-8, the extended ASCII characters are 2 bytes, which is why if you save the script in UTF-8 it ends up a little bigger than in ASCII or ISO-8859-1.

Just as an exercise in learning, I also briefly played around with the LZW algorithm, which was a bit tricky in PHP since the easiest way I could see to implement it was to use unicode characters for the dictionary above 255. But even though the character length was much smaller, the byte size only gave about a 1.5:1 ratio on normal text and scripts because the codewords could be up to 3 bytes long (words 128 to 2048 would be 2 bytes, but 2048 to 4096 - the dictionary limit - would be 3 bytes). On the other hand, LZW is really quite good if you have many blocks of repeating data, which works well for certain kinds of images, but not so well for text.

Weedpacket

Oh, right - the compressed data is UTF8 encoded. For some reason I thought you were referring to the uncompressed source.

dream.scape wrote:
since the easiest way I could see to implement it was to use unicode characters for the dictionary above 255.

I was fiddling around with the PHP tokeniser functions, and found that a convenient for having both raw characters and codewords in the same stream was to use ordinals rather than the characters themselves; "foo" would instead be array(102,111,111). An alternate approach is to store characters as one-character strings and everything else as integers, using type-checking to distinguish them (but storing everything as the same type reduces the need to make such distinctions).

Does your scheme start out with a preloaded dictionary?

dream_scape

Weedpacket wrote:
Does your scheme start out with a preloaded dictionary?

Uh the LZW implementation I played around with does, since that is standard LZW to preload dictionary entries 0-255. This is why in LZW, the dictionary can be throw away and all you transmit is the compressed data. Then the decomposer can start its own dictionary, pre-loaded, and be able to reverse the process.

The LZ77 does not.

"LZ77-based schemes keep track of the last n bytes of data seen, and when a phrase is encountered that has already been seen, they output a pair of values corresponding to the position of the phrase in the previously-seen buffer of data, and the length of the phrase .... The most commonly used algorithms are derived from the LZSS scheme described by James Storer and Thomas Szymanski in 1982. In this the compressor maintains a window of size N bytes and a lookahead buffer the contents of which it tries to find a match for in the window:"

I'm using a sliding window variant of LZ77.

So basically in effect, there really is no dictionary. The compressed data kind of doubles as its own dictionary too, but I'm not sure what is technically the correct way to say that.

So when a match is found the output is <Pos,Len>. When one is not found, the output is the 1st character of the lookahead buffer.

The way I encode this is a trigger character to tell the decomposer that a <Pos,Len> entry follows. I try to automatically find a character not used in the file to prevent confusing the decompressor. Then the position is actually stored in 2 characters/bytes because I use a large window to get better compression. And then the length is 1 character/byte since my look ahead buffer is only 114 characters, which allows the length to be stored in 1 character/byte.

So that is a length of 4 for each codeword, so I also place a minimum match length of 5, but I'm going to play around with that to see if I can find a more optimal number.

How I record those position and lengths to the data stream, is convert them to their character equivalents. So say it needs to place in the code word <94,120,10>, the data stream might output "Â^x
", then slide the window, and look for another match... this is a bit CPU intensive when compressing, but decompression is fairly quick.

Weedpacket

Ah; 'cos with a dictionary you could prime the compressor with codewords for typical things you'd find in JS (keywords for example), rather than wait to learn about them. Of course, the decompressor will need a copy of the dictionary to use as a crib.

DioSOne

Hi dream.scape... i'm a php programmer... and i'm really interested in your js compressor... it can also be used to compress html output serverside i think... do you want a new Tester? 😉

Let me know...

arumpfelt

I would give it a try..
I started to play with prototype + some scriptaculous stuff last days and compressing these libs would be a good add.

Just wondering about efficiency.. is this 300K the point at which you can see that "something happening" in the "backgrond" ?
..still, it depends on client machine

comster

I would love to try out what you have. I could see how this could be very useful to keep sizes small for use on my production servers. I would also like to see the file size numbers that you found in testing.

Post it up.

Jay