Log in

No account? Create an account


gzip vs. zlib

« previous entry | next entry »
4th Mar 2008 | 20:08

I'm baffled. gzip does its thing by calling the zlib deflate() function, though it wraps the compressed stream differently than deflate() normally does. I have an application where I don't need gzip's extra metadata, so I wrote a small program called deflate. They should perform the same apart from a small constant overhead for gzip, but they don't. As far as I can tell the only differences are minor details of buffer handling.
$ jot 40000 | gzip -9c | wc -c
$ jot 40000 | ./deflate | wc -c
$ jot 20000 | sed 's/.*/print "hello, world &"/' | ./lua-5.1.2/src/luac -s -o - - | gzip -9c | wc -c
$ jot 20000 | sed 's/.*/print "hello, world &"/' | ./lua-5.1.2/src/luac -s -o - - | ./deflate | wc -c

| Leave a comment | Share

Comments {4}


from: ewx
date: 4th Mar 2008 21:30 (UTC)

etch's gzip at least (1.3.5) has its own deflate.c - it doesn't use zlib. So I don't think it should necessarily be a surprise that the output differs?

Reply | Thread

Tony Finch

from: fanf
date: 4th Mar 2008 22:10 (UTC)

I've had another look and I now see that FreeBSD's gzip has been rewritten since RELENG_6 branched, so the zlibified -CURRENT source I was looking at didn't correspond to the GPL-encumbered code I was running. If I do a test on FreeBSD-7 I get what I expected.
$ jot 40000 | gzip -9c | wc -c
$ jot 40000 | ./deflate | wc -c

Reply | Parent | Thread


Lempel-Zif and jot

from: alsuren
date: 5th Mar 2008 00:27 (UTC)

For future reference: I don't think the output of jot 40000 is the best way to test the performance of an LZ-based compression library. In this special case, the repeated symbols are limited to about 5 chars in length, so it's impossible to build up a decent symbol table. It's interesting to see that (in this special case) the algorithms with smaller symbol tables can actually perform better.

for i in 1 2 3 4 5 6 7 8 9;
echo -n "${i} " ;
jot 40000 | gzip -${i}c | wc -c;
1 84206
2 69436
3 69850
4 87738
5 87578
6 87733
7 87733
8 87733
9 87733

Maybe cat /usr/share/dasher/training_english_GB.txt would be more representative.

Reply | Thread

Tony Finch

Re: Lempel-Zif and jot

from: fanf
date: 5th Mar 2008 11:06 (UTC)

You're right of course. I was using jot because it's a convenient way of producing output of different sizes for testing the buffering code: for example, I had a truncation bug that was triggered when deflate stopped asking for input for a couple of cycles round the loop. It also suffices to show that gzip and zlib were not using the same underlying algorithm inside their framing.

Reply | Parent | Thread