Fast zlib compression

Description


This is an implementation of the fast deflate algorithm for zlib (to be exact, here is a new implementation of the longest_match function). In my tests it works about 2.5-10x times faster than original method with neraly the same compression ratio (at keast it never compress worse than original zlib).

The library exists as C source code. There's also Assembly version for 32-bit Intel architecture, however C version produces exactly the same output, and works with nearly the same performance as Asm version. C version compiled for 64-bit Intel platform slightly outperforms 32-bit version, so there's no reason to write Assembly version for it.


Source code and binaries


You may find everything on GitHub: https://github.com/gildor2/fast_zlib. Build instructions are available in README.md. Precompiled binaries are in "Releases" section on GitHub.

 

Algorithm


As I mentioned before, only the longest_match function was changed. This function takes the most of compression time, so it defines deflate's performence the most. This function is trying to find the string in history buffer, which matches with currently processed string the most.

Original implementaition was very straightforward. It takes first 3 bytes of the string, computes their hash, and looks over history using hash table to find a match. If it will find a longer string, it remembers it, and continues search using the same hash for first 3 bytes.

The penalty of this approach is when code looks for a string which first 3 bytes (or bytes with the same hash value) are used very often in a file. Optimized version does an attempt to look over history with smaller number of checks. The idea is very simple: when algorithm finds a match (intermediate one), it computes hashes for all matching bytes, and selects a different hash chain - that one which points farther from current position. This is obvious - if we have 2 hash chains, one points at distance 10, and another one at distance 1000, it is not worth checking distance 10 because other bytes will not match - the closest possible match will occur at position 1000 or farther.

This was a brief description of the used algorithm, there's nothing to say more. For better details, please refer to C code - I tried to make it commented as best as possible.

There also were some question-answer discussion regarding how algorithm works in details, you may find this discussion here.

 

Tests


Please note: these tests are obsolete and kept here for historical reasons! You may see new test results here.

So, obsolete test results. Firstly, these tests were made with very old version of the code. The performance was nearly the same like now, but I used different test data sets, and also used pkzip for speed reference.

For testing I have prepared 3 files about 50Mb each. The first file consists of various C/C++ sources (51,923,629 bytes), the second file is Windows executables (50,115,981 bytes of the exe/dll files from the Windows system directory), and 56,481,852 bytes ps (PostScript) file. I have tested compression with the following tools:

  • original zlib, compiled with C and asm versions (contrib/masmx86) of the longest_match function
  • my version of the longest_match function (x86 assembly) 
  • pkzip 4.0

The machine for testing is AMD Athlon 64 X2 4200+.

Zlib is tested using test program from the directory contrib/testzlib. Tested with compression levels 9 (maximal) and 8.

 

What is tested C/C++ source codeWin32 executablesPostScript file

size, bytes

time, ssize, bytestime, ssize, bytestime, s
zlib C version, -9 11,101,198 49.2 21,277,563 27.2 6,116,386 64.5
zlib ASM version -9 11,101,198 17.6 21,277,563 12.3 6,116,386 20.9
zlib MY ASM version -9 11,079,702 7.9 (123%) 21,256,264 8.6 (43%) 6,089,374 11.9 (76%)
zlib C version -8 11,125,415 22.9 21,285,601 18.5 6,162,581 35.9
zlib ASM version -8 11,125,415 10.3 21,285,601 9.9 6,162,581 13.7
zlib MY ASM version -8 11,083,776 7.2 (43%) 21,257,169 8.1 (22%) 6,108,081 10.2 (34%)
pkzip -9 11,116,529 4.5 21,218,554 4.9 6,085,662 9.5

As you can see, my implementation is 20-120% faster than its asm analogue from the zlib distribution, and provides slightly better compression ratio than original zlib. Also notice that -8 compression level with this algorithm is still better than -9 compression level for original algorithm.

 

The story


The idea of this improvement is not mine. I found the algorithm used here in very old MS-DOS archiver "AIN" (it's very hard to find information about it now). The AIN archiver was the best before Win32 era - its compression ratio and speed were much better than PKZIP 2.04 available at that time. Later I found that PKZIP's deflate was used zlib's tricks - actually, PKZIP 2.04 performance and compression ratio were very close to zlib.

I did that research in middle of 90's, during a student time. A friend of mine provided me with a decrypted AIN executable (it was packed with AINEXE), and I started my research. At that time I didn't have a PC, but only a 8-bit ZX Spectrum computer. I've made a 8086 disassembler for ZX Spectrum specially for this research. I wrote everything to paper for easier understanding. 

I have performed search and did not found any patents about match lookup for the LZ77 algorithm, so I suppose it is absolutely legal to use it in other applications. Later I've checked new (2.50) PKZIP's internals - it looks like it's deflate algorithm was also based on reverse engineered AIN code - it uses the same principles and very similar programming tricks everywhere. The difference between PKZIP 4.04 and 2.50 is exactly the same as difference between PKZIP 2.04 and AIN! Everything looked very similar to what I saw in AIN. Interesting thing: better compression ratio (in comparison to zlib) in PKZIP 2.50 was achieved with using adaptive TOO_FAR threshold depending on the file type.

Nearly in 2004 I had a lot of free time, and decided to try this algorithm in zlib. I wasn't a "pro" in C, Assembly coding was much easier for me. I've implemented a new "longest_match" function for zlib using 80386 assembler. The implementation is my own (original version were written on the 8086 assembly). I used the NASM assembler because of its has capabilities to create OBJ files compatible with any x86 compiler (or at least for most of them). I have successfully compiled and tested my Assembly version of the library with WatcomC and Visual C 6.

Later I've ported this code to C, however it generated slightly different compressed files compared to Assembly version. Also some people reported me that in their tests C version produced wrong output.
 
I've got a several requests to make a 64-bit port of my library, including ones from people who are quite famous in zlib community. Some time ago (in 2017) I've dedicated all of my free time to this project. I've fixed several issues in Assembly code, and made C code to match with Asm algorithm exactly. A lot of debugging, lots of experiments and tests. Now I see that algorithm is very (I hope "absolutely") stable. Modern C compiler produces very good code, so Assembly optimization is no longer needed (however, Assembly optimization of original algorithm is noticeably faster that C code!).
 
Current state of the project, test application and test results are available on github - see "Source code" section above.
 

Discussion


 Everything is here