Gildor's Forums

Author Topic: Fast ZLib compression  (Read 21842 times)
Gildor
Administrator
Hero Member
*****
Posts: 7971



View Profile WWW
Fast ZLib compression
« on: March 12, 2011, 16:23 »

I have published my very old project (made in 2004) - fast implementation of the match searching algorithm for zlib. More details are here:
http://www.gildor.org/en/projects/zlib
Logged
roytam1
Newbie
*
Posts: 1


View Profile
Re: Fast ZLib compression
« Reply #1 on: March 16, 2012, 10:58 »

I posted a reply in encode.ru:
http://encode.ru/threads/1506-Fast-Zlib-compression?p=28720&viewfull=1#post28720
Logged
Gildor
Administrator
Hero Member
*****
Posts: 7971



View Profile WWW
Re: Fast ZLib compression
« Reply #2 on: March 16, 2012, 11:17 »

Thank you for the reply, I've posted an answer on encode.ru (forgot to subscribe on the thread updates, will reply faster now).
Logged
oxyfat
Jr. Member
**
Posts: 51



View Profile
Re: Fast ZLib compression
« Reply #3 on: March 16, 2012, 22:09 »

After viewing discussion, think... to compare files: this tool will be useful.
He is quite old and may be defined as a vir, it is not so!
Logged
Gildor
Administrator
Hero Member
*****
Posts: 7971



View Profile WWW
Re: Fast ZLib compression
« Reply #4 on: March 17, 2012, 01:30 »

Thanks, I have my own tool written 10 years ago Smiley I've used MS's tool because I thought it is the analogue. It works like MS's analogue of "fc.exe".
« Last Edit: February 14, 2013, 18:12 by Gildor » Logged
oxyfat
Jr. Member
**
Posts: 51



View Profile
Re: Fast ZLib compression
« Reply #5 on: March 17, 2012, 01:49 »

Quote
I have my own tool written 10 years ag
It is worth dogadyvatsya.  Wink
Logged
Gildor
Administrator
Hero Member
*****
Posts: 7971



View Profile WWW
Re: Fast ZLib compression
« Reply #6 on: February 20, 2017, 19:28 »

This library got a major update.
https://github.com/gildor2/fast_zlib

The code was polished, fixed all found problems. Made C-version fully functional. It generates fully identical result to Asm version. C version is just a little bit slower than Asm version.
Supported 32 and 64 bit platforms (64 bit via C). Added a new test application. Test results are here: https://github.com/gildor2/fast_zlib/wiki
For those who wants to use this library but too lazy to build it - precompiled version is here: https://github.com/gildor2/fast_zlib/releases
Logged
HalfK
Newbie
*
Posts: 2


View Profile
Re: Fast ZLib compression
« Reply #7 on: November 15, 2017, 05:17 »

Hi Gildor,

 Cheesy I asked the questions on GitHub fast-zlib repo https://github.com/gildor2/fast_zlib/issues/4, but I saw you mentioned in another issue that you preferred to discuss here in this forum. So I copied my questions here.


I have tested your fast-longest-match with the data set I need compress, and fast-longest-match does outperforms the original zlib in terms of speed. I want to understand how it works and so I read your brief description of the algorithm and checked the source code. Now I still have a few questions about your algorithm, could you help me better understand it?

Description of the algorithm: When algorithm finds a match (intermediate one), it computes hashes for all matching bytes, and selects a different hash chain - that one which points farther from current position. ...

Q1: Do you mean when finding a match, the algorithm computes hashes for every 3 bytes in the matching string? Then I think when these hash values are inserted into the hash table head[] (or the prev[] if head[a] already exists, they will fall into different positions. Is my understanding correct?

Q2: Regarding to "select a different hash chain - that one which points farther from current position. If we have 2 hash chains, one points at distance 10, and another one at distance 1000, it is not worth checking distance 10 because other bytes will not match ..."
* What do you mean when you refer to "hash chain"? Do you mean the elements in prev[], eg. checking prev[1000] betters checking prev[100]?
* When you say "it not worth checking distance 10 because other bytes will not match", my understanding is like the following case:

Code:
current position 1100, string: abcdabcd......
position 1000, string: abcdabcd......
position 100, string: abcdabcd.......

Here we can check the string starting on position 100, because the possible match length is 1000+. If checking position 1000, the possible match length is 100+.
Is this what you mean?

Q3: To this piece of code:

Code:
cur_match -= offset;
offset = 0;
next_pos = cur_match;
for (i = 0; i <= len - MIN_MATCH; i++) {
    pos = prev[(cur_match + i) & wmask];
    if (pos < next_pos) {
        /* this hash chain is more distant, use it */
        next_pos = pos;
        offset = i;
    }
}
/* Switch cur_match to next_pos chain */
cur_match = next_pos;

I think this piece of code implements the magic "jumping to the hash chain which points to farther distance". I don't quite understand what it tries to do though. Looks to me it does something like this:

when a matching string "abcdefg" at position 1000 is found, try to find a longest match, such as "abcdefg123456"
- if prev[1000] exists, record prev[1000] as p1 (a farther position with the same match, using hash value of "abc")
- if prev[1001] ("bcd") exists, record prev[1001] as p2
- if p2 < p1, record next_pos = p2
- do similar thing for prev[1002] ("cde"), prev[1003] ("def"), prev[1004] ("efg"), find the *farthest* next_pos


Is this process the one you mentioned as "jumping to the hash chain which points to farther distance"? What's the fundamental difference between this one and the zlib behavior, which is keep checking p = prev[1000], p1 = prev[p], p2 = prev[p1]?

Q4: UPDATE_HASH happens for the matched string in lazy evaluation

In your code, when a match string is found in lazy evaluation, UPDATE_HASH is performed for all the strings in the matching:

Code:
if (best_len >= MIN_MATCH) {
    ...
    UPDATE_HASH(s, hash, scan[1]);
    UPDATE_HASH(s, hash, scan[2]);
    for (i = 3; i <= best_len; i++) {
        UPDATE_HASH(s, hash, scan[i]);
        ...
    }
}

Do you do this because in original zlib code's deflate_slow() function, UPDATE_HASH(INSERT_STRING) is not performed on matching found in lazy evaluation? Only the matching found in non-lazy evaluation (prev_match_length > curr_match_length).


My questions might be too long, but I really want to understand this algorithm better, and look forward to getting some feedback from you.

Thank you.
« Last Edit: November 15, 2017, 06:25 by HalfK » Logged
Gildor
Administrator
Hero Member
*****
Posts: 7971



View Profile WWW
Re: Fast ZLib compression
« Reply #8 on: November 15, 2017, 21:30 »

Hi Gildor,

 Cheesy I asked the questions on GitHub fast-zlib repo https://github.com/gildor2/fast_zlib/issues/4, but I saw you mentioned in another issue that you preferred to discuss here in this forum. So I copied my questions here.
Hi, welcome to my forum and thank you for interest to the algorithm. And thank you for converting markdown message to bbcode Smiley
Quote
Description of the algorithm: When algorithm finds a match (intermediate one), it computes hashes for all matching bytes, and selects a different hash chain - that one which points farther from current position. ...

Q1: Do you mean when finding a match, the algorithm computes hashes for every 3 bytes in the matching string? Then I think when these hash values are inserted into the hash table head[] (or the prev[] if head[a] already exists, they will fall into different positions. Is my understanding correct?
Yes you're right. However if I'd says "for every 3 bytes" this might be treated incorrectly too. So, I'm using zlib's hash function which mixes 3 bytes together and computes hash. And doing this computation for every position in a file (I said - "for every byte").

Quote
Q2: Regarding to "select a different hash chain - that one which points farther from current position. If we have 2 hash chains, one points at distance 10, and another one at distance 1000, it is not worth checking distance 10 because other bytes will not match ..."
* What do you mean when you refer to "hash chain"? Do you mean the elements in prev[], eg. checking prev[1000] betters checking prev[100]?
Hash chain - I called hash table organization. There's hash head, beginning of the chain. And "prev" array pointing to previous occurrence of this hash, for every file position. Together they constructs "chains" Smiley
Quote
* When you say "it not worth checking distance 10 because other bytes will not match", my understanding is like the following case:

Code:
current position 1100, string: abcdabcd......
position 1000, string: abcdabcd......
position 100, string: abcdabcd.......

Here we can check the string starting on position 100, because the possible match length is 1000+. If checking position 1000, the possible match length is 100+.
Is this what you mean?
Distances will be the same for each position in your example. Considered "abc", "bcd", "cda", etc at position 1100, 1101, 1102, and all hashes will point 100 bytes back (distance = 100) - at corresponding "abc" "bcd" etc at position 1000, 1001, 1002 etc.

Now, quite from my code, about "not worth checking other bytes". If we've found exact match of the string using hash table, definitely each hash chain from string bytes will point at location in found string, or somewhere closer (with a smaller distance). If we have some "candidate" for match, and for example hash at first byte of string points at distance 10 (don't mix "distance" and "position" - distance 10 means 10 bytes back!), and 2nd hash points at distance 1000, it is not possible that another string 10 bytes before will contain 4 bytes match this string - in the best case, 3 bytes, for 1 hash value, and 2nd hash will be different. If bytes would match, string's second hash would be included into 2nd hash chain, and we won't get distance 1000 immediately without bypassing distance 100.
Logged
Gildor
Administrator
Hero Member
*****
Posts: 7971



View Profile WWW
Re: Fast ZLib compression
« Reply #9 on: November 15, 2017, 21:30 »

Quote
Q3: To this piece of code:

Code:
cur_match -= offset;
offset = 0;
next_pos = cur_match;
for (i = 0; i <= len - MIN_MATCH; i++) {
    pos = prev[(cur_match + i) & wmask];
    if (pos < next_pos) {
        /* this hash chain is more distant, use it */
        next_pos = pos;
        offset = i;
    }
}
/* Switch cur_match to next_pos chain */
cur_match = next_pos;

I think this piece of code implements the magic "jumping to the hash chain which points to farther distance". I don't quite understand what it tries to do though. Looks to me it does something like this:

when a matching string "abcdefg" at position 1000 is found, try to find a longest match, such as "abcdefg123456"
- if prev[1000] exists, record prev[1000] as p1 (a farther position with the same match, using hash value of "abc")
- if prev[1001] ("bcd") exists, record prev[1001] as p2
- if p2 < p1, record next_pos = p2
- do similar thing for prev[1002] ("cde"), prev[1003] ("def"), prev[1004] ("efg"), find the *farthest* next_pos


Is this process the one you mentioned as "jumping to the hash chain which points to farther distance"? What's the fundamental difference between this one and the zlib behavior, which is keep checking p = prev[1000], p1 = prev[p], p2 = prev[p1]?
Yes, this code does "magic". And code around makes this code working - for example, there's "offsets" everywhere.
Difference from zlib. Zlib always works with offset 0. If you'll force offset 0 in code, you'll get exactly the same compressed stream, with the same performance.
Just suppose that algorithm is looking for "abcdefg123456", and "abc" (or something else with the same hash) appears a lot in the file, but say "def" not appears anymore, or it is very far and appears, say, once. Zlib will iterate over all occurrences of "abc" trying to fully match string. My algorithm will quickly detect that "def" is very far, and will jump there, skipping lots of "abc" checks.

Quote
Q4: UPDATE_HASH happens for the matched string in lazy evaluation

In your code, when a match string is found in lazy evaluation, UPDATE_HASH is performed for all the strings in the matching:

Code:
if (best_len >= MIN_MATCH) {
    ...
    UPDATE_HASH(s, hash, scan[1]);
    UPDATE_HASH(s, hash, scan[2]);
    for (i = 3; i <= best_len; i++) {
        UPDATE_HASH(s, hash, scan[i]);
        ...
    }
}

Do you do this because in original zlib code's deflate_slow() function, UPDATE_HASH(INSERT_STRING) is not performed on matching found in lazy evaluation? Only the matching found in non-lazy evaluation (prev_match_length > curr_match_length).
Zlib inserts only hash at first string position into the hash table. I.e. "bcd" (2nd string 3-byte part) is not inserted. This code fixes that, but does this only when needed - when string is supposed to grow.
Logged
HalfK
Newbie
*
Posts: 2


View Profile
Re: Fast ZLib compression
« Reply #10 on: November 16, 2017, 07:58 »

Thanks for the quick reply Gildor Cheesy. Your explanations are very helpful to me.
I'm continuing reading your code now.
Logged
Jump to:  

Powered by SMF | SMF © 2006-2009, Simple Machines LLC
Leviathan design by Bloc | XHTML | CSS