Re: Optimizing pglz compressor - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Optimizing pglz compressor |
Date | |
Msg-id | 5135F3FB.7040706@vmware.com Whole thread Raw |
In response to | Re: Optimizing pglz compressor (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Optimizing pglz compressor
Re: Optimizing pglz compressor |
List | pgsql-hackers |
I spent some more time on this, and came up with the attached patch. It includes the changes I posted earlier, to use indexes instead of pointers in the hash table. In addition, it makes the hash table size variable, depending on the length of the input. This further reduces the startup cost on small inputs. I changed the hash method slightly, because the old method would not use any bits from the 3rd byte with a small hash table size, but fortunately that didn't seem to negative impact with larger hash table sizes either. I wrote a little C extension to test this. It contains a function, which runs pglz_compress() on a bytea input, N times. I ran that with different kinds of inputs, and got the following results: unpatched: testname | auto -------------------+----------- 5k text | 1425.750 512b text | 1287.413 2k random | -1053.160 100k random | -1056.379 512b random | -1018.474 64b of text | -2547.106 64b random | -3731.700 100k of same byte | 1093.885 100k text | 849.430 (9 rows) pglz-replace-pointers-with-indexes.patch (the patch I posted earlier): testname | auto -------------------+----------- 5k text | 1251.576 512b text | 946.348 2k random | -815.095 100k random | -808.356 512b random | -614.418 64b of text | -744.382 64b random | -1060.758 100k of same byte | 1192.397 100k text | 796.530 (9 rows) pglz-variable-size-hash-table.patch: testname | auto -------------------+---------- 5k text | 1276.905 512b text | 823.796 2k random | -844.484 100k random | -848.080 512b random | -615.039 64b of text | -393.448 64b random | -568.685 100k of same byte | 1186.759 100k text | 799.978 (9 rows) These values are from a single run of the test, but I did repeat them several times to make sure there isn't too much variability in them to render the results meaningless. The negative values mean that pglz_compress gave up on the compression in the test, ie. it shows how long it took for pglz_compress to realize that it can't compress the input. Compare the abs() values. With the variable-size hash table, I'm not sure how significant the earlier patch to use indexes instead of pointers is. But I don't think it makes things any worse, so it's included in this. On 01.03.2013 17:42, Heikki Linnakangas wrote: > On 01.03.2013 17:37, Alvaro Herrera wrote: >> My take on this is that if this patch is necessary to get Amit's patch >> to a commitable state, it's fair game. > > I don't think it's necessary for that, but let's see.. With these tweaks, I was able to make pglz-based delta encoding perform roughly as well as Amit's patch. So I think that's the approach we should take, as it's a bit simpler and more versatile. I'll follow up on that in the other thread. - Heikki
Attachment
pgsql-hackers by date: