Re: pluggable compression support - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: pluggable compression support |
Date | |
Msg-id | CA+TgmoZdWLNS-+61woU68GqFFS47y837tkWAm2_jB5S3b594nQ@mail.gmail.com Whole thread Raw |
In response to | Re: pluggable compression support (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: pluggable compression support
|
List | pgsql-hackers |
On Thu, Jun 20, 2013 at 8:09 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-15 12:20:28 +0200, Andres Freund wrote: >> On 2013-06-14 21:56:52 -0400, Robert Haas wrote: >> > I don't think we need it. I think what we need is to decide is which >> > algorithm is legally OK to use. And then put it in. >> > >> > In the past, we've had a great deal of speculation about that legal >> > question from people who are not lawyers. Maybe it would be valuable >> > to get some opinions from people who ARE lawyers. Tom and Heikki both >> > work for real big companies which, I'm guessing, have substantial >> > legal departments; perhaps they could pursue getting the algorithms of >> > possible interest vetted. Or, I could try to find out whether it's >> > possible do something similar through EnterpriseDB. >> >> I personally don't think the legal arguments holds all that much water >> for snappy and lz4. But then the opinion of a european non-lawyer doesn't >> hold much either. >> Both are widely used by a large number open and closed projects, some of >> which have patent grant clauses in their licenses. E.g. hadoop, >> cassandra use lz4, and I'd be surprised if the companies behind those >> have opened themselves to litigation. >> >> I think we should preliminarily decide which algorithm to use before we >> get lawyers involved. I'd surprised if they can make such a analysis >> faster than we can rule out one of them via benchmarks. >> >> Will post an updated patch that includes lz4 as well. > > Attached. > > Changes: > * add lz4 compression algorithm (2 clause bsd) > * move compression algorithms into own subdirectory > * clean up compression/decompression functions > * allow 258 compression algorithms, uses 1byte extra for any but the > first three > * don't pass a varlena to pg_lzcompress.c anymore, but data directly > * add pglz_long as a test fourth compression method that uses the +1 > byte encoding > * us postgres' endian detection in snappy for compatibility with osx > > Based on the benchmarks I think we should go with lz4 only for now. The > patch provides the infrastructure should somebody else want to add more > or even proper configurability. > > Todo: > * windows build support > * remove toast_compression_algo guc > * remove either snappy or lz4 support > * remove pglz_long support (just there for testing) > > New benchmarks: > > Table size: > List of relations > Schema | Name | Type | Owner | Size | Description > --------+--------------------+-------+--------+--------+------------- > public | messages_pglz | table | andres | 526 MB | > public | messages_snappy | table | andres | 523 MB | > public | messages_lz4 | table | andres | 522 MB | > public | messages_pglz_long | table | andres | 527 MB | > (4 rows) > > Workstation (2xE5520, enough s_b for everything): > > Data load: > pglz: 36643.384 ms > snappy: 24626.894 ms > lz4: 23871.421 ms > pglz_long: 37097.681 ms > > COPY messages_* TO '/dev/null' WITH BINARY; > pglz: 3116.083 ms > snappy: 2524.388 ms > lz4: 2349.396 ms > pglz_long: 3104.134 ms > > COPY (SELECT rawtxt FROM messages_*) TO '/dev/null' WITH BINARY; > pglz: 1609.969 ms > snappy: 1031.696 ms > lz4: 886.782 ms > pglz_long: 1606.803 ms > > > On my elderly laptop (core 2 duo), too load shared buffers: > > Data load: > pglz: 39968.381 ms > snappy: 26952.330 ms > lz4: 29225.472 ms > pglz_long: 39929.568 ms > > COPY messages_* TO '/dev/null' WITH BINARY; > pglz: 3920.588 ms > snappy: 3421.938 ms > lz4: 3311.540 ms > pglz_long: 3885.920 ms > > COPY (SELECT rawtxt FROM messages_*) TO '/dev/null' WITH BINARY; > pglz: 2238.145 ms > snappy: 1753.403 ms > lz4: 1638.092 ms > pglz_long: 2227.804 ms Well, the performance of both snappy and lz4 seems to be significantly better than pglz. On these tests lz4 has a small edge but that might not be true on other data sets. I still think the main issue is legal review: are there any license or patent concerns about including either of these algorithms in PG? If neither of them have issues, we might need to experiment a little more before picking between them. If one does and the other does not, well, then it's a short conversation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: