Home > mailing lists

Re: pluggable compression support - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: pluggable compression support
Date	June 25, 2013 16:22:38
Msg-id	CA+TgmoZdWLNS-+61woU68GqFFS47y837tkWAm2_jB5S3b594nQ@mail.gmail.com Whole thread Raw
In response to	Re: pluggable compression support (Andres Freund <andres@2ndquadrant.com>)
Responses	Re: pluggable compression support
List	pgsql-hackers

Tree view

On Thu, Jun 20, 2013 at 8:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-06-15 12:20:28 +0200, Andres Freund wrote:
>> On 2013-06-14 21:56:52 -0400, Robert Haas wrote:
>> > I don't think we need it.  I think what we need is to decide is which
>> > algorithm is legally OK to use.  And then put it in.
>> >
>> > In the past, we've had a great deal of speculation about that legal
>> > question from people who are not lawyers.  Maybe it would be valuable
>> > to get some opinions from people who ARE lawyers.  Tom and Heikki both
>> > work for real big companies which, I'm guessing, have substantial
>> > legal departments; perhaps they could pursue getting the algorithms of
>> > possible interest vetted.  Or, I could try to find out whether it's
>> > possible do something similar through EnterpriseDB.
>>
>> I personally don't think the legal arguments holds all that much water
>> for snappy and lz4. But then the opinion of a european non-lawyer doesn't
>> hold much either.
>> Both are widely used by a large number open and closed projects, some of
>> which have patent grant clauses in their licenses. E.g. hadoop,
>> cassandra use lz4, and I'd be surprised if the companies behind those
>> have opened themselves to litigation.
>>
>> I think we should preliminarily decide which algorithm to use before we
>> get lawyers involved. I'd surprised if they can make such a analysis
>> faster than we can rule out one of them via benchmarks.
>>
>> Will post an updated patch that includes lz4 as well.
>
> Attached.
>
> Changes:
> * add lz4 compression algorithm (2 clause bsd)
> * move compression algorithms into own subdirectory
> * clean up compression/decompression functions
> * allow 258 compression algorithms, uses 1byte extra for any but the
>   first three
> * don't pass a varlena to pg_lzcompress.c anymore, but data directly
> * add pglz_long as a test fourth compression method that uses the +1
>   byte encoding
> * us postgres' endian detection in snappy for compatibility with osx
>
> Based on the benchmarks I think we should go with lz4 only for now. The
> patch provides the infrastructure should somebody else want to add more
> or even proper configurability.
>
> Todo:
> * windows build support
> * remove toast_compression_algo guc
> * remove either snappy or lz4 support
> * remove pglz_long support (just there for testing)
>
> New benchmarks:
>
> Table size:
>                           List of relations
>  Schema |        Name        | Type  | Owner  |  Size  | Description
> --------+--------------------+-------+--------+--------+-------------
>  public | messages_pglz      | table | andres | 526 MB |
>  public | messages_snappy    | table | andres | 523 MB |
>  public | messages_lz4       | table | andres | 522 MB |
>  public | messages_pglz_long | table | andres | 527 MB |
> (4 rows)
>
> Workstation (2xE5520, enough s_b for everything):
>
> Data load:
> pglz:           36643.384 ms
> snappy:         24626.894 ms
> lz4:            23871.421 ms
> pglz_long:      37097.681 ms
>
> COPY messages_* TO '/dev/null' WITH BINARY;
> pglz:           3116.083 ms
> snappy:         2524.388 ms
> lz4:            2349.396 ms
> pglz_long:      3104.134 ms
>
> COPY (SELECT rawtxt FROM messages_*) TO '/dev/null' WITH BINARY;
> pglz:           1609.969 ms
> snappy:         1031.696 ms
> lz4:             886.782 ms
> pglz_long:      1606.803 ms
>
>
> On my elderly laptop (core 2 duo), too load shared buffers:
>
> Data load:
> pglz:           39968.381 ms
> snappy:         26952.330 ms
> lz4:            29225.472 ms
> pglz_long:      39929.568 ms
>
> COPY messages_* TO '/dev/null' WITH BINARY;
> pglz:           3920.588 ms
> snappy:         3421.938 ms
> lz4:            3311.540 ms
> pglz_long:      3885.920 ms
>
> COPY (SELECT rawtxt FROM messages_*) TO '/dev/null' WITH BINARY;
> pglz:           2238.145 ms
> snappy:         1753.403 ms
> lz4:            1638.092 ms
> pglz_long:      2227.804 ms

Well, the performance of both snappy and lz4 seems to be significantly
better than pglz.  On these tests lz4 has a small edge but that might
not be true on other data sets.  I still think the main issue is legal
review: are there any license or patent concerns about including
either of these algorithms in PG?  If neither of them have issues, we
might need to experiment a little more before picking between them.
If one does and the other does not, well, then it's a short
conversation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Bruce Momjian
Date: 25 June 2013, 16:22:06
Subject: Re: Hash partitioning.

From: Robert Haas
Date: 25 June 2013, 16:38:52
Subject: Re: [PATCH] add long options to pgbench (submission 1)

Re: pluggable compression support - Mailing list pgsql-hackers

Previous

Next