Thread: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;

RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;

From
"Dawid Kuroczko"
Date:
Hello!

Currently the TOASTing code does its magic when whole tuple is
larger than TOAST_TUPLE_TARGET which happens to be around 2KB.

There are times though when one is willing to trade using (fast) CPU to
reduce amount of (slow) I/O.  A data warehousing types of workload
most notably.  Rarely used large columns which are likely to compress
well but are not large enough to trigger inline compression.

As we already have four types of ALTER COLUMN .. SET STORAGE
{ PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add
"COMPRESSED" which would force column compression (if column is
smaller than some minimun, I guess somwehwere between 16 and 32 bytes).

First of all, would such a feature be desirable? [1]


...as for implementation idea, so far I see it more or less like this:
* src/backend/access/common/heaptuple.c:  for tuples with COMPRESSED attributes, we set the infomask bit
HEAP_HASEXTERNAL,so that tuple will trigger TOAST regardless  of size.
 
* src/backend/access/heap/tuptoaster.c: - add a bool "need_compress = false;" around line 425. - while scanning the
attributes(lines 472-575), mark the ones which should be COMPRESSED - if (need_compress), compress every marked column.
-perhaps refactor inline compression code (639-659) as a static  funcion shared with need_compress part above.
 

Does this sound reasonable?

PS: as a side note: I wonder if perhaps we could try compression erarlier,
at 1KB or event at 0.5KB, but leave TOASTing at 2KB limit)?

[1]: Actually some time ago I did write a system which stores tons of
real[0:59] (an hour's worth of every minute readings) data.  Such column
takes approximately 246 bytes.  For fun and experiment I did transform
the data into real[0:23][0:59] storing whole day's data.  To my surprise
such column stores between 64 (!) and 5968.  Also 66% of values were
taking less than 254 bytes (and 55% < 128 bytes)...  And as the data
is much larger than RAM and read randomly, having it shrunk by more
than 25% is tempting.  Hence the idea of SET STORAGE COMPRESSED.
I know such schema is flawed by design, but I guess there are other types
of data which would also see benefit from such an option.
-- 
Solving [site load issues] with [more database replication] is a lot
like solving your own personal problems with heroin - at first it
sorta works, but after a while things just get out of hand.


Re: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;

From
Zdenek Kotala
Date:
Dawid Kuroczko napsal(a):
> Hello!
> 
> Currently the TOASTing code does its magic when whole tuple is
> larger than TOAST_TUPLE_TARGET which happens to be around 2KB.
> 
> There are times though when one is willing to trade using (fast) CPU to
> reduce amount of (slow) I/O.  A data warehousing types of workload
> most notably.  Rarely used large columns which are likely to compress
> well but are not large enough to trigger inline compression.
> 
> As we already have four types of ALTER COLUMN .. SET STORAGE
> { PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add
> "COMPRESSED" which would force column compression (if column is
> smaller than some minimun, I guess somwehwere between 16 and 32 bytes).

I think TOAST_TUPLE_TRESHOLD and TOAST_TUPLE_TARGEST should help you. Look int 
tuptoaster.h
    Zdenek


Re: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;

From
Tom Lane
Date:
"Dawid Kuroczko" <qnex42@gmail.com> writes:
> As we already have four types of ALTER COLUMN .. SET STORAGE
> { PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add
> "COMPRESSED" which would force column compression (if column is
> smaller than some minimun, I guess somwehwere between 16 and 32 bytes).

Please see previous discussions about per-column toasting parameters,
for instance
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php
http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php

I think the general consensus was that we want more flexible access to
the compression knobs than just another STORAGE setting.
        regards, tom lane


Re: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;

From
Alvaro Herrera
Date:
Dawid Kuroczko escribió:

> Currently the TOASTing code does its magic when whole tuple is
> larger than TOAST_TUPLE_TARGET which happens to be around 2KB.
> 
> There are times though when one is willing to trade using (fast) CPU to
> reduce amount of (slow) I/O.  A data warehousing types of workload
> most notably.  Rarely used large columns which are likely to compress
> well but are not large enough to trigger inline compression.

I proposed in PGCon to have an option to select different compression
algorithms, for example gzip instead of our own PGLZ algorithm.  In
testing I've found that for certain types of data, PGLZ compresses to
50% while gzip -8 compresses to 33%, albeit at a measurable CPU cost.
This is a tradeoff that some people might be interested in doing,
particularly on systems where the system is bottlenecked on I/O and you
have plenty of CPU to spare.

We are also considering bzip2 compression, which has the disadvantage
that the required library is not already linked in the backend, which
means we need ways to set that up too (loading shared libraries, etc).


This doesn't have a lot to do with what you are proposing, but I think
we need to come up with a reasonable user interface for all of this.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;

From
"Dawid Kuroczko"
Date:
On Tue, Jun 10, 2008 at 5:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Dawid Kuroczko" <qnex42@gmail.com> writes:
>> As we already have four types of ALTER COLUMN .. SET STORAGE
>> { PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add
>> "COMPRESSED" which would force column compression (if column is
>> smaller than some minimun, I guess somwehwere between 16 and 32 bytes).
>
> Please see previous discussions about per-column toasting parameters,
> for instance
> http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php
> http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php
>
> I think the general consensus was that we want more flexible access to
> the compression knobs than just another STORAGE setting.

Sounds like a right way to do it.  Perhaps the syntax should be something like:

ALTER TABLE tab ALTER COLUMN x WITH (storage_parameter = value, ...);

With storage parameters like:  compress -- enable/disable compression (like PLAIN or EXTERNAL)  min_input_size -- don't
compressif smaller than size  min_comp_rate -- leave uncompressed if rate is smaller than  toast -- for out-of-line
storageparameters?  compression_algo -- for specifying alternative algorithms if any
 
(per Alvaro's suggestion).

Perhaps it would be wise to introduce GUCs with default values (as we have now
ALTER COLUMN .. SET STATISTICS and default_statistics_target), named
for example: default_column_min_input_size (and so on).

ALTER COLUMN .. SET STORAGE ... should be aliases for WITH (...) and be
deprecated I guess.

The HEAP_HASEXTERNAL infomask bit should probably be used to "trigger"
TOASTing code.  Perhaps it should be renamed then?  I am worried if storage
parameters wouldn't introduce overhead in PostgreSQL's key parts.

...as for compression_algo, perhaps it could be an oid of compression
function(s)
(we need to decompress too).  Also we would need to store information which
algo was used to compress the column.  Perhaps a byte between varvarlena
herader and actual compressed data (this way we could have multiple algos
simultaneousley).

Speaking of algorithms, I think that e2compr (ext2 filesystem with transparent
compression) could be a nice source of input in this area. http://e2compr.sourceforge.net/
(Having algos as plugins would allow us to use foreign licenses (gzip) or
event patented algos in countries where software patents are prohibited
without risking anything in core PostgreSQL)

OK, enough for today.  Good night.
 Regards,    Dawid
-- 
Solving [site load issues] with [more database replication] is a lot
like solving your own personal problems with heroin - at first it
sorta works, but after a while things just get out of hand.