Thread: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;
Hello! Currently the TOASTing code does its magic when whole tuple is larger than TOAST_TUPLE_TARGET which happens to be around 2KB. There are times though when one is willing to trade using (fast) CPU to reduce amount of (slow) I/O. A data warehousing types of workload most notably. Rarely used large columns which are likely to compress well but are not large enough to trigger inline compression. As we already have four types of ALTER COLUMN .. SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add "COMPRESSED" which would force column compression (if column is smaller than some minimun, I guess somwehwere between 16 and 32 bytes). First of all, would such a feature be desirable? [1] ...as for implementation idea, so far I see it more or less like this: * src/backend/access/common/heaptuple.c: for tuples with COMPRESSED attributes, we set the infomask bit HEAP_HASEXTERNAL,so that tuple will trigger TOAST regardless of size. * src/backend/access/heap/tuptoaster.c: - add a bool "need_compress = false;" around line 425. - while scanning the attributes(lines 472-575), mark the ones which should be COMPRESSED - if (need_compress), compress every marked column. -perhaps refactor inline compression code (639-659) as a static funcion shared with need_compress part above. Does this sound reasonable? PS: as a side note: I wonder if perhaps we could try compression erarlier, at 1KB or event at 0.5KB, but leave TOASTing at 2KB limit)? [1]: Actually some time ago I did write a system which stores tons of real[0:59] (an hour's worth of every minute readings) data. Such column takes approximately 246 bytes. For fun and experiment I did transform the data into real[0:23][0:59] storing whole day's data. To my surprise such column stores between 64 (!) and 5968. Also 66% of values were taking less than 254 bytes (and 55% < 128 bytes)... And as the data is much larger than RAM and read randomly, having it shrunk by more than 25% is tempting. Hence the idea of SET STORAGE COMPRESSED. I know such schema is flawed by design, but I guess there are other types of data which would also see benefit from such an option. -- Solving [site load issues] with [more database replication] is a lot like solving your own personal problems with heroin - at first it sorta works, but after a while things just get out of hand.
Dawid Kuroczko napsal(a): > Hello! > > Currently the TOASTing code does its magic when whole tuple is > larger than TOAST_TUPLE_TARGET which happens to be around 2KB. > > There are times though when one is willing to trade using (fast) CPU to > reduce amount of (slow) I/O. A data warehousing types of workload > most notably. Rarely used large columns which are likely to compress > well but are not large enough to trigger inline compression. > > As we already have four types of ALTER COLUMN .. SET STORAGE > { PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add > "COMPRESSED" which would force column compression (if column is > smaller than some minimun, I guess somwehwere between 16 and 32 bytes). I think TOAST_TUPLE_TRESHOLD and TOAST_TUPLE_TARGEST should help you. Look int tuptoaster.h Zdenek
"Dawid Kuroczko" <qnex42@gmail.com> writes: > As we already have four types of ALTER COLUMN .. SET STORAGE > { PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add > "COMPRESSED" which would force column compression (if column is > smaller than some minimun, I guess somwehwere between 16 and 32 bytes). Please see previous discussions about per-column toasting parameters, for instance http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php I think the general consensus was that we want more flexible access to the compression knobs than just another STORAGE setting. regards, tom lane
Dawid Kuroczko escribió: > Currently the TOASTing code does its magic when whole tuple is > larger than TOAST_TUPLE_TARGET which happens to be around 2KB. > > There are times though when one is willing to trade using (fast) CPU to > reduce amount of (slow) I/O. A data warehousing types of workload > most notably. Rarely used large columns which are likely to compress > well but are not large enough to trigger inline compression. I proposed in PGCon to have an option to select different compression algorithms, for example gzip instead of our own PGLZ algorithm. In testing I've found that for certain types of data, PGLZ compresses to 50% while gzip -8 compresses to 33%, albeit at a measurable CPU cost. This is a tradeoff that some people might be interested in doing, particularly on systems where the system is bottlenecked on I/O and you have plenty of CPU to spare. We are also considering bzip2 compression, which has the disadvantage that the required library is not already linked in the backend, which means we need ways to set that up too (loading shared libraries, etc). This doesn't have a lot to do with what you are proposing, but I think we need to come up with a reasonable user interface for all of this. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jun 10, 2008 at 5:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Dawid Kuroczko" <qnex42@gmail.com> writes: >> As we already have four types of ALTER COLUMN .. SET STORAGE >> { PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add >> "COMPRESSED" which would force column compression (if column is >> smaller than some minimun, I guess somwehwere between 16 and 32 bytes). > > Please see previous discussions about per-column toasting parameters, > for instance > http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php > http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php > > I think the general consensus was that we want more flexible access to > the compression knobs than just another STORAGE setting. Sounds like a right way to do it. Perhaps the syntax should be something like: ALTER TABLE tab ALTER COLUMN x WITH (storage_parameter = value, ...); With storage parameters like: compress -- enable/disable compression (like PLAIN or EXTERNAL) min_input_size -- don't compressif smaller than size min_comp_rate -- leave uncompressed if rate is smaller than toast -- for out-of-line storageparameters? compression_algo -- for specifying alternative algorithms if any (per Alvaro's suggestion). Perhaps it would be wise to introduce GUCs with default values (as we have now ALTER COLUMN .. SET STATISTICS and default_statistics_target), named for example: default_column_min_input_size (and so on). ALTER COLUMN .. SET STORAGE ... should be aliases for WITH (...) and be deprecated I guess. The HEAP_HASEXTERNAL infomask bit should probably be used to "trigger" TOASTing code. Perhaps it should be renamed then? I am worried if storage parameters wouldn't introduce overhead in PostgreSQL's key parts. ...as for compression_algo, perhaps it could be an oid of compression function(s) (we need to decompress too). Also we would need to store information which algo was used to compress the column. Perhaps a byte between varvarlena herader and actual compressed data (this way we could have multiple algos simultaneousley). Speaking of algorithms, I think that e2compr (ext2 filesystem with transparent compression) could be a nice source of input in this area. http://e2compr.sourceforge.net/ (Having algos as plugins would allow us to use foreign licenses (gzip) or event patented algos in countries where software patents are prohibited without risking anything in core PostgreSQL) OK, enough for today. Good night. Regards, Dawid -- Solving [site load issues] with [more database replication] is a lot like solving your own personal problems with heroin - at first it sorta works, but after a while things just get out of hand.