Re: [HACKERS] compression in LO and other fields - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] compression in LO and other fields |
Date | |
Msg-id | 26512.942417843@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [HACKERS] compression in LO and other fields (wieck@debis.com (Jan Wieck)) |
Responses |
Re: [HACKERS] compression in LO and other fields
Re: [HACKERS] compression in LO and other fields |
List | pgsql-hackers |
wieck@debis.com (Jan Wieck) writes: > Html input might be somewhat optimal for Adisak's storage > format, but taking into account that my source implementing > the type input and output functions is smaller than 600 > lines, I think 11% difference to a gzip -9 is a good result > anyway. These strike me as very good results. I'm not at all sure that using gzip or bzip would give much better results in practice in Postgres, because those compressors are optimized for relatively large files, whereas a compressed-field datatype would likely be getting relatively small field values to work on. (So your test data set is probably a good one for our purposes --- do the numbers change if you exclude all the files over, say, 10K?) > Bruce suggested the contrib area, but I'm not sure if that's > the right place. If it goes into the distribution at all, I'd > like to use this data type for rule plan strings and function > source text in the system catalogs. Right, if we are going to bother with it at all, we should put it into the core so that we can use it for rule plans. > I don't expect we'll have > a general solution for tuples split across multiple blocks > for v7.0. I haven't given up hope of that yet --- but even if we do, compressing the data is an attractive choice to reduce the frequency with which tuples must be split across blocks. It occurred to me last night that applying compression to individual fields might not be the best approach. Certainly a "bytez" data type is the easiest thing to fit into the existing system, but it's leaving some space savings on the table. What about compressing the *whole* data contents of a tuple on-disk, as a single entity? That should save more space than field-by-field compression. It could be triggered in the tuple storage routines whenever the uncompressed size exceeds some threshold. (We'd need a flag in the tuple header to indicate compressed data, but I think there are bits to spare.) When we get around to having split tuples, the code would still be useful because it'd be applied as a first resort before splitting a large tuple; it'd reduce the frequency of splits and the number of sections big tuples get split into. All automatic and transparent, too --- the user doesn't have to change data declarations at all. Also, if we do it that way, then it would *automatically* apply to both regular tuples and LO, because the current LO implementation is just tuples. (Tatsuo's idea of a non-transaction-controlled LO would need extra work, of course, if we decide that's a good idea...) regards, tom lane
pgsql-hackers by date: