Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Custom compression methods |
Date | |
Msg-id | CA+TgmoZFEhqxXBOONMQeCZ09AJGqE=H2UMySF4zHZsatA0_E2Q@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Custom compression methods (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: [HACKERS] Custom compression methods
|
List | pgsql-hackers |
On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > OK. I think it's a nice use case (and nice gains on the compression > ratio), demonstrating the datatype-aware compression. The question is > why shouldn't this be built into the datatypes directly? Tomas, thanks for running benchmarks of this. I was surprised to see how little improvement there was from other modern compression methods, although lz4 did appear to be a modest win on both size and speed. But I share your intuition that a lot of the interesting work is in datatype-specific compression algorithms. I have noticed in a number of papers that I've read that teaching other parts of the system to operate directly on the compressed data, especially for column stores, is a critical performance optimization; of course, that only makes sense if the compression is datatype-specific. I don't know exactly what that means for the design of this patch, though. As a general point, no matter which way you go, you have to somehow deal with on-disk compatibility. If you want to build in compression to the datatype itself, you need to find at least one bit someplace to mark the fact that you applied built-in compression. If you want to build it in as a separate facility, you need to denote the compression used someplace else. I haven't looked at how this patch does it, but the proposal in the past has been to add a value to vartag_external. One nice thing about the latter method is that it can be used for any data type generically, regardless of how much bit-space is available in the data type representation itself. It's realistically hard to think of a data-type that has no bit space available anywhere but is still subject to data-type specific compression; bytea definitionally has no bit space but is also can't benefit from special-purpose compression, whereas even something like text could be handled by starting the varlena with a NUL byte to indicate compressed data following. However, you'd have to come up with a different trick for each data type. Piggybacking on the TOAST machinery avoids that. It also implies that we only try to compress values that are "big", which is probably be desirable if we're talking about a kind of compression that makes comprehending the value slower. Not all types of compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38, and for those that don't it probably makes more sense to just build it into the data type. All of that is a somewhat separate question from whether we should have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using the ACCESS METHOD stuff instead). Even if we agree that piggybacking on TOAST is a good way to implement pluggable compression methods, it doesn't follow that the compression method is something that should be attached to the datatype from the outside; it could be built into it in a deep way. For example, "packed" varlenas (1-byte header) are a form of compression, and the default functions for detoasting always produced unpacked values, but the operators for the text data type know how to operate on the packed representation. That's sort of a trivial example, but it might well be that there are other cases where we can do something similar. Maybe jsonb, for example, can compress data in such a way that some of the jsonb functions can operate directly on the compressed representation -- perhaps the number of keys is easily visible, for example, or maybe more. In this view of the world, each data type should get to define its own compression method (or methods) but they are hard-wired into the datatype and you can't add more later, or if you do, you lose the advantages of the hard-wired stuff. BTW, another related concept that comes up a lot in discussions of this area is that we could do a lot better compression of columns if we had some place to store a per-column dictionary. I don't really know how to make that work. We could have a catalog someplace that stores an opaque blob for each column configured to use a compression method, and let the compression method store whatever it likes in there. That's probably fine if you are compressing the whole table at once and the blob is static thereafter. But if you want to update that blob as you see new column values there seem to be almost insurmountable problems. To be clear, I'm not trying to load this patch down with a requirement to solve every problem in the universe. On the other hand, I think it would be easy to beat a patch like this into shape in a fairly mechanical way and then commit-and-forget. That might be leaving a lot of money on the table; I'm glad you are thinking about the bigger picture and hope that my thoughts here somehow contribute. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: