Re: Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: Re: [HACKERS] Custom compression methods |
Date | |
Msg-id | CAFiTN-vcbfy5ScKVUp16c1N_wzP0RL6EkPBAg_Jm3eDK0ftO5Q@mail.gmail.com Whole thread Raw |
In response to | Re: Re: [HACKERS] Custom compression methods (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Re: [HACKERS] Custom compression methods
|
List | pgsql-hackers |
On Tue, Aug 25, 2020 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Aug 24, 2020 at 2:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > IIUC, the main reason for using this flag is for taking the decision > > whether we need any detoasting for this tuple. For example, if we are > > rewriting the table because the compression method is changed then if > > HEAP_HASCUSTOMCOMPRESSED bit is not set in the tuple header and tuple > > length, not tup->t_len > TOAST_TUPLE_THRESHOLD then we don't need to > > call heap_toast_insert_or_update function for this tuple. Whereas if > > this flag is set then we need to because we might need to uncompress > > and compress back using a different compression method. The same is > > the case with INSERT into SELECT * FROM. > > This doesn't really seem worth it to me. I don't see how we can > justify burning an on-disk bit just to save a little bit of overhead > during a rare maintenance operation. If there's a performance problem > here we need to look for another way of mitigating it. Slowing CLUSTER > and/or VACUUM FULL down by a large amount for this feature would be > unacceptable, but is that really a problem? And if so, can we solve it > without requiring this bit? Okay, if we want to avoid keeping the bit then there are multiple ways to handle this, but the only thing is none of that will be specific to those scenarios. approach1. In ExecModifyTable, we can process the source tuple and see if any of the varlena attributes is compressed and its stored compression method is not the same as the target table attribute then we can decompress it. approach2. In heap_prepare_insert, always call the heap_toast_insert_or_update, therein we can check if any of the source tuple attributes are compressed with different compression methods then the target table then we can decompress it. With either of the approach, we have to do this in a generic path because the source of the tuple is not known, I mean it can be a output from a function, or the join tuple or a subquery. So in the attached patch, I have implemented with approach1. For testing, I have implemented using approach1 as well as using approach2 and I have checked the performance of the pg_bench to see whether it impacts the performance of the generic paths or not, but I did not see any impact. > > > I have already extracted these 2 patches from the main patch set. > > But, in these patches, I am still storing the am_oid in the toast > > header. I am not sure can we get rid of that at least for these 2 > > patches? But, then wherever we try to uncompress the tuple we need to > > know the tuple descriptor to get the am_oid but I think that is not > > possible in all the cases. Am I missing something here? > > I think we should instead use the high bits of the toast size word for > patches #1-#4, as discussed upthread. > > > > > Patch #3. Add support for changing the compression method associated > > > > with a column, forcing a table rewrite. > > > > > > > > Patch #4. Add support for PRESERVE, so that you can change the > > > > compression method associated with a column without forcing a table > > > > rewrite, by including the old method in the PRESERVE list, or with a > > > > rewrite, by not including it in the PRESERVE list. > > > > Does this make sense to have Patch #3 and Patch #4, without having > > Patch #5? I mean why do we need to support rewrite or preserve unless > > we have the customer compression methods right? because the build-in > > compression method can not be dropped so why do we need to preserve? > > I think that patch #3 makes sense because somebody might have a table > that is currently compressed with pglz and they want to switch to lz4, > and I think patch #4 also makes sense because they might want to start > using lz4 for future data but not force a rewrite to get rid of all > the pglz data they've already got. Those options are valuable as soon > as there is more than one possible compression algorithm, even if > they're all built in. Now, as I said upthread, it's also true that you > could do #5 before #3 and #4. I don't think that's insane. But I > prefer it in the other order, because I think having #5 without #3 and > #4 wouldn't be too much fun for users. Details of the attached patch set 0001: This provides syntax to set the compression method from the built-in compression method (pglz or zlib). pg_attribute stores the compression method (char) and there are conversion functions to convert that compression method to the built-in compression array index. As discussed up thread the first 2 bits will be storing the compression method index using that we can directly get the handler routing using the built-in compression method array. 0002: This patch provides an option to changes the compression method for an existing column and it will rewrite the table. Next, I will be working on providing an option to alter the compression method without rewriting the whole table, basically, we can provide a preserve list to preserve old compression methods. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: