Re: Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Performance Improvement by reducing WAL for Update Operation |
Date | |
Msg-id | CAA4eK1+GvYOBDpVYZBBfvkRw3yFzK9Rz-nGriKR2Hag=gFK=9A@mail.gmail.com Whole thread Raw |
In response to | Re: Performance Improvement by reducing WAL for Update Operation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Performance Improvement by reducing WAL for Update Operation
Re: Performance Improvement by reducing WAL for Update Operation |
List | pgsql-hackers |
On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> 2. Provide a new reloption to specify Wal compression >> for update operation on table >> Create table tbl(c1 char(100)) With (compress_wal = true); >> >> Alternative options: >> a. compress_wal can take input as operation, e.g. 'insert', 'update', >> b. use alternate syntax: >> Create table tbl(c1 char(100)) Compress Wal For Update; >> c. anything better? > > I think WITH (compress_wal = true) is pretty good. I don't understand > your point about taking the operation as input, because this only > applies to updates. But we could try to work "update" into the name > of the setting somehow, so as to be less likely to conflict with > future additions, like maybe wal_compress_update. I think the > alternate syntax you propose is clearly worse, because it would > involve adding new keywords, something we try to avoid. Changed name to wal_compress_update in attached version of patch. > >> Points to consider >> ----------------------------- >> >> 1. As the current algorithm store the entry for same chunks at head of list, >> it will always find last but one chunk (we don't store last 4 bytes) for >> long matching string during match phase in encoding (pgrb_delta_encode). >> >> We can improve it either by storing same chunks at end of list instead of at >> head or by trying to find a good_match technique used in lz algorithm. >> Finding good_match technique can have overhead in some of the cases >> when there is actually no match. > > I don't see what the good_match thing has to do with anything in the > Rabin algorithm. But I do think there might be a bug here, which is > that, unless I'm misinterpreting something, hp is NOT the end of the > chunk. After calling pgrb_hash_init(), we've looked at the first FOUR > bytes of the input. If we find that we have a zero hash value at that > point, shouldn't the chunk size be 4, not 1? And similarly if we find > it after sucking in one more byte, shouldn't the chunk size be 5, not > 2? Right now, we're deciding where the chunks should end based on the > data in the chunk plus the following 3 bytes, and that seems wonky. I > would expect us to include all of those bytes in the chunk. Okay, I had modified the patch to consider the the data plus following 3 bytes inside chunk. To resolve it, after calling pgrb_hash_init(), we need to initialize chunk size as 4 and once we find the chunk, increase the next chunk start position by adding chunk size to it. Similarly during match phase while copying unmatched, data, make sure to copy 3 bytes ahead of current position as those will not be considered for new chunk. > In the Rabin algorithm, we shouldn't try to find a longer match. The > match should end at the chunk end, period. Otherwise, you lose the > shift-resistant property of the algorithm. Okay for now, I have commented the code in pgrb_find_match() which tries to find longer match after chunk boundary. The reason for just commenting it rather than removing is that I fear it might have negative impact on WAL reduction atleast for the cases where most of the data is repetitive. I have done some performance test and data is at end of mail, if you are okay with it, then I will remove this code altogether. >> 2. Another optimization that we can do in pgrb_find_match(), is that >> currently if >> it doesn't find the first chunk (chunk got by hash index) matching, it >> continues to find the match in other chunks. I am not sure if there is any >> benefit to search for other chunks if first one is not matching. > > Well, if you took that out, I suspect it would hurt the compression > ratio. Unless the CPU savings are substantial, I'd leave it alone. I kept this code intact. >> 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c, >> if we want to move, then we need to either duplicate some common macros >> like pglz_out_tag or keep it common, but might be change the name. > > +1 for a new file. Done, after moving code to new file, it looks better. >> 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively). >> The point to consider is that if we keep bigger chunk sizes, then it can >> save us on CPU cycles, but less reduction in Wal, on the other side if we >> keep it small it can have better reduction in Wal but consume more CPU >> cycles. > > Whoa. That seems way too small. Since PGRB_PATTERN_AFTER_BITS is 4, > the average length of a chunk is about 16 bytes. It makes little > sense to have the maximum chunk size be 25% of the expected chunk > length. I'd recommend making the maximum chunk length something like > 4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like > 4. Okay changed as per suggestion. >> 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we >> can remove it before commit. > > Let's remove it now. Done. >> 7. docs needs to be updated, tab completion needs some work. > > Tab completion can be skipped for now, but documentation is important. Updated Create Table documentation. Performance Data ----------------------------- Non-default settings: autovacuum =off checkpoint_segments =128 checkpoint_timeout = 10min Unpatched ------------------- testname | wal_generated | duration ----------------------------------------------------------+----------------------+------------------ one short and one long field, no change | 1054923224 | 33.101135969162 After pgrb_delta_encoding_v4 --------------------------------------------- testname | wal_generated | duration ----------------------------------------------------------+----------------------+------------------ one short and one long field, no change | 877859144 | 30.6749138832092 Temporary Changes (Revert Max Chunksize = 4 and logic of finding longer match) --------------------------------------------------------------------------------------------- testname | wal_generated | duration ----------------------------------------------------------+----------------------+------------------ one short and one long field, no change | 677337304 | 25.4048750400543 Summarization of test result: 1. If we don't try to find longer match, then it will have more tags in encoded tuple which will increase the overall length of encoded tuple. Note - a. I have taken data just for one case to check whether the effect of changes is acceptable. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: