Re: Compression of full-page-writes - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: Compression of full-page-writes |
Date | |
Msg-id | CAHGQGwGBsYFHptNBiWOLcUfcQ4k8RZphSqMHv1vW1xQMUycJPQ@mail.gmail.com Whole thread Raw |
In response to | Re: Compression of full-page-writes (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Compression of full-page-writes
|
List | pgsql-hackers |
On Fri, Oct 11, 2013 at 3:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Oct 11, 2013 at 1:20 AM, Dimitri Fontaine > <dimitri@2ndquadrant.fr> wrote: >> Hi, >> >> I did a partial review of this patch, wherein I focused on the patch and >> the code itself, as I saw other contributors already did some testing on >> it, so that we know it applies cleanly and work to some good extend. > > Thanks a lot! > >> In full_page_writes_str() why are you returning "unrecognized" rather >> than doing an ELOG(ERROR, …) for this unexpected situation? > > It's because the similar functions 'wal_level_str' and 'dbState' also return > 'unrecognized' in the unexpected situation. I just implemented > full_page_writes_str() > in the same manner. > > If we do an elog(ERROR) in that case, pg_xlogdump would fail to dump > the 'broken' (i.e., unrecognized fpw is set) WAL file. I think that some > users want to use pg_xlogdump to investigate the broken WAL file, so > doing an elog(ERROR) seems not good to me. > >> The code switches to compression (or trying to) when the following >> condition is met: >> >> + if (fpw <= FULL_PAGE_WRITES_COMPRESS) >> + { >> + rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len)); >> >> We have >> >> + typedef enum FullPageWritesLevel >> + { >> + FULL_PAGE_WRITES_OFF = 0, >> + FULL_PAGE_WRITES_COMPRESS, >> + FULL_PAGE_WRITES_ON >> + } FullPageWritesLevel; >> >> + #define FullPageWritesIsNeeded(fpw) (fpw >= FULL_PAGE_WRITES_COMPRESS) >> >> I don't much like using the <= test against and ENUM and I'm not sure I >> understand the intention you have here. It somehow looks like a typo and >> disagrees with the macro. > > I thought that FPW should be compressed only when full_page_writes is > set to 'compress' or 'off'. That is, 'off' implies a compression. When it's set > to 'off', FPW is basically not generated, so there is no need to call > CompressBackupBlock() in that case. But only during online base backup, > FPW is forcibly generated even when it's set to 'off'. So I used the check > "fpw <= FULL_PAGE_WRITES_COMPRESS" there. > >> What about using the FullPageWritesIsNeeded >> macro, and maybe rewriting the macro as >> >> #define FullPageWritesIsNeeded(fpw) \ >> (fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON) > > I'm OK to change the macro so that the <= test is not used. > >> Also, having "on" imply "compress" is a little funny to me. Maybe we >> should just finish our testing and be happy to always compress the full >> page writes. What would the downside be exactly (on buzy IO system >> writing less data even if needing more CPU will be the right trade-off). > > "on" doesn't imply "compress". When full_page_writes is set to "on", > FPW is not compressed at all. > >> I like that you're checking the savings of the compressed data with >> respect to the uncompressed data and cancel the compression if there's >> no gain. I wonder if your test accounts for enough padding and headers >> though given the results we saw in other tests made in this thread. > > I'm afraid that the patch has only limited effects in WAL reduction and > performance improvement unless the database contains highly-compressible > data like large blank characters column. It really depends on the contents > of the database. So, obviously FPW compression should not be the default. > Maybe we can treat it as just tuning knob. > >> Why do we have both the static function full_page_writes_str() and the >> macro FullPageWritesStr, with two different implementations issuing >> either "true" and "false" or "on" and "off"? > > First I was thinking to use "on" and "off" because they are often used > as the setting value of boolean GUC. But unfortunately the existing > pg_xlogdump uses "true" and "false" to show the value of full_page_writes > in WAL. To avoid breaking the backward compatibility, I implmented > the "true/false" version of function. I'm really not sure how many people > want such a compatibility of pg_xlogdump, though. > >> ! unsigned hole_offset:15, /* number of bytes before "hole" */ >> ! flags:2, /* state of a backup block, see below */ >> ! hole_length:15; /* number of bytes in "hole" */ >> >> I don't understand that. I wanted to use that patch as a leverage to >> smoothly discover the internals of our WAL system but won't have the >> time to do that here. > > We need the flag indicating whether each FPW is compressed or not. > If no such a flag exists in WAL, the standby cannot determine whether > it should decompress each FPW or not, and then cannot replay > the WAL containing FPW properly. That is, I just used a 'space' in > the header of FPW to have such a flag. > >> That said, I don't even know that C syntax. > > The struct 'ItemIdData' uses the same C syntax. > >> + #define BKPBLOCK_UNCOMPRESSED 0 /* uncompressed */ >> + #define BKPBLOCK_COMPRESSED 1 /* comperssed */ >> >> There's a typo in the comment above. > > Yep. > >>> [time required to replay WAL generated during running pgbench] >>> 61s (on) .... 1209911 transactions were replayed, >>> recovery speed: 19834.6 transactions/sec >>> 39s (compress) .... 1445446 transactions were replayed, >>> recovery speed: 37062.7 transactions/sec >>> 37s (off) .... 1629235 transactions were replayed, >>> recovery speed: 44033.3 transactions/sec >> >> How did you get those numbers ? pg_basebackup before the test and >> archiving, then a PITR maybe? Is it possible to do the same test with >> the same number of transactions to replay, I guess using the -t >> parameter rather than the -T one for this testing. > > Sure. To be honest, when I received the same request from Andres, > I did that benchmark. But unfortunately because of machine trouble, > I could not report it, yet. Will do that again. Here is the benchmark result: * Result [tps] 1317.306391 (full_page_writes = on) 1628.407752 (compress) [the amount of WAL generated during running pgbench] 1319 MB (on)326 MB (compress) [time required to replay WAL generated during running pgbench] 19s (on) 2013-10-11 12:05:09 JST LOG: redo starts at F/F1000028 2013-10-11 12:05:28 JST LOG: redo done at 10/446B7BF0 12s (on) 2013-10-11 12:06:22 JST LOG: redo starts at F/F1000028 2013-10-11 12:06:34 JST LOG: redo done at 10/446B7BF0 12s (on) 2013-10-11 12:07:19 JST LOG: redo starts at F/F1000028 2013-10-11 12:07:31 JST LOG: redo done at 10/446B7BF0 8s (compress) 2013-10-11 12:17:36 JST LOG: redo starts at 10/50000028 2013-10-11 12:17:44 JST LOG: redo done at 10/655AE478 8s (compress) 2013-10-11 12:18:26 JST LOG: redo starts at 10/50000028 2013-10-11 12:18:34 JST LOG: redo done at 10/655AE478 8s (compress) 2013-10-11 12:19:07 JST LOG: redo starts at 10/50000028 2013-10-11 12:19:15 JST LOG: redo done at 10/655AE478 [benchmark] transaction type: TPC-B (sort of) scaling factor: 100 query mode: prepared number of clients: 32 number of threads: 4 number of transactions per client: 10000 number of transactions actually processed: 320000/320000 Regards, -- Fujii Masao
pgsql-hackers by date: