Re: double writes using "double-write buffer" approach [WIP] - Mailing list pgsql-hackers
From | Dan Scales |
---|---|
Subject | Re: double writes using "double-write buffer" approach [WIP] |
Date | |
Msg-id | 1377299018.1457787.1328745629741.JavaMail.root@zimbra-prod-mbox-4.vmware.com Whole thread Raw |
In response to | Re: double writes using "double-write buffer" approach [WIP] (Amit Kapila <amit.kapila@huawei.com>) |
Responses |
Re: double writes using "double-write buffer" approach [WIP]
|
List | pgsql-hackers |
> Is there any problem if the double-write happens only by bgwriter or checkpoint. > Something like whenever backend process has to evict the buffer, it will do same as you have described that write in adouble-write buffer, but > bgwriter will check this double-buffer and flush from it. > Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tellbgwriter to flush from > double-write buffer. > This can ensure very less I/O by any backend. Yes, I think this is a good idea. I could make changes so that the backends hand off the responsibility to flush batchesof the double-write buffer to the bgwriter whenever possible. This would avoid some long IO waits in the backends,though the backends may of course eventually wait anyways for the bgwriter if IO is not fast enough. I did writethe code so that any process can write a completed batch if the batch is not currently being flushed (so as to dealwith crashes by backends). Having the backends flush the batches as they fill them up was just simpler for a first prototype. Dan ----- Original Message ----- From: "Amit Kapila" <amit.kapila@huawei.com> To: "Dan Scales" <scales@vmware.com>, "PG Hackers" <pgsql-hackers@postgresql.org> Sent: Tuesday, February 7, 2012 1:08:49 AM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] >> I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. I don'tunderstand this point, because from the data in your mail, it appears that when shared buffers are less means when moreevictions can happen, the performance is less. ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process which can degrade performance. Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-writebuffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriterto flush from double-write buffer. This can ensure very less I/O by any backend. -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dan Scales Sent: Saturday, January 28, 2012 4:02 AM To: PG Hackers Subject: [HACKERS] double writes using "double-write buffer" approach [WIP] I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous patchthat only batched up writes by the checkpointer). I think it is a good idea, and can help double-writes perform betterin the case of lots of backend evictions. It also centralizes most of the code change in smgr.c. However, it is trickier to reason about. The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written. Notethat a full copy of the page is required, but can folded in with a checksum calculation. Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double writes-- the pages are first written and fsynced to a double-write file, then written to the data files, which are then fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing thesize of the WAL log). The key changes are conceptually simple: 1. In smgrwrite(), copy the page to the double-write buffer. If a big enough batch has accumulated, then flush the batchusing double writes. [I don't think I need to intercept calls to smgrextend(), but I am not totally sure.] 2. In smgrread(), always look first in the double-write buffer for a particular page, before going to disk. 3. At the end of a checkpoint and on shutdown, always make sure that the current contents of the double-write buffer areflushed. 4. Pass flags around in some cases to indicate whether a page buffer needs a double write or not. (I think eventuallythis would be an attribute of the buffer, set when the page is WAL-logged, rather than a flag passed around.) 5. Deal with duplicates in the double-write buffer appropriately (very rarely happens). To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other processes. The double-write buffers are circular buffers. The checkpointer double-write buffer is just a single batch of64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each. Each batch goes to a differentdouble-write file, so that they can be issued independently as soon as each batch is completed. Also, I need tosort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely onlyhave to write and fsync one data file. Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes enabled,since there are no unpredictable long fsyncs at the end (or during) a checkpoint. Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute 50-warehouseDBT2. One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine. FPWstands for full_page_writes, DW for double_writes. 'two disk' means the WAL log is on a separate ext3 filesystem fromthe data files. FPW off FPW on DW on, FPW off one disk: 15488 13146 11713 [5G buffers, 8G VM] two disk: 18833 16703 18013 one disk: 12908 11159 9758 [3G buffers, 6G VM] two disk: 14258 12694 11229 one disk 10829 9865 5806 [1G buffers, 8G VM] two disk 13605 12694 5682 one disk: 6752 6129 4878 two disk: 7253 6677 5239 [1G buffers, 2G VM] The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW isclose to that of FPW on. In the large cache (5G buffers), where the checkpointer can do all the work and there are minimaldirty evictions, DW is much better than FPW in the two disk case. In the one disk case, it is somewhat worse than FPW. However, interestingly, if you just move the double-write files toa separate ext3 filesystem on the same disk as the data files, the performance goes to 13107 -- now on par with FPW on. We are obviously getting hit by the ext3 fsync slowness issues. (I believe that an fsync on a filesystem can stall on other unrelated writes to the same filesystem.) Let me know if you have any thoughts/comments, etc. The patch is enclosed, and the README.doublewrites is updated a fairbit. Thanks, Dan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: