Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | 20160120101326.rvao4mcuntxxf7wf@alap3.anarazel.de Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing
Re: checkpointer continuous flushing Re: checkpointer continuous flushing |
List | pgsql-hackers |
On 2016-01-19 22:43:21 +0100, Andres Freund wrote: > On 2016-01-19 12:58:38 -0500, Robert Haas wrote: > > This seems like a problem with the WAL writer quite independent of > > anything else. It seems likely to be inadvertent fallout from this > > patch: > > > > Author: Simon Riggs <simon@2ndQuadrant.com> > > Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000 > > > > Wakeup WALWriter as needed for asynchronous commit performance. > > Previously we waited for wal_writer_delay before flushing WAL. Now > > we also wake WALWriter as soon as a WAL buffer page has filled. > > Significant effect observed on performance of asynchronous commits > > by Robert Haas, attributed to the ability to set hint bits on tuples > > earlier and so reducing contention caused by clog lookups. > > In addition to that the "powersaving" effort also plays a role - without > the latch we'd not wake up at any meaningful rate at all atm. The relevant thread is at http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com what I didn't remember is that I voiced concern back then about exactly this: http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de ;) Simon: CCed you, as the author of the above commit. Quick summary: The frequent wakeups of wal writer can lead to significant performance regressions in workloads that are bigger than shared_buffers, because the super-frequent fdatasync()s by the wal writer slow down concurrent writes (bgwriter, checkpointer, individual backend writes) dramatically. To the point that SIGSTOPing the wal writer gets a pgbench workload from 2995 to 10887 tps. The reasons fdatasyncs cause a slow down is that it prevents real use of queuing to the storage devices. On 2016-01-19 22:43:21 +0100, Andres Freund wrote: > On 2016-01-19 12:58:38 -0500, Robert Haas wrote: > > If I understand correctly, prior to that commit, WAL writer woke up 5 > > times per second and flushed just that often (unless you changed the > > default settings). But as the commit message explained, that turned > > out to suck - you could make performance go up very significantly by > > radically decreasing wal_writer_delay. This commit basically lets it > > flush at maximum velocity - as fast as we finish one flush, we can > > start the next. That must have seemed like a win at the time from the > > way the commit message was written, but you seem to now be seeing the > > opposite effect, where performance is suffering because flushes are > > too frequent rather than too infrequent. I wonder if there's an ideal > > flush rate and what it is, and how much it depends on what hardware > > you have got. > > I think the problem isn't really that it's flushing too much WAL in > total, it's that it's flushing WAL in a too granular fashion. I suspect > we want something where we attempt a minimum number of flushes per > second (presumably tied to wal_writer_delay) and, once exceeded, a > minimum number of pages per flush. I think we even could continue to > write() the data at the same rate as today, we just would need to reduce > the number of fdatasync()s we issue. And possibly could make the > eventual fdatasync()s cheaper by hinting the kernel to write them out > earlier. > > Now the question what the minimum number of pages we want to flush for > (setting wal_writer_delay triggered ones aside) isn't easy to answer. A > simple model would be to statically tie it to the size of wal_buffers; > say, don't flush unless at least 10% of XLogBuffers have been written > since the last flush. More complex approaches would be to measure the > continuous WAL writeout rate. > > By tying it to both a minimum rate under activity (ensuring things go to > disk fast) and a minimum number of pages to sync (ensuring a reasonable > number of cache flush operations) we should be able to mostly accomodate > the different types of workloads. I think. This unfortunately leaves out part of the reasoning for the above commit: We want WAL to be flushed fast, so we immediately can set hint bits. One, relatively extreme, approach would be to continue *writing* WAL in the background writer as today, but use rules like suggested above guiding the actual flushing. Additionally using operations like sync_file_range() (and equivalents on other OSs). Then, to address the regression of SetHintBits() having to bail out more often, actually trigger a WAL flush whenever WAL is already written, but not flushed. has the potential to be bad in a number of other cases tho :( Andres
pgsql-hackers by date: