Re: limiting hint bit I/O - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: limiting hint bit I/O |
Date | |
Msg-id | AANLkTikPDXtY8P7QmGZ4VNe2c6feHKCQJKaJ54CDATqB@mail.gmail.com Whole thread Raw |
In response to | Re: limiting hint bit I/O (Merlin Moncure <mmoncure@gmail.com>) |
Responses |
Re: limiting hint bit I/O
|
List | pgsql-hackers |
On Wed, Jan 19, 2011 at 11:18 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Jan 19, 2011 at 10:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Here's a new version of the patch based on some experimentation with >> ideas I posted yesterday. At least on my Mac laptop, this is pretty >> effective at blunting the response time spike for the first table >> scan, and it converges to steady-state after about 20 tables scans. >> Rather than write every 20th page, what I've done here is make every >> 2000'th buffer allocation grant an allowance of 100 "hint bit only" >> writes. All dirty pages and the next 100 pages that are >> dirty-only-for-hint-bits get written out. Then we stop writing the >> dirty-only-for-hint-bits-pages until we get our next allowance of >> writes. The idea is to try to avoid creating a lot of random writes >> on each scan through the table. At least here, that seems to work >> pretty well - the initial scan is only about 25% slower than the >> steady state (rather than 6x or more slower). > > does this only impact the scan case? in oltp scenarios you want to > write out the bits asap, i would imagine. what about time based > flushing, so that only x dirty hint bit pages can be written out per > time unit y? No, it doesn't only affect the scan case. But I don't think that's bad. The goal is for the background writer to provide enough clean pages that backends don't have to write anything at all. If that's not happening, the backends will be slowed by the need to write out pages themselves in order to create a sufficient supply of clean pages to satisfy their allocation needs. The easiest way for that situation to occur is if the backend is doing a large sequential scan of a table - in that case, it's by definition cycling through pages at top speed, and the fact that it's cycling through them in a ring buffer rather than using all of shared_buffers makes the loop even tighter. But if it's possible under some other set of circumstances, the behavior is still reasonable. This behavior kicks in if more than 100 out of some set of 2000 page allocations would require a write only for the purpose of flushing hint bits. Time-based flushing would be problematic in several respects. First, it would require a kernel call, which would be vastly more expensive than what I'm doing now, and might have undesirable performance implications for that reason. Second, I don't think it would be the right way to tune it even if that were not an issue. It doesn't really matter whether the system takes a millisecond or a microsecond or a nanosecond to write each buffer - what matters is that writing all the buffers is a lot slower than writing none of them. So what we want to do is write a percentage of them, in a way that guarantees that they'll all eventually get written if people continue to access the same data. This does that, and a time-based setting would not; it would also almost certainly require tuning based on the I/O capacities of the system it's running on, which isn't necessary with this approach. Before we get too deeply involved in theory, can you give this a test drive on your system and see how it looks? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: