Re: limiting hint bit I/O - Mailing list pgsql-hackers
From | Cédric Villemain |
---|---|
Subject | Re: limiting hint bit I/O |
Date | |
Msg-id | AANLkTi==bJdFbNr0GgHXX-paT1f4_SudX4s18OPuW86q@mail.gmail.com Whole thread Raw |
In response to | Re: limiting hint bit I/O (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: limiting hint bit I/O
|
List | pgsql-hackers |
2011/2/7 Robert Haas <robertmhaas@gmail.com>: > On Mon, Feb 7, 2011 at 10:48 AM, Bruce Momjian <bruce@momjian.us> wrote: >> Robert Haas wrote: >>> On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <bruce@momjian.us> wrote: >>> > Uh, in this C comment: >>> > >>> > + ? ? ? ?* or not we want to take the time to write it. ?We allow up to 5% of >>> > + ? ? ? ?* otherwise-not-dirty pages to be written due to hint bit changes, >>> > >>> > 5% of what? ?5% of all buffers? ?5% of all hint-bit-dirty ones? ?Can you >>> > clarify this in the patch? >>> >>> 5% of buffers that are hint-bit-dirty but not otherwise dirty. ISTM >>> that's exactly what the comment you just quoted says on its face, but >>> I'm open to some other wording you want to propose. >> >> How about: >> >> otherwise-not-dirty -> only-hint-bit-dirty >> >> So 95% of your hint bit modificates are discarded if the pages is not >> otherwise dirtied? That seems pretty radical. > > No, it's more subtle than that, although I admit it *is* radical. > There are three ways that pages can get written out to disk: > > 1. Checkpoints. > 2. Background writer activity. > 3. Backends writing out dirty buffers because there are no clean > buffers available to allocate. > > What the latest version of the patch implements is: > > 1. Checkpoints no longer write only-hint-bit-dirty pages to disk. > Since a checkpoint doesn't evict pages from memory, the hint bits are > still there to be written out (or not) by (2) or (3), below. > > 2. When the background writer's cleaning scan hits an > only-hint-bit-dirty page, it writes it, same as before. This > definitely doesn't result in the loss of any hint bits. > > 3. When a backend writes out a dirty buffer itself, because there are > no clean buffers available to allocate, it initially writes them. But > if there are more than 100 such pages per block of 2000 allocations, > it recycles any after the first 100 without writing them. > > In normal operation, I suspect that there will be very little impact > from this change. The change described in #1 may slightly reduce the > size of some checkpoints, but it's unclear that it will be enough to > be material. The change described in #3 will probably also not > matter, because, in a well-tuned system, the background writer should > be set aggressively enough to provide a supply of clean pages, and > therefore backends shouldn't be doing many writes themselves, and > therefore most buffer allocations will be of already-clean pages, and > the logic described in #3 will probably never kick in. Even if they > are writing a lot of buffers themselves, the logic in #3 still won't > kick in if many of the pages being written are actually dirty - it > will only matter if the backends are writing out lots and lots of > pages *solely because they are only-hint-bit-dirty*. > > Where I expect this to make a big difference is on sequential scans of > just-loaded tables. In that case, the BufferAccessStrategy machinery > will force the backend to reuse the same buffers over and over again, > and all of those pages will be only-hint-bit-dirty. So the backend > has to do a write for every page it allocates, and even though those > writes are being absorbed by the OS cache, it's still slow. With this > patch, what will happen is that the backend will write about 100 > pages, then perform the next 1900 allocations without writing, then > write another 100 pages, etc. So at the end of the scan, instead of > having written an amount of data equal to the size of the table, we > will have written 5% of that amount, and 5% of the hint bits will be > on disk. Each subsequent scan will get another 5% of the hint bits on > disk until after 20 scans they are all set. So the work of setting > the hint bits is spread out across the first 20 table scans instead of > all being done the first time through. > > Clearly, there's further jiggering that can be done here. But the > overall goal is simply that some of our users don't seem to like it > when the first scan of a newly loaded table generates a huge storm of > *write* traffic. Given that the hint bits appear to be quite > important from a performance perspective (see benchmark numbers > upthread), those are not real benchmarks, just quick guess to check behavior. (and I agree it looks good, but I also got inconsistent results, the patched postgresql hardly reach the same speed of the original 9.1devel even after 200 hundreds select of your testcase) > we don't really have the option of just not writing them - > but we can try to not to do it all at once, if we think that's an > improvement, which I think is likely. > > Overall, I'm inclined to move this patch to the next CommitFest and > forget about it for now. I don't think we're going to get enough > testing of this in the next week to be really confident that it's > right. I might be willing to commit with some more moderate amount of > testing if we were right at the beginning of a development cycle, > figuring that we'd shake out any warts as the cycle went along, but > this isn't seeming like the right time for this kind of a change. I agree. I think it might be better to do the hint_bit_allowance decrement when we write something (dirty or dirtyhint). And so we can have something like : 100% writte : write dirty + hint 5 % write : write 5 % of (dirty + hint) (instead of write 5% of the hint only). So come a simple Bandwith/IOrequest limiter. Open for next commitfest :) -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
pgsql-hackers by date: