Thread: Set hint bits upon eviction from BufMgr
Maybe I'm being overly simplistic or incorrect here, but I was thinking that there might be a route to reducing hint bit impact to the main sufferers of the feature without adding too much pain in the general case. I'm unfortunately convinced there is no getting rid of them -- in fact their utility will become even more apparent with faster storage and the pendulum of optimization swings back to the cpu side. My idea is to reserve a bit in the page header, say PD_ALL_SAME_XMIN that indicates all the tuples are from the same transaction and set it when the first insertion tuple hits the page and unset it when any tuple is added from another xmin/touched/deleted. The point here is to set up a cheap check at the page level that we can make when a page is getting evicted from the bufmgr. If the bit is set, we grab off the xmin of the first tuple on the page and test it for visibility (assuming the hint bit is not already set). If we get a thumbs up on the transaction, we can look the page and set all tuple hints as during the page evict/sync process. We don't worry about logging/crash safety on the 'all same' hint because it's only interesting to this bufmgr check (it can even be cleared when page is loaded). Without this bit, the only way to set hint bits going during bufmgr eviction is to do a visibility check on every tuple, which would probably be prohibitively expensive. Since OLTP environments would rarely see this bit, they would not have to pay for the check. Also, we can maybe tweak the bufmgr to prefer not to evict pages with this bit set if it's known they are not yet written out to primary storage. Maybe this impossible or not logical...just thinking out loud. Anyways, if this actually works, shared buffers can start to play a role of mitigating hint bit i/o as long as the transaction resolves before pages start jumping out into storage. If you couple this with a facility to do bulk loads that break up transactions on regular intervals, you have a good shot at getting all your hint bits written out properly in large load situation. You might be able to do similar tricks with deletes -- I haven't thought about that. Also there might be some interplay with vacuum or some other deal breaker -- curious to see if I have something worth further thought here. merlin
On Mar 25, 2011, at 9:52 AM, Merlin Moncure wrote: > Without this bit, the only way to set hint bits going during bufmgr > eviction is to do a visibility check on every tuple, which would > probably be prohibitively expensive. Since OLTP environments would > rarely see this bit, they would not have to pay for the check. IIRC one of the biggest costs is accessing the CLOG, but what if the bufmgr.c/bgwriter didn't use the same CLOG lookup mechanismas backends did? Unlike when a backend is inspecting visibility, it's not necessary for something like bgwriterto know exact visibility as long as it doesn't mark something as visible when it shouldn't. If it uses a differentCLOG caching/accessing method that lags behind the real CLOG then the worst-case scenario is that there's a delayon setting hint bits. But getting grwiter to dothis would likely still be a huge win over forcing backends to worryabout it. It's also possible that the visibility check itself could be simplified. BTW, I don't think you want to play these games when a backend is evicting a page because you'll be slowing a real backenddown. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Mar 25, 2011 at 10:34 AM, Jim Nasby <jim@nasby.net> wrote: > On Mar 25, 2011, at 9:52 AM, Merlin Moncure wrote: >> Without this bit, the only way to set hint bits going during bufmgr >> eviction is to do a visibility check on every tuple, which would >> probably be prohibitively expensive. Since OLTP environments would >> rarely see this bit, they would not have to pay for the check. > > IIRC one of the biggest costs is accessing the CLOG, but what if the bufmgr.c/bgwriter didn't use the same CLOG lookupmechanism as backends did? Unlike when a backend is inspecting visibility, it's not necessary for something like bgwriterto know exact visibility as long as it doesn't mark something as visible when it shouldn't. If it uses a differentCLOG caching/accessing method that lags behind the real CLOG then the worst-case scenario is that there's a delayon setting hint bits. But getting grwiter to dothis would likely still be a huge win over forcing backends to worryabout it. It's also possible that the visibility check itself could be simplified. > > BTW, I don't think you want to play these games when a backend is evicting a page because you'll be slowing a real backenddown. Well, I'm not so sure -- as noted above, you only pay for the check above when all the records in a page are new, and only once per page, not once per tuple. Basically, only when you are bulk jamming records through the buffers. The amoritized cost of the clog lookup is going to be near zero (maybe you could put a fuse in that would get tripped if there weren't enough tuples in the page to justify the check). If you are bulk loading more data than you have shared buffers, then you get zero benefit. However, you might having the makings of a strategy of dealing with hint bit i/o in user land. (by breaking up transactions, tweaking shared buffers, etc). merlin
On 25.03.2011 16:52, Merlin Moncure wrote: > Without this bit, the only way to set hint bits going during bufmgr > eviction is to do a visibility check on every tuple, which would > probably be prohibitively expensive. I don't think the naive approach of scanning all tuples would be too bad, actually. The hint bits only need to be set once, and it'd be bgwriter shouldering the overhead. The problem with setting hing bits when a buffer is evicted is that it doesn't help with the bulk load case. The hint bits can't be set for a bulk load until the load is finished and the transaction commits. Maybe it would still be worthwhile to have bgwriter set hint bits, to reduce I/O caused by hint bit updates in an OLTP workload, but that's not what people usually complain about. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Mar 25, 2011 at 2:32 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 25.03.2011 16:52, Merlin Moncure wrote: >> >> Without this bit, the only way to set hint bits going during bufmgr >> eviction is to do a visibility check on every tuple, which would >> probably be prohibitively expensive. > > I don't think the naive approach of scanning all tuples would be too bad, > actually. The hint bits only need to be set once, and it'd be bgwriter > shouldering the overhead. > > The problem with setting hing bits when a buffer is evicted is that it > doesn't help with the bulk load case. The hint bits can't be set for a bulk > load until the load is finished and the transaction commits. Not the true bulk load case. However, if you can break up a load into multiple transactions and sneak out 10-100mb of pages into the buffer per transaction, you have a good chance of getting most/all the bits out correct before bgwriter eats them up. I was thinking to also teach bgwriter to keep xmin flagged pages in a separate lower priority pool so that it didn't race to them before the transaction had a chance to go in. Long term, I'm imagining more direct transaction control in the backend, either via autonomous transactions, or stored procedures with explicit transaction control, so we don't have to load N gigabytes in a single transaction. > Maybe it would still be worthwhile to have bgwriter set hint bits, to reduce > I/O caused by hint bit updates in an OLTP workload, but that's not what > people usually complain about. well, if bgwriter does it, you lose the ability to bail the clog check via TransactionIdIsCurrentTransactionId, right? If it's done in the bufmgr you at least have a chance to not have to go all the way out. Either way though, you at least have to teach bgwriter to be more cooperative. merlin
On Fri, Mar 25, 2011 at 3:32 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 25.03.2011 16:52, Merlin Moncure wrote: >> >> Without this bit, the only way to set hint bits going during bufmgr >> eviction is to do a visibility check on every tuple, which would >> probably be prohibitively expensive. > > I don't think the naive approach of scanning all tuples would be too bad, > actually. The hint bits only need to be set once, and it'd be bgwriter > shouldering the overhead. I was thinking the same thing. The only thing I'm worried about is whether it'd make the bgwriter less responsive; we already have some issues in that department. > The problem with setting hing bits when a buffer is evicted is that it > doesn't help with the bulk load case. The hint bits can't be set for a bulk > load until the load is finished and the transaction commits. > > Maybe it would still be worthwhile to have bgwriter set hint bits, to reduce > I/O caused by hint bit updates in an OLTP workload, but that's not what > people usually complain about. Yeah. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 25, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 25, 2011 at 3:32 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 25.03.2011 16:52, Merlin Moncure wrote: >>> >>> Without this bit, the only way to set hint bits going during bufmgr >>> eviction is to do a visibility check on every tuple, which would >>> probably be prohibitively expensive. >> >> I don't think the naive approach of scanning all tuples would be too bad, >> actually. The hint bits only need to be set once, and it'd be bgwriter >> shouldering the overhead. > > I was thinking the same thing. The only thing I'm worried about is > whether it'd make the bgwriter less responsive; we already have some > issues in that department. I'd like to experiment on this and see what comes out. If the bgwriter was to be granted the ability to inspect buffers and set hints, it needs to be able to peek in and inspect the buffer itself which it currently doesn't do FWICT. I was thinking about setting a flag in the buffer (BM_HEAP) that gets set by the loader which flags the buffer for later inspection. Is there a simpler way to do this? It may turn out to be a dud, but I'd still like to play with the all visible bit and see how that interacts with data loading, both with and without special bgwriter logic (i'm going to kludge in a crude mechanism to try to prefer non all visible pages). The reason why I like it is the optimization is narrow and the risk of downside is low, although it's up a notch on the complexity level. If you do end up retooling the bgwriter to set hint bits broadly, there are some tricks you can do to reduce the number of useless clog checks you do (that is, you fault through to an in progress transaction). They involve changing the way the scan works, maybe even organizing buffers into multiple priority pools, so it's complicated and has to be done very carefully. I think you guys are correct: the logic belongs in the bgwriter. Generally speaking, it looks like the best route to minimizing hint bit pain is to if at all possible write them out set so they don't have to be rewritten later (Stephen's approach to leverage in transaction table creation is another way of attempting to do that). merlin
On Mon, Mar 28, 2011 at 9:48 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > I'd like to experiment on this and see what comes out. Great! > If the > bgwriter was to be granted the ability to inspect buffers and set > hints, it needs to be able to peek in and inspect the buffer itself > which it currently doesn't do FWICT. That matches my understanding. > I was thinking about setting a > flag in the buffer (BM_HEAP) that gets set by the loader which flags > the buffer for later inspection. Is there a simpler way to do this? Hmm. That's slightly crufty, but it might be OK. At least, I don't have a better idea. > I think you guys are correct: the logic belongs in the bgwriter. > Generally speaking, it looks like the best route to minimizing hint > bit pain is to if at all possible write them out set so they don't > have to be rewritten later (Stephen's approach to leverage in > transaction table creation is another way of attempting to do that). Yeah. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Mar 28, 2011 at 9:48 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> I was thinking about setting a >> flag in the buffer (BM_HEAP) that gets set by the loader which flags >> the buffer for later inspection. �Is there a simpler way to do this? > Hmm. That's slightly crufty, but it might be OK. At least, I don't > have a better idea. The major problem with all of this is that the bgwriter has no idea which buffers contain heap pages. And I'm not convinced it's a good idea to try to let it know that. If we get to the point where bgwriter is trying to do catalog accesses, we are in for a world of pain. (Can you say "modularity violation"? How about "deadlock"?) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > The major problem with all of this is that the bgwriter has no > idea which buffers contain heap pages. And I'm not convinced it's > a good idea to try to let it know that. If we get to the point > where bgwriter is trying to do catalog accesses, we are in for a > world of pain. (Can you say "modularity violation"? How about > "deadlock"?) How about having a BackgroundPrepareForWriteFunction variable associated with each page the bgwriter might see, which would be a pointer to a function to call (if the variable is not NULL) before writing? The bgwriter would still have no idea what kind of page it was or what the function did.... -Kevin
On Mon, Mar 28, 2011 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Mar 28, 2011 at 9:48 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> I was thinking about setting a >>> flag in the buffer (BM_HEAP) that gets set by the loader which flags >>> the buffer for later inspection. Is there a simpler way to do this? > >> Hmm. That's slightly crufty, but it might be OK. At least, I don't >> have a better idea. > > The major problem with all of this is that the bgwriter has no idea > which buffers contain heap pages. And I'm not convinced it's a good > idea to try to let it know that. If we get to the point where bgwriter > is trying to do catalog accesses, we are in for a world of pain. > (Can you say "modularity violation"? How about "deadlock"?) Well, that's why Merlin was suggesting having the backends that read the buffers in flag the heap pages as BM_HEAP. Then the background writer can just examine that bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 28, 2011 at 9:29 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The major problem with all of this is that the bgwriter has no >> idea which buffers contain heap pages. And I'm not convinced it's >> a good idea to try to let it know that. If we get to the point >> where bgwriter is trying to do catalog accesses, we are in for a >> world of pain. (Can you say "modularity violation"? How about >> "deadlock"?) > > How about having a BackgroundPrepareForWriteFunction variable > associated with each page the bgwriter might see, which would be a > pointer to a function to call (if the variable is not NULL) before > writing? The bgwriter would still have no idea what kind of page it > was or what the function did.... Well, that is much cleaner from abstraction point of view but you lose the ability to adjust scan priority before flushing out the page...I'm assuming by the time this function is called, you've already made the decision to write it out. (maybe priority is necessary and maybe it isn't, but I don't like losing the ability to tune at that level). You could though put a priority inspection facility behind a similar abstraction fence (BackgroundGetWritePriority) though. Maybe that's more trouble than it's worth though. merlin
On Mar 28, 2011, at 9:48 AM, Merlin Moncure wrote: > On Mon, Mar 28, 2011 at 9:29 AM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >>> The major problem with all of this is that the bgwriter has no >>> idea which buffers contain heap pages. And I'm not convinced it's >>> a good idea to try to let it know that. If we get to the point >>> where bgwriter is trying to do catalog accesses, we are in for a >>> world of pain. (Can you say "modularity violation"? How about >>> "deadlock"?) >> >> How about having a BackgroundPrepareForWriteFunction variable >> associated with each page the bgwriter might see, which would be a >> pointer to a function to call (if the variable is not NULL) before >> writing? The bgwriter would still have no idea what kind of page it >> was or what the function did.... > > Well, that is much cleaner from abstraction point of view but you lose > the ability to adjust scan priority before flushing out the page...I'm > assuming by the time this function is called, you've already made the > decision to write it out. (maybe priority is necessary and maybe it > isn't, but I don't like losing the ability to tune at that level). > > You could though put a priority inspection facility behind a similar > abstraction fence (BackgroundGetWritePriority) though. Maybe that's > more trouble than it's worth though. Merlin, does your new work on CLOG caching negate anything in this thread? I think there's some ideas here worth furtherinvestigation and want to make sure they don't get lost. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Apr 5, 2011 at 9:49 AM, Jim Nasby <jim@nasby.net> wrote: > On Mar 28, 2011, at 9:48 AM, Merlin Moncure wrote: >> On Mon, Mar 28, 2011 at 9:29 AM, Kevin Grittner >> <Kevin.Grittner@wicourts.gov> wrote: >>> Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>>> The major problem with all of this is that the bgwriter has no >>>> idea which buffers contain heap pages. And I'm not convinced it's >>>> a good idea to try to let it know that. If we get to the point >>>> where bgwriter is trying to do catalog accesses, we are in for a >>>> world of pain. (Can you say "modularity violation"? How about >>>> "deadlock"?) >>> >>> How about having a BackgroundPrepareForWriteFunction variable >>> associated with each page the bgwriter might see, which would be a >>> pointer to a function to call (if the variable is not NULL) before >>> writing? The bgwriter would still have no idea what kind of page it >>> was or what the function did.... >> >> Well, that is much cleaner from abstraction point of view but you lose >> the ability to adjust scan priority before flushing out the page...I'm >> assuming by the time this function is called, you've already made the >> decision to write it out. (maybe priority is necessary and maybe it >> isn't, but I don't like losing the ability to tune at that level). >> >> You could though put a priority inspection facility behind a similar >> abstraction fence (BackgroundGetWritePriority) though. Maybe that's >> more trouble than it's worth though. > > Merlin, does your new work on CLOG caching negate anything in this thread? I think there's some ideas here worth furtherinvestigation and want to make sure they don't get lost. No, they don't -- and I plan to work on this independently. The performance tradeoffs here are much more complicated and will require extensive benchmarking to analyze. A process local clog cache, if it can be made to work (and that's be no means certain) is going to affect how this is put together. In particular, i'd be even more disinclined to adjust scan priorty or do anything fancy like that -- and more amenable to checking every tuple. I'm particularly interested in setting the PD_ALL_VISIBLE bit at eviction time if it's available to be set and the page is already dirty. merlin