Thread: really lazy vacuums?
For historical reasons, what we now think of as VACUUM is referred to in some portions of the code as "lazy vacuum", to distinguish it from pre-9.0 VACUUM FULL. As I understand it, VACUUM works like this: - Scan the relation, accumulating a list of tuples to kill. - When you get to the end of the relation or when you fill up maintenance_work_mem, scan each index and kill all index entries pointing to those tuples. - Scan the relation a second time and kill the heap tuples. I'm wondering if there might be some benefit in having an even lazier type of vacuum that makes only a single scan over the relation and ignores the indexes. If it hits a tuple that could otherwise be killed, it marks it LP_DEAD and defragments the page. If it hits a page with only all-visible tuples, it marks the page PD_ALL_VISIBLE and sets the visibility map bit. This would be significantly cheaper than what we normally do right now because it wouldn't touch the indexes at all, and it would only scan the heap once rather than twice. The first part is particularly significant for relations where a high percentage of the visibility map bits area already set, because we always scan every index in its enitrety even if we only need to kill a handful of tuples, but we use the visibility map avoid scanning portions of the heap where no dead tuples can exist. A further advantage of this approach is that it is very easy to do incrementally - for example, it'd be perfectly reasonable to scan a 128MB chunk of the relation and then stop, expecting to do the rest later. That's a lot less reasonable with our existing approach because you have to scan all the indexes in their entirety every time. On the downside, this approach doesn't get rid of any index tuples, nor does it allow any CTIDs to be reclaimed. But it does reclaim most of the heap space, and it allows index tuples to be reclaimed opportunistically. Also, it gets PD_ALL_VISIBLE bits set, which makes scans cheaper and reduces the cost of later vacuuming. I'm not quite sure how we'd decide whether to do a "really lazy" vacuum or the kind we do now. The case where this approach wins big is when there are few or no dead tuples. In that case, we do a lot of work looking at the indexes and we don't get much out of it; plus we scan the heap twice instead of just once. If there are a lot of dead tuples, then we have to bite the bullet and do the whole thing. It'd be really nice to find a way to avoid needing to scan the entire index to reclaim the dead tuples, but unless we're willing to assume that we can always refind the relevant index tuples based on the heap tuple (an assumption I believe we have not been willing to make in the past), it doesn't work - and even if we did make that assumption, it's not going to be ideal when cleaning out large numbers of index tuples, because the index I/O will be random rather than sequential as it is currently. Thoughts? Does this sound at all feasible/useful? Any ideas on how to tune it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm not quite sure how we'd decide whether to do a "really lazy" > vacuum or the kind we do now. The case where this approach wins big > is when there are few or no dead tuples. In that case, we do a lot of > work looking at the indexes and we don't get much out of it; plus we > scan the heap twice instead of just once. Um, if there are *no* dead tuples then we don't look at the indexes anyway, except for the vacuum-cleanup pass which I don't think you get to decide you don't need. (You certainly don't get to decide that unilaterally without the index AM's cooperation.) I'm less than convinced that there's much gold to be mined here. regards, tom lane
On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I'm not quite sure how we'd decide whether to do a "really lazy" >> vacuum or the kind we do now. The case where this approach wins big >> is when there are few or no dead tuples. In that case, we do a lot of >> work looking at the indexes and we don't get much out of it; plus we >> scan the heap twice instead of just once. > > Um, if there are *no* dead tuples then we don't look at the indexes > anyway ... But you do still have to scan the heap twice. > except for the vacuum-cleanup pass which I don't think you get > to decide you don't need. (You certainly don't get to decide that > unilaterally without the index AM's cooperation.) I'm less than > convinced that there's much gold to be mined here. I'm not sure about that either, although I'm not sure of the reverse either. But before I invest any time in it, do you have any other good ideas for addressing the "it stinks to scan the entire index every time we vacuum" problem? Or for generally making vacuum cheaper? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 14, 2011 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure about that either, although I'm not sure of the reverse > either. But before I invest any time in it, do you have any other > good ideas for addressing the "it stinks to scan the entire index > every time we vacuum" problem? Or for generally making vacuum > cheaper? You could imagine an index am that instead of scanning the index just accumulated all the dead tuples in a hash table and checked it before following any index link. Whenever the hash table gets too big it could do a sequential scan and prune any pointers to those tuples and start a new hash table. That would work well if there are frequent vacuums finding a few tuples per vacuum. It might even allow us to absorb dead tuples from "retail" vacuums so we could get rid of line pointers earlier. But it would involve more WAL-logged operations and incur an extra overhead on each index lookup. -- greg
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Um, if there are *no* dead tuples then we don't look at the indexes >> anyway ... > But you do still have to scan the heap twice. Seems like that should be fixable ... is the second pass actually going to do anything? regards, tom lane
On Mon, Mar 14, 2011 at 7:40 PM, Greg Stark <gsstark@mit.edu> wrote: > On Mon, Mar 14, 2011 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm not sure about that either, although I'm not sure of the reverse >> either. But before I invest any time in it, do you have any other >> good ideas for addressing the "it stinks to scan the entire index >> every time we vacuum" problem? Or for generally making vacuum >> cheaper? > > You could imagine an index am that instead of scanning the index just > accumulated all the dead tuples in a hash table and checked it before > following any index link. Whenever the hash table gets too big it > could do a sequential scan and prune any pointers to those tuples and > start a new hash table. Hmm. For something like a btree, you could also remove each TID from the hash table when you kill the corresponding index tuple. > That would work well if there are frequent vacuums finding a few > tuples per vacuum. It might even allow us to absorb dead tuples from > "retail" vacuums so we could get rid of line pointers earlier. But it > would involve more WAL-logged operations and incur an extra overhead > on each index lookup. Yeah, that seems deeply unfortunate. It's hard to imagine us wanting to go there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 14, 2011 at 8:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Um, if there are *no* dead tuples then we don't look at the indexes >>> anyway ... > >> But you do still have to scan the heap twice. > > Seems like that should be fixable ... is the second pass actually > going to do anything? Err, I'm wrong. There's no second pass if there are exactly zero tuples to prune. So the possible problem case is when there are some dead tuples, but not all that many compared to the size of the indexes. In theory, you could be better off just reclaiming free space, and waiting for dead tuples to accumulate before cleaning the indexes. I'm not certain whether that target is wide enough to be worth aiming at, but it seems plausible that at least in some workloads the opportunistic index cleaning stuff would keep index bloat under control, so the only reason to ever make a full scan of the indexes during vacuuming would be to avoid leaking CTIDs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mar 14, 2011, at 2:36 PM, Robert Haas wrote: > I'm not quite sure how we'd decide whether to do a "really lazy" > vacuum or the kind we do now. The case where this approach wins big > is when there are few or no dead tuples. In that case, we do a lot of > work looking at the indexes and we don't get much out of it; plus we > scan the heap twice instead of just once. If there are a lot of dead > tuples, then we have to bite the bullet and do the whole thing. <snip> > Thoughts? Does this sound at all feasible/useful? Any ideas on how to tune it? One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having troublededicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. Whenyou're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you'verun through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunchof time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; itonly hits the tuples that were remembered as being dead). Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up forautovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazyvacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressivelyscan a table that gets data loaded into it in order to set hint/all visible bits. From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insertto a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra workto deal with the small number of dead tuples. Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazyvacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we couldask folks to run it and see how much super-lazy vacuuming would help in the real world. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby <jim@nasby.net> wrote: > One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having troublededicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. Whenyou're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you'verun through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunchof time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; itonly hits the tuples that were remembered as being dead). I mostly agree with this, but you also can't postpone vacuuming indefinitely just because you're too busy; that's going to blow up in your face. > Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up forautovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazyvacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressivelyscan a table that gets data loaded into it in order to set hint/all visible bits. Right. Really-lazy vacuum could freeze tuples. Unlike regular vacuum, it can also sensibly be done incrementally. One thing I was thinking about is counting the number of times that we fetched a tuple that was older than RecentGlobalXmin and had a committed xmin and an invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's happening a lot, it probably means that some vacuuming would speed things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you could work out some formula where you do a variable amount of super-lazy vacuuming depending on the number of such tuple fetches. The trick would be to avoid overdoing it (so that you swamp the I/O system) or underdoing it (so that the system never converges). It would be really nice (for this and for other things) if we had some way of measuring the I/O saturation of the system, so that we could automatically adjust the aggressiveness of background processes accordingly. Note also that if and when we get index-only scans, making sure the PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get set is going to be a lot more important. > From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insertto a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra workto deal with the small number of dead tuples. I can see that. > Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazyvacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we couldask folks to run it and see how much super-lazy vacuuming would help in the real world. I'm a bit doubtful about this part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Right. Really-lazy vacuum could freeze tuples. Unlike regular > vacuum, it can also sensibly be done incrementally. One thing I was > thinking about is counting the number of times that we fetched a tuple > that was older than RecentGlobalXmin and had a committed xmin and an > invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's > happening a lot, it probably means that some vacuuming would speed > things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you > could work out some formula where you do a variable amount of > super-lazy vacuuming depending on the number of such tuple fetches. > The trick would be to avoid overdoing it (so that you swamp the I/O > system) or underdoing it (so that the system never converges). It > would be really nice (for this and for other things) if we had some > way of measuring the I/O saturation of the system, so that we could > automatically adjust the aggressiveness of background processes > accordingly. > > Note also that if and when we get index-only scans, making sure the > PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get > set is going to be a lot more important. > Is it obvious that the visibillity map bits should track complete pages and not individual tuples? If the visibillity map tracks at page-level the benefit would fall on "slim tables" where you squeeze 200 tuples into each page and having an update rate of 1% would lower the likelyhood even more. (it may be that for slim tables the index-only-scans are not as benefitial as to wide tables). In collaboration with a vacuuming discussion, I dont know if it is there allready but how about "opportunistic vacuuming". Say you have a page what due to changes in one of the tuples are being written out, will it, while being written out anyway get the other tuples on the page vacuumed? It actually dont have to hook into the process directly to benefit the IO-usage, if it just can get the opportunity to do it before the page gets evicted from the OS-cache, then it would save a second read on that page, but it seems way harder to do something sane around that assumption. Really lazy vacuums would "only" benefit "really static tables" ? where vacuuming is not that big a problem in the first place. -- Jesper - Demonstrating totally lack of insight I would assume.
2011/3/17 Robert Haas <robertmhaas@gmail.com>: > On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby <jim@nasby.net> wrote: >> One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having troublededicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. Whenyou're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you'verun through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunchof time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; itonly hits the tuples that were remembered as being dead). > > I mostly agree with this, but you also can't postpone vacuuming > indefinitely just because you're too busy; that's going to blow up in > your face. > >> Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show upfor autovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazyvacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressivelyscan a table that gets data loaded into it in order to set hint/all visible bits. > > Right. Really-lazy vacuum could freeze tuples. Unlike regular > vacuum, it can also sensibly be done incrementally. One thing I was > thinking about is counting the number of times that we fetched a tuple > that was older than RecentGlobalXmin and had a committed xmin and an > invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's > happening a lot, it probably means that some vacuuming would speed > things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you > could work out some formula where you do a variable amount of > super-lazy vacuuming depending on the number of such tuple fetches. > The trick would be to avoid overdoing it (so that you swamp the I/O > system) or underdoing it (so that the system never converges). It > would be really nice (for this and for other things) if we had some > way of measuring the I/O saturation of the system, so that we could > automatically adjust the aggressiveness of background processes > Yes. I am thinking of something like that (the IO saturation measurement) to let the background writer try to work on hint bit when it does not have so much to do, if IO ressources are ok. > > Note also that if and when we get index-only scans, making sure the > PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get > set is going to be a lot more important. > >> From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insertto a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra workto deal with the small number of dead tuples. > > I can see that. > >> Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazyvacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we couldask folks to run it and see how much super-lazy vacuuming would help in the real world. > > I'm a bit doubtful about this part. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh <jesper@krogh.cc> wrote: > Is it obvious that the visibillity map bits should track complete > pages and not individual tuples? If the visibillity map tracks at > page-level the benefit would fall on "slim tables" where you squeeze > 200 tuples into each page and having an update rate of 1% would > lower the likelyhood even more. (it may be that for slim tables the > index-only-scans are not as benefitial as to wide tables). I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but say it's 200. If you track visibility info per tuple rather than per page, then the size of the visibility map is going to expand by a factor of 200. That might decrease contention, but otherwise it's a bad thing - the whole point of having the visibility map in the first place is that it's much, much smaller than the heap. If it were the same size as the heap, we could just read the heap. What the map attempts to accomplish is to allow us, by reading a small number of pages, to check whether the tuples we're thinking of reading are likely to be all-visible without actually looking at them. > In collaboration with a vacuuming discussion, I dont know if it > is there allready but how about "opportunistic vacuuming". Say > you have a page what due to changes in one of the tuples are > being written out, will it, while being written out anyway get the > other tuples on the page vacuumed? The really lazy kind of vacuuming I'm talking about could be done this way. Regular vacuuming cannot, because you can't actually prune a dead tuple until you've scanned all the indexes for references to that CTID. The obvious place to do this would be the background writer: if the page is dirty anyway and we're about to evict it, we could decide to (1) set hint bits, (2) set visibility map bits, (3) freeze tuples that need freezing, and (4) identify dead tuples and reclaim the space they use, but not the associated line pointer. Sadly, I'm not sure this would help much. If we have, say, a 4MB relation, you're not even going to notice it when vacuum comes along and does its thing. Even a vacuum freeze is chump change. The problem is with big relations, like say 1GB+. Processing the whole table at once can have a material impact on system performance, so it'd be nice to do some work incrementally. But it's likely that doing it opportunistically as you evict things from shared buffers is only going to help here and there. Even if you optimistically assumed that we could opportunistically do 10% of the vacuuming that way, that's still not much of a dent. And I think it'd probably be less than that in most real-world cases. A further problem is that the background writer is already a pretty busy process, and giving it more things to do isn't very appealing from the standpoint of overall system performance. > It actually dont have to hook into the process directly to benefit > the IO-usage, if it just can get the opportunity to do it before > the page gets evicted from the OS-cache, then it would save a > second read on that page, but it seems way harder to do something > sane around that assumption. Yeah. > Really lazy vacuums would "only" benefit "really static tables" ? where > vacuuming is not that big a problem in the first place. I think the benefit would be tables that are either quite large (where the ability to do this incrementally would be an advantage over regular VACUUM) or insert-only (where we currently have no way to get PD_ALL_VISIBLE bits set without user intervention). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2011-03-17 15:02, Robert Haas wrote: > On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh<jesper@krogh.cc> wrote: >> Is it obvious that the visibillity map bits should track complete >> pages and not individual tuples? If the visibillity map tracks at >> page-level the benefit would fall on "slim tables" where you squeeze >> 200 tuples into each page and having an update rate of 1% would >> lower the likelyhood even more. (it may be that for slim tables the >> index-only-scans are not as benefitial as to wide tables). > I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but > say it's 200. If you track visibility info per tuple rather than per > page, then the size of the visibility map is going to expand by a > factor of 200. That might decrease contention, but otherwise it's a > bad thing - the whole point of having the visibility map in the first > place is that it's much, much smaller than the heap. If it were the > same size as the heap, we could just read the heap. What the map > attempts to accomplish is to allow us, by reading a small number of > pages, to check whether the tuples we're thinking of reading are > likely to be all-visible without actually looking at them. Yes, that was sort of the math I was trying to make. I do allthough belive that you have a way better feeling about it. But according to this: http://wiki.postgresql.org/wiki/FAQ#How_much_database_disk_space_is_required_to_store_data_from_a_typical_text_file.3F The bulk row-overhead is around 24bytes, which will with 1 bit per row give a size reduction of 1:(24x8) ~1:192, worstcase... that gives at best 341 tuples/page (where each tuple, does not contain any data at all). With that ratio, the visibillitymap of a relation of 10GB would fill 52MB on disk (still worst case) and that by itself would by all means be awesome. (with that small tuples a 10GB relation would have around 42 billion tuples). On the 1 bit per page the "best case" would be 341 times better than above reducing the size of the visibiility map on a 10GB table to around 152KB which is extremely small (and thus also awesome) But the consequenses of a single update would mean that you loose visibilllity map benefit on 341 tuples in one shot. Worst case situations are, where we approach the 4 tuples per page, before we hit toast where the ratio of space reduction in 1 bit per tuple would be: 1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better. In the 1 bit per tuple a visibillity map of a 10GB relation would be around 610KB 1 bit per page would then drop it to around 160KB. Can we drag out some average-case numbers on row-size in the heap from some real production systems? I may have gotten something hugely wrong in above calculations and/or have missed some important points. -- Jesper
On Thu, Mar 17, 2011 at 4:02 PM, Jesper Krogh <jesper@krogh.cc> wrote: > On the 1 bit per page the "best case" would be 341 times better than above > reducing the size of the visibiility map on a 10GB table to around 152KB > which > is extremely small (and thus also awesome) But the consequenses of a single > update would mean that you loose visibilllity map benefit on 341 tuples in > one shot. True, but I'm not really sure that matters very much. Keep in mind also that would increase the frequency with which visibility map bits would need to be flipped, which would carry its own costs. > Worst case situations are, where we approach the 4 tuples per page, before > we hit toast where the ratio of space reduction in 1 bit per tuple would be: > 1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better. > In the 1 bit per tuple a visibillity map of a 10GB relation would be around > 610KB > 1 bit per page would then drop it to around 160KB. > > > Can we drag out some average-case numbers on row-size in the heap > from some real production systems? Well, unless you went to some significantly more complicated data structure, you'd need to allow room for the maximum number of tuples per page on every page, whether the slots were all in use or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mar 16, 2011, at 7:44 PM, Robert Haas wrote: > It > would be really nice (for this and for other things) if we had some > way of measuring the I/O saturation of the system, so that we could > automatically adjust the aggressiveness of background processes > accordingly. Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could weget an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. Thatinformation could potentially be useful to the planner, but even if the database couldn't use that knowledge itself itwould be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim@nasby.net> wrote: > Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could weget an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. Thatinformation could potentially be useful to the planner, but even if the database couldn't use that knowledge itself itwould be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics. > I've done this -- actually better, I used mincore to actually check whether the block was in cache before issuing the read -- but it turns out you can't get what you're looking for this way. It turns out when you do this you see one block being read from disk followed by n blocks that all appear to be cache hits. Because they've been prefetched by the kernel. What you end up with is actually something like the number of iops which is also an interesting measure but not really what you were looking for. My getrusage patch, which I should still dig out though it's rather too late to be committing now unless someone tells me otherwise, would tell you how much i/o a plan node actually did. But you won't know which blocks did the i/o since I was only tracking totals for the plan node. That's probably what you're looking for here. -- greg
2011/3/22 Greg Stark <gsstark@mit.edu>: > On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim@nasby.net> wrote: >> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only couldwe get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not.That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itselfit would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics. >> > > I've done this -- actually better, I used mincore to actually check > whether the block was in cache before issuing the read -- but it turns > out you can't get what you're looking for this way. The linux fincore() syscall never get in the kernel, maybe something to revive... > > It turns out when you do this you see one block being read from disk > followed by n blocks that all appear to be cache hits. Because they've > been prefetched by the kernel. I did the same, I now believe that it is not very important to have the very exact numbers. Prefetech blocks *are* in memory when we request them, the first read access read more than one block because the cost is the same. > > What you end up with is actually something like the number of iops > which is also an interesting measure but not really what you were > looking for. > > My getrusage patch, which I should still dig out though it's rather > too late to be committing now unless someone tells me otherwise, would > tell you how much i/o a plan node actually did. But you won't know > which blocks did the i/o since I was only tracking totals for the plan > node. That's probably what you're looking for here. Please show us the patch :) -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote: > 2011/3/22 Greg Stark <gsstark@mit.edu>: >> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim@nasby.net> wrote: >>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only couldwe get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not.That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itselfit would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics. >>> >> >> I've done this -- actually better, I used mincore to actually check >> whether the block was in cache before issuing the read -- but it turns >> out you can't get what you're looking for this way. > > The linux fincore() syscall never get in the kernel, maybe something > to revive... Is there an equivalent in other OSes? Could we use time measurement as an alternative if not? >> >> It turns out when you do this you see one block being read from disk >> followed by n blocks that all appear to be cache hits. Because they've >> been prefetched by the kernel. > > I did the same, I now believe that it is not very important to have > the very exact numbers. > Prefetech blocks *are* in memory when we request them, the first read > access read more than one block because the cost is the same. Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we couldactually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore. And having this info in pg_stats would be extremely valuable. >> What you end up with is actually something like the number of iops >> which is also an interesting measure but not really what you were >> looking for. >> >> My getrusage patch, which I should still dig out though it's rather >> too late to be committing now unless someone tells me otherwise, would >> tell you how much i/o a plan node actually did. But you won't know >> which blocks did the i/o since I was only tracking totals for the plan >> node. That's probably what you're looking for here. > > Please show us the patch :) -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
2011/3/24 Jim Nasby <jim@nasby.net>: > On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote: >> 2011/3/22 Greg Stark <gsstark@mit.edu>: >>> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim@nasby.net> wrote: >>>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only couldwe get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not.That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itselfit would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics. >>>> >>> >>> I've done this -- actually better, I used mincore to actually check >>> whether the block was in cache before issuing the read -- but it turns >>> out you can't get what you're looking for this way. >> >> The linux fincore() syscall never get in the kernel, maybe something >> to revive... > > Is there an equivalent in other OSes? Could we use time measurement as an alternative if not? fincore() syscall is a shortcut for mmap+mincore calls, suggested by people working on libprefetch. see http://lwn.net/Articles/371538/ The alternative via time measurement is interesting, should be easy to ouput both measures in pg_statio_* and see what happens... > >>> >>> It turns out when you do this you see one block being read from disk >>> followed by n blocks that all appear to be cache hits. Because they've >>> been prefetched by the kernel. >> >> I did the same, I now believe that it is not very important to have >> the very exact numbers. >> Prefetech blocks *are* in memory when we request them, the first read >> access read more than one block because the cost is the same. > > Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we couldactually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore. > > And having this info in pg_stats would be extremely valuable. yes, also Robert wrote some interesting items to keep in mind when thinking about that, in another thread, recently. A fs-cache snapshot or just a 'percent_in_cache' per relation/file (?)is easy to do/add to some auto-analyze daemon. *but* making a good use of it in the planner is not as trivial as it looks. (i.e. without breaking what is working well) Once I get time to add hooks in costsize.c, a simple extension can do the trick. (just need some shared_buffers to keep FS-pg_stats and hooks to use it in some places). > >>> What you end up with is actually something like the number of iops >>> which is also an interesting measure but not really what you were >>> looking for. >>> >>> My getrusage patch, which I should still dig out though it's rather >>> too late to be committing now unless someone tells me otherwise, would >>> tell you how much i/o a plan node actually did. But you won't know >>> which blocks did the i/o since I was only tracking totals for the plan >>> node. That's probably what you're looking for here. >> >> Please show us the patch :) > -- > Jim C. Nasby, Database Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > > > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2011/3/24 Jim Nasby <jim@nasby.net>: > On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote: >> 2011/3/22 Greg Stark <gsstark@mit.edu>: >>> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim@nasby.net> wrote: >>>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only couldwe get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not.That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itselfit would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics. >>>> >>> >>> I've done this -- actually better, I used mincore to actually check >>> whether the block was in cache before issuing the read -- but it turns >>> out you can't get what you're looking for this way. >> >> The linux fincore() syscall never get in the kernel, maybe something >> to revive... > > Is there an equivalent in other OSes? Could we use time measurement as an alternative if not? I made a quick test with time measurement, and find quickly the main bottleneck with this strategy. How to know if block has been fetched from OS memory, SAN memory, quick RAID, slow SATA ...... I just added a gettimeofday around the read() call, and adjust the XXms|µs used to seperate disk fetch and memory fetch. By manualy adjusting this duration I get good results but wonder how this can be automatically adjusted on other systems, also the method use for measuring may impact the measure. Maybe using it to just track 'slow' access, and define 'slow access' in a GUC... > >>> >>> It turns out when you do this you see one block being read from disk >>> followed by n blocks that all appear to be cache hits. Because they've >>> been prefetched by the kernel. >> >> I did the same, I now believe that it is not very important to have >> the very exact numbers. >> Prefetech blocks *are* in memory when we request them, the first read >> access read more than one block because the cost is the same. > > Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we couldactually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore. > > And having this info in pg_stats would be extremely valuable. > >>> What you end up with is actually something like the number of iops >>> which is also an interesting measure but not really what you were >>> looking for. >>> >>> My getrusage patch, which I should still dig out though it's rather >>> too late to be committing now unless someone tells me otherwise, would >>> tell you how much i/o a plan node actually did. But you won't know >>> which blocks did the i/o since I was only tracking totals for the plan >>> node. That's probably what you're looking for here. >> >> Please show us the patch :) > -- > Jim C. Nasby, Database Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > > > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Thursday, March 24, 2011 06:32:10 PM Jim Nasby wrote: > Is there an equivalent in other OSes? Some have mincore which can be used for that in combination with mmap. Andres