Re: decoupling table and index vacuum - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: decoupling table and index vacuum |
Date | |
Msg-id | CAH2-Wzk_h132i7bg5wD9qhiTj9i_q5GZzVBVvyQ2Fu_2=L=uWQ@mail.gmail.com Whole thread Raw |
In response to | decoupling table and index vacuum (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: decoupling table and index vacuum
Re: decoupling table and index vacuum |
List | pgsql-hackers |
On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > We are used to thinking about table vacuum and index vacuum as parts > of a single, indivisible operation. You vacuum the table -- among > other things by performing HOT pruning and remembering dead TIDs -- > and then you vacuum the indexes -- removing the remembered TIDs from > the index -- and then you vacuum the table some more, setting those > dead TIDs unused -- and then you're done. And along the way you do > some other things too like considering truncation that aren't relevant > to the point I want to make here. Now, the problem with this is that > every index has its own needs, which are separate from the needs of > the tables, as I think Peter Geoghegan and Masahiko Sawada were also > discussing recently. I'm very happy to see that you've taken an interest in this work! I believe it's an important area. It's too important to be left to only two contributors. I welcome your participation as an equal partner in the broader project to fix problems with VACUUM. Masahiko and I have had plenty of ideas about where this could go next -- way too many ideas, in fact. Maybe that kind of partnership sounds unnecessary or at least seems premature, but it turns out that this area is extremely broad and far reaching, if you really think it through -- you end up having to negotiate rather a lot all at once. Apart from anything else, I simply don't have the authority to commit a bunch of stuff that implicitly makes Postgres do things a certain way in a huge number of different subsystems. (Whether or not I'd be right in each case is beside the point.) My most ambitious goal is finding a way to remove the need to freeze or to set hint bits. I think that we can do this by inventing a new kind of VACUUM just for aborted transactions, which doesn't do index vacuuming. You'd need something like an ARIES-style dirty page table to make this cheap -- so it's a little like UNDO, but not very much. The basic idea is that eagerly cleaning up aborted transactions in an autovacuum worker allows you to broadly assume that most blocks contain definitely-committed heap tuples, or else LP_DEAD stubs (which of course don't contain any XIDs). You'd still have something like conventional VACUUM, which wouldn't change that much. Freezing is largely implicit, but maybe you freeze tuples the old way if and only if a backend dirties a "known-all-commited" block -- that can still be expensive. The visibility map still has an all-visible bit, but now it also has an all-committed bit (or maybe it's a separate data structure). The combination of all-visible and all-committed to precisely the same as frozen, so you don't need a separate VM bit for that anymore. Notice that this design doesn't change much about our basic approach to transaction management. It just further decouples things. Conventional VACUUMs are now only about garbage collection, and so can be further optimized with that goal in mind. It's much easier to do clever scheduling if VACUUM really only has to do garbage collection. > Opportunistic index cleanup strategies like > kill_prior_tuple and bottom-up deletion may work much better for some > indexes than others, meaning that you could have some indexes that > badly need to be vacuumed because they are full of garbage, and other > indexes on the same table where the opportunistic cleanup has worked > perfectly and there is no need for vacuuming at all. I know I say this all the time these days, but it seems worth repeating now: it is a qualitative difference, not a quantitative difference. Bottom-up index deletion will frequently stop most indexes on a table from growing by even one single block, while indexes that cannot use the optimization (indexes that are logically modified by UPDATE statements) might be hugely bloated. If this is the case during one VACUUM operation, it's probably going to work like that with all future VACUUM operations. It's abundantly clear that the current quantitative approach just cannot be pushed much further. > But, as things stand today, strategy options to deal with such > situations are limited. Leaving aside what the code actually does > right now, let's talk about what options we have in theory with the > technology as it exists now. They basically all boil down to stopping > early and then redoing the work later. We must always start with a > pass over the heap to collect dead TIDs, because otherwise there's > nothing else to do. Now we can choose to stop, but then the next > VACUUM will have to collect all those TIDs again. It may get to skip > more all-visible pages than the current vacuum did, but the pages that > still have dead TIDs will all have to be visited again. If we don't > stop, then we can choose to vacuum all of the indexes or just some of > them, and then afterwards stop. But if we do this, the next VACUUM > will have to scan all indexes again for the same TIDs. Here, we don't > even have the visibility map to allow skipping known-clean pages, so > it's *really* a lot of work we have to redo. Thus what we normally do > is press on to the third step, where we mark dead line pointers unused > after scanning every index in its entirety, and now they're gone and > we don't have to worry about them again. Barring emergency escape > valves, as things stand today, the frequency of table vacuuming is the > same as the frequency of index vacuuming, even though the *required* > frequency of vacuuming is not the same, and also varies from index to > index. I'm 100% in agreement about all of this. > Now, the reason for this is that when we discover dead TIDs, we only > record them in memory, not on disk. So, as soon as VACUUM ends, we > lose all knowledge of whether those TIDs were and must rediscover > them. Suppose we didn't do this, and instead had a "dead TID" fork for > each table. I had a similar idea myself recently -- clearly remembering the TIDs that you haven't vacuumed to save work later on makes a lot of sense. I didn't get very far with it, even in my own head, but I like the direction you're taking it. Having it work a little like a queue makes a lot of sense. > Suppose further that this worked like a conveyor belt, > similar to WAL, where every dead TID we store into the fork is > assigned an identifying 64-bit number that is never reused. Then, > suppose that for each index, we store the number of the oldest entry > that might still need to be vacuumed from the index. Every time you > perform what we now call the first heap pass of a VACUUM, you add the > new TIDs you find to the dead TID fork. Maybe we can combine this known-dead-tid structure with the visibility map. Index-only scans might be able to reason about blocks that are mostly all-visible, but still have stub LP_DEAD line pointers that this other structure knows about -- so you can get index-only scans without requiring a full round of traditional vacuuming. Maybe there is some opportunity like that, but not sure how to fit it in to everything else right now. > Every time you vacuum an > index, the TIDs that need to be removed are those between the > oldest-entry pointer for that index and the current end of the TID > fork. You remove all of those and then advance your oldest-entry > pointer accordingly. If that's too many TIDs to fit in > maintenance_work_mem, you can just read as many as will fit and > advance your oldest-entry pointer less far. Every time you perform > what we now call the second heap pass of a VACUUM, you find all the > TIDs that precede every index's oldest-entry pointer and set them > unused. You then throw away the associated storage at the OS level. > This requires a scheme where relations can be efficiently truncated > from the beginning rather than only at the end, which is why I said "a > conveyor belt" and "similar to WAL". Details deliberately vague since > I am just brainstorming here. This amounts to adding yet more decoupling -- which seems great to me. Anything that gives us the option but not the obligation to perform work either more lazily or more eagerly (whichever makes sense) seems highly desirable to me. Especially if we can delay our decision until the last possible point, when we can have relatively high confidence that we know what we're doing. And especially if holding it open as an option is pretty cheap (that's the point of remembering dead TIDs). > Furthermore, all of these operations can start in any order, and any > of them can be repeated any number of times during a single run of any > particular other one, or indeed, without that particular one ever > being run at all. Both heap phases can also now be done in smaller > chunks, if desired. > But is this worthwhile? I think it depends a lot on what you think the > comparative required frequencies are for the various operations. There is a risk that you'll never think that any optimization is worth it because each optimization seems marginal in isolation. Sometimes a diversity of strategies is the real strategy. Let's say you have a bunch of options that you can pick and choose from, with fallbacks and with ways of course correcting even halfway through the VACUUM. It's possible that that will work wonderfully well for a given complex user workload, but if you subtract away *any one* of the strategies suddenly things get much worse in some obvious high-level way. It's entirely possible for a single table to have different needs in different parts of the table. Certainly works that way with indexes -- that much I can say for sure. > If index A needs to be vacuumed every 40 minutes and index B needs to be > vacuumed every 55 minutes and the table that owns both of them needs > to be vacuumed every 70 minutes, I am not sure there is a whole lot > here. I think you will be pretty much equally well off if you just do > what we do today every 40 minutes and call it good. That's probably all true, but I think that an excellent heuristic is to work hard to avoid really terrible outcomes, rather than trying hard to get good outcomes. The extremes don't just matter -- they may even be the only thing that matters. If index A needs to be vacuumed about as frequently as index B anyway, then the user happens to naturally be in a position where the current simplistic scheduling works for them. Which is fine, as far as it goes, but that's not really where we have problems. If you consider the "qualitative, not quantitative" perspective, things change. It's now pretty unlikely that all of the indexes on the same table will have approximately the same needs -- except when there is very little to do with indexes anyway, which is pretty much not interesting anyway. Because workloads generally don't logically modify all indexes columns within each UPDATE. They just don't tend to look like that in practice. > Also, you will not > benefit very much if the driving force is reclaiming dead line > pointers in the table itself. If that has to happen frequently, then > the indexes have to be scanned frequently, and this whole thing is a > lot of work for not much. But, maybe that's not the case. Suppose > index A needs to be vacuumed every hour to avoid bloat, index B needs > to be vacuumed every 4 hours to avoid bloat, and the table needs dead > line pointers reclaimed every 5.5 hours. Well, now you can gain a lot. > You can vacuum index A frequently while vacuuming index B only as > often as it needs, and you can reclaim dead line pointers on their own > schedule based on whatever index vacuuming was already done for bloat > avoidance. Without this scheme, there's just no way to give everybody > what they need without some of the participants being "dragged along > for the ride" and forced into work that they don't actually need done > simply because "that's how it works." Right. And, the differences between index A and index B will tend to be pretty consistent and often much larger than this. Many indexes would never have to be vacuumed, even with non-HOT UPDATES due to bottom-up index deletion -- because they literally won't even have one single page split for hours, while maybe one index gets 3x larger in the same timeframe. Eventually you'll need to vacuum the indexes all the same (not just the bloated index), but that's only required to enable safely performing heap vacuuming. It's not so bad if the non-bloated indexes won't be dirtied and if it's not so frequent (dirtying pages is the real cost to keep under control here). > One thing I don't know is whether the kind of scenario that I describe > above is common, i.e. is the main reason we need to vacuum to control > index bloat, where this kind of approach seems likely to help, or is > it to reclaim dead line pointers in the heap, where it's not? I'd be > interested in hearing from people who have some experience in this > area, or at least better intuition than I do. The paradox here is: 1. Workload characteristics are important and must be exploited to get optimal performance. 2. Workloads are too complicated and unpredictable to ever truly understand. Roughly speaking, the solution that I think has the most promise is to make it okay for your heuristics to be wrong. You do this by keeping the costs simple, fixed and low, and by doing things that have multiple benefits (not just one). This is why it's important to give VACUUM a bunch of strategies that it can choose from and switch back and forth from, with minimal commitment -- you let VACUUM figure out what to do about the workload through trial and error. It has to try and fail on occasion -- you must be willing to pay the cost of negative feedback (though the cost must be carefully managed). This approach is perhaps sufficient to cover all of the possible extremes with all workloads. I think that the extremes are where our problems all are, or close to it. The cost shouldn't be terribly noticeable because you have the flexibility to change your mind at the first sign of an issue. So you never pay an extreme cost (you pay a pretty low fixed cost incrementally, at worst), but you do sometimes (and maybe even often) get an extreme benefit -- the benefit of avoiding current pathological performance issues. We know that the cost of bloat is very non-linear in a bunch of ways that can be pretty well understood, so that seems like the right thing to focus on -- this is perhaps the only thing that we can expect to understand with a relatively high degree of confidence. We can live with a lot of uncertainty about what's going on with the workload by managing it continually, ramping up and down, etc. > Clearly, > we do not want to vacuum each partition by scanning the 1GB partition > + the 50MB local index + the 50GB global index. That's insane. With > the above system, since everything's decoupled, you can vacuum the > partition tables individually as often as required, and similarly for > their local indexes, but put off vacuuming the global index until > you've vacuumed a bunch, maybe all, of the partitions, so that the > work of cleaning up the global index cleans up dead TIDs from many or > all partitions instead of just one at a time. I can't think of any other way of sensibly implementing global indexes. > Now, the fly in the ointment here is that this supposes that we don't > get forced into vacuuming the global index too quickly because of dead > line pointer accumulation. But, I think if that does happen, with > careful scheduling, we might not really be worse off than we would > have been without partitioning. If we scan the table for just one > partition and, say, exhaust maintenance_work_mem, we have some > expensive index vacuuming to do immediately, but that would've also > happened in pretty much the same way with an unpartitioned table. But you can at least drop the partitions with a global index. It shouldn't be too hard to make that work without breaking things. > It's probably tricky to get the > autovacuum algorithm right here, but there seems to be some room for > optimism. I think that it's basically okay if global indexes suck when you do lots of UPDATEs -- this is a limitation that users can probably live with. What's not okay is if they suck when you do relatively few UPDATEs. It's especially not okay to have to scan the global index to delete one index tuple that points to one LP_DEAD item. Since you tend to get a tiny number of LP_DEAD items even when the DBA bends over backwards to make all UPDATEs use HOT. Getting that to happen 99%+ of the time is so much easier than getting it to happen 100% of the time. There can be enormous asymmetry with this stuff. Long term, I see VACUUM evolving into something that can only be configured in an advisory way. It's too hard to tune this stuff because what we really want here is to structure many things as an optimization problem, and to have a holistic view that considers how the table changes over time -- over multiple VACUUM operations. We can safely be very lazy if we have some basic sense of proportion about what the risk is. For example, maybe we limit the number of newly dirtied pages during VACUUM by being lazy about pruning pages that don't happen to be dirty when encountered within VACUUM. We still have some sense of how much work we've put off, so as to never get in over our head with debt. We might also have a sense of how many dirty pages in total there are in the system currently -- maybe if the DB is not busy right now we can be extra aggressive. In short, we apply holistic thinking. > Even if global indexes never happened, though, I think this could have > other benefits. For example, the wraparound failsafe mechanism > recently added by Masahiko Sawada and Peter Geoghegan bypasses index > vacuuming when wraparound danger is imminent. The only problem is that > making that decision means throwing away the accumulated list of dead > TIDs, which then need to be rediscovered whenever we get around to > vacuuming the indexes. But that's avoidable, if they're stored on disk > rather than in RAM. Yeah, that's not ideal. > One rather serious objection to this whole line of attack is that we'd > ideally like VACUUM to reclaim disk space without using any more, in > case the motivation for running VACUUM in the first place. A related > objection is that if it's sometimes agreable to do everything all at > once as we currently do, the I/O overhead could be avoided. Of course it's possible that what we currently do will be optimal. But it's pretty much a question of mostly-independent things all going the same way. So I expect that it will be rare. > I think > we'd probably have to retain a code path that buffers the dead TIDs in > memory to account, at least, for the low-on-disk-space case, and maybe > that can also be used to avoid I/O in some other cases, too. I haven't > thought through all the details here. It seems to me that the actual > I/O avoidance is probably not all that much - each dead TID is much > smaller than the deleted tuple that gave rise to it, and typically you > don't delete all the tuples at once - but it might be material in some > cases, and it's definitely material if you don't have enough disk > space left for it to complete without error. All true. -- Peter Geoghegan
pgsql-hackers by date: