Re: decoupling table and index vacuum - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: decoupling table and index vacuum |
Date | |
Msg-id | 20210422200118.ag3cnhtyfuvztdoo@alap3.anarazel.de Whole thread Raw |
In response to | Re: decoupling table and index vacuum (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: decoupling table and index vacuum
Re: decoupling table and index vacuum |
List | pgsql-hackers |
Hi, On 2021-04-22 12:15:27 -0400, Robert Haas wrote: > On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote: > > I'm not sure that's the only way to deal with this. While some form of > > generic "conveyor belt" infrastructure would be a useful building block, > > and it'd be sensible to use it here if it existed, it seems feasible to > > dead tids in a different way here. You could e.g. have per-heap-vacuum > > files with a header containing LSNs that indicate the age of the > > contents. > > That's true, but have some reservations about being overly reliant on > the filesystem to provide structure here. There are good reasons to be > worried about bloating the number of files in the data directory. Hmm, > but maybe we could mitigate that. First, we could skip this for small > relations. If you can vacuum the table and all of its indexes using > the naive algorithm in <10 seconds, you probably shouldn't do anything > fancy. That would *greatly* reduce the number of additional files > generated. Second, we could forget about treating them as separate > relation forks and make them some other kind of thing entirely, in a > separate directory I'm not *too* worried about this issue. IMO the big difference to the cost of additional relation forks is that such files would only exist when the table is modified to a somewhat meaningful degree. IME the practical issues with the number of files due to forks are cases where huge number of tables that are practically never modified exist. That's not to say that I am sure that some form of "conveyor belt" storage *wouldn't* be the right thing. How were you thinking of dealing with the per-relation aspects of this? One conveyor belt per relation? > especially if we adopted Sawada-san's proposal to skip WAL logging. I > don't know if that proposal is actually a good idea, because it > effectively adds a performance penalty when you crash or fail over, > and that sort of thing can be an unpleasant surprise. But it's > something to think about. I'm doubtful about skipping WAL logging entirely - I'd have to think harder about it, but I think that'd mean we'd restart from scratch after crashes / immediate restarts as well, because we couldn't rely on the contents of the "dead tid" files to be accurate. In addition to the replication issues you mention. > > One thing that you didn't mention so far is that this'd allow us to add > > dead TIDs to the "dead tid" file outside of vacuum too. In some > > workloads most of the dead tuple removal happens as part of on-access > > HOT pruning. While some indexes are likely to see that via the > > killtuples logic, others may not. Being able to have more aggressive > > index vacuum for the one or two bloated index, without needing to rescan > > the heap, seems like it'd be a significant improvement. > > Oh, that's a very interesting idea. It does impose some additional > requirements on any such system, though, because it means you have to > be able to efficiently add single TIDs. For example, you mention a > per-heap-VACUUM file above, but you can't get away with creating a new > file per HOT prune no matter how you arrange things at the FS level. I agree that it'd be an issue, even though I think it's not too common that only one tuple gets pruned. It might be possible to have a per-relation file per backend or such... But yes, we'd definitely have to think about it. I've previously pondered adding some cross-page batching and deferring of hot pruning in the read case, which I guess might be more advantageous with this. The main reason for thinking about batching & deferring of HOT pruning is that I found during the AIO work that there's speed gains to be head if we pad xlog pages instead of partially filling them - obviously risking increasing WAL usage. One idea to reduce the cost of that was to fill the padded space with actually useful things, like FPIs or hot pruning records. A related speedup opportunity with AIO is to perform useful work while waiting for WAL flushes during commit (i.e. after initiating IO to flush the commit record, but before that IO has completed). > Actually, though, I think the big problem here is deduplication. A > full-blown VACUUM can perhaps read all the already-known-to-be-dead > TIDs into some kind of data structure and avoid re-adding them, but > that's impractical for a HOT prune. What is there to deduplicate during HOT pruning? It seems that hot pruning would need to log all items that it marks dead, but nothing else? And that VACUUM can't yet have put those items onto the dead tuple map, because they weren't yet? This actually brings up a question I vaguely had to the fore: How are you assuming indexes would access the list of dead tids? As far as I can see the on-disk data would not be fully sorted even without adding things during HOT pruning - the dead tids from a single heap pass will be, but there'll be tids from multiple passes, right? Are you assuming that we'd read the data into memory and then merge-sort between each of the pre-sorted "runs"? Or that we'd read and cache parts of the on-disk data during index checks? > > Have you thought about how we would do the scheduling of vacuums for the > > different indexes? We don't really have useful stats for the number of > > dead index entries to be expected in an index. It'd not be hard to track > > how many entries are removed in an index via killtuples, but > > e.g. estimating how many dead entries there are in a partial index seems > > quite hard (at least without introducing significant overhead). > > No, I don't have any good ideas about that, really. Partial indexes > seem like a hard problem, and so do GIN indexes or other kinds of > things where you may have multiple index entries per heap tuple. We > might have to accept some known-to-be-wrong approximations in such > cases. The gin case seems a bit easier than the partial index case. Keeping stats about the number of new entries in a GIN index doesn't seem too hard, nor does tracking the number of cleaned up index entries. But knowing which indexes are affected when a heap tuple becomes dead seems harder. I guess we could just start doing a stats-only version of ExecInsertIndexTuples() for deletes, but obviously the cost of that is not enticing. Perhaps it'd not be too bad if we only did it when there's an index with predicates? > > > One rather serious objection to this whole line of attack is that we'd > > > ideally like VACUUM to reclaim disk space without using any more, in > > > case the motivation for running VACUUM in the first place. > > > > I suspect we'd need a global limit of space used for this data. If above > > that limit we'd switch to immediately performing the work required to > > remove some of that space. > > I think that's entirely the wrong approach. On the one hand, it > doesn't prevent you from running out of disk space during emergency > maintenance, because the disk overall can be full even though you're > below your quota of space for this particular purpose. On the other > hand, it does subject you to random breakage when your database gets > big enough that the critical information can't be stored within the > configured quota. What random breakage are you thinking of? I'm not thinking of a hard limit that may not be crossed at any cost, by even a single byte, but that [auto]VACUUMs would start to be more aggressive about performing index vacuums once the limit is reached. > I think we'd end up with pathological cases very much like what used > to happen with the fixed-size free space map. What happened there was > that your database got big enough that you couldn't track all the free > space any more and it just started bloating out the wazoo. What would > happen here is that you'd silently lose the well-optimized version of > VACUUM when your database gets too big. That does not seem like > something anybody wants. I don't think the consequences would really be that comparable. Once the FSM size was reached in the bad old days, we'd just loose track of of free space. Whereas here we'd start to be more aggressive about cleaning up once the "dead tids" data reaches a certain size. Of course that would have efficiency impacts, but I think "global free space wasted" is a valid input in deciding when to perform index vacuums. I think max_wal_size has worked out pretty well, even if not perfect. Greetings, Andres Freund
pgsql-hackers by date: