should vacuum's first heap pass be read-only? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | should vacuum's first heap pass be read-only? |
Date | |
Msg-id | CA+TgmoY18RzQqDm2jE2WDkiA8ngTEDHp7uLtHb3a-ABs+wbY_g@mail.gmail.com Whole thread Raw |
Responses |
Re: should vacuum's first heap pass be read-only?
Re: should vacuum's first heap pass be read-only? |
List | pgsql-hackers |
VACUUM's first pass over the heap is implemented by a function called lazy_scan_heap(), while the second pass is implemented by a function called lazy_vacuum_heap_rel(). This seems to imply that the first pass is primarily an examination of what is present, while the second pass does the real work. This used to be more true than it now is. In PostgreSQL 7.2, the first release that implemented concurrent vacuum, the first heap pass could set hint bits as a side effect of calling HeapTupleSatisfiesVacuum(), and it could freeze old xmins. However, neither of those things wrote WAL, and you had a reasonable chance of escaping without dirtying the page at all. By the time PostgreSQL 8.2 was released, it had been understood that making critical changes to pages without writing WAL was not a good plan, and so freezing now wrote WAL, but no big deal: most vacuums wouldn't freeze anything anyway. Things really changed a lot in PostgreSQL 8.3. With the addition of HOT, lazy_scan_heap() was made to prune the page, meaning that the first heap pass would likely dirty a large fraction of the pages that it touched, truncating dead tuples to line pointers and defragmenting the page. The second heap pass would then have to dirty the page again to mark dead line pointers unused. In the absolute worst case, that's a very large increase in WAL generation. VACUUM could write full page images for all of those pages while HOT-pruning them, and then a checkpoint could happen, and then VACUUM could write full-page images of all of them again while marking the dead line pointers unused. I don't know whether anyone spent time and energy worrying about this problem, but considering how much HOT improves performance overall, it would be entirely understandable if this didn't seem like a terribly important thing to worry about. But maybe we should reconsider. What benefit do we get out of dirtying the page twice like this, writing WAL each time? What if we went back to the idea of having the first heap pass be read-only? In fact, I'm thinking we might want to go even further and try to prevent even hint bit changes to the page during the first pass, especially because now we have checksums and wal_log_hints. If our vacuum cost settings are to believed (and I am not sure that they are) dirtying a page is 10 times as expensive as reading one from the disk. So on a large table, we're paying 44 vacuum cost units per heap page vacuumed twice, when we could be paying only 24 such cost units. What a bargain! The downside is that we would be postponing, perhaps substantially, the work that can be done immediately, namely freeing up space in the page and updating the free space map. The former doesn't seem like a big loss, because it can be done by anyone who visits the page anyway, and skipped if nobody does. The latter might be a loss, because getting the page into the freespace map sooner could prevent bloat by allowing space to be recycled sooner. I'm not sure how serious a problem this is. I'm curious what other people think. Would it be worth the delay in getting pages into the FSM if it means we dirty the pages only once? Could we have our cake and eat it too by updating the FSM with the amount of free space that the page WOULD have if we pruned it, but not actually do so? I'm thinking about this because of the "decoupling table and index vacuuming" thread, which I was discussing with Dilip this morning. In a world where table vacuuming and index vacuuming are decoupled, it feels like we want to have only one kind of heap vacuum. It pushes us in the direction of unifying the first and second pass, and doing all the cleanup work at once. However, I don't know that we want to use the approach described there in all cases. For a small table that is, let's just say, not part of any partitioning hierarchy, I'm not sure that using the conveyor belt approach makes a lot of sense, because the total amount of work we want to do is so small that we should just get it over with and not clutter up the disk with more conveyor belt forks -- especially for people who have large numbers of small tables, the inode consumption could be a real issue. And we won't really save anything either. The value of decoupling operations has to do with improving concurrency and error recovery and allowing global indexes and a bunch of stuff that, for a small table, simply doesn't matter. So it would be nice to fall back to an approach more like what we do now. But then you end up with two fairly distinct code paths, one where you want the heap phases combined and another where you want them separated. If the first pass were a strictly read-only pass, you could do that if there's no conveyor belt, or else read from the conveyor belt if there is one, and then the phase where you dirty the heap looks about the same either way. Aside from the question of whether this is a good idea at all, I'm also wondering what kinds of experiments we could run to try to find out. What would be a best case workload for the current strategy vs. this? What would be a worst case for the current strategy vs. this? I'm not sure. If you have ideas, I'd love to hear them. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: