Re: crash-safe visibility map, take three - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: crash-safe visibility map, take three |
Date | |
Msg-id | AANLkTi=m063OeTLPWzixSDahsO9Aep9uRSs0DHYNMdgp@mail.gmail.com Whole thread Raw |
In response to | Re: crash-safe visibility map, take three (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: crash-safe visibility map, take three
Re: crash-safe visibility map, take three Re: crash-safe visibility map, take three Re: crash-safe visibility map, take three |
List | pgsql-hackers |
On Wed, Dec 1, 2010 at 10:36 AM, Bruce Momjian <bruce@momjian.us> wrote: > Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag? OK, > please let me think some more. Thanks. As far as I can tell, there are basically two viable solutions on the table here. 1. Every time we observe a page as all-visible, (a) set the PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the bit in the visibility map page, bumping the LSN as usual, and (c) emit a WAL record indicating the relation and block number. On redo of this record, set both the page-level bit and the visibility map bit. The heap page may hit the disk before the WAL record, but that's OK; it just might result in a little extra work until some subsequent operation gets the visibility map bit set. The visibility map page page may hit the disk before the heap page, but that's OK too, because the WAL record will already be on disk due to the LSN interlock. If a crash occurs before the heap page is flushed, redo will fix the heap page. (The heap page will get flushed as part of the next checkpoint, if not sooner, so by the time the redo pointer advances past the WAL record, there's no longer a risk.) 2. Every time we observe a page as all-visible, (a) set the PD_ALL_VISIBLE bit on the page, without bumping the LSN, (b) set the bit in the visibility map page, bumping the LSN if a WAL record is issued (which only happens sometimes, read on), and (c) emit a WAL record indicating the "chunk" of 128 visibility map bits which contains the bit we just set - but only if we're now dealing with a new group of 128 visibility map bits or if a checkpoint has intervened since the last such record we emitted. On redo of this record, clear the visibility map bits in each chunk. The heap page may hit the disk before the WAL record, but that's OK for the same reasons as in plan #1. The visibility map page may hit the disk before the heap page, but that's OK too, because the WAL record will already be on disk to due the LSN interlock. If a crash occurs before the heap page makes it to disk, then redo will clear the visibility map bits, leaving them to be reset by a subsequent VACUUM. As is typical with good ideas, neither of these seems terribly complicated in retrospect. Kudos to Heikki for thinking them up and explaining them. After some thought, I think that approach #1 is probably better, because it propagates visibility map bits to the standby. During index-only scans, the standby will have to ignore them during HS operation just as it currently ignores the PD_ALL_VISIBLE page-level bit, but if and when the standby is promoted to master, it's important to have those bits already set, both for index-only scans and also because, absent that, the first autovacuum on each table will end up scanning the whole things and dirtying tremendous gobs of data setting all those bits, which is just the sort of ugly surprise that we don't want to give people right after they've been forced to perform a failover. I think we can improve this a bit further by also introducing a HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with FrozenXID. This allows us to freeze tuples aggressively - if we want - without losing any forensic information. We can then modify the above algorithm slightly, so that when we observe that a page is all visible, we not only set PD_ALL_VISIBLE on the page but also HEAP_XMIN_FROZEN on each tuple. The WAL record marking the page as all-visible then doubles as a WAL record marking it frozen, eliminating the need to dirty the page yet again at anti-wraparound vacuum time. It'll still be a net increase in WAL volume (as Heikki pointed out) but the added WAL volume is small compared with the I/O involved in writing out the dirty heap pages (as Tom pointed out), so it should hopefully be OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: