Re: FPI - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: FPI |
Date | |
Msg-id | AANLkTinutHWmZo8L9KzXqZCuN6tRMz=CO_Y4Yfb=zPiJ@mail.gmail.com Whole thread Raw |
In response to | Re: FPI (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: FPI
|
List | pgsql-hackers |
On Mon, Jan 31, 2011 at 10:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 28, 2011 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> What happens if we (a) keep the current rule after reaching >> consistency and (b) apply any such updates *unconditionally* - that >> is, without reference to the LSN - prior to reaching consistency? >> Under that rule, if we encounter an FPI before reaching consistency, >> we're OK. So let's suppose we don't. No matter how many times we >> replay any initial prefix of any such updates between the redo pointer >> and the point at which we reach consistency, the state of the page >> when we finally reach consistency will be identical. But we could get >> hosed if replay progressed *past* the minimum recovery point and then >> started over at the previous redo pointer. If we forced an immediate >> restartpoint on reaching consistency, that seems like it might prevent >> that scenario. > > Actually, I'm wrong, and this doesn't work at all. At the time of the > crash, there could already be pages on disk with LSNs greater than the > minimum recovery point. Duh. > > It was such a good idea in my head... I should mention that most of this idea was Heikki's, originally. Except for the crappy parts that don't work - those are all me. But I'm back to thinking this can work. Heikki pointed out to me on IM today that in crash recovery, we always replay to end-of-WAL before opening for connections, and for Hot Standby every block we write advances the minimum recovery point to its LSN. This implies that if we're accepting connections (either regular or Hot Standby) or at a valid stopping point for PITR, there are no unreplayed WAL records whose changes are reflected in blocks on disk. So I'm back to proposing that we just apply FPI-free WAL records unconditionally, without regard to the LSN. This could potentially corrupt the page, of course. Consider delete (no FPI) - vacuum (with FPI) - crash, leaving the vacuum page half on disk. Now the replay of the delete is probably going to do the wrong thing, because the page is torn. But it doesn't matter, because the vacuum's FPI will overwrite the page anyway, and whatever stupid thing the delete replay did will become irrelevant - BEFORE we can begin processing any queries. On the other hand, if the delete record *isn't* followed by an FPI, but just, by, say, a bunch more deletes, then it should all Just Work (TM). As long as the page header (excluding LSN and TLI, which we're ignoring by stipulation) and item pointer list are intact, we can redo those deletes and clean things up. And if they're not intact, then we must've done something that emits an FPI, and so any temporary page corruption will get overwritten when we get to that point in the WAL stream... That is a bit ugly, though, because it means the XLOG replay of FPI-free records would have to be prepared to just punt if they encounter any sort of corruption, in the sure hope that any such corruption must imply the presence of a future FPI that will be replayed - since if there is no such future FPI, it should be impossible for the page to be corrupted in the first place. But that might reduce our chances of being able to detect real corruption. Heikki also came up with another idea that might be worth exploring: at the point when we currently emit FPIs, emit an image of just the part of the page that precedes pd_lower - the page header and item IDs. To make this work, we'd have to make a rule that redo isn't allowed to rely for correctness on any bits following the pd_lower boundary - it can write those bits, but it can't read them. But most of the XLOG_HEAP records follow that rule already - we look at the item pointers to figure out where we're putting a new tuple or to locate an existing tuple and unconditionally overwrite some of its bits. The obvious exception is XLOG_HEAP2_CLEAN, emitted by VACUUM, which would probably need to just log the entire page. Also, we'd again need to apply records unconditionally, without reference to the page LSN, until we reached the minimum recovery point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: