Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs
From | Noah Misch |
---|---|
Subject | Re: FSM Corruption (was: Could not read block at end of the relation) |
Date | |
Msg-id | 20240304230503.a4.nmisch@google.com Whole thread Raw |
In response to | Re: FSM Corruption (was: Could not read block at end of the relation) (Ronan Dunklau <ronan.dunklau@aiven.io>) |
Responses |
Re: FSM Corruption (was: Could not read block at end of the relation)
|
List | pgsql-bugs |
On Mon, Mar 04, 2024 at 11:21:07PM +0100, Ronan Dunklau wrote: > Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit : > > Is this happening after an OS crash, a replica promote, or a PITR restore? > > I need to double check all occurences, but I wouldn't be surprised if they all > went trough a replica promotion. > > > If so, I think I see the problem. We have an undocumented rule that FSM > > shall not contain references to pages past the end of the relation. To > > facilitate that, relation truncation WAL-logs FSM truncate. However, > > there's no similar protection for relation extension, which is not > > WAL-logged. We break the rule whenever we write FSM for block X before > > some WAL record initializes block X. data_checksums makes the trouble > > easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A > > replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield > > this broken state. While v16 RelationAddBlocks() made this easier to hit, > > I suspect it's reproducible in all supported branches. For example, > > lazy_scan_new_or_empty() and multiple index AMs break the rule via > > RecordPageWithFreeSpace() on a PageIsNew() page. > > Very interesting. I understand the reasoning, and am now able to craft a very > crude test case, which I will refine into something more actionable: > > - start a session 1, which will just vacuum our table continuously > - start a ession 2, which will issue checkpoints continuously > - in a session 3, COPY enough data so that we prealllocate too many pages. In > my example, I 'm copying 33150 single integer rows, which fit on a bit more > than 162 pages. The idea behind that is to make sure we don't fall on a round > number of pages, and we leave several iterations of the extend mechanism > running to issue a full page write of the fsm. > > So to trigger the bug, it seems to me one needs to run in parallel: > - COPY > - VACUUM > - CHECKPOINT > > Which would explain why a surprisingly large number of occurences I've seen > seemed to involve a pg_restore invocation. Does that procedure alone reproduce the ERROR, without any replica promote, postmaster restart, etc.? That, I do not have reason to expect. > > I think the fix is one of: > > > > - Revoke the undocumented rule. Make FSM consumers resilient to the FSM > > returning a now-too-large block number. > > Performance wise, that seems to be the better answer I would guess this one is more risky from a performance perspective, since we'd be adding to a hotter path under RelationGetBufferForTuple(). Still, it's likely fine. > but won't this kind of > built-in resilience encourage subtle bugs creeping in ? Since the !wal_log_hints case avoids WAL for FSM, we've already accepted subtle bugs. I bet an unlucky torn FSM page could reach the ERROR you reached, even if we introduced a "main-fork WAL before FSM" rule. We'll have plenty of need for resilience. Hence, I'd lean toward this approach. > > - Enforce a new "main-fork WAL before FSM" rule for logged rels. For > > example, in each PageIsNew() case, either don't update FSM or WAL-log an > > init (like lazy_scan_new_or_empty() does when PageIsEmpty()). > > The FSM is not updated by the caller: in the bulk insert case, we > intentionally don't add them to the FSM. So having VACUUM WAL-log the page in > lazy_scan_new_or_empty in the New case as you propose seems a very good idea. > > Performance wise that would keep the WAL logging outside of the backend > performing the preventive work for itself or others, and it should not be that > many pages that it gives VACUUM too much work. A drawback is that this approach touches several access methods, so out-of-tree access methods also likely need changes. Still, this is a good backup if the first option measurably slows INSERT throughput. > I can try my hand at such a patch it if looks like a good idea. Please do. > Thank you very much for your insights. Thanks for the detailed report that made it possible.
pgsql-bugs by date: