Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae - Mailing list pgsql-bugs
From | Robert Haas |
---|---|
Subject | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae |
Date | |
Msg-id | CA+TgmoaHcqdTNFuSZJrBAhqV9p+Gp9A3w2eN-YMSu_vHd3N3+g@mail.gmail.com Whole thread Raw |
In response to | relfrozenxid may disagree with row XIDs after 1ccc1e05ae (Noah Misch <noah@leadboat.com>) |
Responses |
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
|
List | pgsql-bugs |
On Sun, Mar 3, 2024 at 7:07 PM Noah Misch <noah@leadboat.com> wrote: > I figure Matthias's upthread theory is more likely than not to hold. If it > does hold, commit 1ccc1e05ae created a new corruption route. Hence, I'm > adding a v17 open item for commit 1ccc1e05ae. I need some help understanding what's going on here. I became aware of this thread because I took a look at the open items list. This email seems to have branched off of the thread for bug #17257, reported 2021-10-29. The antecedent of "Matthias's upthread theory" is unclear to me. These emails seem like the most relevant ones: https://www.postgresql.org/message-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH%3D71_JEmgJV%2B6eBgjL7g%40mail.gmail.com https://www.postgresql.org/message-id/CAEze2WhxhEQEx%2Bc%2BCXoDpQs1H1HgkYUK4BW-hFw5_eQxuVWqRw%40mail.gmail.com https://www.postgresql.org/message-id/20240106202413.e5%40rfd.leadboat.com But I'm having a hard time piecing it all together. The general picture seems to be that pruning and vacuum disagree about whether a particular tuple is prunable; before 1ccc1e05ae, that caused the retry loop in heap_page_prune() to retry forever. Now, it causes relfrozenxid to be set to too new a value, which is a data-corruption scenario. If that's right, I'm slightly miffed to find this being labeled as an open item, since that makes it seem like 1ccc1e05ae didn't create any new problem but only caused existing defects in the GlobalVisTest machinery to have different consequences. Perhaps it's all for the best, though. It's kind of embarrassing that we haven't fixed whatever the problem is here yet. But what exactly is the problem, and what's the fix? In the first of the emails linked above, Matthias argues that the problem is that GlobalVisState->maybe_needed can move backward. Peter Geoghegan seems to agree with that here: https://www.postgresql.org/message-id/CAH2-Wzk_L7Z7LREHTtg5vY08eeWdnHO70m98eWx4U1uwvW%3D0sA%40mail.gmail.com And Peter seems to have been trying to make sense of Andres's remarks here, which I think are saying the same thing: https://www.postgresql.org/message-id/20210616192202.6q63mu66h4uyn343%40alap3.anarazel.de So it seems like Matthias, Peter, and Andres all agree that GlobalVisState->maybe_needed going backward is bad and causes this problem. Unfortunately, I don't understand the mechanism. vacuumlazy.c's LVRelState has this: /* VACUUM operation's cutoffs for freezing and pruning */ struct VacuumCutoffs cutoffs; GlobalVisState *vistest; /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */ TransactionId NewRelfrozenXid; MultiXactId NewRelminMxid; This looks scary, because we've got an awful lot of different variables serving rather closely-related purposes. cutoffs actually contains three separate XID/MXID pairs; GlobalVisTest contains XIDs; and then we've got another XID/MXID pair in the form of NewRel{frozenXid,minMxid}. It's certainly understandable how there could be a bug here: you just need whichever of these things ultimately goes into relfrozenxid to disagree with whichever of these things actually controls the pruning behavior. But it is not clear to me exactly which of these things disagree with each other in exactly what way, and, perhaps as a result, it is also not clear to my why preventing GlobalVisTest->maybe_needed from moving backward would fix anything. Can someone explain, or point me to the relevant previous email? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-bugs by date: