On Mon, May 20, 2024 at 4:48 PM Noah Misch <noah@leadboat.com> wrote: > > On Mon, May 20, 2024 at 11:58:23AM -0400, Melanie Plageman wrote: > > On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah@leadboat.com> wrote: > > > Are there obstacles to fixing the hang by back-patching 1ccc1e05ae instead of > > > this? We'll need to get confident about 1ccc1e05ae before v17, and that > > > sounds potentially easier than getting confident about both 1ccc1e05ae and > > > this other approach. > > > > I haven't tried back-patching 1ccc1e05ae yet, but I don't understand > > why we would want to use stable back branches to get comfortable with > > an approach committed to an unreleased version of Postgres. > > I wouldn't say we want to use stable back branches to get comfortable with an > approach. I wanted to say that it's less work to be confident about one new > way of doing things than two new ways of doing things.
I think I understand your point better now.
> > The small fix proposed in this thread is extremely minimal and > > straightforward. It seems much less risky as a backpatch. > > Here's how I model the current and proposed code: > > 1. v15 VACUUM removes tuples that are HEAPTUPLE_DEAD according to VisTest. > OldestXmin doesn't cause tuple removal, but there's a hang when OldestXmin > rules DEAD after VisTest ruled RECENTLY_DEAD. > > 2. With 1ccc1e05ae, v17 VACUUM still removes tuples that are HEAPTUPLE_DEAD > according to VisTest only. OldestXmin doesn't come into play. > > 3. fix_hang_15.patch would make v15 VACUUM remove tuples that are > HEAPTUPLE_DEAD according to _either_ VisTest or OldestXmin. > > Since (3) is the only list entry that removes tuples that VisTest ruled > RECENTLY_DEAD, I find it higher risk. For all three, the core task of > certifying the behavior is confirming that its outcome is sound when VisTest > and OldestXmin disagree. How do you model it?
Okay, I see your point. In 1 and 2, tuples that would have been considered HEAPTUPLE_DEAD at the beginning of vacuum but are considered HEAPTUPLE_RECENTLY_DEAD when pruning them are not removed. In 3, tuples that would have been considered HEAPTUPLE_DEAD at the beginning of vacuum are always removed, regardless of whether or not they would be considered HEAPTUPLE_RECENTLY_DEAD when pruning them.
One option is to add the logic in fix_hang_15.patch to master as well (always remove tuples older than OldestXmin). This addresses your concern about gaining confidence in a single solution.
However, I can see how removing more tuples could be concerning. In the case that the horizon moves backwards because of a standby reconnecting, I think the worst case is that removing that tuple causes a recovery conflict on the standby (depending on the value of max_standby_streaming_delay et al).
What would happen if we simply skipped the current page when we found the vacuum process had entered the infinite loop (use a counter)?