Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers

From Arseniy Mukhin
Subject Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date
Msg-id CAE7r3M+=oOhDSmSihqGvdFzfgekF+6KibEXUJdCK7DdFTA8uPQ@mail.gmail.com
Whole thread Raw
In response to Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue  (Matheus Alcantara <matheusssilv97@gmail.com>)
Responses Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
List pgsql-hackers
Hi,

On Fri, Sep 19, 2025 at 12:35 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
>
> On Mon Sep 15, 2025 at 2:40 PM -03, Masahiko Sawada wrote:
> > While the WAL-based approach discussed on another thread is promising,
> > I think it would not be acceptable for back branches as it requires
> > quite a lot of refactoring. Given that this is a long-standing bug in
> > listen/notify, I think we can continue discussing how to fix the issue
> > on backbranches on this thread.
> >
> Please see the new attached patch, it has a different implementation
> that I've previously posted which is based on the idea that Arseniy
> posted on [1].
>

Thank you for the new version.

> This new version include the "committed" field on AsyncQueueEntry struct
> so that we can use this info when processing the notification instead of
> call TransactionIdDidCommit()
>
> The "committed" field is set to true when the AsyncQueueEntry is being
> added on the SLRU page buffer when the PreCommit_Notify() is called. If
> an error occurs between the PreCommit_Notify() and AtCommit_Notify() the
> AtAbort_Notify() will be called and will set the "committed" field to
> false for the notifications inside the aborted transaction.
>
> It's a bit tricky to know at AtAbort_Notify() which notifications were
> added on the SLRU page buffer by the aborted transaction, so I created a
> new data structure and a global variable to keep track of this
> information. See the commit message for more information.
>

I like this approach. We got rid of dependency on clog and don't limit
vacuum. Several points about the fix:

Is it correct to remember and reuse slru slots here? IIUC we can't do
it if we don't hold SLRU bank lock, because by the time we get in
AtAbort_Notify() the queue page could be already evicted. Probably we
need to use SimpleLruReadPage after we acquire the lock in
AtAbort_Notify()?

I think adding a boolean 'committed' is a good approach, but what do
you think about setting the queue head back to the position where
aborted transaction notifications start? We can do such a reset in
AtAbort_Notify(). So instead of marking notifications as
'commited=false' we completely erase them from the queue by moving the
queue head back. From listeners perspective if there is a notification
of completed transaction in the queue - it's always a committed
transaction, so again get rid of TransactionIdDidCommit() call. It
seems like a simpler approach because we don't need to remember all
notifications positions in the queue and don't need the additional
field 'committed'. All we need is to remember the head position before
we write anything to the queue, and reset it back if there is an
abort. IIUC Listeners will never send such erased notifications:
- while the aborted transaction is looking like 'in progress',
listeners can't send its notifications.
- by the time the aborted transaction is completed, the head is
already set back so erased notifications are located after the queue
head and listeners can't read it.

> On the previously patch that I've posted I've created a TAP test to
> reproduce the issue with the VACUUM FREEZE, this new version also
> include this test and also a new test case that use the injection points
> extension to force an error between the PreCommit_Notify() and
> AtCommit_Notify() so that we can ensure that these notifications of an
> aborted transaction are not visible to other listener backends.
>

I think it's a good test to have. FWIW there is a way to reproduce the
test condition without the injection point. We can use the fact that
serializable conflicts are checked after tx adds notifications to the
queue. Please find the attached patch with the example tap test. Not
sure if using injections points is more preferable?


Best regards,
Arseniy Mukhin

Attachment

pgsql-hackers by date:

Previous
From: Antonin Houska
Date:
Subject: Re: REPACK and naming
Next
From: torikoshia
Date:
Subject: Re: RFC: Logging plan of the running query