Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers
From | Arseniy Mukhin |
---|---|
Subject | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |
Date | |
Msg-id | CAE7r3M+=oOhDSmSihqGvdFzfgekF+6KibEXUJdCK7DdFTA8uPQ@mail.gmail.com Whole thread Raw |
In response to | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue (Matheus Alcantara <matheusssilv97@gmail.com>) |
Responses |
Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
|
List | pgsql-hackers |
Hi, On Fri, Sep 19, 2025 at 12:35 AM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > > On Mon Sep 15, 2025 at 2:40 PM -03, Masahiko Sawada wrote: > > While the WAL-based approach discussed on another thread is promising, > > I think it would not be acceptable for back branches as it requires > > quite a lot of refactoring. Given that this is a long-standing bug in > > listen/notify, I think we can continue discussing how to fix the issue > > on backbranches on this thread. > > > Please see the new attached patch, it has a different implementation > that I've previously posted which is based on the idea that Arseniy > posted on [1]. > Thank you for the new version. > This new version include the "committed" field on AsyncQueueEntry struct > so that we can use this info when processing the notification instead of > call TransactionIdDidCommit() > > The "committed" field is set to true when the AsyncQueueEntry is being > added on the SLRU page buffer when the PreCommit_Notify() is called. If > an error occurs between the PreCommit_Notify() and AtCommit_Notify() the > AtAbort_Notify() will be called and will set the "committed" field to > false for the notifications inside the aborted transaction. > > It's a bit tricky to know at AtAbort_Notify() which notifications were > added on the SLRU page buffer by the aborted transaction, so I created a > new data structure and a global variable to keep track of this > information. See the commit message for more information. > I like this approach. We got rid of dependency on clog and don't limit vacuum. Several points about the fix: Is it correct to remember and reuse slru slots here? IIUC we can't do it if we don't hold SLRU bank lock, because by the time we get in AtAbort_Notify() the queue page could be already evicted. Probably we need to use SimpleLruReadPage after we acquire the lock in AtAbort_Notify()? I think adding a boolean 'committed' is a good approach, but what do you think about setting the queue head back to the position where aborted transaction notifications start? We can do such a reset in AtAbort_Notify(). So instead of marking notifications as 'commited=false' we completely erase them from the queue by moving the queue head back. From listeners perspective if there is a notification of completed transaction in the queue - it's always a committed transaction, so again get rid of TransactionIdDidCommit() call. It seems like a simpler approach because we don't need to remember all notifications positions in the queue and don't need the additional field 'committed'. All we need is to remember the head position before we write anything to the queue, and reset it back if there is an abort. IIUC Listeners will never send such erased notifications: - while the aborted transaction is looking like 'in progress', listeners can't send its notifications. - by the time the aborted transaction is completed, the head is already set back so erased notifications are located after the queue head and listeners can't read it. > On the previously patch that I've posted I've created a TAP test to > reproduce the issue with the VACUUM FREEZE, this new version also > include this test and also a new test case that use the injection points > extension to force an error between the PreCommit_Notify() and > AtCommit_Notify() so that we can ensure that these notifications of an > aborted transaction are not visible to other listener backends. > I think it's a good test to have. FWIW there is a way to reproduce the test condition without the injection point. We can use the fact that serializable conflicts are checked after tx adds notifications to the queue. Please find the attached patch with the example tap test. Not sure if using injections points is more preferable? Best regards, Arseniy Mukhin
Attachment
pgsql-hackers by date: