Re: Optimize LISTEN/NOTIFY - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Optimize LISTEN/NOTIFY
Date
Msg-id 9eba307f-f2fb-48f0-9507-2e197f39ef9e@app.fastmail.com
Whole thread Raw
In response to Re: Optimize LISTEN/NOTIFY  ("Joel Jacobson" <joel@compiler.org>)
Responses Re: Optimize LISTEN/NOTIFY
List pgsql-hackers
On Fri, Oct 10, 2025, at 20:46, Joel Jacobson wrote:
> On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote:
>> "Joel Jacobson" <joel@compiler.org> writes:
>>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>>>> 5. ChannelHashAddListener: "already registered" case is not reached,
>>>> which surprises me a bit, and neither is the "grow the array" stanza.
>>
>>> I've added a test for the "grow the array" stanza.
>>
>>> The "already registered" case seems impossible to reach, since the
>>> caller, Exec_ListenCommit, returns early if IsListeningOn.
>>
>> Maybe we should remove the check for "already registered" then,
>> or reduce it to an Assert?  Seems pointless to check twice.
>>
>> Or thinking a little bigger: why are we maintaining the set of
>> channels-listened-to both as a list and a hash?  Could we remove
>> the list form?
>
> Yes, it was indeed possible to remove the list form.
>
> Some functions got a bit more complex, but I think it's worth it since a
> single source of truth seems like an important design goal.
>
> This also made LISTEN faster when a backend is listening on plenty of
> channels, since we can now lookup the channel in the hash, instead of
> having to go through the list as before. The additional linear scan of
> the listenersArray didn't add any noticeable extra cost even with
> thousands of listening backends for the channel.
>
> I also tried to keep listenersArray sorted and binary-search it, but
> even with thousands of listening backends, I couldn't measure any
> overall latency difference of LISTEN, so I kept the linear scan to keep
> it simple.
>
> In Exec_ListenCommit, I've now inlined code that is similar to
> IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful
> having to do dshash_find, when we instead can just use
> dshash_find_or_insert, to handle both cases.
>
> I also added a static int numChannelsListeningOn variable, to avoid the
> possibly expensive operation of going through the entire hash, to be
> able to check `numChannelsListeningOn == 0` instead of the now removed
> `listenChannels == NIL`. It's of course critical to keep
> numChannelsListeningOn in sync, but I think it should be safe? Would of
> course be better to avoid this variable. Maybe the extra cycles that
> would cost would be worth it?

In addition to previously suggested optimization, there is another major
one that seems doable, that would mean a great improvement for workload
having large traffic differences between channels, i.e. some low traffic
and some high traffic.

I'm not entirely sure this approach is correct though, I've might
misunderstood the guarantees of the heavyweight lock. My assumption is
based on that there can only be one backend that is currently running
the code in PreCommit_Notify after having aquired the heavyweight lock.
If this is not true, then it doesn't work. What made me worried is the
exclusive lock we also take inside the same function, I don't see the
point of it since we're already holding the heavyweight lock, but maybe
this is just to "allows deadlocks to be detected" like the comment says?

---

Patches:

* 0001-optimize_listen_notify-v14.patch:
Just adds additional test coverage of async.c

* 0002-optimize_listen_notify-v14.patch:
Adds the shared channel hash.
Unchanged since 0002-optimize_listen_notify-v13.patch.

* 0003-optimize_listen_notify-v14.patch:

Optimize LISTEN/NOTIFY by advancing idle backends directly

Building on the previous channel-specific listener tracking
optimization, this patch further reduces context switching by detecting
idle listening backends that don't listen to any of the channels being
notified and advancing their queue positions directly without waking
them up.

When a backend commits notifications, it now saves both the queue head
position before and after writing. In SignalBackends(), backends that
are at the old queue head and weren't marked for wakeup (meaning they
don't listen to any of the notified channels) are advanced directly to
the new queue head. This eliminates unnecessary wakeups for these
backends, which would otherwise wake up, scan through all the
notifications, skip each one, and advance to the same position anyway.

The implementation carefully handles the race condition where other
backends may write notifications after the heavyweight lock is released
but before SignalBackends() is called. By saving queueHeadAfterWrite
immediately after writing (before releasing the lock), we ensure
backends are only advanced over the exact notifications we wrote, not
notifications from other concurrent backends.

---

Benchmark:

% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=10
pgbench_patched: starting LISTEN/NOTIFY round-trip benchmark
pgbench_patched: round-trips per iteration: 10000
pgbench_patched: idle listeners added per iteration: 10

master:

idle_listeners  round_trips_per_sec     max_latency_usec
             0              33592.9                 2278
            10              14251.1                 1041
            20               9258.7                 1367
            30               6144.2                 2277
            40               4653.1                 1690
            50               3780.7                 2869
            60               3234.9                 3215
            70               2818.9                 3652
            80               2458.7                 3219
            90               2203.1                 3505
           100               1951.9                 1739

0002-optimize_listen_notify-v14.patch:

idle_listeners  round_trips_per_sec     max_latency_usec
             0              33936.2                  889
            10              30631.9                 1233
            20              22404.7                 7862
            30              19446.2                 9539
            40              16013.3                13963
            50              14310.1                16983
            60              12827.0                21363
            70              11271.9                24775
            80              10764.4                28703
            90               9568.1                31693
           100               9241.3                32724

0003-optimize_listen_notify-v14.patch:

idle_listeners  round_trips_per_sec     max_latency_usec
             0              33236.8                 1090
            10              34681.0                 1338
            20              34530.4                 1372
            30              34061.6                 1339
            40              33084.5                  913
            50              33847.5                  955
            60              33675.8                 1239
            70              28857.4                20443
            80              33324.9                  786
            90              33612.3                  758
           100              31259.2                 7706

As we can see, with 0002, the ping-pong round-trips per second degrades
much slower than master, but the wakeup of idle listening backends still
needs to happen at some point, much fewer wakeups, and staggered over
time, but still makes it go down from 33k to 9k due to 100 idle
listening backends. With 0003, the round-trips per second is sustained,
unaffected by additional idle listening backends.

I've also attached the pgbench patch as a .txt in
pgbench-listen-notify-benchmark-patch.txt, since it's not part of this
patch, it's just provided to help others verify the results.

/Joel
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [PATCH TEST] Fix logical replication setup in subscription test `t/009_matviews.pl`
Next
From: "Joel Jacobson"
Date:
Subject: Re: Optimize LISTEN/NOTIFY