Re: Optimize LISTEN/NOTIFY - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Re: Optimize LISTEN/NOTIFY |
Date | |
Msg-id | 9eba307f-f2fb-48f0-9507-2e197f39ef9e@app.fastmail.com Whole thread Raw |
In response to | Re: Optimize LISTEN/NOTIFY ("Joel Jacobson" <joel@compiler.org>) |
Responses |
Re: Optimize LISTEN/NOTIFY
|
List | pgsql-hackers |
On Fri, Oct 10, 2025, at 20:46, Joel Jacobson wrote: > On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote: >> "Joel Jacobson" <joel@compiler.org> writes: >>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote: >>>> 5. ChannelHashAddListener: "already registered" case is not reached, >>>> which surprises me a bit, and neither is the "grow the array" stanza. >> >>> I've added a test for the "grow the array" stanza. >> >>> The "already registered" case seems impossible to reach, since the >>> caller, Exec_ListenCommit, returns early if IsListeningOn. >> >> Maybe we should remove the check for "already registered" then, >> or reduce it to an Assert? Seems pointless to check twice. >> >> Or thinking a little bigger: why are we maintaining the set of >> channels-listened-to both as a list and a hash? Could we remove >> the list form? > > Yes, it was indeed possible to remove the list form. > > Some functions got a bit more complex, but I think it's worth it since a > single source of truth seems like an important design goal. > > This also made LISTEN faster when a backend is listening on plenty of > channels, since we can now lookup the channel in the hash, instead of > having to go through the list as before. The additional linear scan of > the listenersArray didn't add any noticeable extra cost even with > thousands of listening backends for the channel. > > I also tried to keep listenersArray sorted and binary-search it, but > even with thousands of listening backends, I couldn't measure any > overall latency difference of LISTEN, so I kept the linear scan to keep > it simple. > > In Exec_ListenCommit, I've now inlined code that is similar to > IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful > having to do dshash_find, when we instead can just use > dshash_find_or_insert, to handle both cases. > > I also added a static int numChannelsListeningOn variable, to avoid the > possibly expensive operation of going through the entire hash, to be > able to check `numChannelsListeningOn == 0` instead of the now removed > `listenChannels == NIL`. It's of course critical to keep > numChannelsListeningOn in sync, but I think it should be safe? Would of > course be better to avoid this variable. Maybe the extra cycles that > would cost would be worth it? In addition to previously suggested optimization, there is another major one that seems doable, that would mean a great improvement for workload having large traffic differences between channels, i.e. some low traffic and some high traffic. I'm not entirely sure this approach is correct though, I've might misunderstood the guarantees of the heavyweight lock. My assumption is based on that there can only be one backend that is currently running the code in PreCommit_Notify after having aquired the heavyweight lock. If this is not true, then it doesn't work. What made me worried is the exclusive lock we also take inside the same function, I don't see the point of it since we're already holding the heavyweight lock, but maybe this is just to "allows deadlocks to be detected" like the comment says? --- Patches: * 0001-optimize_listen_notify-v14.patch: Just adds additional test coverage of async.c * 0002-optimize_listen_notify-v14.patch: Adds the shared channel hash. Unchanged since 0002-optimize_listen_notify-v13.patch. * 0003-optimize_listen_notify-v14.patch: Optimize LISTEN/NOTIFY by advancing idle backends directly Building on the previous channel-specific listener tracking optimization, this patch further reduces context switching by detecting idle listening backends that don't listen to any of the channels being notified and advancing their queue positions directly without waking them up. When a backend commits notifications, it now saves both the queue head position before and after writing. In SignalBackends(), backends that are at the old queue head and weren't marked for wakeup (meaning they don't listen to any of the notified channels) are advanced directly to the new queue head. This eliminates unnecessary wakeups for these backends, which would otherwise wake up, scan through all the notifications, skip each one, and advance to the same position anyway. The implementation carefully handles the race condition where other backends may write notifications after the heavyweight lock is released but before SignalBackends() is called. By saving queueHeadAfterWrite immediately after writing (before releasing the lock), we ensure backends are only advanced over the exact notifications we wrote, not notifications from other concurrent backends. --- Benchmark: % ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=10 pgbench_patched: starting LISTEN/NOTIFY round-trip benchmark pgbench_patched: round-trips per iteration: 10000 pgbench_patched: idle listeners added per iteration: 10 master: idle_listeners round_trips_per_sec max_latency_usec 0 33592.9 2278 10 14251.1 1041 20 9258.7 1367 30 6144.2 2277 40 4653.1 1690 50 3780.7 2869 60 3234.9 3215 70 2818.9 3652 80 2458.7 3219 90 2203.1 3505 100 1951.9 1739 0002-optimize_listen_notify-v14.patch: idle_listeners round_trips_per_sec max_latency_usec 0 33936.2 889 10 30631.9 1233 20 22404.7 7862 30 19446.2 9539 40 16013.3 13963 50 14310.1 16983 60 12827.0 21363 70 11271.9 24775 80 10764.4 28703 90 9568.1 31693 100 9241.3 32724 0003-optimize_listen_notify-v14.patch: idle_listeners round_trips_per_sec max_latency_usec 0 33236.8 1090 10 34681.0 1338 20 34530.4 1372 30 34061.6 1339 40 33084.5 913 50 33847.5 955 60 33675.8 1239 70 28857.4 20443 80 33324.9 786 90 33612.3 758 100 31259.2 7706 As we can see, with 0002, the ping-pong round-trips per second degrades much slower than master, but the wakeup of idle listening backends still needs to happen at some point, much fewer wakeups, and staggered over time, but still makes it go down from 33k to 9k due to 100 idle listening backends. With 0003, the round-trips per second is sustained, unaffected by additional idle listening backends. I've also attached the pgbench patch as a .txt in pgbench-listen-notify-benchmark-patch.txt, since it's not part of this patch, it's just provided to help others verify the results. /Joel
Attachment
pgsql-hackers by date: