When doing logical replication, a large transaction can prevent the postgres process from shutting down until the WAL has all been processed and the client reports back. This is obviously less than ideal, as it means a pg_ctl stop -m fast can take minutes or hours to complete. I would expect the behavior to be that all backends are signalled so they can leave cleanly.
I found this thread that reports something very similar (but without the infinite looping):
Subject: walsender bug: stuck during shutdown
I have cc'd Alvaro in case he has any progress on this, or ideas. I tried applying the patch from that thread, but the behavior remained unchanged. Wanted to raise this in -bugs for added visibility, and also see if anyone had thoughts before I dig deeper.
My test case (tested with latest, as of commit b8ea0f675f35c3f0c2cf62175517ba0dacad4abd)
* Spin up a cluster, port 5555, using wal_level logical
* pg_recvlogical --create-slot -d postgres -p 5555 --slot=foo
* pg_recvlogical --start -d postgres -p 5555 --slot=foo --file /tmp/tmp
* If all is well, ctrl-z, bg 1, watch -n 3 tail /tmp/tmp
Other session:
* psql -p 5555 postgres
* create table t (id int generated always as identity, foo text);
* insert into t(foo) select 'abcdefghijklmnopqrstuvwxyz' from generate_series(1,10_000_000);
Once the commit finishes, and as soon as pc_recvlogical starts processing it:
* time pg_ctl stop -m fast -w -t 10000
I found 10 million a nice test on my system - shutdown takes an additional 50 seconds or so, as it waits for pg_recvlogical to respond.
Cheers,
Greg