Re: failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: failures in t/031_recovery_conflict.pl on CI |
Date | |
Msg-id | 20220409231002.mddrr7mtca7p55v5@alap3.anarazel.de Whole thread Raw |
In response to | Re: failures in t/031_recovery_conflict.pl on CI (Andres Freund <andres@anarazel.de>) |
Responses |
Re: failures in t/031_recovery_conflict.pl on CI
|
List | pgsql-hackers |
Hi, On 2022-04-09 15:00:54 -0700, Andres Freund wrote: > What are we expecting to wake the startup process up, once it does > SendRecoveryConflictWithBufferPin()? > > It's likely not the problem here, because we never seem to have even reached > that path, but afaics once we've called disable_all_timeouts() at the bottom > of ResolveRecoveryConflictWithBufferPin() and then re-entered > ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind, > so clear a path as quickly as possible." path, there's no guarantee for any > timeout to be pending anymore? > > If there's either no backend that we're still conflicting with (an entirely > possible race condition), or if there's e.g. a snapshot or database conflict, > there's afaics nobody setting the startup processes' latch. It's not that (although I still suspect it's a problem). It's a self-deadlock, because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin() *explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls, if the deadlock timeout is reached. To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom: (gdb) bt #0 __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=true) at ../sysdeps/nptl/futex-internal.c:87 #1 0x00007fd4ce5a215b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ../sysdeps/nptl/futex-internal.c:123 #2 0x00007fd4ce59e44f in do_futex_wait (sem=sem@entry=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:112 #3 0x00007fd4ce59e4e8 in __new_sem_wait_slow64 (sem=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:184 #4 0x00007fd4cf20d823 in PGSemaphoreLock (sema=0x7fd4cb001138) at pg_sema.c:327 #5 0x00007fd4cf2dd675 in LWLockAcquire (lock=0x7fd4cb001600, mode=LW_EXCLUSIVE) at /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1324 #6 0x00007fd4cf2c36e7 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, conflictPending=false) at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3638 #7 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN) at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846 #8 0x00007fd4cf2cc69d in StandbyTimeoutHandler () at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:911 #9 0x00007fd4cf4e68d7 in handle_sig_alarm (postgres_signal_arg=14) at /home/andres/src/postgresql/src/backend/utils/misc/timeout.c:421 #10 <signal handler called> #11 0x00007fd4cddfffc4 in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e5561c0) at ../sysdeps/unix/sysv/linux/select.c:71 #12 0x00007fd4cf52ea2a in pg_usleep (microsec=10000) at /home/andres/src/postgresql/src/port/pgsleep.c:56 #13 0x00007fd4cf2c36f1 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, conflictPending=false) at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3640 #14 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK) at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846 #15 0x00007fd4cf2cc50f in ResolveRecoveryConflictWithBufferPin () at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:820 #16 0x00007fd4cf2a996f in LockBufferForCleanup (buffer=43) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:4336 #17 0x00007fd4ceec911c in XLogReadBufferForRedoExtended (record=0x7fd4d106a618, block_id=0 '\000', mode=RBM_NORMAL, get_cleanup_lock=true,buf=0x7ffc6e556394) at /home/andres/src/postgresql/src/backend/access/transam/xlogutils.c:427 #18 0x00007fd4cee1aa41 in heap_xlog_prune (record=0x7fd4d106a618) at /home/andres/src/postgresql/src/backend/access/heap/heapam.c:8634 it's reproducible on linux. I'm lacking words I dare to put in an email to describe how bad an idea it is to call CancelDBBackends() from within a timeout function, particularly when the function enabling the timeout also calls that function itself. Before disabling timers. I ... Greetings, Andres Freund
pgsql-hackers by date: