Re: failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: failures in t/031_recovery_conflict.pl on CI |
| Date | |
| Msg-id | 20220409231002.mddrr7mtca7p55v5@alap3.anarazel.de Whole thread Raw |
| In response to | Re: failures in t/031_recovery_conflict.pl on CI (Andres Freund <andres@anarazel.de>) |
| Responses |
Re: failures in t/031_recovery_conflict.pl on CI
|
| List | pgsql-hackers |
Hi,
On 2022-04-09 15:00:54 -0700, Andres Freund wrote:
> What are we expecting to wake the startup process up, once it does
> SendRecoveryConflictWithBufferPin()?
>
> It's likely not the problem here, because we never seem to have even reached
> that path, but afaics once we've called disable_all_timeouts() at the bottom
> of ResolveRecoveryConflictWithBufferPin() and then re-entered
> ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind,
> so clear a path as quickly as possible." path, there's no guarantee for any
> timeout to be pending anymore?
>
> If there's either no backend that we're still conflicting with (an entirely
> possible race condition), or if there's e.g. a snapshot or database conflict,
> there's afaics nobody setting the startup processes' latch.
It's not that (although I still suspect it's a problem). It's a self-deadlock,
because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin()
*explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does
CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls,
if the deadlock timeout is reached.
To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom:
(gdb) bt
#0 __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0,
clockid=clockid@entry=0,
abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=true) at
../sysdeps/nptl/futex-internal.c:87
#1 0x00007fd4ce5a215b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fd4cb001138,
expected=expected@entry=0,
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at
../sysdeps/nptl/futex-internal.c:123
#2 0x00007fd4ce59e44f in do_futex_wait (sem=sem@entry=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:112
#3 0x00007fd4ce59e4e8 in __new_sem_wait_slow64 (sem=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:184
#4 0x00007fd4cf20d823 in PGSemaphoreLock (sema=0x7fd4cb001138) at pg_sema.c:327
#5 0x00007fd4cf2dd675 in LWLockAcquire (lock=0x7fd4cb001600, mode=LW_EXCLUSIVE) at
/home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1324
#6 0x00007fd4cf2c36e7 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
conflictPending=false)
at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3638
#7 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)
at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846
#8 0x00007fd4cf2cc69d in StandbyTimeoutHandler () at
/home/andres/src/postgresql/src/backend/storage/ipc/standby.c:911
#9 0x00007fd4cf4e68d7 in handle_sig_alarm (postgres_signal_arg=14) at
/home/andres/src/postgresql/src/backend/utils/misc/timeout.c:421
#10 <signal handler called>
#11 0x00007fd4cddfffc4 in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e5561c0) at
../sysdeps/unix/sysv/linux/select.c:71
#12 0x00007fd4cf52ea2a in pg_usleep (microsec=10000) at /home/andres/src/postgresql/src/port/pgsleep.c:56
#13 0x00007fd4cf2c36f1 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
conflictPending=false)
at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3640
#14 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)
at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846
#15 0x00007fd4cf2cc50f in ResolveRecoveryConflictWithBufferPin () at
/home/andres/src/postgresql/src/backend/storage/ipc/standby.c:820
#16 0x00007fd4cf2a996f in LockBufferForCleanup (buffer=43) at
/home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:4336
#17 0x00007fd4ceec911c in XLogReadBufferForRedoExtended (record=0x7fd4d106a618, block_id=0 '\000', mode=RBM_NORMAL,
get_cleanup_lock=true,buf=0x7ffc6e556394)
at /home/andres/src/postgresql/src/backend/access/transam/xlogutils.c:427
#18 0x00007fd4cee1aa41 in heap_xlog_prune (record=0x7fd4d106a618) at
/home/andres/src/postgresql/src/backend/access/heap/heapam.c:8634
it's reproducible on linux.
I'm lacking words I dare to put in an email to describe how bad an idea it is
to call CancelDBBackends() from within a timeout function, particularly when
the function enabling the timeout also calls that function itself. Before
disabling timers.
I ...
Greetings,
Andres Freund
pgsql-hackers by date: