Re: Failed assertion on standby while shutdown - Mailing list pgsql-hackers
From | Maxim Orlov |
---|---|
Subject | Re: Failed assertion on standby while shutdown |
Date | |
Msg-id | 14d8bf0615489b5c92818974f56c373e@postgrespro.ru Whole thread Raw |
In response to | Re: Failed assertion on standby while shutdown (Fujii Masao <masao.fujii@oss.nttdata.com>) |
Responses |
Re: Failed assertion on standby while shutdown
|
List | pgsql-hackers |
On 2021-03-22 16:40, Fujii Masao wrote: > On 2021/03/20 2:25, Maxim Orlov wrote: >> Hi, haсkers! >> >> Recently, I was doing some experiments with primary/standby instances >> interaction. In certain conditions I’ve got and was able to reproduce >> crash on failed assertion. >> >> The scenario is the following: >> 1. start primary server >> 2. start standby server by pg_basebackup -P -R -X stream -c fast >> -p5432 -D data >> 3. apply some load to the primary server by pgbench -p5432 -i -s 150 >> postgres >> 4. kill primary server (with kill -9) and keep it down >> 5. stop standby server by pg_ctl >> 6. run standby server >> >> Then any standby server termination will result in a failed assertion. >> >> The log with a backtrace is following: >> >> 2021-03-19 18:54:25.352 MSK [3508443] LOG: received fast shutdown >> request >> 2021-03-19 18:54:25.379 MSK [3508443] LOG: aborting any active >> transactions >> TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", >> File: >> "/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c", >> Line: 592, PID: 3508452) >> postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f] >> postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31] >> postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32] >> postgres: walreceiver (+0x530bff)[0x555555a84bff] >> postgres: walreceiver (+0x531044)[0x555555a85044] >> postgres: walreceiver (+0x530959)[0x555555a84959] >> /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0] >> /lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da] >> postgres: walreceiver (+0x52bea4)[0x555555a7fea4] >> postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1] >> postgres: walreceiver (+0x41ff1f)[0x555555973f1f] >> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3] >> postgres: walreceiver (_start+0x2e)[0x55555561abfe] >> >> After a brief investigation I found out that I can get this assert >> with 100% probability if I insert a sleep for about 5 sec into >> InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c: >> >> diff --git a/src/backend/storage/lmgr/proc.c >> b/src/backend/storage/lmgr/proc.c >> index 897045ee272..b5f365f426d 100644 >> --- a/src/backend/storage/lmgr/proc.c >> +++ b/src/backend/storage/lmgr/proc.c >> @@ -525,7 +525,7 @@ InitAuxiliaryProcess(void) >> >> if (MyProc != NULL) >> elog(ERROR, "you already exist"); >> - >> + pg_usleep(5000000L); >> /* >> * We use the ProcStructLock to protect assignment and >> releasing of >> * AuxiliaryProcs entries. > > Thanks for the report! I could reproduce this issue by adding that > sleep > into InitAuxiliaryProcess(). > >> Maybe, this kinda behaviour would appear if a computer hosting >> instances is under significant side load, which cause delay to start >> db-instances under a heavy load. >> >> Configuration for a primary server is default with "wal_level = >> logical" >> >> Configuration for a standby server is default with "wal_level = >> logical" and "primary_conninfo = 'port=5432'" >> >> I'm puzzled with this behavor. I'm pretty sure it is not what should >> be. Any ideas how this can be fixed? > > ISTM that the cause of this issue is that the startup process exits > without releasing the locks that it was holding when shutdown is > requested. To address this issue, IMO the startup process should > call ShutdownRecoveryTransactionEnvironment() at its exit. > Attached is the POC patch that changes the startup process that way. > > I've not tested the patch enough yet.. > > Regards, Thank you for reply! As far as I understand, this is really the case. I've test your patch a bit. This annoying failed assertion is gone now. I think I should test more and report later about results. Should we put this patch to CF? --- Best regards, Maxim Orlov.
pgsql-hackers by date: