Re: Checkpointer crashes on slave in 9.4 on windows - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Checkpointer crashes on slave in 9.4 on windows |
Date | |
Msg-id | CA+TgmoZeFqBFtyCZTQSA+gqcyspvQ4KNui1Ggw7ab_bQ32qzdw@mail.gmail.com Whole thread Raw |
In response to | Checkpointer crashes on slave in 9.4 on windows (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Checkpointer crashes on slave in 9.4 on windows
|
List | pgsql-hackers |
On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > During internals tests, it is observed that checkpointer > is getting crashed on slave with below log on slave in > windows: > > LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005 > HINT: See C include file "ntstatus.h" for a description of the hexadecimal > value. > LOG: terminating any other active server processes > > I debugged and found that it is happening when checkpointer > tries to update shared memory config and below is the > call stack. > >> postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode >> mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned >> __int64 val=18446744073709551615) Line 579 + 0x14 bytes C > postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned > __int64 * valptr=0x0000000000000020, unsigned __int64 > val=18446744073709551615) Line 510 C > postgres.exe!WALInsertLockAcquireExclusive() Line 1627 C > postgres.exe!UpdateFullPageWrites() Line 9037 C > postgres.exe!UpdateSharedMemoryConfig() Line 1364 C > postgres.exe!CheckpointerMain() Line 359 C > postgres.exe!AuxiliaryProcessMain(int argc=2, char * * > argv=0x00000000007d2180) Line 427 C > postgres.exe!SubPostmasterMain(int argc=4, char * * > argv=0x00000000007d2170) Line 4635 C > postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170) Line 207 > C > > Basically, here the issue is that during startup when > checkpointer tries to acquire WAL Insertion Locks to > update the value of fullPageWrites, it crashes because > the same is still not initialized. It will be initialized in > InitXLOGAccess() which will get called via RecoveryInProgress() > in case recovery is in progress before doing actual checkpoint. > However we are trying to access it before that which leads to > crash. > > I think the reason why it occurs only on windows is that > on linux fork will ensure that WAL Insertion Locks get > initialized with same values as postmaster. > > To fix this issue, we need to ensure that WAL Insertion > Locks should get initialized before we use them, so one of > the ways is to call InitXLOGAccess() before calling > CheckPointerMain() as I have done in attached patch, other > could be to call RecoveryInProgess() much earlier in path > than now. So, this problem was introduced by Heikki's commit, 68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots with regular LWLocks. I think the problem here is that the initialization code here really doesn't belong in InitXLOGAccess at all: 1. I think WALInsertLocks is just another global variable that needs to be saved and restored in EXEC_BACKEND mode and that it therefore ought to participate in the save_backend_variables() mechanism instead of having its own special-purpose mechanism to save and restore the value. 2. And I think that the LWLockRegisterTranche call belongs in XLOGShmeInit(), so that it's parallel to the other call in CreateLWLocks. I think that would be more robust, because while your fix will definitely work, we could easily reintroduce a similar platform-specific bug for some other auxiliary process. Using the mechanisms described above will mean that this is set up properly for everything that's attached to shared memory at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: