Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. - Mailing list pgsql-bugs
From | Michael Paquier |
---|---|
Subject | Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. |
Date | |
Msg-id | CAB7nPqTAHrSKsxu2PwbXwcnQs_PBv-KPxS5roRBbZDsd=AucmA@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: BUG #13657: Some kind of undetected deadlock between query
and "startup process" on replica.
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. |
List | pgsql-bugs |
On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> >> >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting for >> >> something. >> >> And query waiting for AccessShareLock on the same relation. >> >> >> > gdb backtrace from stuck startup process: >> > (gdb) bt >> > #0 0x00007f04ad862633 in select () from /lib/x86_64-linux-gnu/libc.so.6 >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 >> > #3 ResolveRecoveryConflictWithVirtualXIDs >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, >> > lsn=107351881751648) at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 >> >> This backtrace is not indicating that this process is waiting on a >> relation lock, it is resolving a recovery conflict while removing tuples, >> killing the virtual transaction depending on if max_standby_streaming_delay >> or max_standby_archive_delay are set if the conflict gets longer. Did you >> change the default of those parameters, which is 30s, to -1? This would mean >> that the standby waits indefinitely. > > While setting it to -1 gives the startup process permission to wait > indefinitely for another back-end which is doing something, I don't think > that that means it should have permission to deadlock indefinitely. A > deadlock is a different kind of thing that "someone started a transaction > and then left on vacation for a month without closing it" Looking at that more closely... We know that a startup process is the only process in a standby that can take an exclusive lock on a relation, per LockAcquireExtended in lock.c and that this lock is taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by Maxim indicates that the deadlock is showing up in heap_xlog_clean, which may happen in the case of a VACUUM for example. It seems to me that we could have a deadlock if some WAL records had their order swapped. For example, the EXCLUSIVE lock is replayed first and the cleanup of relation pages happens before the lock has been released, freezing the other backend using pg_relation_size on a standby waiting for the access share lock indefinitely. This would indeed prevent recovery conflict to work properly, because we should not hold an exclusive lock on the relation while replaying heap_xlog_clean. Hence it seems to me that we may have an issue with 68a2e52 (XLogInsert multi-slot), and it has been introduced in 9.4. We could perhaps think about adding some safeguards in ResolveRecoveryConflictWithSnapshot. Maxim, are you using commit_delay? It may be possible that this is fixed by bab9599 that has been pushed after 9.4.4 was released. Could you also check if HEAD or the soon-to-be-released 9.4.5 fix your problem? Regards, -- Michael
pgsql-bugs by date: