Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. - Mailing list pgsql-bugs
From | Jeff Janes |
---|---|
Subject | Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. |
Date | |
Msg-id | CAMkU=1zxgAf846Lrjk9z1iBAP3MrwHvr9p-XLvC1-N9_uNoKfw@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: BUG #13657: Some kind of undetected deadlock between query
and "startup process" on replica.
|
List | pgsql-bugs |
On Mon, Oct 5, 2015 at 1:15 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier < > michael.paquier@gmail.com> > > wrote: > >> > >> > >> > >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> > wrote: > >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting > for > >> >> something. > >> >> And query waiting for AccessShareLock on the same relation. > >> >> > >> > gdb backtrace from stuck startup process: > >> > (gdb) bt > >> > #0 0x00007f04ad862633 in select () from > /lib/x86_64-linux-gnu/libc.so.6 > >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 > >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 > >> > #3 ResolveRecoveryConflictWithVirtualXIDs > >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) > at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 > >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, > >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 > >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, > >> > lsn=107351881751648) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 > >> > >> This backtrace is not indicating that this process is waiting on a > >> relation lock, it is resolving a recovery conflict while removing > tuples, > >> killing the virtual transaction depending on if > max_standby_streaming_delay > >> or max_standby_archive_delay are set if the conflict gets longer. Did > you > >> change the default of those parameters, which is 30s, to -1? This would > mean > >> that the standby waits indefinitely. > > > > While setting it to -1 gives the startup process permission to wait > > indefinitely for another back-end which is doing something, I don't think > > that that means it should have permission to deadlock indefinitely. A > > deadlock is a different kind of thing that "someone started a transaction > > and then left on vacation for a month without closing it" > > Looking at that more closely... We know that a startup process is the > only process in a standby that can take an exclusive lock on a > relation, per LockAcquireExtended in lock.c and that this lock is > taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by > Maxim indicates that the deadlock is showing up in heap_xlog_clean, > which may happen in the case of a VACUUM for example. It seems to me > that we could have a deadlock if some WAL records had their order > swapped. For example, the EXCLUSIVE lock is replayed first and the > cleanup of relation pages happens before the lock has been released, > freezing the other backend using pg_relation_size on a standby waiting > for the access share lock indefinitely. This would indeed prevent > recovery conflict to work properly, because we should not hold an > exclusive lock on the relation while replaying heap_xlog_clean. Hence > it seems to me that we may have an issue with 68a2e52 (XLogInsert > multi-slot), and it has been introduced in 9.4. > I was assuming that the table being AccessExclusive locked and the table being VACUUMed were two different tables, which were done in two different processes on the master, which means that no re-ordering of WAL records would be necessary to hit this. Although I also haven't yet been able to reproduce the deadlock under this assumption. Cheers, Jeff
pgsql-bugs by date: