Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. - Mailing list pgsql-bugs
From | Maxim Boguk |
---|---|
Subject | Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. |
Date | |
Msg-id | CAK-MWwS=MvYN=4VQFYbxoMiqUsnCXpc0JF9ChyP+opvhkcrD6Q@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: BUG #13657: Some kind of undetected deadlock between query
and "startup process" on replica.
|
List | pgsql-bugs |
On Mon, Oct 5, 2015 at 7:15 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier < > michael.paquier@gmail.com> > > wrote: > >> > >> > >> > >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> > wrote: > >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting > for > >> >> something. > >> >> And query waiting for AccessShareLock on the same relation. > >> >> > >> > gdb backtrace from stuck startup process: > >> > (gdb) bt > >> > #0 0x00007f04ad862633 in select () from > /lib/x86_64-linux-gnu/libc.so.6 > >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 > >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 > >> > #3 ResolveRecoveryConflictWithVirtualXIDs > >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) > at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 > >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, > >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 > >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, > >> > lsn=107351881751648) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 > >> > >> This backtrace is not indicating that this process is waiting on a > >> relation lock, it is resolving a recovery conflict while removing > tuples, > >> killing the virtual transaction depending on if > max_standby_streaming_delay > >> or max_standby_archive_delay are set if the conflict gets longer. Did > you > >> change the default of those parameters, which is 30s, to -1? This would > mean > >> that the standby waits indefinitely. > > > > While setting it to -1 gives the startup process permission to wait > > indefinitely for another back-end which is doing something, I don't think > > that that means it should have permission to deadlock indefinitely. A > > deadlock is a different kind of thing that "someone started a transaction > > and then left on vacation for a month without closing it" > > Looking eat that more closely... We know that a startup process is the > only process in a standby that can take an exclusive lock on a > relation, per LockAcquireExtended in lock.c and that this lock is > taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by > Maxim indicates that the deadlock is showing up in heap_xlog_clean, > which may happen in the case of a VACUUM for example. It seems to me > that we could have a deadlock if some WAL records had their order > swapped. For example, the EXCLUSIVE lock is replayed first and the > cleanup of relation pages happens before the lock has been released, > freezing the other backend using pg_relation_size on a standby waiting > for the access share lock indefinitely. This would indeed prevent > recovery conflict to work properly, because we should not hold an > exclusive lock on the relation while replaying heap_xlog_clean. Hence > it seems to me that we may have an issue with 68a2e52 (XLogInsert > multi-slot), and it has been introducd in 9.4. > > We could perhaps think about adding some safeguards in > ResolveRecoveryConflictWithSnapshot. > > Maxim, are you using commit_delay? It may be possible that this is > fixed by bab9599 that has been pushed after 9.4.4 was released. Could > you also check if HEAD or the soon-to-be-released 9.4.5 fix your > problem? > Regards, > -- > Michael > Hi Michael, commit_delay isn't used on master server or replicas. I would like to check 9.4.5 but I prefer wait for release and not build it by hands, and from my experience building from git tree could be tricky as I don't have root access to install additional packages on the server. However, I will try it in next few days (if I get an permission). -- Maxim Boguk Senior Postgresql DBA http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/>
pgsql-bugs by date: