Re: [BUGS] BUG #14721: Assertion of synchronous replication - Mailing list pgsql-bugs
| From | Thomas Munro |
|---|---|
| Subject | Re: [BUGS] BUG #14721: Assertion of synchronous replication |
| Date | |
| Msg-id | CAEepm=3k=gTAn5=X_Qv=hWw9JnUxUMXCzBxTKPaHHXxKkF0+iw@mail.gmail.com Whole thread Raw |
| In response to | [BUGS] BUG #14721: Assertion of synchronous replication (const_sunny@126.com) |
| Responses |
Re: [BUGS] BUG #14721: Assertion of synchronous replication
|
| List | pgsql-bugs |
On Thu, Jun 29, 2017 at 2:36 PM, <const_sunny@126.com> wrote:
> The following bug has been logged on the website:
>
> Bug reference: 14721
> Logged by: Const Zhang
> Email address: const_sunny@126.com
> PostgreSQL version: 9.6.2
> Operating system: CentOS7
Hi Const,
Thanks for the detailed report. What type of CPU is this running on?
> I have found a bug about synchronous replication.
> At first, see the stack of the core file.
>
> 1
> (gdb) bt
> 2
> #0 0x00007fe9aab2e1d7 in raise () from /lib64/libc.so.6
> 3
> #1 0x00007fe9aab2f8c8 in abort () from /lib64/libc.so.6
> 4
> #2 0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111
> "!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443
> "FailedAssertion",
> 5
> fileName=0xcdc140
> "/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c",
> lineNumber=294) at
I assume this is line 298 in master today. There is some interesting
IPC going on here. SyncRepWakeQueue() does three things while holding
SyncRepLock in some other process:
thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE; SHMQueueDelete(&(thisproc->syncRepLinks));
SetLatch(&(thisproc->procLatch));
Meanwhile SyncRepWaitForLSN() in your process does this:
/* * Acquiring the lock is not needed, the latch ensures proper * barriers. If it looks like we're
done,we must really be done, * because once walsender changes the state to SYNC_REP_WAIT_COMPLETE, * it
willnever update it again, so we can't be seeing a stale value * in that case. */ if
(MyProc->syncRepState== SYNC_REP_WAIT_COMPLETE) break;
... OK, then outside the loop:
/* * WalSender has checked our LSN and has removed us from queue. Clean up * state and leave. It's OK to
resetthese shared memory fields without * holding SyncRepLock, because any walsenders will ignore us anyway when
*we're not on the queue. */ Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
I wonder your CPU core was able to see syncRepState ==
SYNC_REP_WAIT_COMPLETE, but not yet see MyProc->syncRepLinks's next
and previous members as 0. When you look at them in your debugger you
see them as zero, but that's a bit later:
> (gdb) p MyProc->syncRepLinks
> 2
> $1 = {prev = 0x0, next = 0x0}
I may be way off base here, and haven't studied all of your report
yet. But my first thought is: shouldn't SyncRepWakeQueue() do things
the other way around, with a barrier in between, like this:
SHMQueueDelete(&(thisproc->syncRepLinks)); pg_write_barrier(); thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
SetLatch(&(thisproc->procLatch));
... and then shouldn't SyncRepWaitForLSN() have a pg_read_barrier()
inserted before Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)))?
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
pgsql-bugs by date: