Re: [BUGS] BUG #14721: Assertion of synchronous replication - Mailing list pgsql-bugs
From | Thomas Munro |
---|---|
Subject | Re: [BUGS] BUG #14721: Assertion of synchronous replication |
Date | |
Msg-id | CAEepm=3k=gTAn5=X_Qv=hWw9JnUxUMXCzBxTKPaHHXxKkF0+iw@mail.gmail.com Whole thread Raw |
In response to | [BUGS] BUG #14721: Assertion of synchronous replication (const_sunny@126.com) |
Responses |
Re: [BUGS] BUG #14721: Assertion of synchronous replication
|
List | pgsql-bugs |
On Thu, Jun 29, 2017 at 2:36 PM, <const_sunny@126.com> wrote: > The following bug has been logged on the website: > > Bug reference: 14721 > Logged by: Const Zhang > Email address: const_sunny@126.com > PostgreSQL version: 9.6.2 > Operating system: CentOS7 Hi Const, Thanks for the detailed report. What type of CPU is this running on? > I have found a bug about synchronous replication. > At first, see the stack of the core file. > > 1 > (gdb) bt > 2 > #0 0x00007fe9aab2e1d7 in raise () from /lib64/libc.so.6 > 3 > #1 0x00007fe9aab2f8c8 in abort () from /lib64/libc.so.6 > 4 > #2 0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111 > "!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443 > "FailedAssertion", > 5 > fileName=0xcdc140 > "/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c", > lineNumber=294) at I assume this is line 298 in master today. There is some interesting IPC going on here. SyncRepWakeQueue() does three things while holding SyncRepLock in some other process: thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE; SHMQueueDelete(&(thisproc->syncRepLinks)); SetLatch(&(thisproc->procLatch)); Meanwhile SyncRepWaitForLSN() in your process does this: /* * Acquiring the lock is not needed, the latch ensures proper * barriers. If it looks like we're done,we must really be done, * because once walsender changes the state to SYNC_REP_WAIT_COMPLETE, * it willnever update it again, so we can't be seeing a stale value * in that case. */ if (MyProc->syncRepState== SYNC_REP_WAIT_COMPLETE) break; ... OK, then outside the loop: /* * WalSender has checked our LSN and has removed us from queue. Clean up * state and leave. It's OK to resetthese shared memory fields without * holding SyncRepLock, because any walsenders will ignore us anyway when *we're not on the queue. */ Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks))); I wonder your CPU core was able to see syncRepState == SYNC_REP_WAIT_COMPLETE, but not yet see MyProc->syncRepLinks's next and previous members as 0. When you look at them in your debugger you see them as zero, but that's a bit later: > (gdb) p MyProc->syncRepLinks > 2 > $1 = {prev = 0x0, next = 0x0} I may be way off base here, and haven't studied all of your report yet. But my first thought is: shouldn't SyncRepWakeQueue() do things the other way around, with a barrier in between, like this: SHMQueueDelete(&(thisproc->syncRepLinks)); pg_write_barrier(); thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE; SetLatch(&(thisproc->procLatch)); ... and then shouldn't SyncRepWaitForLSN() have a pg_read_barrier() inserted before Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)))? -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
pgsql-bugs by date: