Re: BUG #7500: hot-standby replica crash after an initial rsync - Mailing list pgsql-bugs
From | Maxim Boguk |
---|---|
Subject | Re: BUG #7500: hot-standby replica crash after an initial rsync |
Date | |
Msg-id | CAK-MWwTfoG_HcUhFk2KVm1R8XPWEq1AEw3JWuodWNq76GUBm8A@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #7500: hot-standby replica crash after an initial rsync (Maxim Boguk <maxim.boguk@gmail.com>) |
Responses |
Re: BUG #7500: hot-standby replica crash after an initial rsync
|
List | pgsql-bugs |
On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > > > On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres@2ndquadrant.com>wrote: > >> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote: >> > Hi Andres, >> >> I would add something akin to >> >> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid()); >> sleep(10); >> > > Hi Andres, > > There is it: > > gdb backtrace: > > #0 0x0000000800bd1ab0 in exit () from /lib/libc.so.7 > #1 0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137 > #2 0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501 > #3 0x0000000000488956 in SlruReportIOError (ctl=<optimized out>, > pageno=<optimized out>, xid=361788644) at slru.c:857 > #4 0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654, > write_ok=1 '\001', xid=361788644) at slru.c:427 > #5 0x0000000000489f7f in RecordNewMultiXact (multi=361788644, > offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764 > #6 0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at > multixact.c:2017 > #7 0x0000000000484bea in StartupXLOG () at xlog.c:6277 > #8 0x00000000004870a8 in StartupProcessMain () at xlog.c:9335 > #9 0x000000000049cab2 in AuxiliaryProcessMain (argc=2, > argv=0x7fffffffe450) at bootstrap.c:412 > #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at > postmaster.c:4442 > #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at > postmaster.c:1093 > #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at main.c:188 > > relevant database log: > 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOG: 00000: > redo starts at 925/2C0004C8 > 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION: > StartupXLOG, xlog.c:6202 > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []FATAL: > XX000: could not access status of transaction 361788644 > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []DETAIL: > Could not read from file "pg_multixact/offsets/1590" at offset 114688: > Unknown error: 0. > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []CONTEXT: > xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION: > SlruReportIOError, slru.c:840 > > > Kind Regards, > Maksym > And finally gdb backtrace on build without optimization ( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug --enable-cassert CFLAGS='-O0' ): #0 0x0000000800d42ab0 in exit () from /lib/libc.so.7 #1 0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137 #2 0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501 #3 0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654, xid=361788644) at slru.c:836 #4 0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654, write_ok=1 '\001', xid=361788644) at slru.c:427 #5 0x00000000004b85dd in RecordNewMultiXact (multi=361788644, offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764 #6 0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at multixact.c:2017 #7 0x00000000004aef0c in StartupXLOG () at xlog.c:6277 #8 0x00000000004b446d in StartupProcessMain () at xlog.c:9335 #9 0x00000000004d9e36 in AuxiliaryProcessMain (argc=2, argv=0x7fffffffe8b0) at bootstrap.c:412 #10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at postmaster.c:4442 #11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at postmaster.c:1093 #12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188 It seems failing around function SlruPhysicalReadPage at block errno = 0; if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ) { slru_errcause = SLRU_READ_FAILED; slru_errno = errno; close(fd); return false; } It seems that it read less than BLCKSZ bytes. And it is not an error condition for read call (from man 2 read: Note that reading less than size bytes is not an error. ). Actual size of pg_multixact/offsets/1590 file is: [pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la /db/data_back/pg_multixact/offsets/1590 -rw------- 1 pgsql pgsql 81920 Aug 21 07:46 /db/data_back/pg_multixact/offsets/1590 Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving seek position to outside of an actual file size. And after that, read call actually could read only 0 bytes from it. Does that mean that file was damaged during rsync? Kind Regards, Maksym What next I should examine? Kind Regards, Maksym
pgsql-bugs by date: