Home > mailing lists

Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers

From	Greg Stark
Subject	Re: Recovery inconsistencies, standby much larger than primary
Date	January 31, 2014 20:28:37
Msg-id	CAM-w4HObtoH7vekEP6W5C-CCie26CDNyAXK8G3vPcVTWxZdGtw@mail.gmail.com Whole thread
In response to	Re: Recovery inconsistencies, standby much larger than primary (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Recovery inconsistencies, standby much larger than primary
List	pgsql-hackers

Tree view

One thing I keep coming back to is a bad ran chip setting a bit in the block number. But I just can't seem to get it to add up. The difference is not a power of two, it had happened on two different machines, and we don't see other weirdness on the machine. It seems like a strange coincidence it would happen to the same variable twice and not to other variables.

Unless there's some unrelated code writing through a wild pointer, possibly to a stack allocated object that just happens to often be that variable?

--
greg

On 31 Jan 2014 20:21, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:
> So just to summarize, this xlog record:
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
> 3634978/282
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> blk:3634978 hole_off/len:1240/2072

> Appears to have been written to [ block 7141472 ]

I've been staring at the code for a bit trying to guess how that could
have happened. Since the WAL record has a backup block, btree_xlog_insert
would have passed control to RestoreBackupBlock, which would call
XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint
about writing past the end of the relation. Now, you can imagine some
very low-level error causing a write to go to the wrong page due to a seek
problem or some such, but it's hard to credit that that would've resulted
in creation of all the intervening segment files. Some level of our code
had to have thought it was being told to extend the relation.

However, on closer inspection I was a bit surprised to realize that there
are two possible candidates for doing that! XLogReadBufferExtended will
extend the relation, a block at a time, if told to write a page past
the current nominal EOF. And in md.c, _mdfd_getseg will *also* extend
the relation if we're InRecovery, even though it normally would not do
so when called from mdwrite().

Given the behavior in XLogReadBufferExtended, I rather think that the
InRecovery special case in _mdfd_getseg is dead code and should be
removed. But for the purpose at hand, it's more interesting to try to
confirm which of these code levels did the extension. I notice that
_mdfd_getseg only bothers to write the last physical page of each segment,
whereas XLogReadBufferExtended knows nothing of segments and will
ploddingly write every page. So on a filesystem that supports "holes"
in files, I'd expect that the added segments would be fully allocated
if XLogReadBufferExtended did the deed, but they'd be quite small if
_mdfd_getseg did so. The du results you started with suggest that the
former is the case, but could you verify that the filesystem this is
on supports holes and that du will report only the actually allocated
space when there's a hole?

Assuming that the extension was done in XLogReadBufferExtended, we are
forced to the conclusion that XLogReadBufferExtended was passed a bad
block number (viz 7141472); and it's pretty hard to see how that could
happen. RestoreBackupBlock is just passing the value it got out of the
WAL record. I thought about the idea that it was wrong about exactly
where the BkpBlock struct was in the record, but that would presumably
lead to garbage relnode and fork numbers not just a bad block number.

So I'm still baffled ...

regards, tom lane

pgsql-hackers by date:

From: Merlin Moncure
Date: 31 January 2014, 19:49:00
Subject: Re: jsonb and nested hstore

From: Anirudh
Date: 31 January 2014, 20:36:03
Subject: Re: Regarding google summer of code

Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers

Previous

Next