Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: Recovery inconsistencies, standby much larger than primary |
Date | |
Msg-id | CAM-w4HObtoH7vekEP6W5C-CCie26CDNyAXK8G3vPcVTWxZdGtw@mail.gmail.com Whole thread Raw |
In response to | Re: Recovery inconsistencies, standby much larger than primary (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Recovery inconsistencies, standby much larger than primary
|
List | pgsql-hackers |
<p dir="ltr">One thing I keep coming back to is a bad ran chip setting a bit in the block number. But I just can't seem toget it to add up. The difference is not a power of two, it had happened on two different machines, and we don't see otherweirdness on the machine. It seems like a strange coincidence it would happen to the same variable twice and not toother variables.<p dir="ltr">Unless there's some unrelated code writing through a wild pointer, possibly to a stack allocatedobject that just happens to often be that variable?<p dir="ltr">-- <br /> greg<div class="gmail_quote">On 31 Jan2014 20:21, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br type="attribution" /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Greg Stark <<ahref="mailto:stark@mit.edu">stark@mit.edu</a>> writes:<br /> > So just to summarize, this xlog record:<br />> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,<br /> > info:8, prev:EA1/635290] insert_leaf:s/d/r:1663/16385/1261982 tid<br /> > 3634978/282<br /> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree),len/tot_len:18/6194,<br /> > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982<br /> >blk:3634978 hole_off/len:1240/2072<br /><br /> > Appears to have been written to [ block 7141472 ]<br /><br /> I'vebeen staring at the code for a bit trying to guess how that could<br /> have happened. Since the WAL record has a backupblock, btree_xlog_insert<br /> would have passed control to RestoreBackupBlock, which would call<br /> XLogReadBufferExtendedwith mode RBM_ZERO, so there would be no complaint<br /> about writing past the end of the relation. Now, you can imagine some<br /> very low-level error causing a write to go to the wrong page due to a seek<br />problem or some such, but it's hard to credit that that would've resulted<br /> in creation of all the intervening segmentfiles. Some level of our code<br /> had to have thought it was being told to extend the relation.<br /><br /> However,on closer inspection I was a bit surprised to realize that there<br /> are two possible candidates for doing that! XLogReadBufferExtended will<br /> extend the relation, a block at a time, if told to write a page past<br /> the currentnominal EOF. And in md.c, _mdfd_getseg will *also* extend<br /> the relation if we're InRecovery, even though itnormally would not do<br /> so when called from mdwrite().<br /><br /> Given the behavior in XLogReadBufferExtended, Irather think that the<br /> InRecovery special case in _mdfd_getseg is dead code and should be<br /> removed. But for thepurpose at hand, it's more interesting to try to<br /> confirm which of these code levels did the extension. I noticethat<br /> _mdfd_getseg only bothers to write the last physical page of each segment,<br /> whereas XLogReadBufferExtendedknows nothing of segments and will<br /> ploddingly write every page. So on a filesystem that supports"holes"<br /> in files, I'd expect that the added segments would be fully allocated<br /> if XLogReadBufferExtendeddid the deed, but they'd be quite small if<br /> _mdfd_getseg did so. The du results you started withsuggest that the<br /> former is the case, but could you verify that the filesystem this is<br /> on supports holes andthat du will report only the actually allocated<br /> space when there's a hole?<br /><br /> Assuming that the extensionwas done in XLogReadBufferExtended, we are<br /> forced to the conclusion that XLogReadBufferExtended was passeda bad<br /> block number (viz 7141472); and it's pretty hard to see how that could<br /> happen. RestoreBackupBlockis just passing the value it got out of the<br /> WAL record. I thought about the idea that it was wrongabout exactly<br /> where the BkpBlock struct was in the record, but that would presumably<br /> lead to garbage relnodeand fork numbers not just a bad block number.<br /><br /> So I'm still baffled ...<br /><br /> regards, tom lane<br /></blockquote></div>
pgsql-hackers by date: