Re: FSM corruption leading to errors - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: FSM corruption leading to errors |
Date | |
Msg-id | CAB7nPqSd4L_P7NOGQ1CMeE8ecS8kyUyBP8Rv+9cMxFyJR=_EcQ@mail.gmail.com Whole thread Raw |
In response to | Re: FSM corruption leading to errors (Pavan Deolasee <pavan.deolasee@gmail.com>) |
Responses |
Re: FSM corruption leading to errors
|
List | pgsql-hackers |
On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> >> >> + /* >> + * See comments in GetPageWithFreeSpace about handling outside the >> valid >> + * range blocks >> + */ >> + nblocks = RelationGetNumberOfBlocks(rel); >> + while (target_block >= nblocks && target_block != InvalidBlockNumber) >> + { >> + target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0, >> + spaceNeeded); >> + } >> Hm. This is just a workaround. Even if things are done this way the >> FSM will remain corrupted. > > > No, because the code above updates the FSM of those out-of-the range blocks. > But now that I look at it again, may be this is not correct and it may get > into an endless loop if the relation is repeatedly extended concurrently. Ah yes, that's what the call for RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed that yesterday before sleeping. >> And isn't that going to break once the >> relation is extended again? > > > Once the underlying bug is fixed, I don't see why it should break again. I > added the above code to mostly deal with already corrupt FSMs. May be we can > just document and leave it to the user to run some correctness checks (see > below), especially given that the code is not cheap and adds overheads for > everybody, irrespective of whether they have or will ever have corrupt FSM. Yep. I'd leave it for the release notes to hold a diagnostic method. That's annoying, but this has been done in the past like for the multixact issues.. >> I'd suggest instead putting in the release >> notes a query that allows one to analyze what are the relations broken >> and directly have them fixed. That's annoying, but it would be really >> better than a workaround. One idea here is to use pg_freespace() and >> see if it returns a non-zero value for an out-of-range block on a >> standby. >> > > Right, that's how I tested for broken FSMs. A challenge with any such query > is that if the shared buffer copy of the FSM page is intact, then the query > won't return problematic FSMs. Of course, if the fix is applied to the > standby and is restarted, then corrupt FSMs can be detected. What if you restart the standby, and then do a diagnostic query? Wouldn't that be enough? (Something just based on pg_freespace(pg_relation_size(oid) / block_size) != 0) -- Michael
pgsql-hackers by date: