Home > mailing lists

Re: FSM corruption leading to errors - Mailing list pgsql-hackers

From	Michael Paquier
Subject	Re: FSM corruption leading to errors
Date	October 10, 2016 23:50:23
Msg-id	CAB7nPqSd4L_P7NOGQ1CMeE8ecS8kyUyBP8Rv+9cMxFyJR=_EcQ@mail.gmail.com Whole thread Raw
In response to	Re: FSM corruption leading to errors (Pavan Deolasee <pavan.deolasee@gmail.com>)
Responses	Re: FSM corruption leading to errors
List	pgsql-hackers

Tree view

On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>>
>>
>> +   /*
>> +    * See comments in GetPageWithFreeSpace about handling outside the
>> valid
>> +    * range blocks
>> +    */
>> +   nblocks = RelationGetNumberOfBlocks(rel);
>> +   while (target_block >= nblocks && target_block != InvalidBlockNumber)
>> +   {
>> +       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
>> +               spaceNeeded);
>> +   }
>> Hm. This is just a workaround. Even if things are done this way the
>> FSM will remain corrupted.
>
>
> No, because the code above updates the FSM of those out-of-the range blocks.
> But now that I look at it again, may be this is not correct and it may get
> into an endless loop if the relation is repeatedly extended concurrently.

Ah yes, that's what the call for
RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed
that yesterday before sleeping.

>> And isn't that going to break once the
>> relation is extended again?
>
>
> Once the underlying bug is fixed, I don't see why it should break again. I
> added the above code to mostly deal with already corrupt FSMs. May be we can
> just document and leave it to the user to run some correctness checks (see
> below), especially given that the code is not cheap and adds overheads for
> everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

>> I'd suggest instead putting in the release
>> notes a query that allows one to analyze what are the relations broken
>> and directly have them fixed. That's annoying, but it would be really
>> better than a workaround. One idea here is to use pg_freespace() and
>> see if it returns a non-zero value for an out-of-range block on a
>> standby.
>>
>
> Right, that's how I tested for broken FSMs. A challenge with any such query
> is that if the shared buffer copy of the FSM page is intact, then the query
> won't return problematic FSMs. Of course, if the fix is applied to the
> standby and is restarted, then corrupt FSMs can be detected.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
-- 
Michael

pgsql-hackers by date:

From: Dave Cramer
Date: 10 October 2016, 23:47:46
Subject: Re: PL/Python adding support for multi-dimensional arrays

From: Tatsuo Ishii
Date: 11 October 2016, 00:14:01
Subject: parallel.sgml

Re: FSM corruption leading to errors - Mailing list pgsql-hackers

Previous

Next