Home > mailing lists
Re: Error with index on unlogged table - Mailing list pgsql-hackers

From	Michael Paquier
Subject	Re: Error with index on unlogged table
Date	December 4, 2015 08:00:21
Msg-id	CAB7nPqTHs-2qvb4YWeqX2hid25tBAfPj=guXUi3gB0ooQi_GFQ@mail.gmail.com Whole thread Raw
In response to	Re: Error with index on unlogged table (Andres Freund <andres@anarazel.de>)
Responses	Re: Error with index on unlogged table
List	pgsql-hackers
Tree view
Andres Freud wrote:
> On 2015-11-20 16:11:15 +0900, Michael Paquier wrote:
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index cc845d2..4883697 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
>>               data += sizeof(BkpBlock);
>>
>>               RestoreBackupBlockContents(lsn, bkpb, data, false, false);
>> +
>> +             if (bkpb.fork == INIT_FORKNUM)
>> +             {
>> +                     SMgrRelation srel;
>> +                     srel = smgropen(bkpb.node, InvalidBackendId);
>> +                     smgrimmedsync(srel, INIT_FORKNUM);
>> +                     smgrclose(srel);
>> +             }
>>       }
>>       else if (info == XLOG_BACKUP_END)
>>       {
>
> A smgrwrite() instead of a smgrimmedsync() should be sufficient here.

Check.

On Fri, Dec 4, 2015 at 6:35 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-11-27 16:59:20 +0900, Michael Paquier wrote:
>> Attached is a patch that fixes the issue for me in master and 9.5.
>> Actually in the last patch I forgot a call to smgrwrite to ensure that
>> the INIT_FORKNUM is correctly synced to disk when those pages are
>> replayed at recovery, letting the reset routines for unlogged
>> relations do their job correctly. I have noticed as well that we need
>> to do the same for gin and brin relations. In this case I think that
>> we could limit the flush to unlogged relations, my patch does it
>> unconditionally though to generalize the logic. Thoughts?
>
> I think it's a good idea to limit this to unlogged relations. For a
> dataset that fits into shared_buffers and creates many relations, the
> additional disk writes could be noticeable.

OK, then I have switched the patch this way to limit the flush to
unlogged relations where needed.

>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
>> index 99337b0..d7964ac 100644
>> --- a/src/backend/access/brin/brin.c
>> +++ b/src/backend/access/brin/brin.c
>> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS)
>>       brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>>                                          BRIN_CURRENT_VERSION);
>>       MarkBufferDirty(metabuf);
>> -     log_newpage_buffer(metabuf, false);
>> +     /*
>> +      * When this full page image is replayed, there is no guarantee that
>> +      * this page will be present to disk when replayed particularly for
>> +      * unlogged relations, hence enforce it to be flushed to disk.
>> +      */
>
> The grammar seems a bit off here.

Check.

>> +     /*
>> +      * Initialize and xlog metabuffer and root buffer. When those full
>> +      * pages are replayed, it is not guaranteed that those relation
>> +      * init forks will be flushed to disk after replaying them, hence
>> +      * enforce those pages to be flushed to disk at replay, only the
>> +      * last record will enforce a flush for performance reasons and
>> +      * because it is actually unnecessary to do it multiple times.
>> +      */
>
> That comment needs some love too. I think it really only makes sense if
> you know the current state. There's also some language polishing needed.

Check.

>> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
>>       START_CRIT_SECTION();
>>       GinInitMetabuffer(MetaBuffer);
>>       MarkBufferDirty(MetaBuffer);
>> -     log_newpage_buffer(MetaBuffer, false);
>> +     log_newpage_buffer(MetaBuffer, false, false);
>>       GinInitBuffer(RootBuffer, GIN_LEAF);
>>       MarkBufferDirty(RootBuffer);
>> -     log_newpage_buffer(RootBuffer, false);
>> +     log_newpage_buffer(RootBuffer, false, true);
>>       END_CRIT_SECTION();
>>
> Why isn't the metapage flushed to disk?

I was not sure if we should only flush only at the last page, as pages
just before would be already replayed. Switched.

>> --- a/src/backend/access/spgist/spginsert.c
>> +++ b/src/backend/access/spgist/spginsert.c
>> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>>                         (char *) page, true);
>>       if (XLogIsNeeded())
>>               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
>> -                                     SPGIST_METAPAGE_BLKNO, page, false);
>> +                                     SPGIST_METAPAGE_BLKNO, page, false, false);
>>
>>       /* Likewise for the root page. */
>>       SpGistInitPage(page, SPGIST_LEAF);
>> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>>                         (char *) page, true);
>>       if (XLogIsNeeded())
>>               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
>> -                                     SPGIST_ROOT_BLKNO, page, true);
>> +                                     SPGIST_ROOT_BLKNO, page, true, false);
>>
>
> I don't think it's correct to not flush in these cases. Yes, the primary
> does an smgrimmedsync() - but that's not done on the standby.

OK. Switched.

>> @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record)
>>                * when checksums are enabled. There is no difference in handling
>>                * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
>>                * code just to distinguish them for statistics purposes.
>> +              *
>> +              * XLOG_FPI_FOR_SYNC records are generated when a relation needs to
>> +              * be sync'ed just after replaying a full page. This is important
>> +              * to match the master behavior in the case where a page is written
>> +              * directly to disk without going through shared buffers, particularly
>> +              * when writing init forks for index relations.
>>                */
>
> How about
>
> FPI_FOR_SYNC records indicate that the page immediately needs to be
> written to disk, not just to shared buffers. This is important if the
> on-disk state is to be the authoritative, not the state in shared
> buffers. E.g. because on-disk files may later be copied directly.

OK.

>> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
>> index 0ddde72..eb22a51 100644
>> --- a/src/backend/commands/tablecmds.c
>> +++ b/src/backend/commands/tablecmds.c
>> @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
>>                * space.
>>                */
>>               if (use_wal)
>> -                     log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
>> +                     log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page,
>> +                                             false, false);
>
> Shouldn't this flush data if it's an init fork? Currently that's an
> academic distinction, given there'll never be a page, but it still seems
> prudent.

Yeah it should do that for all the INIT_FORKNUM because we don't know
the page type here. Note that use_wal is broken as well. We had better
log and flush the INIT_FORKNUM as well for unlogged relations.

>>  extern void InitXLogInsert(void);
>> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
>> index ad1eb4b..91445f1 100644
>> --- a/src/include/catalog/pg_control.h
>> +++ b/src/include/catalog/pg_control.h
>> @@ -73,6 +73,7 @@ typedef struct CheckPoint
>>  #define XLOG_END_OF_RECOVERY                 0x90
>>  #define XLOG_FPI_FOR_HINT                            0xA0
>>  #define XLOG_FPI                                             0xB0
>> +#define XLOG_FPI_FOR_SYNC                            0xC0
>
>
> I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
> ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
> instead adding actual record data and a 'flags' field in there? I
> slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
> different, XLOG_FPI_FOR_SYNC not so much.

Let's go for XLOG_FPI_FLUSH.

Attached is an updated patch. For back branches, I am still not sure
what would be a good idea though. I don't see any other initiative
than flushing unconditionally INIT_FORKNUM when replaying an FPI...
But that would impact some systems severely.
Regards,
--
Michael
Attachment

0001-Ensure-consistent-on-disk-state-of-UNLOGGED-indexes-.patch
pgsql-hackers by date:
From: Etsuro Fujita
Date: 04 December 2015, 07:37:18
Subject: Re: Remaining 9.5 open items
From: Amit Kapila
Date: 04 December 2015, 08:07:29
Subject: Re: parallel joins, and better parallel explain
Re: Error with index on unlogged table - Mailing list pgsql-hackers

Attachment

Previous

Next