Re: Error with index on unlogged table - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Error with index on unlogged table |
Date | |
Msg-id | CAB7nPqTHs-2qvb4YWeqX2hid25tBAfPj=guXUi3gB0ooQi_GFQ@mail.gmail.com Whole thread Raw |
In response to | Re: Error with index on unlogged table (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Error with index on unlogged table
|
List | pgsql-hackers |
Andres Freud wrote: > On 2015-11-20 16:11:15 +0900, Michael Paquier wrote: >> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c >> index cc845d2..4883697 100644 >> --- a/src/backend/access/transam/xlog.c >> +++ b/src/backend/access/transam/xlog.c >> @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record) >> data += sizeof(BkpBlock); >> >> RestoreBackupBlockContents(lsn, bkpb, data, false, false); >> + >> + if (bkpb.fork == INIT_FORKNUM) >> + { >> + SMgrRelation srel; >> + srel = smgropen(bkpb.node, InvalidBackendId); >> + smgrimmedsync(srel, INIT_FORKNUM); >> + smgrclose(srel); >> + } >> } >> else if (info == XLOG_BACKUP_END) >> { > > A smgrwrite() instead of a smgrimmedsync() should be sufficient here. Check. On Fri, Dec 4, 2015 at 6:35 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-11-27 16:59:20 +0900, Michael Paquier wrote: >> Attached is a patch that fixes the issue for me in master and 9.5. >> Actually in the last patch I forgot a call to smgrwrite to ensure that >> the INIT_FORKNUM is correctly synced to disk when those pages are >> replayed at recovery, letting the reset routines for unlogged >> relations do their job correctly. I have noticed as well that we need >> to do the same for gin and brin relations. In this case I think that >> we could limit the flush to unlogged relations, my patch does it >> unconditionally though to generalize the logic. Thoughts? > > I think it's a good idea to limit this to unlogged relations. For a > dataset that fits into shared_buffers and creates many relations, the > additional disk writes could be noticeable. OK, then I have switched the patch this way to limit the flush to unlogged relations where needed. >> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c >> index 99337b0..d7964ac 100644 >> --- a/src/backend/access/brin/brin.c >> +++ b/src/backend/access/brin/brin.c >> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS) >> brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index), >> BRIN_CURRENT_VERSION); >> MarkBufferDirty(metabuf); >> - log_newpage_buffer(metabuf, false); >> + /* >> + * When this full page image is replayed, there is no guarantee that >> + * this page will be present to disk when replayed particularly for >> + * unlogged relations, hence enforce it to be flushed to disk. >> + */ > > The grammar seems a bit off here. Check. >> + /* >> + * Initialize and xlog metabuffer and root buffer. When those full >> + * pages are replayed, it is not guaranteed that those relation >> + * init forks will be flushed to disk after replaying them, hence >> + * enforce those pages to be flushed to disk at replay, only the >> + * last record will enforce a flush for performance reasons and >> + * because it is actually unnecessary to do it multiple times. >> + */ > > That comment needs some love too. I think it really only makes sense if > you know the current state. There's also some language polishing needed. Check. >> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS) >> START_CRIT_SECTION(); >> GinInitMetabuffer(MetaBuffer); >> MarkBufferDirty(MetaBuffer); >> - log_newpage_buffer(MetaBuffer, false); >> + log_newpage_buffer(MetaBuffer, false, false); >> GinInitBuffer(RootBuffer, GIN_LEAF); >> MarkBufferDirty(RootBuffer); >> - log_newpage_buffer(RootBuffer, false); >> + log_newpage_buffer(RootBuffer, false, true); >> END_CRIT_SECTION(); >> > Why isn't the metapage flushed to disk? I was not sure if we should only flush only at the last page, as pages just before would be already replayed. Switched. >> --- a/src/backend/access/spgist/spginsert.c >> +++ b/src/backend/access/spgist/spginsert.c >> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS) >> (char *) page, true); >> if (XLogIsNeeded()) >> log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, >> - SPGIST_METAPAGE_BLKNO, page, false); >> + SPGIST_METAPAGE_BLKNO, page, false, false); >> >> /* Likewise for the root page. */ >> SpGistInitPage(page, SPGIST_LEAF); >> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS) >> (char *) page, true); >> if (XLogIsNeeded()) >> log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, >> - SPGIST_ROOT_BLKNO, page, true); >> + SPGIST_ROOT_BLKNO, page, true, false); >> > > I don't think it's correct to not flush in these cases. Yes, the primary > does an smgrimmedsync() - but that's not done on the standby. OK. Switched. >> @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record) >> * when checksums are enabled. There is no difference in handling >> * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info >> * code just to distinguish them for statistics purposes. >> + * >> + * XLOG_FPI_FOR_SYNC records are generated when a relation needs to >> + * be sync'ed just after replaying a full page. This is important >> + * to match the master behavior in the case where a page is written >> + * directly to disk without going through shared buffers, particularly >> + * when writing init forks for index relations. >> */ > > How about > > FPI_FOR_SYNC records indicate that the page immediately needs to be > written to disk, not just to shared buffers. This is important if the > on-disk state is to be the authoritative, not the state in shared > buffers. E.g. because on-disk files may later be copied directly. OK. >> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c >> index 0ddde72..eb22a51 100644 >> --- a/src/backend/commands/tablecmds.c >> +++ b/src/backend/commands/tablecmds.c >> @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst, >> * space. >> */ >> if (use_wal) >> - log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false); >> + log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, >> + false, false); > > Shouldn't this flush data if it's an init fork? Currently that's an > academic distinction, given there'll never be a page, but it still seems > prudent. Yeah it should do that for all the INIT_FORKNUM because we don't know the page type here. Note that use_wal is broken as well. We had better log and flush the INIT_FORKNUM as well for unlogged relations. >> extern void InitXLogInsert(void); >> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h >> index ad1eb4b..91445f1 100644 >> --- a/src/include/catalog/pg_control.h >> +++ b/src/include/catalog/pg_control.h >> @@ -73,6 +73,7 @@ typedef struct CheckPoint >> #define XLOG_END_OF_RECOVERY 0x90 >> #define XLOG_FPI_FOR_HINT 0xA0 >> #define XLOG_FPI 0xB0 >> +#define XLOG_FPI_FOR_SYNC 0xC0 > > > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or > instead adding actual record data and a 'flags' field in there? I > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are > different, XLOG_FPI_FOR_SYNC not so much. Let's go for XLOG_FPI_FLUSH. Attached is an updated patch. For back branches, I am still not sure what would be a good idea though. I don't see any other initiative than flushing unconditionally INIT_FORKNUM when replaying an FPI... But that would impact some systems severely. Regards, -- Michael
Attachment
pgsql-hackers by date: