Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: [HACKERS] WAL logging problem in 9.4.3? |
Date | |
Msg-id | 20191118045434.GA1173436@rfd.leadboat.com Whole thread Raw |
In response to | Re: [HACKERS] WAL logging problem in 9.4.3? (Noah Misch <noah@leadboat.com>) |
Responses |
Re: [HACKERS] WAL logging problem in 9.4.3?
Re: [HACKERS] WAL logging problem in 9.4.3? Re: [HACKERS] WAL logging problem in 9.4.3? |
List | pgsql-hackers |
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote: > I started pre-commit editing on 2019-10-28, and comment+README updates have > been the largest part of that. I'll check my edits against the things you > list here, and I'll share on-list before committing. I've now marked the CF > entry Ready for Committer. Having dedicated many days to that, I am attaching v24nm. I know of two remaining defects: === Defect 1: gistGetFakeLSN() When I modified pg_regress.c to use wal_level=minimal for all suites, src/test/isolation/specs/predicate-gist.spec failed the assertion in gistGetFakeLSN(). One could reproduce the problem just by running this sequence in psql: begin; create table gist_point_tbl(id int4, p point); create index gist_pointidx on gist_point_tbl using gist(p); insert into gist_point_tbl (id, p) select g, point(g*10, g*10) from generate_series(1, 1000) g; I've included a wrong-in-general hack to make the test pass. I see two main options for fixing this: (a) Introduce an empty WAL record that reserves an LSN and has no other effect. Make GiST use that for permanent relations that are skipping WAL. Further optimizations are possible. For example, we could use a backend-local counter (like the one gistGetFakeLSN() uses for temp relations) until the counter is greater a recent real LSN. That optimization is probably too clever, though it would make the new WAL record almost never appear. (b) Exempt GiST from most WAL skipping. GiST index build could still skip WAL, but it would do its own smgrimmedsync() in addition to the one done at commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly other AM-independent code that skips WAL. Overall, I like the cleanliness of (a). The main argument for (b) is that it ensures we have all the features to opt-out of WAL skipping, which could be useful for out-of-tree index access methods. (I think we currently have the features for a tableam to do so, but not for an indexam to do so.) Overall, I lean toward (a). Any other ideas or preferences? === Defect 2: repetitive work when syncing many relations For deleting relfilenodes, smgrDoPendingDeletes() collects a list for smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is sophisticated about optimizing the shared buffers scan. Commit 279628a introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to further reduce the chance of causing performance regressions. (One could, however, work around the problem by raising wal_skip_threshold.) Kyotaro, if you agree, could you modify v24nm to implement that? Notable changes in v24nm: - Wrote section "Skipping WAL for New RelFileNode" in src/backend/access/transam/README to be the main source concerning the new coding rules. - Updated numerous comments and doc sections. - Eliminated the pendingSyncs list in favor of a "sync" field in pendingDeletes. I mostly did this to eliminate the possibility of the lists getting out of sync. This removed considerable parallel code for managing a second list at end-of-xact. We now call smgrDoPendingSyncs() only when committing or preparing a top-level transaction. - Whenever code sets an rd_*Subid field of a Relation, it must call EOXactListAdd(). swap_relation_files() was not doing so, so the field remained set during the next transaction. I introduced RelationAssumeNewRelfilenode() to handle both tasks, and I located the call so it also affects the mapped relation case. - In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, rd_createSubid remained set. (That happened before this patch, but it has been harmless.) I fixed this in heap_create(). - Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary when checksums are enabled. Observe the precedent that RelationCopyStorage() has not been exempting FSM_FORKNUM. - Pass log_newpage_range() a "false" for page_std, for the same reason RelationCopyStorage() does. - log_newpage_range() ignored its forkNum and page_std arguments, so we logged the wrong data for non-main forks. Before this patch, callers always passed MAIN_FORKNUM and "true", hence the lack of complaints. - Restored table_finish_bulk_insert(), though heapam no longer provides a callback. The API is still well-defined, and other table AMs might have use for it. Removing it feels like a separate proposal. - Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit itself in light of this patch. - Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation; it was overcounting. - Made us skip WAL after SET TABLESPACE, like we do after CLUSTER. - Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to "Write Ahead Log" -> "Settings", between similar settings wal_writer_flush_after and commit_delay. The other place I considered was "Resource Consumption" -> "Asynchronous Behavior", due to the similarity of backend_flush_after. - Gave each test a unique name. Changed test table names to be descriptive, e.g. test7 became trunc_trig. - Squashed all patches into one. Split patches are good when one could reasonably choose to push a subset, but that didn't apply here. I wouldn't push a GUC implementation without its documentation. Since the tests fail without the main bug fix, I wouldn't push tests separately. By the way, based on the comment at zheap_prepare_insert(), I expect zheap will exempt itself from skipping WAL. It may stop calling RelationNeedsWAL() and instead test for RELPERSISTENCE_PERMANENT. nm
Attachment
pgsql-hackers by date: