Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Handing off SLRU fsyncs to the checkpointer |
Date | |
Msg-id | VI1PR0701MB6960EEB838D53886D8A180E3F6520@VI1PR0701MB6960.eurprd07.prod.outlook.com Whole thread Raw |
In response to | Re: Handing off SLRU fsyncs to the checkpointer (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Handing off SLRU fsyncs to the checkpointer
Re: Handing off SLRU fsyncs to the checkpointer |
List | pgsql-hackers |
Hi Thomas, hackers, >> > To move these writes out of recovery's way, we should probably just >> > run the bgwriter process during crash recovery. I'm going to look >> > into that. >> >> Sounds awesome. > >I wrote a quick and dirty experimental patch to try that. I can't see >any benefit from it on pgbench with default shared buffers, but maybe >it would do better with your append test due to locality, especially >if you can figure out how to tune bgwriter to pace itself optimally. >https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery OK, so I've quickly tested those two PoCs patches together, in the conditions like below: - similar append-only workload by pgbench (to eliminate other already known different WAL bottlenecks: e.g. sorting), - 4.3GB of WAL to be applied (mostly Btree/INSERT_LEAF) - on same system as last time (ext4 on NVMe, 1s8c16, 4.14 kernel) - 14master already with SLRU fsync to checkpointer/pg_qgsort patches applied TEST bgwriterPOC1: - in severe dirty memory conditions (artificially simulated via small s_b here) --> so for workloads with very high FlushBufferactivity in StartupXLOG - with fsync=off/fpw=off by default and on NVMe (e.g. scenario: I want to perform some PITR as fast as I can to see howproduction data looked like in the past, before some user deleted some data) baseline s_b@128MB: 140.404, 0.123 (2nd small as there is small region to checkpoint) 22.49% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--14.72%--copyin | __pwrite_nocancel | FileWrite | mdwrite | FlushBuffer | ReadBuffer_common | --14.52%--btree_xlog_insert --7.77%--copyout __pread_nocancel --7.57%--FileRead mdread ReadBuffer_common 6.13% postgres [kernel.kallsyms] [k] do_syscall_64 |--1.64%--__pwrite_nocancel --1.23%--__pread_nocancel 3.68% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--1.02%--smgropen After applying: patch -p1 < ../0001-Run-checkpointer-and-bgworker-in-crash-recovery.patch patch -p1 < ../0002-Optionally-don-t-wait-for-end-of-recovery-checkpoint.patch 0001+0002 s_b@128MB: similar result to above 0001+0002 s_b@128MB: 108.871, 0.114 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 1000 0001+0002 s_b@128MB: 85.392, 0.103 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB max? 18.40% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--17.79%--copyout | __pread_nocancel | |--16.56%--FileRead | | mdread | | ReadBuffer_common --0.61%--copyin // WOW __pwrite_nocancel FileWrite mdwrite FlushBuffer ReadBuffer_common 9.20% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--4.70%--smgropen of course there is another WOW moment during recovery ("61.9%") USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND postgres 120935 0.9 0.0 866052 3824 ? Ss 09:47 0:00 postgres: checkpointer postgres 120936 61.9 0.0 865796 3824 ? Rs 09:47 0:22 postgres: background writer postgres 120937 97.4 0.0 865940 5228 ? Rs 09:47 0:36 postgres: startup recovering 000000010000000000000089 speedup of 1.647x when dirty memory is in way. When it's not: baseline s_b@24000MB: 39.199, 1.448 (2x patches off) 0001+0002 s_b@24000MB: 39.383, 1.442 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB/s max, yay there's no regression. I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpagesin this scenario or even more preferred there shouldn't be a maximum set during closed DB recoveryscenario. TEST bgwriterPOC2a to showcase the 2nd patch which opens the DB for read-write users before the final checkpoint finishesafter redo recovery. The DBA may make the decision via this parameter end_of_recovery_checkpoint_wait=off. - on slow storage (xvda, fsync=on) and even with high memory: s_b@24000MB: 39.043, 15.639 -- even with WAL recovery being 100% CPU bound(mostly on hash_search_with_hash_value() for Buffers/__memmove_ssse3_back),it took additional 15s to perform checkpoint before DB was open for users (it had to write269462 buffers =~ 2GB =~ 140MB/s which is close to the xvda device speed): the complete output looks in 14master lookssimilar to this: 1598609928.620 startup 22543 LOG: redo done at 1/12201C88 1598609928.624 checkpointer 22541 LOG: checkpoint starting: end-of-recovery immediate wait 1598609944.908 checkpointer 22541 LOG: checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0 removed,273 recycled; write=15.145 s, sync=0.138 s, total=16.285 s; sync files=11, longest=0.133 s, average=0.012 s; distance=4468855kB, estimate=4468855 kB 1598609944.912 postmaster 22538 LOG: database system is ready to accept connections s_b@24000MB: 39.96, 0 , with end_of_recovery_checkpoint_wait = off, before DB is open 15s faster 1598610331.556 startup 29499 LOG: redo done at 1/12201C88 1598610331.559 checkpointer 29497 LOG: checkpoint starting: immediate force 1598610331.562 postmaster 29473 LOG: database system is ready to accept connections 1598610347.202 checkpointer 29497 LOG: checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0 removed,273 recycled; write=15.092 s, sync=0.149 s, total=15.643 s; sync files=12, longest=0.142 s, average=0.012 s; distance=4468855kB, estimate=4468855 kB I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a lot of time and this 0002 patch bypassesthat. I would find it quite useful in some scenarios (e.g. testing backups, PITR recoveries, opening DB from storagesnapshots / storage replication, maybe with DWH-after-crash too). TEST bgwriterPOC2b: FYI, I was also testing the the hot_standby code path -- to test if it would reduce time of starting/ opening a fresh standby for read-only queries, but this parameter doesn't seem to influence that in my tests. AsI've learned it's apparently much more complex to reproduce what I'm after and involves a lot of reading about LogStandbySnapshot()/ standby recovery points on my side. Now, back to smgropen() hash_search_by_values() reproducer... -Jakub Wartak.
pgsql-hackers by date: