Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers
From | Shawn Debnath |
---|---|
Subject | Re: Refactoring the checkpointer's fsync request queue |
Date | |
Msg-id | 20190220232739.GA8280@f01898859afd.ant.amazon.com Whole thread Raw |
In response to | Re: Refactoring the checkpointer's fsync request queue (Shawn Debnath <sdn@amazon.com>) |
Responses |
Re: Refactoring the checkpointer's fsync request queue
|
List | pgsql-hackers |
As promised, here's a patch that addresses the points discussed by Andres and Thomas at FOSDEM. As a result of how we want checkpointer to track what files to fsync, the pending ops table now integrates the forknum and segno as part of the hash key eliminating the need for the bitmapsets, or vectors from the previous iterations. We re-construct the pathnames from the RelFileNode, ForkNumber and SegmentNumber and use PathNameOpenFile to get the file descriptor to use for fsync. Apart from that, this patch moves the system for requesting and processing fsyncs out of md.c into smgr.c, allowing us to call on smgr component specific callbacks to retrieve metadata like relation and segment paths. This allows smgr components to maintain how relfilenodes, forks and segments map to specific files without exposing this knowledge to smgr. It redefines smgrsync() behavior to be closer to that of smgrimmedsysnc(), i.e., if a regular sync is required for a particular file, enqueue it in locally or forward it to checkpointer. smgrimmedsync() retains the existing behavior and fsyncs the file right away. The processing of fsync requests has been moved from mdsync() to a new ProcessFsyncRequests() function. Testing ------- Checkpointer stats didn't cover what I wanted to verify, i.e., time spent dealing with the pending operations table. So I added temporary instrumentation to get the numbers by timing the code in ProcessFsyncRequests which starts by absorbing fsync requests from checkpointer queue, processing them and finally issuing sync on the files. Similarly, I added the same instrumentation in the mdsync code in master branch. The time to actually execute FileSync is irrelevant for this patch. I did two separate runs for 30 mins, both with scale=10,000 on i3.8xlarge instances [1] with default params to force frequent checkpoints: 1. Single pgbench run having 1000 clients update 4 tables, as a result we get 4 relations and its forks and several segments in each being synced. 2. 10 parallel pgbench runs on 10 separate databases having 200 clients each. This results in more relations and more segments being touched letting us better compare against the bitmapset optimizations. Results -------- The important metric to look at would be the total time spent absorbing and processing the fsync requests as that's what the changes revolve around. The other metrics are here for posterity. The new code is about 6% faster in total time taken to process the queue for the single pgbench run. For the 10x parallel pgbench run, we are seeing drops up to 70% with the patch. Would be great if some other folks can verify this. The temporary instrumentation patches for the master branch and one that applies after the main patch are attached. Enable log_checkpoints and then use grep and cut to extract the numbers from the log file after the runs. [Requests Absorbed] single pgbench run Min Max Average Median Mode Std Dev -------- ------- -------- ---------- -------- ------- ---------- patch 15144 144961 78628.84 76124 58619 24135.69 master 25728 138422 81455.04 80601 25728 21295.83 10 parallel pgbench runs Min Max Average Median Mode Std Dev -------- -------- -------- ----------- -------- -------- ---------- patch 45098 282158 155969.4 151603 153049 39990.91 master 191833 602512 416533.86 424946 191833 82014.48 [Files Synced] single pgbench run Min Max Average Median Mode Std Dev -------- ----- ----- --------- -------- ------ --------- patch 153 166 158.11 158 159 1.86 master 154 166 158.29 159 159 10.29 10 parallel pgbench runs Min Max Average Median Mode Std Dev -------- ------ ------ --------- -------- ------ --------- patch 1540 1662 1556.42 1554 1552 11.12 master 1546 1546 1546 1559 1553 12.79 [Total Time in ProcessFsyncRequest/mdsync] single pgbench run Min Max Average Median Mode Std Dev -------- ----- --------- --------- -------- ------ --------- patch 500 3833.51 2305.22 2239 500 510.08 master 806 4430.32 2458.77 2382 806 497.01 10 parallel pgbench runs Min Max Average Median Mode Std Dev -------- ------ ------- ---------- -------- ------ --------- patch 908 6927 3022.58 2863 908 939.09 master 4323 17858 10982.15 11154 4322 2760.47 [1] https://aws.amazon.com/ec2/instance-types/i3/ -- Shawn Debnath Amazon Web Services (AWS)
Attachment
pgsql-hackers by date: