Thread: Support tid range scan in parallel?
Hello
When using ctid as a restriction clause with lower and upper bounds, PostgreSQL's planner will use TID range scan plan to handle such query. This works and generally fine. However, if the ctid range covers a huge amount of data, the planner will not use parallel worker to perform ctid range scan because it is not supported. It could, however, still choose to use parallel sequential scan to complete the scan if ti costs less.
In one of our migration scenarios, we rely on tid range scan to migrate huge table from one database to another once the lower and upper ctid bound is determined. With the support of parallel ctid range scan, this process could be done much quicker.
The attached patch is my approach to add parallel ctid range scan to PostgreSQL's planner and executor. In my tests, I do see an increase in performance using parallel tid range scan over the single worker tid range scan and it is also faster than parallel sequential scan covering similar ranges. Of course, the table needs to be large enough to reflect the performance increase.
below is the timing to complete a select query covering all the records in a simple 2-column table with 40 million records,
- tid range scan takes 10216ms
- tid range scan with 2 workers takes 7109ms
- sequential scan with 2 workers takes 8499ms
Having the support for parallel ctid range scan is definitely helpful in our migration case, I am sure it could be useful in other cases as well. I am sharing the patch here and if someone could provide a quick feedback or review that would be greatly appreciated.
Thank you!
Attachment
On Tue, 30 Apr 2024 at 10:36, Cary Huang <cary.huang@highgo.ca> wrote: > In one of our migration scenarios, we rely on tid range scan to migrate huge table from one database to another once thelower and upper ctid bound is determined. With the support of parallel ctid range scan, this process could be done muchquicker. I would have thought that the best way to migrate would be to further divide the TID range into N segments and run N queries, one per segment to get the data out. From a CPU point of view, I'd hard to imagine that a SELECT * query without any other items in the WHERE clause other than the TID range quals would run faster with multiple workers than with 1. The problem is the overhead of pushing tuples to the main process often outweighs the benefits of the parallelism. However, from an I/O point of view on a server with slow enough disks, I can imagine there'd be a speedup. > The attached patch is my approach to add parallel ctid range scan to PostgreSQL's planner and executor. In my tests, Ido see an increase in performance using parallel tid range scan over the single worker tid range scan and it is also fasterthan parallel sequential scan covering similar ranges. Of course, the table needs to be large enough to reflect theperformance increase. > > below is the timing to complete a select query covering all the records in a simple 2-column table with 40 million records, > > - tid range scan takes 10216ms > - tid range scan with 2 workers takes 7109ms > - sequential scan with 2 workers takes 8499ms Can you share more details about this test? i.e. the query, what the times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?). Also, which version/commit did you patch against? I was wondering if the read stream code added in v17 would result in the serial case running faster because the parallelism just resulted in more I/O concurrency. Of course, it may be beneficial to have parallel TID Range for other cases when more row filtering or aggregation is being done as that requires pushing fewer tuples over from the parallel worker to the main process. It just would be good to get to the bottom of if there's still any advantage to parallelism when no filtering other than the ctid quals is being done now that we've less chance of having to wait for I/O coming from disk with the read streams code. David
Hi David Thank you for your reply. > From a CPU point of view, I'd hard to imagine that a SELECT * query > without any other items in the WHERE clause other than the TID range > quals would run faster with multiple workers than with 1. The problem > is the overhead of pushing tuples to the main process often outweighs > the benefits of the parallelism. However, from an I/O point of view > on a server with slow enough disks, I can imagine there'd be a > speedup. yeah, this is generally true. With everything set to default, the planner would not choose parallel sequential scan if thescan range covers mostly all tuples of a table (to reduce the overhead of pushing tuples to main proc as you mentioned).It is preferred when the target data is small but the table is huge. In my case, it is also the same, the plannerby default uses normal tid range scan, so I had to alter cost parameters to influence the planner's decision. Thisis where I found that with WHERE clause only containing TID ranges that cover the entire table would result faster withparallel workers, at least in my environment. > Of course, it may be beneficial to have parallel TID Range for other > cases when more row filtering or aggregation is being done as that > requires pushing fewer tuples over from the parallel worker to the > main process. It just would be good to get to the bottom of if there's > still any advantage to parallelism when no filtering other than the > ctid quals is being done now that we've less chance of having to wait > for I/O coming from disk with the read streams code. I believe so too. I shared my test procedure below with ctid being the only quals. >> below is the timing to complete a select query covering all the records in a simple 2-column table with 40 million records, >> >> - tid range scan takes 10216ms >> - tid range scan with 2 workers takes 7109ms >> - sequential scan with 2 workers takes 8499ms > > Can you share more details about this test? i.e. the query, what the > times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?). > Also, which version/commit did you patch against? I was wondering if > the read stream code added in v17 would result in the serial case > running faster because the parallelism just resulted in more I/O > concurrency. Yes of course. These numbers were obtained earlier this year on master with the patch applied most likely without the readstream code you mentioned. The patch attached here is rebased to commit dd0183469bb779247c96e86c2272dca7ff4ec9e7 on master,which is quite recent and should have the read stream code for v17 as I can immediately tell that the serial scansrun much faster now in my setup. I increased the records on the test table from 40 to 100 million because serial scansare much faster now. Below is the summary and details of my test. Note that I only include the EXPLAIN ANALYZE detailsof round1 test. Round2 is the same except for different execution times. [env] - OS: Ubuntu 18.04 - CPU: 4 cores @ 3.40 GHz - MEM: 16 GB [test table setup] initdb with all default values CREATE TABLE test (a INT, b TEXT); INSERT INTO test VALUES(generate_series(1,100000000), 'testing'); SELECT min(ctid), max(ctid) from test; min | max -------+-------------- (0,1) | (540540,100) (1 row) [summary] round 1: tid range scan: 14915ms tid range scan 2 workers: 12265ms seq scan with 2 workers: 12675ms round2: tid range scan: 12112ms tid range scan 2 workers: 10649ms seq scan with 2 workers: 11206ms [details of EXPLAIN ANALYZE below] [default tid range scan] EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Tid Range Scan on test (cost=0.01..1227029.81 rows=68648581 width=4) (actual time=0.188..12280.791 rows=99999815 loops=1) TID Cond: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid)) Planning Time: 0.817 ms Execution Time: 14915.035 ms (4 rows) [parallel tid range scan with 2 workers] set parallel_setup_cost=0; set parallel_tuple_cost=0; set min_parallel_table_scan_size=0; set max_parallel_workers_per_gather=2; EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=0.01..511262.43 rows=68648581 width=4) (actual time=1.322..9249.197 rows=99999815 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Tid Range Scan on test (cost=0.01..511262.43 rows=28603575 width=4) (actual time=0.332..4906.262 rows=33333272loops=3) TID Cond: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid)) Planning Time: 0.213 ms Execution Time: 12265.873 ms (7 rows) [parallel seq scan with 2 workers] set enable_tidscan = 'off'; EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Gather (cost=0.00..969595.42 rows=68648581 width=4) (actual time=4.489..9713.299 rows=99999815 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on test (cost=0.00..969595.42 rows=28603575 width=4) (actual time=0.995..5541.178 rows=33333272loops=3) Filter: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid)) Rows Removed by Filter: 62 Planning Time: 0.129 ms Execution Time: 12675.681 ms (8 rows) Best regards Cary Huang ------------- HighGo Software Inc. (Canada) cary.huang@highgo.ca www.highgo.ca
On Wed, 1 May 2024 at 07:10, Cary Huang <cary.huang@highgo.ca> wrote: > Yes of course. These numbers were obtained earlier this year on master with the patch applied most likely without the readstream code you mentioned. The patch attached here is rebased to commit dd0183469bb779247c96e86c2272dca7ff4ec9e7 on master,which is quite recent and should have the read stream code for v17 as I can immediately tell that the serial scansrun much faster now in my setup. I increased the records on the test table from 40 to 100 million because serial scansare much faster now. Below is the summary and details of my test. Note that I only include the EXPLAIN ANALYZE detailsof round1 test. Round2 is the same except for different execution times. It would be good to see the EXPLAIN (ANALYZE, BUFFERS) with SET track_io_timing = 1; Here's a quick review 1. Does not produce correct results: -- serial plan postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)'; count ------- 2260 (1 row) -- parallel plan postgres=# set max_parallel_workers_per_gather=2; SET postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)'; count ------- 0 (1 row) I've not really looked into why, but I see you're not calling heap_setscanlimits() in parallel mode. You need to somehow restrict the block range of the scan to the range specified in the quals. You might need to do more work to make the scan limits work with parallel scans. If you look at heap_scan_stream_read_next_serial(), it's calling heapgettup_advance_block(), where there's "if (--scan->rs_numblocks == 0)". But no such equivalent code in table_block_parallelscan_nextpage() called by heap_scan_stream_read_next_parallel(). To make Parallel TID Range work, you'll need heap_scan_stream_read_next_parallel() to abide by the scan limits. 2. There's a 4 line comment you've added to cost_tidrangescan() which is just a copy and paste from cost_seqscan(). If you look at the seqscan costing, the comment is true in that scenario, but not true in where you've pasted it. The I/O cost is all tied in to run_cost. + /* The CPU cost is divided among all the workers. */ + run_cost /= parallel_divisor; + + /* + * It may be possible to amortize some of the I/O cost, but probably + * not very much, because most operating systems already do aggressive + * prefetching. For now, we assume that the disk run cost can't be + * amortized at all. + */ 3. Calling TidRangeQualFromRestrictInfoList() once for the serial path and again for the partial path isn't great. It would be good to just call that function once and use the result for both path types. 4. create_tidrangescan_subpaths() seems like a weird name for a function. That seems to imply that scans have subpaths. Scans are always leaf paths and have no subpaths. This isn't a complete review. It's just that this seems enough to keep you busy for a while. I can look a bit harder when the patch is working correctly. I think you should have enough feedback to allow that now. David
> This isn't a complete review. It's just that this seems enough to keep > you busy for a while. I can look a bit harder when the patch is > working correctly. I think you should have enough feedback to allow > that now. Thanks for the test, review and feedback. They are greatly appreciated! I will polish the patch some more following your feedback and share new results / patch when I have them. Thanks again! Cary
Hello > -- parallel plan > postgres=# set max_parallel_workers_per_gather=2; > SET > postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)'; > count > ------- > 0 > (1 row) > > I've not really looked into why, but I see you're not calling > heap_setscanlimits() in parallel mode. You need to somehow restrict > the block range of the scan to the range specified in the quals. You > might need to do more work to make the scan limits work with parallel > scans. I found that select count(*) using parallel tid rangescan for the very first time, it would return the correct result, but the same subsequent queries would result in 0 as you stated. This is due to the "pscan->phs_syncscan" set to true in ExecTidRangeScanInitializeDSM(), inherited from parallel seq scan case. With syncscan enabled, the "table_block_parallelscan_nextpage()" would return the next block since the end of the first tid rangescan instead of the correct start block that should be scanned. I see that single tid rangescan does not have SO_ALLOW_SYNC set, so I figure syncscan should also be disabled in parallel case. With this change, then it would be okay to call heap_setscanlimits() in parallel case, so I added this call back to heap_set_tidrange() in both serial and parallel cases. > 2. There's a 4 line comment you've added to cost_tidrangescan() which > is just a copy and paste from cost_seqscan(). If you look at the > seqscan costing, the comment is true in that scenario, but not true in > where you've pasted it. The I/O cost is all tied in to run_cost. thanks for pointing out, I have removed these incorrect comments > 3. Calling TidRangeQualFromRestrictInfoList() once for the serial path > and again for the partial path isn't great. It would be good to just > call that function once and use the result for both path types. good point. I moved the adding of tid range scan partial path inside create_tidscan_paths() where it makes a TidRangeQualFromRestrictInfoList() call for serial path, so I can just reuse tidrangequals if it is appropriate to consider parallel tid rangescan. > 4. create_tidrangescan_subpaths() seems like a weird name for a > function. That seems to imply that scans have subpaths. Scans are > always leaf paths and have no subpaths. I removed this function with weird name; it is not needed because the logic inside is moved to create_tidscan_paths() where it can reuse tidrangequals. > It would be good to see the EXPLAIN (ANALYZE, BUFFERS) with SET > track_io_timing = 1; the v2 patch is attached that should address the issues above. Below are the EXPLAIN outputs with track_io_timing = 1 in my environment. Generally, parallel tid range scan results in more I/O timings and shorter execution time. SET track_io_timing = 1; ===serial tid rangescan=== EXPLAIN (ANALYZE, BUFFERS) select a from test where ctid >= '(0,0)' and ctid < '(216216,40)'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Tid Range Scan on test (cost=0.01..490815.59 rows=27459559 width=4) (actual time=0.072..10143.770 rows=39999999 loops=1) TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid)) Buffers: shared hit=298 read=215919 written=12972 I/O Timings: shared read=440.277 write=58.525 Planning: Buffers: shared hit=2 Planning Time: 0.289 ms Execution Time: 12497.081 ms (8 rows) set parallel_setup_cost=0; set parallel_tuple_cost=0; set min_parallel_table_scan_size=0; set max_parallel_workers_per_gather=2; ===parallel tid rangescan=== EXPLAIN (ANALYZE, BUFFERS) select a from test where ctid >= '(0,0)' and ctid < '(216216,40)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=0.01..256758.88 rows=40000130 width=4) (actual time=0.878..7083.705 rows=39999999 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=216217 I/O Timings: shared read=1224.153 -> Parallel Tid Range Scan on test (cost=0.01..256758.88 rows=16666721 width=4) (actual time=0.256..3980.770 rows=13333333loops=3) TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid)) Buffers: shared read=216217 I/O Timings: shared read=1224.153 Planning Time: 0.258 ms Execution Time: 9731.800 ms (11 rows) ===serial tid rangescan with aggregate=== set max_parallel_workers_per_gather=0; EXPLAIN (ANALYZE, BUFFERS) select count(a) from test where ctid >= '(0,0)' and ctid < '(216216,40)'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=716221.63..716221.64 rows=1 width=8) (actual time=12931.695..12931.696 rows=1 loops=1) Buffers: shared read=216217 I/O Timings: shared read=599.331 -> Tid Range Scan on test (cost=0.01..616221.31 rows=40000130 width=4) (actual time=0.079..6800.482 rows=39999999 loops=1) TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid)) Buffers: shared read=216217 I/O Timings: shared read=599.331 Planning: Buffers: shared hit=1 read=2 I/O Timings: shared read=0.124 Planning Time: 0.917 ms Execution Time: 12932.348 ms (12 rows) ===parallel tid rangescan with aggregate=== set max_parallel_workers_per_gather=2; EXPLAIN (ANALYZE, BUFFERS) select count(a) from test where ctid >= '(0,0)' and ctid < '(216216,40)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=298425.70..298425.71 rows=1 width=8) (actual time=4842.512..4847.863 rows=1 loops=1) Buffers: shared read=216217 I/O Timings: shared read=1155.321 -> Gather (cost=298425.68..298425.69 rows=2 width=8) (actual time=4842.020..4847.851 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=216217 I/O Timings: shared read=1155.321 -> Partial Aggregate (cost=298425.68..298425.69 rows=1 width=8) (actual time=4824.730..4824.731 rows=1 loops=3) Buffers: shared read=216217 I/O Timings: shared read=1155.321 -> Parallel Tid Range Scan on test (cost=0.01..256758.88 rows=16666721 width=4) (actual time=0.098..2614.108rows=13333333 loops=3) TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid)) Buffers: shared read=216217 I/O Timings: shared read=1155.321 Planning: Buffers: shared read=3 I/O Timings: shared read=3.323 Planning Time: 4.124 ms Execution Time: 4847.992 ms (20 rows) Cary Huang ------------- HighGo Software Inc. (Canada) cary.huang@highgo.ca www.highgo.ca
Attachment
On Sat, 4 May 2024 at 06:55, Cary Huang <cary.huang@highgo.ca> wrote: > With syncscan enabled, the "table_block_parallelscan_nextpage()" would > return the next block since the end of the first tid rangescan instead of the > correct start block that should be scanned. I see that single tid rangescan > does not have SO_ALLOW_SYNC set, so I figure syncscan should also be > disabled in parallel case. With this change, then it would be okay to call > heap_setscanlimits() in parallel case, so I added this call back to > heap_set_tidrange() in both serial and parallel cases. This now calls heap_setscanlimits() for the parallel version, it's just that table_block_parallelscan_nextpage() does nothing to obey those limits. The only reason the code isn't reading the entire table is due to the optimisation in heap_getnextslot_tidrange() which returns false when the ctid goes out of range. i.e, this code: /* * When scanning forward, the TIDs will be in ascending order. * Future tuples in this direction will be higher still, so we can * just return false to indicate there will be no more tuples. */ if (ScanDirectionIsForward(direction)) return false; If I comment out that line, I see all pages are accessed: postgres=# explain (analyze, buffers) select count(*) from a where ctid >= '(0,1)' and ctid < '(11,0)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=18.80..18.81 rows=1 width=8) (actual time=33.530..36.118 rows=1 loops=1) Buffers: shared read=4425 -> Gather (cost=18.78..18.79 rows=2 width=8) (actual time=33.456..36.102 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=4425 -> Partial Aggregate (cost=18.78..18.79 rows=1 width=8) (actual time=20.389..20.390 rows=1 loops=3) Buffers: shared read=4425 -> Parallel Tid Range Scan on a (cost=0.01..16.19 rows=1035 width=0) (actual time=9.375..20.349 rows=829 loops=3) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared read=4425 <---- this is all pages in the table instead of 11 pages. With that code still commented out, the non-parallel version still won't read all pages due to the setscanlimits being obeyed. postgres=# set max_parallel_workers_per_gather=0; SET postgres=# explain (analyze, buffers) select count(*) from a where ctid >= '(0,1)' and ctid < '(11,0)'; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Aggregate (cost=45.07..45.08 rows=1 width=8) (actual time=0.302..0.302 rows=1 loops=1) Buffers: shared hit=11 -> Tid Range Scan on a (cost=0.01..38.86 rows=2485 width=0) (actual time=0.019..0.188 rows=2486 loops=1) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=11 If I put that code back in, how many pages are read depends on the number of parallel workers as workers will keep running with higher page numbers and heap_getnextslot_tidrange() will just (inefficiently) filter those out. max_parallel_workers_per_gather=2; -> Parallel Tid Range Scan on a (cost=0.01..16.19 rows=1035 width=0) (actual time=0.191..0.310 rows=829 loops=3) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared read=17 max_parallel_workers_per_gather=3; -> Parallel Tid Range Scan on a (cost=0.01..12.54 rows=802 width=0) (actual time=0.012..0.114 rows=622 loops=4) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=19 max_parallel_workers_per_gather=4; -> Parallel Tid Range Scan on a (cost=0.01..9.72 rows=621 width=0) (actual time=0.014..0.135 rows=497 loops=5) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=21 To fix this you need to make table_block_parallelscan_nextpage obey the limits imposed by heap_setscanlimits(). The equivalent code in the non-parallel version is in heapgettup_advance_block(). /* check if the limit imposed by heap_setscanlimits() is met */ if (scan->rs_numblocks != InvalidBlockNumber) { if (--scan->rs_numblocks == 0) return InvalidBlockNumber; } I've not studied exactly how you'd get the rs_numblock information down to the parallel scan descriptor. But when you figure that out, just remember that you can't do the --scan->rs_numblocks from table_block_parallelscan_nextpage() as that's not parallel safe. You might be able to add an or condition to: "if (nallocated >= pbscan->phs_nblocks)" to make it "if (nallocated >= pbscan->phs_nblocks || nallocated >= pbscan->phs_numblocks)", although the field names don't seem very intuitive there. It would be nicer if the HeapScanDesc field was called rs_blocklimit rather than rs_numblocks. It's not for this patch to go messing with that, however. David
Thank you very much for the test and review. Greatly appreciated! > This now calls heap_setscanlimits() for the parallel version, it's > just that table_block_parallelscan_nextpage() does nothing to obey > those limits. yes, you are absolutely right. Though heap_setscanlimits() is now called by parallel tid range scan, table_block_parallelscan_nextpage() does nothing to obey these limits, resulting in more blocks being inefficiently filtered out by the optimization code you mentioned in heap_getnextslot_tidrange(). > I've not studied exactly how you'd get the rs_numblock information > down to the parallel scan descriptor. But when you figure that out, > just remember that you can't do the --scan->rs_numblocks from > table_block_parallelscan_nextpage() as that's not parallel safe. You > might be able to add an or condition to: "if (nallocated >= > pbscan->phs_nblocks)" to make it "if (nallocated >= > pbscan->phs_nblocks || nallocated >= pbscan->phs_numblocks)", > although the field names don't seem very intuitive there. It would be > nicer if the HeapScanDesc field was called rs_blocklimit rather than > rs_numblocks. It's not for this patch to go messing with that, > however. rs_numblock was not passed down to the parallel scan context and table_block_parallelscan_nextpage() did not seem to have a logic to limit the block scan range set by heap_setscanlimits() in parallel scan. Also, I noticed that the rs_startblock was also not passed to the parallel scan context, which causes the parallel scan always start from 0 even when a lower ctid bound is specified. so I added a logic in heap_set_tidrange() to pass these 2 values to parallel scan descriptor as "phs_startblock" and "phs_numblock". These will be available in table_block_parallelscan_nextpage() in parallel scan. I followed your recommendation and modified the condition to: if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 && nallocated >= pbscan->phs_numblock)) so that the parallel tid range scan will stop when the upper scan limit is reached. With these changes, I see that the number of buffer reads is consistent between single and parallel ctid range scans. The v3 patch is attached. postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Aggregate (cost=39.43..39.44 rows=1 width=8) (actual time=1.007..1.008 rows=1 loops=1) Buffers: shared read=11 -> Tid Range Scan on test (cost=0.01..34.35 rows=2034 width=0) (actual time=0.076..0.639 rows=2035 loops=1) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared read=11 postgres=# set max_parallel_workers_per_gather=2; SET postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Finalize Aggregate (cost=16.45..16.46 rows=1 width=8) (actual time=14.329..16.840 rows=1 loops=1) Buffers: shared hit=11 -> Gather (cost=16.43..16.44 rows=2 width=8) (actual time=3.197..16.814 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=11 -> Partial Aggregate (cost=16.43..16.44 rows=1 width=8) (actual time=0.705..0.706 rows=1 loops=3) Buffers: shared hit=11 -> Parallel Tid Range Scan on test (cost=0.01..14.31 rows=848 width=0) (actual time=0.022..0.423 rows=678loops=3) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=11 postgres=# set max_parallel_workers_per_gather=3; SET postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Finalize Aggregate (cost=12.74..12.75 rows=1 width=8) (actual time=16.793..19.053 rows=1 loops=1) Buffers: shared hit=11 -> Gather (cost=12.72..12.73 rows=3 width=8) (actual time=2.827..19.012 rows=4 loops=1) Workers Planned: 3 Workers Launched: 3 Buffers: shared hit=11 -> Partial Aggregate (cost=12.72..12.73 rows=1 width=8) (actual time=0.563..0.565 rows=1 loops=4) Buffers: shared hit=11 -> Parallel Tid Range Scan on test (cost=0.01..11.08 rows=656 width=0) (actual time=0.018..0.338 rows=509loops=4) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=11 thank you! Cary Huang ------------- HighGo Software Inc. (Canada) cary.huang@highgo.ca www.highgo.ca
Attachment
On Thu, 9 May 2024 at 10:23, Cary Huang <cary.huang@highgo.ca> wrote: > The v3 patch is attached. I've not looked at the patch, but please add it to the July CF. I'll try and look in more detail then. David
> I've not looked at the patch, but please add it to the July CF. I'll > try and look in more detail then. Thanks David, I have added this patch on July commitfest under the server feature category. I understand that the regression tests for parallel ctid range scan is a bit lacking now. It only has a few EXPLAIN clauses to ensure parallel workers are used when tid ranges are specified. They are added as part of select_parallel.sql test. I am not sure if it is more appropriate to have them as part of tidrangescan.sql test instead. So basically re-run the same test cases in tidrangescan.sql but in parallel? thank you Cary
On Fri, 10 May 2024 at 05:16, Cary Huang <cary.huang@highgo.ca> wrote: > I understand that the regression tests for parallel ctid range scan is a > bit lacking now. It only has a few EXPLAIN clauses to ensure parallel > workers are used when tid ranges are specified. They are added as > part of select_parallel.sql test. I am not sure if it is more appropriate > to have them as part of tidrangescan.sql test instead. So basically > re-run the same test cases in tidrangescan.sql but in parallel? I think tidrangescan.sql is a more suitable location than select_parallel.sql I don't think you need to repeat all the tests as many of them are testing the tid qual processing which is the same code as it is in the parallel version. You should add a test that creates a table with a very low fillfactor, low enough so only 1 tuple can fit on each page and insert a few dozen tuples. The test would do SELECT COUNT(*) to ensure you find the correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in there too for extra coverage to ensure that the correct tuples are being counted. Just make sure and EXPLAIN (COSTS OFF) the query first in the test to ensure that it's always testing the plan you're expecting to test. David
> You should add a test that creates a table with a very low fillfactor, > low enough so only 1 tuple can fit on each page and insert a few dozen > tuples. The test would do SELECT COUNT(*) to ensure you find the > correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in > there too for extra coverage to ensure that the correct tuples are > being counted. Just make sure and EXPLAIN (COSTS OFF) the query first > in the test to ensure that it's always testing the plan you're > expecting to test. thank you for the test suggestion. I moved the regress tests from select_parallel to tidrangescan instead. I follow the existing test table creation in tidrangescan with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page instead of 1. It should be okay as long as it is consistently 5 tuples per page so the tuple count results from parallel tests would be in multiples of 5. The attached v4 patch includes the improved regression tests. Thank you very much! Cary Huang ------------- HighGo Software Inc. (Canada) cary.huang@highgo.ca www.highgo.ca
Attachment
Hi Cary, On Wed, May 15, 2024 at 5:33 AM Cary Huang <cary.huang@highgo.ca> wrote: > > > You should add a test that creates a table with a very low fillfactor, > > low enough so only 1 tuple can fit on each page and insert a few dozen > > tuples. The test would do SELECT COUNT(*) to ensure you find the > > correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in > > there too for extra coverage to ensure that the correct tuples are > > being counted. Just make sure and EXPLAIN (COSTS OFF) the query first > > in the test to ensure that it's always testing the plan you're > > expecting to test. > > thank you for the test suggestion. I moved the regress tests from select_parallel > to tidrangescan instead. I follow the existing test table creation in tidrangescan > with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page > instead of 1. It should be okay as long as it is consistently 5 tuples per page so > the tuple count results from parallel tests would be in multiples of 5. > > The attached v4 patch includes the improved regression tests. > > Thank you very much! > > Cary Huang > ------------- > HighGo Software Inc. (Canada) > cary.huang@highgo.ca > www.highgo.ca > +++ b/src/backend/access/heap/heapam.c @@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) * results for a non-MVCC snapshot, the caller must hold some higher-level * lock that ensures the interesting tuple(s) won't change.) */ + I see no reason why you add a blank line here, is it a typo? +/* ---------------------------------------------------------------- + * ExecSeqScanInitializeWorker + * + * Copy relevant information from TOC into planstate. + * ---------------------------------------------------------------- + */ +void +ExecTidRangeScanInitializeWorker(TidRangeScanState *node, + ParallelWorkerContext *pwcxt) +{ + ParallelTableScanDesc pscan; Function name in the comment is not consistent. @@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData BlockNumber phs_startblock; /* starting block number */ pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to * workers so far. */ + BlockNumber phs_numblock; /* max number of blocks to scan */ } ParallelBlockTableScanDescData; Can this be reorganized by putting phs_numblock after phs_startblock? > > > > > -- Regards Junwang Zhao
* ExecTidRangeScanInitializeDSM
*
* Set up a parallel heap scan descriptor.
* ----------------------------------------------------------------
*/
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
Hi Cary,
On Wed, May 15, 2024 at 5:33 AM Cary Huang <cary.huang@highgo.ca> wrote:
>
> > You should add a test that creates a table with a very low fillfactor,
> > low enough so only 1 tuple can fit on each page and insert a few dozen
> > tuples. The test would do SELECT COUNT(*) to ensure you find the
> > correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
> > there too for extra coverage to ensure that the correct tuples are
> > being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
> > in the test to ensure that it's always testing the plan you're
> > expecting to test.
>
> thank you for the test suggestion. I moved the regress tests from select_parallel
> to tidrangescan instead. I follow the existing test table creation in tidrangescan
> with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page
> instead of 1. It should be okay as long as it is consistently 5 tuples per page so
> the tuple count results from parallel tests would be in multiples of 5.
>
> The attached v4 patch includes the improved regression tests.
>
> Thank you very much!
>
> Cary Huang
> -------------
> HighGo Software Inc. (Canada)
> cary.huang@highgo.ca
> www.highgo.ca
>
+++ b/src/backend/access/heap/heapam.c
@@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/
+
I see no reason why you add a blank line here, is it a typo?
+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
Function name in the comment is not consistent.
@@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_startblock; /* starting block number */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
} ParallelBlockTableScanDescData;
Can this be reorganized by putting phs_numblock after phs_startblock?
>
>
>
>
>
--
Regards
Junwang Zhao
Rafia Sabih