Parallel heap vacuum - Mailing list pgsql-hackers
| From | Masahiko Sawada | 
|---|---|
| Subject | Parallel heap vacuum | 
| Date | |
| Msg-id | CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com Whole thread Raw | 
| Responses | Re: Parallel heap vacuum RE: Parallel heap vacuum | 
| List | pgsql-hackers | 
Hi all, The parallel vacuum we have today supports only for index vacuuming. Therefore, while multiple workers can work on different indexes in parallel, the heap table is always processed by the single process. I'd like to propose $subject, which enables us to have multiple workers running on the single heap table. This would be helpful to speedup vacuuming for tables without indexes or tables with INDEX_CLENAUP = off. I've attached a PoC patch for this feature. It implements only parallel heap scans in lazyvacum. We can extend this feature to support parallel heap vacuum as well in the future or in the same patch. # Overall idea (for parallel heap scan in lazy vacuum) At the beginning of vacuum, we determine how many workers to launch based on the table size like other parallel query operations. The number of workers is capped by max_parallel_maitenance_workers. Once we decided to use parallel heap scan, we prepared DSM to share data among parallel workers and leader. The information include at least the vacuum option such as aggressive, the counters collected during lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs and GlobalVisState, and parallel scan description. Before starting heap scan in lazy vacuum, we launch parallel workers and then each worker (and the leader) process different blocks. Each worker does HOT-pruning on pages and collects dead tuple TIDs. When adding dead tuple TIDs, workers need to hold an exclusive lock on TidStore. At the end of heap scan phase, workers exit and the leader will wait for all workers to exit. After that, the leader process gather the counters collected by parallel workers, and compute the oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is also enabled, we launch other parallel workers for parallel index vacuuming. When it comes to parallel heap scan in lazy vacuum, I think we can use the table_block_parallelscan_XXX() family. One tricky thing we need to deal with is that if the TideStore memory usage reaches the limit, we stop the parallel scan, do index vacuum and table vacuum, and then resume the parallel scan from the previous state. In order to do that, in the patch, we store ParallelBlockTableScanWorker, per-worker parallel scan state, into DSM so that different parallel workers can resume the scan using the same parallel scan state. In addition to that, since we could end up launching fewer workers than requested, it could happen that some ParallelBlockTableScanWorker data is used once and never be used while remaining unprocessed blocks. To handle this case, in the patch, the leader process checks at the end of the parallel scan if there is an uncompleted parallel scan. If so, the leader process does the scan using worker's ParallelBlockTableScanWorker data on behalf of workers. # Discussions I'm somewhat convinced the brief design of this feature, but there are some points regarding the implementation we need to discuss. In the patch, I extended vacuumparalle.c to support parallel table scan (and vacuum in the future). So I was required to add some table AM callbacks such as DSM size estimation, DSM initialization, and actual table scans etc. We need to verify these APIs are appropriate. Specifically, if we want to support both parallel heap scan and parallel heap vacuum, do we want to add separate callbacks for them? It could be overkill since such a 2-pass vacuum strategy is specific to heap AM. As another implementation idea, we might want to implement parallel heap scan/vacuum in lazyvacuum.c while minimizing changes for vacuumparallel.c. That way, we would not need to add table AM callbacks. However, we would end up having duplicate codes related to parallel operation in vacuum such as vacuum delays. Also, we might need to add some functions to share GlobalVisState among parallel workers, since GlobalVisState is a private struct. Other points I'm somewhat uncomfortable with or need to be discussed remain in the code with XXX comments. # Benchmark results * Test-1: parallel heap scan on the table without indexes I created 20GB table, made garbage on the table, and run vacuum while changing parallel degree: create unlogged table test (a int) with (autovacuum_enabled = off); insert into test select generate_series(1, 600000000); --- 20GB table delete from test where a % 5 = 0; vacuum (verbose, parallel 0) test; Here are the results (total time and heap scan time): PARALLEL 0: 21.99 s (single process) PARALLEL 1: 11.39 s PARALLEL 2: 8.36 s PARALLEL 3: 6.14 s PARALLEL 4: 5.08 s * Test-2: parallel heap scan on the table with one index I used a similar table to the test case 1 but created one btree index on it: create unlogged table test (a int) with (autovacuum_enabled = off); insert into test select generate_series(1, 600000000); --- 20GB table create index on test (a); delete from test where a % 5 = 0; vacuum (verbose, parallel 0) test; I've measured the total execution time as well as the time of each vacuum phase (from left heap scan time, index vacuum time, and heap vacuum time): PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) Overall, I can see the parallel heap scan in lazy vacuum has a decent scalability; In both test-1 and test-2, the execution time of heap scan got ~4x faster with 4 parallel workers. On the other hand, when it comes to the total vacuum execution time, I could not see much performance improvement in test-2 (45.11 vs. 35.84). Looking at the results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), and heap scan in case 2 was not as fast as in case 1 with 1 parallel worker (12.75 vs. 11.39). I think the reason is the shared TidStore is not very scalable since we have a single lock on it. In all cases in the test-1, we don't use the shared TidStore since all dead tuples are removed during heap pruning. So the scalability was better overall than in test-2. In parallel 0 case in test-2, we use the local TidStore, and from parallel degree of 1 in test-2, we use the shared TidStore and parallel worker concurrently update it. Also, I guess that the lookup performance of the local TidStore is better than the shared TidStore's lookup performance because of the differences between a bump context and an DSA area. I think that this difference contributed the fact that index vacuuming got slower (16.74 vs. 22.04). There are two obvious improvement ideas to improve overall vacuum execution time: (1) improve the shared TidStore scalability and (2) support parallel heap vacuum. For (1), several ideas are proposed by the ART authors[1]. I've not tried these ideas but it might be applicable to our ART implementation. But I prefer to start with (2) since it would be easier. Feedback is very welcome. Regards, [1] https://db.in.tum.de/~leis/papers/artsync.pdf -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: