Re: Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Vacuum: allow usage of more than 1GB of work mem |
Date | |
Msg-id | cd8f7b62-17e1-4307-9f81-427922e5a1f6@iki.fi Whole thread Raw |
In response to | Re: Vacuum: allow usage of more than 1GB of work mem (Claudio Freire <klaussfreire@gmail.com>) |
Responses |
Re: Vacuum: allow usage of more than 1GB of work mem
Re: Vacuum: allow usage of more than 1GB of work mem Re: Vacuum: allow usage of more than 1GB of work mem |
List | pgsql-hackers |
On 06/04/18 01:59, Claudio Freire wrote: > The iteration interface, however, seems quite specific for the use > case of vacuumlazy, so it's not really a good abstraction. Can you elaborate? It does return the items one block at a time. Is that what you mean by being specific for vacuumlazy? I guess that's a bit special, but if you imagine some other users for this abstraction, it's probably not that unusual. For example, if we started using it in bitmap heap scans, a bitmap heap scan would also want to get the TIDs one block number at a time. > It also copies stuff a lot, so it's quite heavyweight. I'd suggest > trying to go for a lighter weight interface with less overhead that > is more general at the same time. Note that there was similar copying, to construct an array of OffsetNumbers, happening in lazy_vacuum_page() before this patch. So the net amount of copying is the same. I'm envisioning that this data structure will sooner or later be optimized further, so that when you have a lot of TIDs pointing to the same block, we would pack them more tightly, storing the block number just once, with an array of offset numbers. This interface that returns an array of offset numbers matches that future well, as the iterator could just return a pointer to the array of offset numbers, with no copying. (If we end up doing something even more dense, like a bitmap, then it doesn't help, but that's ok too.) > About the B-tree, however, I don't think a B-tree is a good idea. > > Trees' main benefit is that they can be inserted to efficiently. When > all your data is loaded sequentially, in-order, in-memory and > immutable; the tree is pointless, more costly to build, and harder to > maintain - in terms of code complexity. > > In this use case, the only benefit of B-trees would be that they're > optimized for disk access. Those are not the reasons for which I'd prefer a B-tree. A B-tree has good cache locality, and when you don't need to worry about random insertions, page splits, deletions etc., it's also very simple to implement. This patch is not much longer than the segmented multi-array. > On the other side, using B-trees incurs memory overhead due to the > need for internal nodes, can fragment memory because internal nodes > aren't the same size as leaf nodes, is easier to get wrong and > introduce bugs... I don't see a gain. The memory overhead incurred by the internal nodes is quite minimal, and can be adjusted by changing the node sizes. After some experimentation, I settled on 2048 items per leaf node, and 64 items per internal node. With those values, the overhead caused by the internal nodes is minimal, below 0.5%. That seems fine, but we could increase the node sizes to bring it further down, if we'd prefer that tradeoff. I don't understand what memory fragmentation problems you're worried about. The tree grows one node at a time, as new TIDs are added, until it's all released at the end. I don't see how the size of internal vs leaf nodes matters. > If you propose its use, at least benchmark it to show some gain. Sure. I used the attached script to test this. It's inspired by the test script you posted. It creates a pgbench database with scale factor 100, deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() more heavily, the test script creates a number of extra indexes. Half of them are on the primary key, just to get more repetitions without having to re-initialize in between, and the rest are like this: create index random_1 on pgbench_accounts((hashint4(aid))) to stress lazy_vacuum_tid_reaped() with a random access pattern, rather than the sequential one that you get with the primary key index. I ran the test script on my laptop, with unpatched master, with your latest multi-array patch, and with the attached version of the b-tree patch. The results are quite noisy, unfortunately, so I wouldn't draw very strong conclusions from it, but it seems that the performance of all three versions is roughly the same. I looked in particular at the CPU time spent in the index vacuums, as reported by VACUUM VERBOSE. > Furthermore, among the 200-ish messages this thread has accumulated, > better ideas have been proposed, better because they do use less > memory and are faster (like using bitmaps when possible), but if we > can't push a simple refactoring first, there's no chance a bigger > rewrite will fare better. Remember, in this use case, using less > memory far outweights any other consideration. Less memory directly > translates to less iterations over the indexes, because more can be > crammed into m_w_m, which is a huge time saving. Far more than any > micro-optimization. > > About 2 years ago, I chose to try to push this simple algorithm first, > then try to improve on it with better data structures. Nobody > complained at the time (I think, IIRC), and I don't think it fair to > go and revisit that now. It just delays getting a solution for this > issue for the persuit of "the perfect implementaiton" that might never > arrive. Or even if it doesn, there's nothing stopping us from pushing > another patch in the future with that better implementation if we > wish. Lets get something simple and proven first. True all that. My point is that the multi-segmented array isn't all that simple and proven, compared to an also straightforward B-tree. It's pretty similar to a B-tree, actually, except that it has exactly two levels, and the node (= segment) sizes grow exponentially. I'd rather go with a true B-tree, than something homegrown that resembles a B-tree, but not quite. > I'm attaching again one version of them (I've been modifying it to > suit my purposes at each review round), you'll probably want to tweak > it to build test cases good for your purpose here. Thanks! Attached is a new version of my b-tree version. Compared to yesterday's version, I fixed a bunch of bugs that turned up in testing. Looking at the changes to the regression test in this, I don't quite understand what it's all about. What are the "wait_barriers" for? If I understand correctly, they're added so that the VACUUMs can remove the tuples that are deleted in the test. But why are they needed now? Was that an orthogonal change we should've done anyway? Rather than add those wait_barriers, should we stop running the 'vacuum' test in parallel with the other tests? Or maybe it's a good thing to run it in parallel, to test some other things? What are the new tests supposed to cover? The test comment says "large mwm vacuum runs", and it sets maintenance_work_mem to 1 MB, which isn't very large - Heikki
Attachment
pgsql-hackers by date: