Home > mailing lists

Re: Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Vacuum: allow usage of more than 1GB of work mem
Date	April 6, 2018 16:39:21
Msg-id	cd8f7b62-17e1-4307-9f81-427922e5a1f6@iki.fi Whole thread Raw
In response to	Re: Vacuum: allow usage of more than 1GB of work mem (Claudio Freire <klaussfreire@gmail.com>)
Responses	Re: Vacuum: allow usage of more than 1GB of work mem Re: Vacuum: allow usage of more than 1GB of work mem Re: Vacuum: allow usage of more than 1GB of work mem
List	pgsql-hackers

Tree view

On 06/04/18 01:59, Claudio Freire wrote:
> The iteration interface, however, seems quite specific for the use
> case of vacuumlazy, so it's not really a good abstraction.

Can you elaborate? It does return the items one block at a time. Is that 
what you mean by being specific for vacuumlazy? I guess that's a bit 
special, but if you imagine some other users for this abstraction, it's 
probably not that unusual. For example, if we started using it in bitmap 
heap scans, a bitmap heap scan would also want to get the TIDs one block 
number at a time.

> It also copies stuff a lot, so it's quite heavyweight. I'd suggest
> trying to go for a lighter weight interface with less overhead that
> is more general at the same time.

Note that there was similar copying, to construct an array of 
OffsetNumbers, happening in lazy_vacuum_page() before this patch. So the 
net amount of copying is the same.

I'm envisioning that this data structure will sooner or later be 
optimized further, so that when you have a lot of TIDs pointing to the 
same block, we would pack them more tightly, storing the block number 
just once, with an array of offset numbers. This interface that returns 
an array of offset numbers matches that future well, as the iterator 
could just return a pointer to the array of offset numbers, with no 
copying. (If we end up doing something even more dense, like a bitmap, 
then it doesn't help, but that's ok too.)

> About the B-tree, however, I don't think a B-tree is a good idea.
> 
> Trees' main benefit is that they can be inserted to efficiently. When
> all your data is loaded sequentially, in-order, in-memory and
> immutable; the tree is pointless, more costly to build, and harder to
> maintain - in terms of code complexity.
> 
> In this use case, the only benefit of B-trees would be that they're
> optimized for disk access.

Those are not the reasons for which I'd prefer a B-tree. A B-tree has 
good cache locality, and when you don't need to worry about random 
insertions, page splits, deletions etc., it's also very simple to 
implement. This patch is not much longer than the segmented multi-array.

> On the other side, using B-trees incurs memory overhead due to the
> need for internal nodes, can fragment memory because internal nodes
> aren't the same size as leaf nodes, is easier to get wrong and
> introduce bugs... I don't see a gain.

The memory overhead incurred by the internal nodes is quite minimal, and 
can be adjusted by changing the node sizes. After some experimentation, 
I settled on 2048 items per leaf node, and 64 items per internal node. 
With those values, the overhead caused by the internal nodes is minimal, 
below 0.5%. That seems fine, but we could increase the node sizes to 
bring it further down, if we'd prefer that tradeoff.

I don't understand what memory fragmentation problems you're worried 
about. The tree grows one node at a time, as new TIDs are added, until 
it's all released at the end. I don't see how the size of internal vs 
leaf nodes matters.

> If you propose its use, at least benchmark it to show some gain.

Sure. I used the attached script to test this. It's inspired by the test 
script you posted. It creates a pgbench database with scale factor 100, 
deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() 
more heavily, the test script creates a number of extra indexes. Half of 
them are on the primary key, just to get more repetitions without having 
to re-initialize in between, and the rest are like this:

create index random_1 on pgbench_accounts((hashint4(aid)))

to stress lazy_vacuum_tid_reaped() with a random access pattern, rather 
than the sequential one that you get with the primary key index.

I ran the test script on my laptop, with unpatched master, with your 
latest multi-array patch, and with the attached version of the b-tree 
patch. The results are quite noisy, unfortunately, so I wouldn't draw 
very strong conclusions from it, but it seems that the performance of 
all three versions is roughly the same. I looked in particular at the 
CPU time spent in the index vacuums, as reported by VACUUM VERBOSE.

> Furthermore, among the 200-ish messages this thread has accumulated,
> better ideas have been proposed, better because they do use less
> memory and are faster (like using bitmaps when possible), but if we
> can't push a simple refactoring first, there's no chance a bigger
> rewrite will fare better. Remember, in this use case, using less
> memory far outweights any other consideration. Less memory directly
> translates to less iterations over the indexes, because more can be
> crammed into m_w_m, which is a huge time saving. Far more than any
> micro-optimization.
> 
> About 2 years ago, I chose to try to push this simple algorithm first,
> then try to improve on it with better data structures. Nobody
> complained at the time (I think, IIRC), and I don't think it fair to
> go and revisit that now. It just delays getting a solution for this
> issue for the persuit of "the perfect implementaiton" that might never
> arrive. Or even if it doesn, there's nothing stopping us from pushing
> another patch in the future with that better implementation if we
> wish. Lets get something simple and proven first.

True all that. My point is that the multi-segmented array isn't all that 
simple and proven, compared to an also straightforward B-tree. It's 
pretty similar to a B-tree, actually, except that it has exactly two 
levels, and the node (= segment) sizes grow exponentially. I'd rather go 
with a true B-tree, than something homegrown that resembles a B-tree, 
but not quite.

> I'm attaching again one version of them (I've been modifying it to
> suit my purposes at each review round), you'll probably want to tweak
> it to build test cases good for your purpose here.

Thanks!

Attached is a new version of my b-tree version. Compared to yesterday's 
version, I fixed a bunch of bugs that turned up in testing.

Looking at the changes to the regression test in this, I don't quite 
understand what it's all about. What are the "wait_barriers" for? If I 
understand correctly, they're added so that the VACUUMs can remove the 
tuples that are deleted in the test. But why are they needed now? Was 
that an orthogonal change we should've done anyway?

Rather than add those wait_barriers, should we stop running the 'vacuum' 
test in parallel with the other tests? Or maybe it's a good thing to run 
it in parallel, to test some other things?

What are the new tests supposed to cover? The test comment says "large 
mwm vacuum runs", and it sets maintenance_work_mem to 1 MB, which isn't 
very large

- Heikki

Attachment

pgsql-hackers by date:

From: Ashutosh Bapat
Date: 06 April 2018, 16:35:54
Subject: Re: Get the name of the target Relation from Query struct?

From: Ashutosh Bapat
Date: 06 April 2018, 16:40:13
Subject: Re: Get the name of the target Relation from Query struct? SOLVED!

Re: Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

Attachment

Previous

Next