Home > mailing lists

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date	August 11 16:09:45
Msg-id	493bd9a1-788b-41ec-8848-f64949e52b33@vondra.me Whole thread Raw
In response to	Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock (Greg Burd <greg@burd.me>)
Responses	Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
List	pgsql-hackers

Tree view

Hi,

I took a look at this thread / patches, mostly because it's somewhat
related to the NUMA patch series [1]. Freelist is one of the things the
NUMA patches tweak, to split it into per-node partitions. So removing
the freelist would affect the NUMA patches, mostly simplifying it by
making the freelist part irrelevant.

On the whole the idea seems reasonable - I'm not against removing the
freelist, assuming Andres is right and the freelist is not very useful
anyway. So I decided to validate this assumption by running a couple of
benchmarks ...

benchmarks
----------

I did two tests:

1) run-pgbench.sh

- read-only pgbench

- scale set to different fraction of shared buffers (50%, 90%, 110%)

- tracks tps

2) run-seqscan.sh

- concurrect evic+seqscans on different tables

- shared buffers set to different fractions of total dataset sizes (50%,
90%, 110%)

- This is similar to the workload Andres suggested in [2], except that
it uses a simple SELECT instead of pg_prewarm. The patches change how
pg_prewarm decides to stop, and I didn't want to be affected by that.

- tracks tps, and average latency for the evict + select

The attached test scripts are "complete" - setup an instance. disable
checksums, set a couple GUCs, etc.

I wanted to see the effect of each patch, so tests were done on master,
and then with patches applied one by one (after fixing the compile
failures in 0001).

I did this on two EPYC machines with 96 cores (lscpu attached), and the
results are virtually the same. Attached are PDFs with results from one
of the machines.

On the whole, the impact of the patches is negligible it's within 1% in
either direction, and I don't see any particular trend / systemic change
in the behavior (e.g. regressions for some cases). Seems like a noise,
which supports the assumption the freelist is not very useful.

The one exception positive is "evict latency" which tracks latency of
pg_buffercache_evict_relation(). That got consistently a little bit
better, which makes sense as it does not need to maintain the freelist.

freelist statistics?
--------------------

There's one caveat, though. I find it tricky to "know" if the workload
actually uses a freelist. Is there a good way to determine if a buffer
was acquired from a freelist, or through the clocksweep?

I'm worried the tests might actually have empty freelists, or maybe the
freelists won't be used very much. In which case removing the freelist
has understandably no impact. But it says nothing about cases getting
buffers from freelists often ...

I can probably think about each workload and deduce if there will be
freelists. For the benchmarks described earlier, I think the situation
is this:

pgbench, shared buffers < 100% -> no freelists, uses clock-sweep
pgbench, shared buffers > 100% -> freelists, but unused (no evictions)

seqscan, shared buffers < 100% -> freelists, but clocksweep too
seqscan, shared buffers > 100% -> freelists, no clocksweep

But maybe I got it wrong for some cases? It'd be good to have a way to
collect some stats for each test, to confirm this and also to quantify
the effects. A test that gets 10% of buffers from a freelist may behave
differently from a test that gets 99% buffers from a freelist.

I propose we add a system view showing interesting information about
freelists and buffer allocation, or perhaps extend an existing one (e.g.
pg_stat_bgwriter, which already has buffers_alloc). In the NUMA patches
I simply added a new view into pg_buffercache.

I'm not suggesting this would get committed, at this point I'm more
focused on the development. Whether the view would be useful outside the
development is an open question. Also, if it adds stats about freelists
(for current master), that'd be useless once we remove them.

Now, a couple review comments about the individual patches:

0001
----

1) nitpick: adds a blank line at the end of buffer/README

2) gcc complains about missing have_free_buffer prototype, but it's no
no longer needed and can be removed

3) freelist fails to compile, because of:

freelist.c: In function ‘have_free_buffer’:
freelist.c:166:51: error: passing argument 1 of ‘pg_atomic_read_u64’
from incompatible pointer type [-Wincompatible-pointer-types]
166 | uint64 hand =
pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| pg_atomic_uint32 *

That is, it should be accessed using pg_atomic_read_u32.

CI/cfbot does not report this, because it applies all the patches, and
0002 fixes this.

0002
----

1) I think the buffer/README needs a bit more work. It now suddenly
mentions completePasses, without explaining what it's for, and how it
works with nextVictimBuffer. And then it's not mentioned again, so why
even mention it? We don't mention various other fields ...

It's true completePasses is not a new thing, and probably should have
been already explained by the README. Still, it seems a bit confusing.

2) It'd be good if the README explained how we coordinate access to
multiple atomic values (with the spinlock removed), why the current
approach is safe, or how it's made safe. Or at least mention that we
though about it, and where to look for details (e.g. in a comment for
particular function).

3) I have no opinion on how "clock-sweep" should be spelled, but if we
want to unify the spelling, it should be done in a separate preparatory
patch, so that it does not mix with the actual changes. It adds quite a
bit of changes.

3) I'm not sure about: Assert(prev_strategy_passes <= strategy_passes);

Either it's unlikely but possible, and then it should be elog(ERROR), or
we consider it impossible (and then assert is fine). Maybe just rephrase
the comment to say we consider the overflow it impossible.

Also, no need to split the passes_delta assignment, I think. It's fine
to subtract before an assert - it'll be bogus/negative, but so what?
It's not an undefined / invalid memory access or anything like that.

4) nitpick: freelist.c has incorrect formatting of "{" in a couple
places, e.g. it should be on a newline in a struct definition, etc.
pgindent would eventually fix this, but it bugs me.

5) I don't like the CLOCK_HAND_POSITION() naming very much, partially
because it's not really clear what "clock hand" is. Intuitively I know
what it's supposed to mean, also because I worked with this code before.
But not everyone has that benefit, and may assume something else.

IMO if nextVictimBuffer is no longer meant to be "buffer", but a counter
that only ever increases, we should call it something different. Say,
"clockCweepCounter"?

Then CLOCK_HAND_POSITION() could be CLOCKSWEEP_NEXT_BUFFER()? And we why
not to have a separate CLOCKSWEEP_COMPLETE_PASSES() macro too?

6) I don't understand why the comment mentions the BufferDescriptor
array at all? Sure, we may look at the descriptor, but why is this
detail relevant in this place?

Also, shouldn't it say "divide by NBuffers" at the end?

7) CLOCK_HAND_POSITION() should probably have a comment explaining why
it does the masking, etc. And why it's better than simply doing the
modulo on the counter directly. I mean, why is this better than just
doing (counter % NBuffers)? It's going to work on uint64 anyway.

8) There was a discussion about doing the modulo in a better way, but I
don't see any message explaining it clearly enough for me. And then
there were some issues with it, and I guess the idea was abandoned.

I'm asking because my NUMA patches do a modulo in a couple places in the
clock-sweep part too, so if there's a better way, I'd like to know.

9) Why initialize "hand" to UINT64_MAX? Seems unnecessary.

10) nitpick: StrategySyncStart formatting is a bit wrong

0003
----

1) I don't quite understand the purpose of this part. How is this
abstracting anything, compared to what we already have (or what the two
earlier parts did)?

And if it "abstracts the algorithm", what would be the purpose? Do we
expect some alternative algorithms? And would 0003 really make it easier
compared to just changing the code directly? I don't get it.

2) I don't understand why we need to explicitly pass the "clock" to
every ClockSweepCycles/ClockSweepPosition call.

I mean, there's only ever one instance anyway - at least without "my"
NUMA patches. But even with the NUMA patches, the partition is
determined inside those function, otherwise every caller would have to
repeat that. I think this is unnecessary/inconvenient.

regards

[1]
https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017%40vondra.me

[2]
https://www.postgresql.org/message-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3%40euyewe6tf3z3

--
Tomas Vondra

Attachment

pgsql-hackers by date:

From: Chao Li
Date: 11 August, 15:32:42
Subject: Patch 1 of GB18030-2022 support

From: Kirill Reshke
Date: 11 August, 16:20:31
Subject: TAB completion for ALTER TABLE ... ALTER CONSTRAINT ... ENFORCED

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers

Attachment

Previous

Next