Re: Scaling shared buffer eviction - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Scaling shared buffer eviction |
Date | |
Msg-id | CAA4eK1KSeY78HYwjtPJkhLCrM9ugyjQC_6_mzmW85Bg1Rp-T7w@mail.gmail.com Whole thread Raw |
In response to | Re: Scaling shared buffer eviction (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Scaling shared buffer eviction
|
List | pgsql-hackers |
On Tue, Sep 9, 2014 at 12:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Apart from above, I think for this patch, cat version bump is required
> >> as I have modified system catalog. However I have not done the
> >> same in patch as otherwise it will be bit difficult to take performance
> >> data.
> >
> > One regression failed on linux due to spacing issue which is
> > fixed in attached patch.
>
> I took another read through this patch. Here are some further review comments:
>
> 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
> both num_to_free and tmp_num_to_free.
Changed as per suggestion. I have also updated
> On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Apart from above, I think for this patch, cat version bump is required
> >> as I have modified system catalog. However I have not done the
> >> same in patch as otherwise it will be bit difficult to take performance
> >> data.
> >
> > One regression failed on linux due to spacing issue which is
> > fixed in attached patch.
>
> I took another read through this patch. Here are some further review comments:
>
> 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
> both num_to_free and tmp_num_to_free.
num_to_free is used to accumulate total number of buffers that are
freed in one cycle of BgMoveBuffersToFreelist() which is reported
for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary
number which is a count of number of buffers to be freed in one sub-cycle
(inner while loop)
> I'd get rid of tmp_num_to_free
> and move the declaration of num_to_free inside the outer loop. I'd
> also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
> tmp_recent_backend_clocksweep into the innermost scope in which they
> are used.
okay, I have moved the tmp_* variables in innermost scope.
> 2. Also in that function, I think the innermost bit of logic could be
> rewritten more compactly, and in such a way as to make it clearer for
> what set of instructions the buffer header will be locked.
> LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if
> (bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist
> = true; } UnlockBufHdr(bufHdr); if (add_to_freelist &&
> StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;
Changed as per suggestion.
> 3. This comment is now obsolete:
>
> + /*
> + * If bgwriterLatch is set, we need to waken the bgwriter, but we should
> + * not do so while holding freelist_lck; so set it after releasing the
> + * freelist_lck. This is annoyingly tedious, but it happens
> at most once
> + * per bgwriter cycle, so the performance hit is minimal.
> + */
> +
>
> We're not actually holding any lock in need of releasing at that point
> in the code, so this can be shortened to "If bgwriterLatch is set, we
> need to waken the bgwriter."
Changed as per suggestion.
> * Ideally numFreeListBuffers should get called under freelist spinlock,
>
> That doesn't make any sense. numFreeListBuffers is a variable, so you
> can't "call" it. The value should be *read* under the spinlock, but
> it is. I think this whole comment can be deleted and replaced with
> "If the number of free buffers has fallen below the low water mark,
> awaken the bgreclaimer to repopulate it."
Changed as per suggestion.
> 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return
> the same victim buffer that's being handed out - at almost the same
> time - to a backend running the clock sweep; if it does, they'll fight
> over the buffer. Two, the *end out parameter actually returns a
> count, not an endpoint. I think we should have
> BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
> top of the inner loop rather than the bottom, and change
> StrategySyncStartAndEnd() so that it knows nothing about
> victimbuf_lck. Let's also change StrategyGetBuffer() to call
> StrategySyncNextVictimBuffer() so that the logic is centralized in one
> place, and rename StrategySyncStartAndEnd() to something that better
> matches its revised purpose.
> and move the declaration of num_to_free inside the outer loop. I'd
> also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
> tmp_recent_backend_clocksweep into the innermost scope in which they
> are used.
okay, I have moved the tmp_* variables in innermost scope.
> 2. Also in that function, I think the innermost bit of logic could be
> rewritten more compactly, and in such a way as to make it clearer for
> what set of instructions the buffer header will be locked.
> LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if
> (bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist
> = true; } UnlockBufHdr(bufHdr); if (add_to_freelist &&
> StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;
Changed as per suggestion.
> 3. This comment is now obsolete:
>
> + /*
> + * If bgwriterLatch is set, we need to waken the bgwriter, but we should
> + * not do so while holding freelist_lck; so set it after releasing the
> + * freelist_lck. This is annoyingly tedious, but it happens
> at most once
> + * per bgwriter cycle, so the performance hit is minimal.
> + */
> +
>
> We're not actually holding any lock in need of releasing at that point
> in the code, so this can be shortened to "If bgwriterLatch is set, we
> need to waken the bgwriter."
Changed as per suggestion.
> * Ideally numFreeListBuffers should get called under freelist spinlock,
>
> That doesn't make any sense. numFreeListBuffers is a variable, so you
> can't "call" it. The value should be *read* under the spinlock, but
> it is. I think this whole comment can be deleted and replaced with
> "If the number of free buffers has fallen below the low water mark,
> awaken the bgreclaimer to repopulate it."
Changed as per suggestion.
> 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return
> the same victim buffer that's being handed out - at almost the same
> time - to a backend running the clock sweep; if it does, they'll fight
> over the buffer. Two, the *end out parameter actually returns a
> count, not an endpoint. I think we should have
> BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
> top of the inner loop rather than the bottom, and change
> StrategySyncStartAndEnd() so that it knows nothing about
> victimbuf_lck. Let's also change StrategyGetBuffer() to call
> StrategySyncNextVictimBuffer() so that the logic is centralized in one
> place, and rename StrategySyncStartAndEnd() to something that better
> matches its revised purpose.
StrategySyncNextVictimBuffer() such that it increments completePasses
on completion of cycle as I think it is appropriate to update it, even when
clock sweep is done by bgreclaimer.
> Maybe StrategyGetReclaimInfo().
I have changed it to StrategyGetFreelistAccessInfo() as it seems most
other functions in freelist.c uses the names to sound something related
to buffers.
> 5. Have you tested that this new bgwriter statistic is actually
> working? Because it looks to me like BgMoveBuffersToFreelist is
> changing BgWriterStats but never calling pgstat_send_bgwriter(), which
> I'm thinking will mean the counters accumulate forever inside the
> reclaimer but never get forwarded to the stats collector.
pgstat_send_bgwriter() is called in bgreclaimer loop (caller of
BgMoveBuffersToFreelist, this is similar to how bgwriter does).
I have done few tests with it before sending the previous patch.
> 6. StrategyInitialize() uses #defines for
> HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
> LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
> for clamping. Let's have constants for all of those (and omit mention
> of the specific values in the comments).
Changed as per suggestion.
> 7. The line you've added to the definition of view pg_stat_bgwriter
> doesn't seem to be indented the same way as all the others. Tab vs.
> space problem?
Fixed.
Performance Data:
-------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
I have changed it to StrategyGetFreelistAccessInfo() as it seems most
other functions in freelist.c uses the names to sound something related
to buffers.
> 5. Have you tested that this new bgwriter statistic is actually
> working? Because it looks to me like BgMoveBuffersToFreelist is
> changing BgWriterStats but never calling pgstat_send_bgwriter(), which
> I'm thinking will mean the counters accumulate forever inside the
> reclaimer but never get forwarded to the stats collector.
pgstat_send_bgwriter() is called in bgreclaimer loop (caller of
BgMoveBuffersToFreelist, this is similar to how bgwriter does).
I have done few tests with it before sending the previous patch.
> 6. StrategyInitialize() uses #defines for
> HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
> LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
> for clamping. Let's have constants for all of those (and omit mention
> of the specific values in the comments).
Changed as per suggestion.
> 7. The line you've added to the definition of view pg_stat_bgwriter
> doesn't seem to be indented the same way as all the others. Tab vs.
> space problem?
Fixed.
Performance Data:
-------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
Client Count/Patch_ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 58614 | 107370 | 140717 | 104357 | 65010 |
Patch | 61825 | 115152 | 170952 | 217389 | 220994 |
Observation
--------------------
1. The scalability/performance is similar to previous patch, slightly
better at higher client count.
2. I have taken the performance data just for one set of configuration,
as there doesn't seem to be any fundamental change which can impact
performance.
Attachment
pgsql-hackers by date: