Thread: Misaligned BufferDescriptors causing major performance problems on AMD

Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

02 February 2014, 15:13:25

Hi,

In the nearby thread at
http://archives.postgresql.org/message-id/20140202140014.GM5930%40awork2.anarazel.de
Peter and I discovered that there is a large performance difference
between different max_connections on a larger machine (4x Opteron 6272,
64 cores together) in a readonly pgbench tests...

Just as reference, we're talking about a performance degradation from
475963.613865 tps to 197744.913556 in a pgbench -S -cj64 just by setting
max_connections to 90, from 91...

On 2014-02-02 15:00:14 +0100, Andres Freund wrote:
> On 2014-02-01 19:47:29 -0800, Peter Geoghegan wrote:
> > Here are the results of a benchmark on Nathan Boley's 64-core, 4
> > socket server: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/
>
> That's interesting. The maximum number of what you see here (~293125)
> is markedly lower than what I can get.
>
> ... poke around ...
>
> Hm, that's partially because you're using pgbench without -M prepared if
> I see that correctly. The bottleneck in that case is primarily memory
> allocation. But even after that I am getting higher
> numbers: ~342497.
>
> Trying to nail down the differnce it oddly seems to be your
> max_connections=80 vs my 100. The profile in both cases is markedly
> different, way much more spinlock contention with 80. All in
> Pin/UnpinBuffer().
>
> I think =80 has to lead to some data being badly aligned. I can
> reproduce that =91 has *much* better performance than =90. 170841.844938
> vs 368490.268577 in a 10s test. Reproducable both with an without the test.
> That's certainly worth some investigation.
> This is *not* reproducable on the intel machine, so it might the
> associativity of the L1/L2 cache on the AMD.

So, I looked into this, and I am fairly certain it's because of the
(mis-)alignment of the buffer descriptors. With certain max_connections
settings InitBufferPool() happens to get 64byte aligned addresses, with
others not. I checked the alignment with gdb to confirm that.

A quick hack (attached) making BufferDescriptor 64byte aligned indeed
restored performance across all max_connections settings. It's not
surprising that a misaligned buffer descriptor causes problems -
there'll be plenty of false sharing of the spinlocks otherwise. Curious
that the the intel machine isn't hurt much by this.

Now all this hinges on the fact that by a mere accident
BufferDescriptors are 64byte in size:
struct sbufdesc {
        BufferTag                  tag;                  /*     0    20 */
        BufFlags                   flags;                /*    20     2 */
        uint16                     usage_count;          /*    22     2 */
        unsigned int               refcount;             /*    24     4 */
        int                        wait_backend_pid;     /*    28     4 */
        slock_t                    buf_hdr_lock;         /*    32     1 */

        /* XXX 3 bytes hole, try to pack */

        int                        buf_id;               /*    36     4 */
        int                        freeNext;             /*    40     4 */

        /* XXX 4 bytes hole, try to pack */

        LWLock *                   io_in_progress_lock;  /*    48     8 */
        LWLock *                   content_lock;         /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */

        /* size: 64, cachelines: 1, members: 10 */
        /* sum members: 57, holes: 2, sum holes: 7 */
};

We could polish up the attached patch and apply it to all the branches,
the costs of memory are minimal. But I wonder if we shouldn't instead
make ShmemInitStruct() always return cacheline aligned addresses. That
will require some fiddling, but it might be a good idea nonetheless?

I think we should also consider some more reliable measures to have
BufferDescriptors cacheline sized, rather than relying on the happy
accident. Debugging alignment issues isn't fun, too much of a guessing
game...

Thoughts?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

keep-BufferDescriptors-cacheline-aligned.patch

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Peter Geoghegan

Date:

03 February 2014, 23:17:21

On Sun, Feb 2, 2014 at 7:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Just as reference, we're talking about a performance degradation from
> 475963.613865 tps to 197744.913556 in a pgbench -S -cj64 just by setting
> max_connections to 90, from 91...

That's pretty terrible.

> So, I looked into this, and I am fairly certain it's because of the
> (mis-)alignment of the buffer descriptors. With certain max_connections
> settings InitBufferPool() happens to get 64byte aligned addresses, with
> others not. I checked the alignment with gdb to confirm that.

I find your diagnosis to be quite plausible.

> A quick hack (attached) making BufferDescriptor 64byte aligned indeed
> restored performance across all max_connections settings. It's not
> surprising that a misaligned buffer descriptor causes problems -
> there'll be plenty of false sharing of the spinlocks otherwise. Curious
> that the the intel machine isn't hurt much by this.

I think that is explained here:

http://www.agner.org/optimize/blog/read.php?i=142&v=t

With Sandy Bridge, "Misaligned memory operands [are] handled efficiently".

> Now all this hinges on the fact that by a mere accident
> BufferDescriptors are 64byte in size:

Are they 64 bytes in size on REL9_*_STABLE? How about on win64? I
think we're reasonably disciplined here already, but long is 32-bits
in length even on win64. Looks like it would probably be okay, but as
you say, it doesn't seem like something to leave to chance.

> We could polish up the attached patch and apply it to all the branches,
> the costs of memory are minimal. But I wonder if we shouldn't instead
> make ShmemInitStruct() always return cacheline aligned addresses. That
> will require some fiddling, but it might be a good idea nonetheless?

What fiddling are you thinking of?

> I think we should also consider some more reliable measures to have
> BufferDescriptors cacheline sized, rather than relying on the happy
> accident. Debugging alignment issues isn't fun, too much of a guessing
> game...

+1. Maybe make code that isn't appropriately aligned fail to compile?

-- 
Peter Geoghegan

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

03 February 2014, 23:38:25

On 2014-02-03 15:17:13 -0800, Peter Geoghegan wrote:
> On Sun, Feb 2, 2014 at 7:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Just as reference, we're talking about a performance degradation from
> > 475963.613865 tps to 197744.913556 in a pgbench -S -cj64 just by setting
> > max_connections to 90, from 91...
>
> That's pretty terrible.

Yea, I was scared myself.

> > A quick hack (attached) making BufferDescriptor 64byte aligned indeed
> > restored performance across all max_connections settings. It's not
> > surprising that a misaligned buffer descriptor causes problems -
> > there'll be plenty of false sharing of the spinlocks otherwise. Curious
> > that the the intel machine isn't hurt much by this.
>
> I think that is explained here:
>
> http://www.agner.org/optimize/blog/read.php?i=142&v=t
>
> With Sandy Bridge, "Misaligned memory operands [are] handled efficiently".

No, I don't think so. Those improvements afair refer to unaligned
accesses as in accessing a 4 byte variable at address % 4 != 0.

> > Now all this hinges on the fact that by a mere accident
> > BufferDescriptors are 64byte in size:
>
> Are they 64 bytes in size on REL9_*_STABLE? How about on win64? I
> think we're reasonably disciplined here already, but long is 32-bits
> in length even on win64. Looks like it would probably be okay, but as
> you say, it doesn't seem like something to leave to chance.

I haven't checked any branches yet, but I don't remember any recent
changes to sbufdesc. And I think we're lucky enough that all the used
types are the same width across common 64bit OSs.
But I absolutely agree we shouldn't leave this to chance.

> > We could polish up the attached patch and apply it to all the branches,
> > the costs of memory are minimal. But I wonder if we shouldn't instead
> > make ShmemInitStruct() always return cacheline aligned addresses. That
> > will require some fiddling, but it might be a good idea nonetheless?

> What fiddling are you thinking of?

Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before
returning from ShmemAlloc() (and thereby ShmemInitStruct). After I'd
written the email I came across an interesting bit of code:

void *
ShmemAlloc(Size size)
{
...       /* extra alignment for large requests, since they are probably buffers */       if (size >= BLCKSZ)
   newStart = BUFFERALIGN(newStart);

/** Preferred alignment for disk I/O buffers.  On some CPUs, copies between* user space and kernel space are
significantlyfaster if the user buffer* is aligned on a larger-than-MAXALIGN boundary.  Ideally this should be* a
platform-dependentvalue, but for now we just hard-wire it.*/

#define ALIGNOF_BUFFER    32

#define BUFFERALIGN(LEN)        TYPEALIGN(ALIGNOF_BUFFER, (LEN))

So, we're already trying to make large allocations (which we'll be using
here) try to fit to a 32 byte alignment. Which unfortunately isn't
enough here, but it explains why max_connections has to be changed by
exactly 1 to achieve adequate performance...
So, the absolutely easiest thing would be to just change ALIGNOF_BUFFER
to 64. The current value was accurate in 2003 (common cacheline size
back then), but it really isn't anymore today. Anything relevant is
64byte.

But I really think we should also remove the BLCKSZ limit. There's
relatively few granular/dynamic usages of ShmemAlloc(), basically just
shared memory hashes. And I'd be rather unsurprised if aligning all
allocations for hashes resulted in a noticeable speedup. We have
frightening bottlenecks in those today.

> > I think we should also consider some more reliable measures to have
> > BufferDescriptors cacheline sized, rather than relying on the happy
> > accident. Debugging alignment issues isn't fun, too much of a guessing
> > game...
>
> +1. Maybe make code that isn't appropriately aligned fail to compile?

I was wondering about using some union trickery to make sure the array
only consists out of properly aligned types. That'd require some changes
but luckily code directly accessing the BufferDescriptors array isn't
too far spread.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

04 February 2014, 12:21:06

On 2014-02-04 00:38:19 +0100, Andres Freund wrote:
> > > A quick hack (attached) making BufferDescriptor 64byte aligned indeed
> > > restored performance across all max_connections settings. It's not
> > > surprising that a misaligned buffer descriptor causes problems -
> > > there'll be plenty of false sharing of the spinlocks otherwise. Curious
> > > that the the intel machine isn't hurt much by this.
> >
> > I think that is explained here:
> >
> > http://www.agner.org/optimize/blog/read.php?i=142&v=t
> >
> > With Sandy Bridge, "Misaligned memory operands [are] handled efficiently".
> 
> No, I don't think so. Those improvements afair refer to unaligned
> accesses as in accessing a 4 byte variable at address % 4 != 0.

So, Christian did some benchmarking on the intel machine, and his
results were also lower than mine, and I've since confirmed that it's
also possible to reproduce the alignment problems on the intel machine.

Which imo means fixing this got more important...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Peter Geoghegan

Date:

04 February 2014, 22:55:31

On Tue, Feb 4, 2014 at 4:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Which imo means fixing this got more important...

I strongly agree.


-- 
Peter Geoghegan

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Peter Geoghegan

Date:

05 February 2014, 00:24:16

On Mon, Feb 3, 2014 at 3:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > A quick hack (attached) making BufferDescriptor 64byte aligned indeed
>> > restored performance across all max_connections settings. It's not
>> > surprising that a misaligned buffer descriptor causes problems -
>> > there'll be plenty of false sharing of the spinlocks otherwise. Curious
>> > that the the intel machine isn't hurt much by this.

>> What fiddling are you thinking of?
>
> Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before
> returning from ShmemAlloc() (and thereby ShmemInitStruct).

There is something you have not drawn explicit attention to that is
very interesting. If we take REL9_3_STABLE tip to be representative
(built with full -O2 optimization, no assertions just debugging
symbols), setting max_connections to 91 from 90 does not have the
effect of making the BufferDescriptors array aligned; it has the
effect of making it *misaligned*. You reported that 91 was much better
than 90. I think that the problem actually occurs when the array *is*
aligned!

I suspect that the scenario described in this article accounts for the
quite noticeable effect reported: http://danluu.com/3c-conflict

-- 
Peter Geoghegan

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

05 February 2014, 08:10:41

On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
> On Mon, Feb 3, 2014 at 3:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > A quick hack (attached) making BufferDescriptor 64byte aligned indeed
> >> > restored performance across all max_connections settings. It's not
> >> > surprising that a misaligned buffer descriptor causes problems -
> >> > there'll be plenty of false sharing of the spinlocks otherwise. Curious
> >> > that the the intel machine isn't hurt much by this.
> 
> >> What fiddling are you thinking of?
> >
> > Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before
> > returning from ShmemAlloc() (and thereby ShmemInitStruct).
> 
> There is something you have not drawn explicit attention to that is
> very interesting. If we take REL9_3_STABLE tip to be representative
> (built with full -O2 optimization, no assertions just debugging
> symbols), setting max_connections to 91 from 90 does not have the
> effect of making the BufferDescriptors array aligned; it has the
> effect of making it *misaligned*. You reported that 91 was much better
> than 90. I think that the problem actually occurs when the array *is*
> aligned!

I don't think you can learn much from the alignment in 9.3
vs. HEAD. Loads has changed since, most prominently and recently
Robert's LWLock work. That certainly has changed allocation patterns.
It will also depend on some other parameters, e.g. changing
max_wal_senders, max_background_workers will also change the
offset. It's not that 91 is intrinsically better, it just happened to
give a aligned BufferDescriptors array when the other parameters weren't
changed at the same time.

> I suspect that the scenario described in this article accounts for the
> quite noticeable effect reported: http://danluu.com/3c-conflict

I don't think that's applicable here. What's described there is relevant
for access patterns that are larger multiple of the cacheline size - but
our's is exactly cacheline sized. What can happen in such scenarios is
that all your accesses map to the same set of cachelines, so instead of
using most of the cache, you end up using only 8 or so (8 is a common
size of set associative caches these days).
Theoretically we could see something like that for shared_buffers
itself, but I *think* our accesses are too far spread around in them for
that to be a significant issue.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Tom Lane

Date:

05 February 2014, 15:00:23

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
>> I suspect that the scenario described in this article accounts for the
>> quite noticeable effect reported: http://danluu.com/3c-conflict

> I don't think that's applicable here.

Maybe, or maybe not, but I think it does say that we should be very wary
of proposals to force data structure alignment without any testing of the
consequences.
        regards, tom lane

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

05 February 2014, 15:08:56

On 2014-02-05 09:57:11 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
> >> I suspect that the scenario described in this article accounts for the
> >> quite noticeable effect reported: http://danluu.com/3c-conflict
> 
> > I don't think that's applicable here.
> 
> Maybe, or maybe not, but I think it does say that we should be very wary
> of proposals to force data structure alignment without any testing of the
> consequences.

I agree it needs testing, but what the page is talking about really,
really doesn't have *anything* to do with this. What the author is
talking about is not storing things *page* aligned (I.e. 4k or a
multiple). The problem is that the beginning of each such page falls
into the same cacheline set and thus doesn't utilize the entire L1/L2/L3
but only the single set they map into.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Greg Stark

Date:

05 February 2014, 15:14:52

On Wed, Feb 5, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe, or maybe not, but I think it does say that we should be very wary
> of proposals to force data structure alignment without any testing of the
> consequences.

Sure. see for instance
http://igoro.com/archive/gallery-of-processor-cache-effects/

From what I understand what Andres is suggesting is ensuring that each
BufferDescriptor is sitting entirely in one cache line. That seems
very unlikely to be worse than having a BufferDescriptor spanning two
cache lines and being on the same cache line as the adjacent
BufferDescriptors. But this all depends on knowing the length of the
cache lines. I see a lot of confusion online over whether cache lines
are 64 bytes, 128 bytes, or other length even just on Intel
architectures, let alone others.

I wonder if there are any generic tools to optimize array/structure
layouts based on cachegrind profiling or something like that. Then we
wouldn't need to know the oddities ourselves and optimize manually. We
could maybe even do it on the build farm and select the right profile
at build time by matching build target information.

-- 
greg

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

05 February 2014, 15:21:48

On 2014-02-05 16:14:01 +0100, Greg Stark wrote:
> I see a lot of confusion online over whether cache lines
> are 64 bytes, 128 bytes, or other length even just on Intel
> architectures, let alone others.

All current x86 processors use 64. But even if it were bigger/smaller,
they will be either 32, or 128. Neither benefits from touching more
cachelines than necessary. E.g. in the 128 case, we could still touch
two with the current code.
The effects referred to upthread only affect code with larger multiples
of the cacheline size. Not what we have here.

> I wonder if there are any generic tools to optimize array/structure
> layouts based on cachegrind profiling or something like that. Then we
> wouldn't need to know the oddities ourselves and optimize manually. We
> could maybe even do it on the build farm and select the right profile
> at build time by matching build target information.

There's profiling tools (e.g. perf's -e
stalled-cycles-(frontent|backend)), but I don't think there's more than
that.
And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
just wasn't updated in the last 10 years.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Tom Lane

Date:

05 February 2014, 16:24:44

Andres Freund <andres@2ndquadrant.com> writes:
> And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
> just wasn't updated in the last 10 years.

No, ALIGNOF_BUFFER is there because we read something that said that I/O
transfers between userspace and kernel disk cache would be faster with
aligned buffers.  There's been no particular thought given to alignment
of other data structures, AFAIR.

It may well be that your proposal is spot on.  But I'd like to see some
data-structure-by-data-structure measurements, rather than assuming that
alignment must be a good thing.
        regards, tom lane

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

05 February 2014, 16:37:42

On 2014-02-05 11:23:29 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
> > just wasn't updated in the last 10 years.
> 
> No, ALIGNOF_BUFFER is there because we read something that said that I/O
> transfers between userspace and kernel disk cache would be faster with
> aligned buffers.  There's been no particular thought given to alignment
> of other data structures, AFAIR.

But it's not aligned anymore on at last half a decade's hardware, and
it's what we already align *all* bigger ShmemAlloc() values with. And
BufferDescriptors surely counts as larger in its entirety.

> It may well be that your proposal is spot on.  But I'd like to see some
> data-structure-by-data-structure measurements, rather than assuming that
> alignment must be a good thing.

I am fine with just aligning BufferDescriptors properly. That has
clearly shown massive improvements.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

05 February 2014, 17:36:48

On Wed, Feb 5, 2014 at 11:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-02-05 11:23:29 -0500, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>> > And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
>> > just wasn't updated in the last 10 years.
>>
>> No, ALIGNOF_BUFFER is there because we read something that said that I/O
>> transfers between userspace and kernel disk cache would be faster with
>> aligned buffers.  There's been no particular thought given to alignment
>> of other data structures, AFAIR.
>
> But it's not aligned anymore on at last half a decade's hardware, and
> it's what we already align *all* bigger ShmemAlloc() values with. And
> BufferDescriptors surely counts as larger in its entirety.
>
>> It may well be that your proposal is spot on.  But I'd like to see some
>> data-structure-by-data-structure measurements, rather than assuming that
>> alignment must be a good thing.
>
> I am fine with just aligning BufferDescriptors properly. That has
> clearly shown massive improvements.

I thought your previous idea of increasing BUFFERALIGN to 64 bytes had
a lot to recommend it.  But that doesn't mean it doesn't need testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Peter Geoghegan

Date:

05 February 2014, 18:13:44

On Wed, Feb 5, 2014 at 7:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> All current x86 processors use 64. But even if it were bigger/smaller,
> they will be either 32, or 128. Neither benefits from touching more
> cachelines than necessary. E.g. in the 128 case, we could still touch
> two with the current code.

That's true, but I believe that a hardware optimization called
Adjacent Cache Line Prefetch is widely supported by recent
microarchitectures, and is sometimes enabled by default. I'm not
suggesting that that would necessarily radically alter the outcome,
but it is a consideration.

-- 
Peter Geoghegan

Re: Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Peter Geoghegan

Date:

05 February 2014, 18:48:00

On Tue, Feb 4, 2014 at 4:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> There is something you have not drawn explicit attention to that is
> very interesting. If we take REL9_3_STABLE tip to be representative
> (built with full -O2 optimization, no assertions just debugging
> symbols), setting max_connections to 91 from 90 does not have the
> effect of making the BufferDescriptors array aligned; it has the
> effect of making it *misaligned*.

I spoke too soon; the effect is indeed reversed on master (i.e. "bad"
max_connection settings are misaligned, and vice-versa).


-- 
Peter Geoghegan

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

06 February 2014, 08:40:44

On 2014-02-05 12:36:42 -0500, Robert Haas wrote:
> >> It may well be that your proposal is spot on.  But I'd like to see some
> >> data-structure-by-data-structure measurements, rather than assuming that
> >> alignment must be a good thing.
> >
> > I am fine with just aligning BufferDescriptors properly. That has
> > clearly shown massive improvements.
> 
> I thought your previous idea of increasing BUFFERALIGN to 64 bytes had
> a lot to recommend it.

Good.

I wonder if we shouldn't move that bit of logic:if (size >= BUFSIZ)    newStart = BUFFERALIGN(newStart);
out of ShmemAlloc() and instead have a ShmemAllocAligned() and
ShmemInitStructAligned() that does it. So we can sensibly can control it
per struct.

> But that doesn't mean it doesn't need testing.

I feel the need here, to say that I never said it doesn't need testing
and never thought it didn't...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

16 April 2014, 23:18:09

On Thu, Feb  6, 2014 at 09:40:32AM +0100, Andres Freund wrote:
> On 2014-02-05 12:36:42 -0500, Robert Haas wrote:
> > >> It may well be that your proposal is spot on.  But I'd like to see some
> > >> data-structure-by-data-structure measurements, rather than assuming that
> > >> alignment must be a good thing.
> > >
> > > I am fine with just aligning BufferDescriptors properly. That has
> > > clearly shown massive improvements.
> > 
> > I thought your previous idea of increasing BUFFERALIGN to 64 bytes had
> > a lot to recommend it.
> 
> Good.
> 
> I wonder if we shouldn't move that bit of logic:
>     if (size >= BUFSIZ)
>         newStart = BUFFERALIGN(newStart);
> out of ShmemAlloc() and instead have a ShmemAllocAligned() and
> ShmemInitStructAligned() that does it. So we can sensibly can control it
> per struct.
> 
> > But that doesn't mean it doesn't need testing.
> 
> I feel the need here, to say that I never said it doesn't need testing
> and never thought it didn't...

Where are we on this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

17 April 2014, 09:23:34

On 2014-04-16 19:18:02 -0400, Bruce Momjian wrote:
> On Thu, Feb  6, 2014 at 09:40:32AM +0100, Andres Freund wrote:
> > On 2014-02-05 12:36:42 -0500, Robert Haas wrote:
> > > >> It may well be that your proposal is spot on.  But I'd like to see some
> > > >> data-structure-by-data-structure measurements, rather than assuming that
> > > >> alignment must be a good thing.
> > > >
> > > > I am fine with just aligning BufferDescriptors properly. That has
> > > > clearly shown massive improvements.
> > > 
> > > I thought your previous idea of increasing BUFFERALIGN to 64 bytes had
> > > a lot to recommend it.
> > 
> > Good.
> > 
> > I wonder if we shouldn't move that bit of logic:
> >     if (size >= BUFSIZ)
> >         newStart = BUFFERALIGN(newStart);
> > out of ShmemAlloc() and instead have a ShmemAllocAligned() and
> > ShmemInitStructAligned() that does it. So we can sensibly can control it
> > per struct.
> > 
> > > But that doesn't mean it doesn't need testing.
> > 
> > I feel the need here, to say that I never said it doesn't need testing
> > and never thought it didn't...
> 
> Where are we on this?

It needs somebody with time to evaluate possible performance regressions
- I personally won't have time to look into this in detail before pgcon.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

26 August 2014, 02:58:29

On Thu, Apr 17, 2014 at 11:23:24AM +0200, Andres Freund wrote:
> On 2014-04-16 19:18:02 -0400, Bruce Momjian wrote:
> > On Thu, Feb  6, 2014 at 09:40:32AM +0100, Andres Freund wrote:
> > > On 2014-02-05 12:36:42 -0500, Robert Haas wrote:
> > > > >> It may well be that your proposal is spot on.  But I'd like to see some
> > > > >> data-structure-by-data-structure measurements, rather than assuming that
> > > > >> alignment must be a good thing.
> > > > >
> > > > > I am fine with just aligning BufferDescriptors properly. That has
> > > > > clearly shown massive improvements.
> > > > 
> > > > I thought your previous idea of increasing BUFFERALIGN to 64 bytes had
> > > > a lot to recommend it.
> > > 
> > > Good.
> > > 
> > > I wonder if we shouldn't move that bit of logic:
> > >     if (size >= BUFSIZ)
> > >         newStart = BUFFERALIGN(newStart);
> > > out of ShmemAlloc() and instead have a ShmemAllocAligned() and
> > > ShmemInitStructAligned() that does it. So we can sensibly can control it
> > > per struct.
> > > 
> > > > But that doesn't mean it doesn't need testing.
> > > 
> > > I feel the need here, to say that I never said it doesn't need testing
> > > and never thought it didn't...
> > 
> > Where are we on this?
> 
> It needs somebody with time to evaluate possible performance regressions
> - I personally won't have time to look into this in detail before pgcon.

Again, has anyone made any headway on this?  Is it a TODO item?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

24 December 2014, 03:51:37

On Thu, Apr 17, 2014 at 11:23:24AM +0200, Andres Freund wrote:
> On 2014-04-16 19:18:02 -0400, Bruce Momjian wrote:
> > On Thu, Feb  6, 2014 at 09:40:32AM +0100, Andres Freund wrote:
> > > On 2014-02-05 12:36:42 -0500, Robert Haas wrote:
> > > > >> It may well be that your proposal is spot on.  But I'd like to see some
> > > > >> data-structure-by-data-structure measurements, rather than assuming that
> > > > >> alignment must be a good thing.
> > > > >
> > > > > I am fine with just aligning BufferDescriptors properly. That has
> > > > > clearly shown massive improvements.
> > > >
> > > > I thought your previous idea of increasing BUFFERALIGN to 64 bytes had
> > > > a lot to recommend it.
> > >
> > > Good.
> > >
> > > I wonder if we shouldn't move that bit of logic:
> > >     if (size >= BUFSIZ)
> > >         newStart = BUFFERALIGN(newStart);
> > > out of ShmemAlloc() and instead have a ShmemAllocAligned() and
> > > ShmemInitStructAligned() that does it. So we can sensibly can control it
> > > per struct.
> > >
> > > > But that doesn't mean it doesn't need testing.
> > >
> > > I feel the need here, to say that I never said it doesn't need testing
> > > and never thought it didn't...
> >
> > Where are we on this?
>
> It needs somebody with time to evaluate possible performance regressions
> - I personally won't have time to look into this in detail before pgcon.

I am doing performance testing to try to complete this item.  I used the
first attached patch to report which structures are 64-byte aligned:

    64-byte shared memory alignment of Control File:  0
    64-byte shared memory alignment of XLOG Ctl:  1
    64-byte shared memory alignment of CLOG Ctl:  0
    64-byte shared memory alignment of CommitTs Ctl:  0
    64-byte shared memory alignment of CommitTs shared:  0
    64-byte shared memory alignment of SUBTRANS Ctl:  1
    64-byte shared memory alignment of MultiXactOffset Ctl:  1
    64-byte shared memory alignment of MultiXactMember Ctl:  1
    64-byte shared memory alignment of Shared MultiXact State:  1
    64-byte shared memory alignment of Buffer Descriptors:  1
    64-byte shared memory alignment of Buffer Blocks:  1
    64-byte shared memory alignment of Shared Buffer Lookup Table:  1
    64-byte shared memory alignment of Buffer Strategy Status:  1
    64-byte shared memory alignment of LOCK hash:  0
    64-byte shared memory alignment of PROCLOCK hash:  0
    64-byte shared memory alignment of Fast Path Strong Relation Lock Data:  0
    64-byte shared memory alignment of PREDICATELOCKTARGET hash:  0
    64-byte shared memory alignment of PREDICATELOCK hash:  0
    64-byte shared memory alignment of PredXactList:  0
    64-byte shared memory alignment of SERIALIZABLEXID hash:  1
    64-byte shared memory alignment of RWConflictPool:  1
    64-byte shared memory alignment of FinishedSerializableTransactions:  0
    64-byte shared memory alignment of OldSerXid SLRU Ctl:  1
    64-byte shared memory alignment of OldSerXidControlData:  1
    64-byte shared memory alignment of Proc Header:  0
    64-byte shared memory alignment of Proc Array:  0
    64-byte shared memory alignment of Backend Status Array:  0
    64-byte shared memory alignment of Backend Application Name Buffer:  0
    64-byte shared memory alignment of Backend Client Host Name Buffer:  0
    64-byte shared memory alignment of Backend Activity Buffer:  0
    64-byte shared memory alignment of Prepared Transaction Table:  0
    64-byte shared memory alignment of Background Worker Data:  0
    64-byte shared memory alignment of shmInvalBuffer:  1
    64-byte shared memory alignment of PMSignalState:  0
    64-byte shared memory alignment of ProcSignalSlots:  0
    64-byte shared memory alignment of Checkpointer Data:  0
    64-byte shared memory alignment of AutoVacuum Data:  0
    64-byte shared memory alignment of Wal Sender Ctl:  0
    64-byte shared memory alignment of Wal Receiver Ctl:  0
    64-byte shared memory alignment of BTree Vacuum State:  0
    64-byte shared memory alignment of Sync Scan Locations List:  0
    64-byte shared memory alignment of Async Queue Control:  0
    64-byte shared memory alignment of Async Ctl:  0

Many of these are 64-byte aligned, including Buffer Descriptors.  I
tested pgbench with these commands:

    $ pgbench -i -s 95 pgbench
    $ pgbench -S -c 95 -j 95 -t 100000 pgbench

on a 16-core Xeon server and got 84k tps.  I then applied another patch,
attached, which causes all the structures to be non-64-byte aligned, but
got the same tps number.

Can someone test these patches on an AMD CPU and see if you see a
difference?  Thanks.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Attachment

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

24 December 2014, 09:30:29

On 2014-12-23 22:51:22 -0500, Bruce Momjian wrote:
> Many of these are 64-byte aligned, including Buffer Descriptors.

In that case you need to change max_connections, some settings will lead
to unaligned BufferDescriptors.

> I
> tested pgbench with these commands:
> 
>     $ pgbench -i -s 95 pgbench
>     $ pgbench -S -c 95 -j 95 -t 100000 pgbench
> 
> on a 16-core Xeon server and got 84k tps.  I then applied another patch,
> attached, which causes all the structures to be non-64-byte aligned, but
> got the same tps number.

'Xeon' itself doesn't say much. It's been applied to widly different
CPUs over the years. I guess that was a single socket server? You're
much more likely to see significant problems on a multi node NUMA
servers where the penalties for cache misses/false sharing are a
magnitude or three higher.

> Can someone test these patches on an AMD CPU and see if you see a
> difference?  Thanks.

I don't think you'll see a bigger difference there.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

24 December 2014, 15:00:14

On Wed, Dec 24, 2014 at 10:30:19AM +0100, Andres Freund wrote:
> On 2014-12-23 22:51:22 -0500, Bruce Momjian wrote:
> > Many of these are 64-byte aligned, including Buffer Descriptors.
> 
> In that case you need to change max_connections, some settings will lead
> to unaligned BufferDescriptors.

Well, isn't my second patch that misaligns the buffers sufficient for
testing?

> > I tested pgbench with these commands:
> > 
> >     $ pgbench -i -s 95 pgbench
> >     $ pgbench -S -c 95 -j 95 -t 100000 pgbench
> > 
> > on a 16-core Xeon server and got 84k tps.  I then applied another patch,
> > attached, which causes all the structures to be non-64-byte aligned, but
> > got the same tps number.
> 
> 'Xeon' itself doesn't say much. It's been applied to widly different
> CPUs over the years. I guess that was a single socket server? You're
> much more likely to see significant problems on a multi node NUMA
> servers where the penalties for cache misses/false sharing are a
> magnitude or three higher.

Sorry, the server has 2 x Intel Xeon E5620 2.4GHz Quad-Core Processors;
the full details are here:
http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012

> > Can someone test these patches on an AMD CPU and see if you see a
> > difference?  Thanks.
> 
> I don't think you'll see a bigger difference there.

Uh, I thought AMD showed a huge difference for misalignment:
http://www.postgresql.org/message-id/20140202151319.GD32123@awork2.anarazel.de

and that email is from you.

I ended up running pgbench using 16-scale and got 90k tps:
pgbench -S -c 16 -j 16 -t 100000 pgbench

but again could not see any difference between aligned and misaligned.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

24 December 2014, 16:20:24

On 2014-12-24 10:00:05 -0500, Bruce Momjian wrote:
> On Wed, Dec 24, 2014 at 10:30:19AM +0100, Andres Freund wrote:
> > On 2014-12-23 22:51:22 -0500, Bruce Momjian wrote:
> > > Many of these are 64-byte aligned, including Buffer Descriptors.
> > 
> > In that case you need to change max_connections, some settings will lead
> > to unaligned BufferDescriptors.
> 
> Well, isn't my second patch that misaligns the buffers sufficient for
> testing?

I hadn't looked at it. Note that you're quite likely to overlap the
allocated region into the next one with the trivial method you're using.

I just verified that I can still reproduce the problem:

# aligned case (max_connections=401)
afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 -j 96 -T 100 -S
progress: 1.0 s, 405170.2 tps, lat 0.195 ms stddev 0.928
progress: 2.0 s, 467011.1 tps, lat 0.204 ms stddev 0.140
progress: 3.0 s, 462832.1 tps, lat 0.205 ms stddev 0.154
progress: 4.0 s, 471035.5 tps, lat 0.202 ms stddev 0.154
progress: 5.0 s, 500329.0 tps, lat 0.190 ms stddev 0.132

BufferDescriptors is at 0x7f63610a6960 (which is 32byte aligned)

# unaligned case (max_connections=400)
afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 -j 96 -T 100 -S
progress: 1.0 s, 202271.1 tps, lat 0.448 ms stddev 1.232
progress: 2.0 s, 223823.4 tps, lat 0.427 ms stddev 3.007
progress: 3.0 s, 227584.5 tps, lat 0.414 ms stddev 4.760
progress: 4.0 s, 221095.6 tps, lat 0.410 ms stddev 4.390
progress: 5.0 s, 217430.6 tps, lat 0.454 ms stddev 7.913
progress: 6.0 s, 210275.9 tps, lat 0.411 ms stddev 0.606
BufferDescriptors is at 0x7f1718aeb980 (which is 64byte aligned)

This is on a quad E5-4620 with 256GB RAM on a scale 100 pgbench
instance.

> > > Can someone test these patches on an AMD CPU and see if you see a
> > > difference?  Thanks.
> > 
> > I don't think you'll see a bigger difference there.
> 
> Uh, I thought AMD showed a huge difference for misalignment:
> 
>     http://www.postgresql.org/message-id/20140202151319.GD32123@awork2.anarazel.de

Ugh, yes. Forgot that... There was another patch that wasn't showing big
differences on AMD, and I mixed it up.

> I ended up running pgbench using 16-scale and got 90k tps:
> 
>     pgbench -S -c 16 -j 16 -t 100000 pgbench
> 
> but again could not see any difference between aligned and misaligned.

At the very least you should use -M prepared, otherwise you'll be
bottlenecked by the parser.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

28 December 2014, 01:05:47

On Wed, Dec 24, 2014 at 11:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I just verified that I can still reproduce the problem:
>
> # aligned case (max_connections=401)
> afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 -j 96 -T 100 -S
> progress: 1.0 s, 405170.2 tps, lat 0.195 ms stddev 0.928
> progress: 2.0 s, 467011.1 tps, lat 0.204 ms stddev 0.140
> progress: 3.0 s, 462832.1 tps, lat 0.205 ms stddev 0.154
> progress: 4.0 s, 471035.5 tps, lat 0.202 ms stddev 0.154
> progress: 5.0 s, 500329.0 tps, lat 0.190 ms stddev 0.132
>
> BufferDescriptors is at 0x7f63610a6960 (which is 32byte aligned)
>
> # unaligned case (max_connections=400)
> afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 -j 96 -T 100 -S
> progress: 1.0 s, 202271.1 tps, lat 0.448 ms stddev 1.232
> progress: 2.0 s, 223823.4 tps, lat 0.427 ms stddev 3.007
> progress: 3.0 s, 227584.5 tps, lat 0.414 ms stddev 4.760
> progress: 4.0 s, 221095.6 tps, lat 0.410 ms stddev 4.390
> progress: 5.0 s, 217430.6 tps, lat 0.454 ms stddev 7.913
> progress: 6.0 s, 210275.9 tps, lat 0.411 ms stddev 0.606
> BufferDescriptors is at 0x7f1718aeb980 (which is 64byte aligned)

So, should we increase ALIGNOF_BUFFER from 32 to 64?  Seems like
that's what these results are telling us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

29 December 2014, 21:59:22

On Sat, Dec 27, 2014 at 08:05:42PM -0500, Robert Haas wrote:
> On Wed, Dec 24, 2014 at 11:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I just verified that I can still reproduce the problem:
> >
> > # aligned case (max_connections=401)
> > afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 -j 96 -T 100 -S
> > progress: 1.0 s, 405170.2 tps, lat 0.195 ms stddev 0.928
> > progress: 2.0 s, 467011.1 tps, lat 0.204 ms stddev 0.140
> > progress: 3.0 s, 462832.1 tps, lat 0.205 ms stddev 0.154
> > progress: 4.0 s, 471035.5 tps, lat 0.202 ms stddev 0.154
> > progress: 5.0 s, 500329.0 tps, lat 0.190 ms stddev 0.132
> >
> > BufferDescriptors is at 0x7f63610a6960 (which is 32byte aligned)
> >
> > # unaligned case (max_connections=400)
> > afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 -j 96 -T 100 -S
> > progress: 1.0 s, 202271.1 tps, lat 0.448 ms stddev 1.232
> > progress: 2.0 s, 223823.4 tps, lat 0.427 ms stddev 3.007
> > progress: 3.0 s, 227584.5 tps, lat 0.414 ms stddev 4.760
> > progress: 4.0 s, 221095.6 tps, lat 0.410 ms stddev 4.390
> > progress: 5.0 s, 217430.6 tps, lat 0.454 ms stddev 7.913
> > progress: 6.0 s, 210275.9 tps, lat 0.411 ms stddev 0.606
> > BufferDescriptors is at 0x7f1718aeb980 (which is 64byte aligned)
>
> So, should we increase ALIGNOF_BUFFER from 32 to 64?  Seems like
> that's what these results are telling us.

I am glad someone else considers this important.  Andres reported the
above 2x pgbench difference in February, but no action was taken as
everyone felt there needed to be more performance testing, but it never
happened:

    http://www.postgresql.org/message-id/20140202151319.GD32123@awork2.anarazel.de

I have now performance tested this by developing the attached two
patches which both increase the Buffer Descriptors allocation by 64
bytes.  The first patch causes each 64-byte Buffer Descriptor struct to
align on a 32-byte boundary but not a 64-byte boundary, while the second
patch aligns it with a 64-byte boundary.

I tried many tests, including this:

    $ pgbench --initialize --scale 1 pgbench
    $ pgbench --protocol prepared --client 16 --jobs 16 --transactions 100000 --select-only pgbench

I cannot measure any difference on my dual-CPU-socket, 16-vcore server
(http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012).   I
thought this test would cause the most Buffer Descriptor contention
between the two CPUs.  Can anyone else see a difference when testing
these two patches?  (The patch reports alignment in the server logs.)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Attachment

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

29 December 2014, 23:48:30

On 2014-12-29 16:59:05 -0500, Bruce Momjian wrote:
> I am glad someone else considers this important.

I do consider it important. I just considered the lwlock scalability
more important...

> Andres reported the above 2x pgbench difference in February, but no
> action was taken as everyone felt there needed to be more performance
> testing, but it never happened:

FWIW, I have no idea what exactly should be tested there. Right now I
think what we should do is to remove the BUFFERALIGN from shmem.c and
instead add explicit alignment code in a couple callers
(BufferDescriptors/Blocks, proc.c stuff).

>     $ pgbench --initialize --scale 1 pgbench

Scale 1 isn't particularly helpful in benchmarks, not even read only
ones.

>     $ pgbench --protocol prepared --client 16 --jobs 16 --transactions 100000 --select-only pgbench

I'd suspect you're more likely to see differences with a higher client
count. Also, I seriously doubt 100k xacts is enough to get stable
results - even on my laptop I get 100k+ TPS. I'd suggest using something
like -P 1 -T 100 or so.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

01 January 2015, 16:55:12

On Mon, Dec 29, 2014 at 6:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Andres reported the above 2x pgbench difference in February, but no
>> action was taken as everyone felt there needed to be more performance
>> testing, but it never happened:
>
> FWIW, I have no idea what exactly should be tested there. Right now I
> think what we should do is to remove the BUFFERALIGN from shmem.c and
> instead add explicit alignment code in a couple callers
> (BufferDescriptors/Blocks, proc.c stuff).

That seems like a strange approach.  I think it's pretty sensible to
try to ensure that allocated blocks of shared memory have decent
alignment, and we don't have enough of them for aligning on 64-byte
boundaries (or even 128-byte boundaries, perhaps) to eat up any
meaningful amount of memory.  The BUFFERALIGN() stuff, like much else
about the way we manage shared memory, has also made its way into the
dynamic-shared-memory code.  So if we do adjust the alignment that we
guarantee for the main shared memory segment, we should perhaps adjust
DSM to match.  But I guess I don't understand why you'd want to do it
that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

01 January 2015, 16:59:54

On 2015-01-01 11:55:03 -0500, Robert Haas wrote:
> On Mon, Dec 29, 2014 at 6:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Andres reported the above 2x pgbench difference in February, but no
> >> action was taken as everyone felt there needed to be more performance
> >> testing, but it never happened:
> >
> > FWIW, I have no idea what exactly should be tested there. Right now I
> > think what we should do is to remove the BUFFERALIGN from shmem.c and
> > instead add explicit alignment code in a couple callers
> > (BufferDescriptors/Blocks, proc.c stuff).
> 
> That seems like a strange approach.  I think it's pretty sensible to
> try to ensure that allocated blocks of shared memory have decent
> alignment, and we don't have enough of them for aligning on 64-byte
> boundaries (or even 128-byte boundaries, perhaps) to eat up any
> meaningful amount of memory.  The BUFFERALIGN() stuff, like much else
> about the way we manage shared memory, has also made its way into the
> dynamic-shared-memory code.  So if we do adjust the alignment that we
> guarantee for the main shared memory segment, we should perhaps adjust
> DSM to match.  But I guess I don't understand why you'd want to do it
> that way.

The problem is that just aligning the main allocation to some boundary
doesn't mean the hot part of the allocation is properly aligned. shmem.c
in fact can't really do much about that - so fully moving the
responsibility seems more likely to ensure that future code thinks about
alignment.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

01 January 2015, 18:58:21

On Thu, Jan  1, 2015 at 05:59:25PM +0100, Andres Freund wrote:
> > That seems like a strange approach.  I think it's pretty sensible to
> > try to ensure that allocated blocks of shared memory have decent
> > alignment, and we don't have enough of them for aligning on 64-byte
> > boundaries (or even 128-byte boundaries, perhaps) to eat up any
> > meaningful amount of memory.  The BUFFERALIGN() stuff, like much else
> > about the way we manage shared memory, has also made its way into the
> > dynamic-shared-memory code.  So if we do adjust the alignment that we
> > guarantee for the main shared memory segment, we should perhaps adjust
> > DSM to match.  But I guess I don't understand why you'd want to do it
> > that way.
> 
> The problem is that just aligning the main allocation to some boundary
> doesn't mean the hot part of the allocation is properly aligned. shmem.c
> in fact can't really do much about that - so fully moving the
> responsibility seems more likely to ensure that future code thinks about
> alignment.

Yes, there is shared memory allocation alignment and object alignment. 
Since there are only about 50 cases of these, a worst-case change to
force 64-byte alignment would only cost 3.2k of shared memory.

It might make sense to make them all 64-byte aligned to reduce CPU cache
contention, but we have to have actual performance numbers to prove
that.  My two patches allow individual object alignment to be tested.  I
have not been able to see any performance difference (<1%) with:
$ pgbench --initialize --scale 100 pgbench$ pgbench --protocol prepared --client 32 --jobs 16 --time=100 --select-only
pgbench

on my dual-socket 16 vcore server.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

01 January 2015, 19:49:12

On Thu, Jan 1, 2015 at 11:59 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> The problem is that just aligning the main allocation to some boundary
> doesn't mean the hot part of the allocation is properly aligned. shmem.c
> in fact can't really do much about that - so fully moving the
> responsibility seems more likely to ensure that future code thinks about
> alignment.

That's true, but if you don't align the beginnings of the allocations,
then it's a lot more complicated for the code to properly align stuff
within the allocation.  It's got to insert a variable amount of
padding based on the alignment it happens to get.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

01 January 2015, 20:05:02

On January 1, 2015 8:49:06 PM CET, Robert Haas <robertmhaas@gmail.com> wrote:
>On Thu, Jan 1, 2015 at 11:59 AM, Andres Freund <andres@2ndquadrant.com>
>wrote:
>> The problem is that just aligning the main allocation to some
>boundary
>> doesn't mean the hot part of the allocation is properly aligned.
>shmem.c
>> in fact can't really do much about that - so fully moving the
>> responsibility seems more likely to ensure that future code thinks
>about
>> alignment.
>
>That's true, but if you don't align the beginnings of the allocations,
>then it's a lot more complicated for the code to properly align stuff
>within the allocation.  It's got to insert a variable amount of
>padding based on the alignment it happens to get.

Hm? Allocate +PG_CACHELINE_SIZE and do var = CACHELINEALIGN(var).


-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

01 January 2015, 20:16:10

On Thu, Jan  1, 2015 at 09:04:48PM +0100, Andres Freund wrote:
> On January 1, 2015 8:49:06 PM CET, Robert Haas <robertmhaas@gmail.com> wrote:
> >On Thu, Jan 1, 2015 at 11:59 AM, Andres Freund <andres@2ndquadrant.com>
> >wrote:
> >> The problem is that just aligning the main allocation to some
> >boundary
> >> doesn't mean the hot part of the allocation is properly aligned.
> >shmem.c
> >> in fact can't really do much about that - so fully moving the
> >> responsibility seems more likely to ensure that future code thinks
> >about
> >> alignment.
> >
> >That's true, but if you don't align the beginnings of the allocations,
> >then it's a lot more complicated for the code to properly align stuff
> >within the allocation.  It's got to insert a variable amount of
> >padding based on the alignment it happens to get.
> 
> Hm? Allocate +PG_CACHELINE_SIZE and do var = CACHELINEALIGN(var).

Yeah, I am afraid we have to do that anyway --- you can see it in my
patch too.  I guess if you had the shmem allocation aligned on 64-byte
boundries and required all allocations to be a multiple of 64, that
would work too.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

02 January 2015, 17:26:16

On 2014-12-29 16:59:05 -0500, Bruce Momjian wrote:
> diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
> new file mode 100644
> index ff6c713..c4dce5b
> *** a/src/backend/storage/buffer/buf_init.c
> --- b/src/backend/storage/buffer/buf_init.c
> *************** InitBufferPool(void)
> *** 67,72 ****
> --- 67,73 ----
>       bool        foundBufs,
>                   foundDescs;
>   
> +     fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc));
>       BufferDescriptors = (BufferDesc *)
>           ShmemInitStruct("Buffer Descriptors",
>                           NBuffers * sizeof(BufferDesc), &foundDescs);
> diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
> new file mode 100644
> index 2ea2216..669c07f
> *** a/src/backend/storage/ipc/shmem.c
> --- b/src/backend/storage/ipc/shmem.c
> *************** ShmemInitStruct(const char *name, Size s
> *** 327,332 ****
> --- 327,335 ----
>       ShmemIndexEnt *result;
>       void       *structPtr;
>   
> +     if (strcmp(name, "Buffer Descriptors") == 0)
> +         size = BUFFERALIGN(size) + 64;
> + 
>       LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
>   
>       if (!ShmemIndex)
> *************** ShmemInitStruct(const char *name, Size s
> *** 413,418 ****
> --- 416,432 ----
>                               " \"%s\" (%zu bytes requested)",
>                               name, size)));
>           }
> +         if (strcmp(name, "Buffer Descriptors") == 0)
> +         {
> +             /* align on 32 */
> +             if ((int64)structPtr % 32 != 0)
> +                 structPtr = (void *)((int64)structPtr + 32 - (int64)structPtr % 32);
> +             /* align on 32 but not 64 */
> +             if ((int64)structPtr % 64 == 0)
> +                 structPtr = (void *)((int64)structPtr + 32);
> +         }
> +         fprintf(stderr, "shared memory alignment of %s:  %ld-byte\n", name,
> +             (int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % 64);
>           result->size = size;
>           result->location = structPtr;
>       }

> diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
> new file mode 100644
> index ff6c713..c4dce5b
> *** a/src/backend/storage/buffer/buf_init.c
> --- b/src/backend/storage/buffer/buf_init.c
> *************** InitBufferPool(void)
> *** 67,72 ****
> --- 67,73 ----
>       bool        foundBufs,
>                   foundDescs;
>   
> +     fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc));
>       BufferDescriptors = (BufferDesc *)
>           ShmemInitStruct("Buffer Descriptors",
>                           NBuffers * sizeof(BufferDesc), &foundDescs);
> diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
> new file mode 100644
> index 2ea2216..50f836e
> *** a/src/backend/storage/ipc/shmem.c
> --- b/src/backend/storage/ipc/shmem.c
> *************** ShmemInitStruct(const char *name, Size s
> *** 327,332 ****
> --- 327,335 ----
>       ShmemIndexEnt *result;
>       void       *structPtr;
>   
> +     if (strcmp(name, "Buffer Descriptors") == 0)
> +         size = BUFFERALIGN(size) + 64;
> + 
>       LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
>   
>       if (!ShmemIndex)
> *************** ShmemInitStruct(const char *name, Size s
> *** 413,418 ****
> --- 416,429 ----
>                               " \"%s\" (%zu bytes requested)",
>                               name, size)));
>           }
> +         if (strcmp(name, "Buffer Descriptors") == 0)
> +         {
> +             /* align on 64 */
> +             if ((int64)structPtr % 64 != 0)
> +                 structPtr = (void *)((int64)structPtr + 64 - (int64)structPtr % 64);
> +         }
> +         fprintf(stderr, "shared memory alignment of %s:  %ld-byte\n", name,
> +             (int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % 64);
>           result->size = size;
>           result->location = structPtr;
>       }

I can't run tests right now...

What exactly do you want to see with these tests? that's essentially
what I've already benchmarked + some fprintfs?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

05 January 2015, 16:56:03

On Thu, Jan 1, 2015 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>>That's true, but if you don't align the beginnings of the allocations,
>>then it's a lot more complicated for the code to properly align stuff
>>within the allocation.  It's got to insert a variable amount of
>>padding based on the alignment it happens to get.
>
> Hm? Allocate +PG_CACHELINE_SIZE and do var = CACHELINEALIGN(var).

Meh.  I guess that will work, but I see little advantage in it.
Cache-line aligning the allocations is simple and, not of no value, of
long precedent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

06 January 2015, 02:43:19

On Fri, Jan  2, 2015 at 06:25:52PM +0100, Andres Freund wrote:
> I can't run tests right now...
> 
> What exactly do you want to see with these tests? that's essentially
> what I've already benchmarked + some fprintfs?

I want to test two patches that both allocate an extra 64 bytes but
shift the alignment of just the buffer descriptor memory allocation.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

27 January 2015, 00:44:14

On 2015-01-05 21:43:04 -0500, Bruce Momjian wrote:
> On Fri, Jan  2, 2015 at 06:25:52PM +0100, Andres Freund wrote:
> > I can't run tests right now...
> >
> > What exactly do you want to see with these tests? that's essentially
> > what I've already benchmarked + some fprintfs?
>
> I want to test two patches that both allocate an extra 64 bytes but
> shift the alignment of just the buffer descriptor memory allocation.

pgbench scale 300, with postgres configured:
-c shared_buffers=48GB
-c log_line_prefix="[%m %p] "
-c log_min_messages=debug1
-p 5440
-c checkpoint_segments=600
-c fsync=off
-c synchronous_commit=off
-c maintenance_work_mem=4GB
-c huge_pages=off
-c log_autovacuum_min_duration='10s'

test: pgbench -n -M prepared -c 96 -j 96 -T 30 -S

master as of 4b2a254793be50e31d43d4bfd813da8d141494b8:
-c max_connections=400
tps = 193748.454044 (high run vs. run variability)
-c max_connections=401
tps = 484238.551349

master + 32align.patch:
-c max_connections=400
tps = 183791.872359 (high run vs. run variability)
-c max_connections=401
tps = 185494.98244 (high run vs. run variability)

master + 64align.patch:
-c max_connections=400
tps = 489257.195570
-c max_connections=401
tps = 490496.520632

Pretty much as expected, rigth?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Bruce Momjian

Date:

27 January 2015, 00:58:36

On Tue, Jan 27, 2015 at 01:43:41AM +0100, Andres Freund wrote:
> master + 32align.patch:
> -c max_connections=400
> tps = 183791.872359 (high run vs. run variability)
> -c max_connections=401
> tps = 185494.98244 (high run vs. run variability)
> 
> master + 64align.patch:
> -c max_connections=400
> tps = 489257.195570
> -c max_connections=401
> tps = 490496.520632
> 
> Pretty much as expected, rigth?

Yes, I am convinced.  Let's work on a patch now.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

27 January 2015, 01:04:34

On 2015-01-26 19:58:25 -0500, Bruce Momjian wrote:
> On Tue, Jan 27, 2015 at 01:43:41AM +0100, Andres Freund wrote:
> > master + 32align.patch:
> > -c max_connections=400
> > tps = 183791.872359 (high run vs. run variability)
> > -c max_connections=401
> > tps = 185494.98244 (high run vs. run variability)
> > 
> > master + 64align.patch:
> > -c max_connections=400
> > tps = 489257.195570
> > -c max_connections=401
> > tps = 490496.520632
> > 
> > Pretty much as expected, rigth?
> 
> Yes, I am convinced.  Let's work on a patch now.

Since I can reproduce some minor (1-3%) performance *regressions* at low
client counts when aligning every shmem allocation, I'm inclined to just
add special case code to BufferShmemSize()/InitBufferPool() to align
descriptors to PG_CACHE_LINE_SIZE. That's really unlikely to regress
anythign as it basically can't be a bad idea to align buffer
descriptors.

Contrary opinions? Robert?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

27 January 2015, 02:08:32

On Mon, Jan 26, 2015 at 8:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-01-26 19:58:25 -0500, Bruce Momjian wrote:
>> On Tue, Jan 27, 2015 at 01:43:41AM +0100, Andres Freund wrote:
>> > master + 32align.patch:
>> > -c max_connections=400
>> > tps = 183791.872359 (high run vs. run variability)
>> > -c max_connections=401
>> > tps = 185494.98244 (high run vs. run variability)
>> >
>> > master + 64align.patch:
>> > -c max_connections=400
>> > tps = 489257.195570
>> > -c max_connections=401
>> > tps = 490496.520632
>> >
>> > Pretty much as expected, rigth?
>>
>> Yes, I am convinced.  Let's work on a patch now.
>
> Since I can reproduce some minor (1-3%) performance *regressions* at low
> client counts when aligning every shmem allocation, I'm inclined to just
> add special case code to BufferShmemSize()/InitBufferPool() to align
> descriptors to PG_CACHE_LINE_SIZE. That's really unlikely to regress
> anythign as it basically can't be a bad idea to align buffer
> descriptors.
>
> Contrary opinions? Robert?

I'm totally OK with further aligning just that one allocation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

27 January 2015, 02:13:36

On Mon, Jan 26, 2015 at 9:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Contrary opinions? Robert?
>
> I'm totally OK with further aligning just that one allocation.

Of course, now that I think about it, aligning it probably works
mostly because the size is almost exactly one cache line.  If it were
any bigger or smaller, aligning it more wouldn't help.  So maybe we
should also do something like what LWLocks do, and make a union
between the actual structure and an appropriate array of padding bytes
- say either 64 or 128 of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

28 January 2015, 11:47:42

On 2015-01-26 21:13:31 -0500, Robert Haas wrote:
> On Mon, Jan 26, 2015 at 9:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> Contrary opinions? Robert?
> >
> > I'm totally OK with further aligning just that one allocation.
> 
> Of course, now that I think about it, aligning it probably works
> mostly because the size is almost exactly one cache line.

Not just almost - it's precisely 64 byte on x86-64 linux. And I think
it'll also be that on other 64bit platforms. The only architecture
dependant thing in there is the spinlock and that already can be between
1 and 4 bytes without changing the size of the struct.

> If it were any bigger or smaller, aligning it more wouldn't help.

Yea.

> So maybe we should also do something like what LWLocks do, and make a
> union between the actual structure and an appropriate array of padding
> bytes - say either 64 or 128 of them.

Hm. That's a bit bigger patch. I'm inclined to just let it slide for the
moment. I still have plans to downsize some of sbufdesc's content (move
the io lock out) and move the content lwlock inline. Then we're not
going to have much choice but do this...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Tom Lane

Date:

28 January 2015, 15:35:52

Andres Freund <andres@2ndquadrant.com> writes:
> On 2015-01-26 21:13:31 -0500, Robert Haas wrote:
>> So maybe we should also do something like what LWLocks do, and make a
>> union between the actual structure and an appropriate array of padding
>> bytes - say either 64 or 128 of them.

> Hm. That's a bit bigger patch. I'm inclined to just let it slide for the
> moment. I still have plans to downsize some of sbufdesc's content (move
> the io lock out) and move the content lwlock inline. Then we're not
> going to have much choice but do this...

Even if you didn't have plans like that, it would be entire folly to
imagine that buffer headers will be exactly 64 bytes without some forcing
function for that.  Accordingly, I vote against applying any patch that
pretends to improve their alignment unless it also does something to
ensure that the size is a power of 2.  Any notational changes that are
forced by that would be much better done in a patch that does only that
than in a patch that also makes functional changes to the header contents.
        regards, tom lane

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Robert Haas

Date:

28 January 2015, 15:39:52

On Wed, Jan 28, 2015 at 10:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2015-01-26 21:13:31 -0500, Robert Haas wrote:
>>> So maybe we should also do something like what LWLocks do, and make a
>>> union between the actual structure and an appropriate array of padding
>>> bytes - say either 64 or 128 of them.
>
>> Hm. That's a bit bigger patch. I'm inclined to just let it slide for the
>> moment. I still have plans to downsize some of sbufdesc's content (move
>> the io lock out) and move the content lwlock inline. Then we're not
>> going to have much choice but do this...
>
> Even if you didn't have plans like that, it would be entire folly to
> imagine that buffer headers will be exactly 64 bytes without some forcing
> function for that.  Accordingly, I vote against applying any patch that
> pretends to improve their alignment unless it also does something to
> ensure that the size is a power of 2.  Any notational changes that are
> forced by that would be much better done in a patch that does only that
> than in a patch that also makes functional changes to the header contents.

I entirely agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

28 January 2015, 15:43:42

On 2015-01-28 10:35:28 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2015-01-26 21:13:31 -0500, Robert Haas wrote:
> >> So maybe we should also do something like what LWLocks do, and make a
> >> union between the actual structure and an appropriate array of padding
> >> bytes - say either 64 or 128 of them.
> 
> > Hm. That's a bit bigger patch. I'm inclined to just let it slide for the
> > moment. I still have plans to downsize some of sbufdesc's content (move
> > the io lock out) and move the content lwlock inline. Then we're not
> > going to have much choice but do this...
> 
> Even if you didn't have plans like that, it would be entire folly to
> imagine that buffer headers will be exactly 64 bytes without some forcing
> function for that.

Meh. The 128 byte additionally used by the alignment don't hurt in any
case. But forcing all buffer descriptors to 64bit on a 32bit platform
isn't guaranteed to be performance neutral.

So, no I don't think it's a "folly" to do so.

I'd rather make actual progress that improves the absurd situation today
(a factor of 2-3 by changing max_connections by one...) than argue
whether the impact on 32bit platforms is acceptable before doing so. If
we additionally decide to pad the struct, fine. But I don't see why this
has to be done at the same time.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Tom Lane

Date:

28 January 2015, 15:59:00

Andres Freund <andres@2ndquadrant.com> writes:
> On 2015-01-28 10:35:28 -0500, Tom Lane wrote:
>> Even if you didn't have plans like that, it would be entire folly to
>> imagine that buffer headers will be exactly 64 bytes without some forcing
>> function for that.

> Meh. The 128 byte additionally used by the alignment don't hurt in any
> case. But forcing all buffer descriptors to 64bit on a 32bit platform
> isn't guaranteed to be performance neutral.

> So, no I don't think it's a "folly" to do so.

Once we have the mechanism in place, it is a policy decision whether to
apply it on 32-bit builds.  If you don't think there is evidence to
support aligning headers on 32-bit builds, we don't have to do that.
But I firmly object to applying a patch that claims to align the headers
on 64-bit platforms unless it includes something to ensure that it
*actually* does that, regardless of platform variations or subsequent
additions or subtractions of fields.  I think that's a necessary component
of any such patch and should not be allowed to slide.  If you don't want
to do that work now, then drop the topic until you do.
        regards, tom lane

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

28 January 2015, 16:09:02

On 2015-01-28 10:58:40 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2015-01-28 10:35:28 -0500, Tom Lane wrote:
> >> Even if you didn't have plans like that, it would be entire folly to
> >> imagine that buffer headers will be exactly 64 bytes without some forcing
> >> function for that.
> 
> > Meh. The 128 byte additionally used by the alignment don't hurt in any
> > case. But forcing all buffer descriptors to 64bit on a 32bit platform
> > isn't guaranteed to be performance neutral.
> 
> > So, no I don't think it's a "folly" to do so.
> 
> Once we have the mechanism in place, it is a policy decision whether to
> apply it on 32-bit builds.  If you don't think there is evidence to
> support aligning headers on 32-bit builds, we don't have to do that.

I just have no idea whether it'd be beneficial to use more space on
32bit to pad the individual entries. Since this mostly is beneficial on
multi-socket, highly concurrent workloads, I doubt it really matter.

> But I firmly object to applying a patch that claims to align the headers
> on 64-bit platforms unless it includes something to ensure that it
> *actually* does that, regardless of platform variations or subsequent
> additions or subtractions of fields.

Well, the patch claims to align the buffer descriptor *array* not the
individual descriptors...


I personally still think that a comment above sbufdesc's definition
would be sufficient for now. But whatever. I'll enforce 64byte padding
on 64bit platforms, and do nothing on 32bit platforms.

> I think that's a necessary component
> of any such patch and should not be allowed to slide.  If you don't want
> to do that work now, then drop the topic until you do.

Man. This style of comment really is just sad.

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

28 January 2015, 18:17:38

On 2015-01-28 17:08:46 +0100, Andres Freund wrote:
> I just have no idea whether it'd be beneficial to use more space on
> 32bit to pad the individual entries. Since this mostly is beneficial on
> multi-socket, highly concurrent workloads, I doubt it really matter.

> I personally still think that a comment above sbufdesc's definition
> would be sufficient for now. But whatever. I'll enforce 64byte padding
> on 64bit platforms, and do nothing on 32bit platforms.

Patch doing that attached. I've for now dealt with the 32bit issue by:

#define BUFFERDESC_PADDED_SIZE    (sizeof(void*) == 8 ? 64 : sizeof(BufferDesc))
and
 * XXX: As this is primarily matters in highly concurrent workloads which
 * probably all are 64bit these days, and the space wastage would be a bit
 * more noticeable on 32bit systems, we don't force the stride to be cache
 * line sized. If somebody does actual performance testing, we can reevaluate.

I've replaced direct accesses to the (Local)BufferDescriptors array by a
wrapper macros to hide the union indirect and allow potential future
changes to be made more easily.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Tom Lane

Date:

28 January 2015, 18:39:17

Andres Freund <andres@2ndquadrant.com> writes:
>> I personally still think that a comment above sbufdesc's definition
>> would be sufficient for now. But whatever. I'll enforce 64byte padding
>> on 64bit platforms, and do nothing on 32bit platforms.

> Patch doing that attached.

Surely the sizeof() in BufferShmemSize needs to be
sizeof(BufferDescPadded) now?

Also, the macro is unnecessarily unsafe for use in #if tests; why not
something like

#define BUFFERDESC_PADDED_SIZE    (SIZEOF_VOID_P == 8 ? 64 : 32)

Otherwise it looks reasonably sane to me, just in a quick once-over;
I didn't test.
        regards, tom lane

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

28 January 2015, 18:47:32

On 2015-01-28 13:38:51 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> >> I personally still think that a comment above sbufdesc's definition
> >> would be sufficient for now. But whatever. I'll enforce 64byte padding
> >> on 64bit platforms, and do nothing on 32bit platforms.
> 
> > Patch doing that attached.
> 
> Surely the sizeof() in BufferShmemSize needs to be
> sizeof(BufferDescPadded) now?

Good catch. It'd surely fail nastily later, once somebody actually made
the size change away from 64 bytes.

> Also, the macro is unnecessarily unsafe for use in #if tests; why not
> something like

Out of curiosity: What's the danger you're seing?

I'm perfectly fine with using SIZEOF_VOID_P, I actually looked for a
macro like it and didn't find anything.

> #define BUFFERDESC_PADDED_SIZE    (SIZEOF_VOID_P == 8 ? 64 : 32)

Hm, did you intentionally put a 32in there or was that just the logical
continuation of 64? Because there's no way it'll ever fit into 32 bytes
in the near future. That's why I had put the sizeof(BufferDesc)
there. We could just make it 1 as well, to indicate that we don't want
any padding...

> Otherwise it looks reasonably sane to me, just in a quick once-over;
> I didn't test.

I tested it on both 32/64 bit linux and it indeed fixes the issue of
changing performance due to max_connections+-1.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Tom Lane

Date:

28 January 2015, 18:52:49

Andres Freund <andres@2ndquadrant.com> writes:
> On 2015-01-28 13:38:51 -0500, Tom Lane wrote:
>> #define BUFFERDESC_PADDED_SIZE    (SIZEOF_VOID_P == 8 ? 64 : 32)

> Hm, did you intentionally put a 32in there or was that just the logical
> continuation of 64? Because there's no way it'll ever fit into 32 bytes
> in the near future. That's why I had put the sizeof(BufferDesc)
> there. We could just make it 1 as well, to indicate that we don't want
> any padding...

Yeah, 1 would be fine too.  Maybe better to call it BUFFERDESC_MIN_SIZE,
because as this stands it's enforcing a min size not exact size.  (I'm
not going to whinge about that aspect of it; the main point here is to
put in the union and fix the ensuing notational fallout.  We can worry
about exactly what size to pad to as a separate discussion.)
        regards, tom lane

Re: Misaligned BufferDescriptors causing major performance problems on AMD

From

Andres Freund

Date:

29 January 2015, 22:27:44

Pushed a version (hopefully) fixing Tom's complaints.

On 2015-01-28 13:52:30 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2015-01-28 13:38:51 -0500, Tom Lane wrote:
> >> #define BUFFERDESC_PADDED_SIZE    (SIZEOF_VOID_P == 8 ? 64 : 32)
> 
> > Hm, did you intentionally put a 32in there or was that just the logical
> > continuation of 64? Because there's no way it'll ever fit into 32 bytes
> > in the near future. That's why I had put the sizeof(BufferDesc)
> > there. We could just make it 1 as well, to indicate that we don't want
> > any padding...
> 
> Yeah, 1 would be fine too.  Maybe better to call it BUFFERDESC_MIN_SIZE,
> because as this stands it's enforcing a min size not exact size.

It's _PAD_TO_SIZE now.

> not going to whinge about that aspect of it; the main point here is to
> put in the union and fix the ensuing notational fallout.  We can worry
> about exactly what size to pad to as a separate discussion.)

Yea. The details can be changed in a concise way now. I hope ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services