Thread: LWLocks in DSM memory

Re: LWLocks in DSM memory

From

Thomas Munro

Date:

25 July 2016, 03:22:29

On Mon, Jul 25, 2016 at 3:02 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I measured the following times for unpatched master, on my 4 core laptop:
>
> 16 workers = 73.067s, 74.869s, 75.338s
> 8 workers  = 65.846s, 67.622s, 68.039s
> 4 workers  = 68.763s, 68.980s, 69.035s <-- curiously slower here
> 3 workers  = 59.701s, 59.991s, 60.133s
> 2 workers  = 53.620s, 55.300s, 55.790s
> 1 worker   = 21.578s, 21.535s, 21.598s
>
> With the attached patched I got:
>
> 16 workers = 75.341s, 77.445s, 77.635s <- +3.4%
> 8 workers  = 67.462s, 68.622s, 68.851s <- +1.4%
> 4 workers  = 64.983s, 65.021s, 65.496s <- -5.7%
> 3 workers  = 60.247s, 60.425s, 60.492s <- +0.7%
> 2 workers  = 57.544s, 57.626s, 58.133s <- +2.3%
> 1 worker   = 21.403s, 21.486s, 21.661s <- -0.2%

Correction, that +2.3% for 2 workers should be +4.2%.  And to clarify,
I ran the test 3 times as shown and those percentage changes are based
on the middle times.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: LWLocks in DSM memory

From

Robert Haas

Date:

25 July 2016, 17:50:54

On Sun, Jul 24, 2016 at 11:02 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, Jul 24, 2016 at 1:10 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> One solution could be to provide a non-circular variant of the dlist
>> interface that uses NULL list termination.  I've attached a quick
>> sketch of something like that which seems to work correctly.  It is
>> only lightly tested so far and probably buggy, but shows the general
>> idea.
>
> On reflection, it wouldn't make much sense to put "noncircular" in the
> names of interfaces, as that is an internal detail.  Maybe
> "relocatable" or "position independent" or something like that, since
> that's a user-visible property of the dlist_head.  Here is a version
> of the patch that uses _relocatable.
>
>> Any thoughts on this approach, or better ways to solve this problem?
>> How valuable is the branch-free property of the circular push and
>> delete operations?
>
> I investigated this a bit.  If I do this on my laptop (clang, no
> asserts, -O2),  it takes 3895 milliseconds, or 4.8ns per push/delete
> operation:
>
>         dlist_init(&head);
>         for (i = 0; i < 100000000; ++i)
>         {
>             dlist_push_head(&head, &nodes[0]);
>             dlist_push_tail(&head, &nodes[1]);
>             dlist_push_head(&head, &nodes[2]);
>             dlist_push_tail(&head, &nodes[3]);
>             dlist_delete(&nodes[2]);
>             dlist_delete(&nodes[3]);
>             dlist_delete(&nodes[0]);
>             dlist_delete(&nodes[1]);
>         }
>
> The relocatable version takes 5907 milliseconds, or 7.4ns per
> push/delete operation:
>
>         dlist_init_relocatable(&head);
>         for (i = 0; i < 100000000; ++i)
>         {
>             dlist_push_head_relocatable(&head, &nodes[0]);
>             dlist_push_tail_relocatable(&head, &nodes[1]);
>             dlist_push_head_relocatable(&head, &nodes[2]);
>             dlist_push_tail_relocatable(&head, &nodes[3]);
>             dlist_delete_relocatable(&head, &nodes[2]);
>             dlist_delete_relocatable(&head, &nodes[3]);
>             dlist_delete_relocatable(&head, &nodes[0]);
>             dlist_delete_relocatable(&head, &nodes[1]);
>         }
>
> Those operations are ~50% slower.  So getting rid of dlist's clever
> branch-free code generally seems like a bad idea.
>
> Next I wondered if it would be a bad idea to use slower relocatable
> dlist heads for all LWLocks.  Obviously LWLocks are used all over the
> place and it would be bad to slow them down, but I wondered if maybe
> dlist manipulation might not be a very significant part of their work.
> So I put a LWLock and a counter in shared memory, and had N background
> workers run a tight loop that locks, increments and unlocks
> concurrently until the counter reaches 1 billion.  This creates
> different degrees of contention and wait list sizes.  The worker loop
> looks like this:
>
>     while (!finished)
>     {
>         LWLockAcquire(&control->lock, LW_EXCLUSIVE);
>         ++control->counter;
>         if (control->counter >= control->goal)
>             finished = true;
>         LWLockRelease(&control->lock);
>     }
>
> I measured the following times for unpatched master, on my 4 core laptop:
>
> 16 workers = 73.067s, 74.869s, 75.338s
> 8 workers  = 65.846s, 67.622s, 68.039s
> 4 workers  = 68.763s, 68.980s, 69.035s <-- curiously slower here
> 3 workers  = 59.701s, 59.991s, 60.133s
> 2 workers  = 53.620s, 55.300s, 55.790s
> 1 worker   = 21.578s, 21.535s, 21.598s
>
> With the attached patched I got:
>
> 16 workers = 75.341s, 77.445s, 77.635s <- +3.4%
> 8 workers  = 67.462s, 68.622s, 68.851s <- +1.4%
> 4 workers  = 64.983s, 65.021s, 65.496s <- -5.7%
> 3 workers  = 60.247s, 60.425s, 60.492s <- +0.7%
> 2 workers  = 57.544s, 57.626s, 58.133s <- +2.3%
> 1 worker   = 21.403s, 21.486s, 21.661s <- -0.2%
>
> Somewhat noisy data and different effects at different numbers of workers.
>
> I can post the source for those tests if anyone is interested.  If you
> have any other ideas for access patterns to test, or clever ways to
> keep push and delete branch-free while also avoiding internal pointers
> back to dlist_head, I'm all ears.  Otherwise, if a change affecting
> all LWLocks turns out to be unacceptable, maybe we would need to have
> a different LWLock interface for relocatable LWLocks to make them
> suitable for use in DSM segments.  Any thoughts?

Thanks for looking into this.  Any theory on why this is slower in the
abstract but not slower, perhaps even faster, for LWLocks?  I wonder
if it has to do with how likely it is for a list to be completely
empty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: LWLocks in DSM memory

From

Andres Freund

Date:

25 July 2016, 18:00:43

Hi,

On 2016-07-25 15:02:41 +1200, Thomas Munro wrote:
> > Any thoughts on this approach, or better ways to solve this problem?
> > How valuable is the branch-free property of the circular push and
> > delete operations?

I think it's generally valuable, but I suspect that it doesn't really
matter much for lwlocks, given that this is already the slow path, where
we enter the kernel and such. I suspect that the performance difference
here is a more due to code movement than anything.

> Next I wondered if it would be a bad idea to use slower relocatable
> dlist heads for all LWLocks.  Obviously LWLocks are used all over the
> place and it would be bad to slow them down, but I wondered if maybe
> dlist manipulation might not be a very significant part of their work.

I'm pretty sure thats's the case.

> So I put a LWLock and a counter in shared memory, and had N background
> workers run a tight loop that locks, increments and unlocks
> concurrently until the counter reaches 1 billion.  This creates
> different degrees of contention and wait list sizes.  The worker loop
> looks like this:
> 
>     while (!finished)
>     {
>         LWLockAcquire(&control->lock, LW_EXCLUSIVE);
>         ++control->counter;
>         if (control->counter >= control->goal)
>             finished = true;
>         LWLockRelease(&control->lock);
>     }

You really need shared lwlocks here as well, because exclusive lwlocks
will only ever trigger a single list manipulation (basically a pop from
the wait queue), further minimizing the list manipulation overhead.

I think the better fix here would acutally be to get rid of a pointer
based list here, and a) replace the queue with integer offsets b) make
the queue lock-free.  But I understand that you might not want to tackle
that :P

Greetings,

Andres Freund

Re: LWLocks in DSM memory

From

Robert Haas

Date:

28 July 2016, 14:12:40

On Thu, Jul 28, 2016 at 12:12 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund <andres@anarazel.de> wrote:
>> I think the better fix here would acutally be to get rid of a pointer
>> based list here, and a) replace the queue with integer offsets ...
>
> Here is a prototype of that idea.  It replaces that dlist with a
> proclist, a new specialised doubly-linked list for pgprocno values,
> using INVALID_PGPROCNO for termination.  It works with proclist_node
> objects inside PGPROC at an arbitrary offset which must be provided
> when you initialise the proclist.

Aside from the fact that this allows LWLocks inside DSM segments,
which I definitely want to support, this seems to have the nice
property of reducing the size of an LWLock by 8 bytes.  We need to
consider what to do about LWLOCK_MINIMAL_SIZE.  We could (a) decide to
still pad all arrays of LWLocks to 32 bytes per LWLock so as to reduce
false sharing, and rename this constant not to imply minimality; or
(b) alter the macro definition to allow for 16 bytes as a possible
result.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: LWLocks in DSM memory

From

Robert Haas

Date:

11 August 2016, 18:19:49

On Thu, Jul 28, 2016 at 6:51 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> That version didn't actually make LWLock any smaller, because of the
> additional offset stored in proclist_head when initialising the list.
> Here is a new version that requires callers to provide it when
> accessing the list, bringing sizeof(LWLock) down to 16 on LP64/LLP64
> systems.  In the spirit of the dlist_container macro, I added macros
> so you can name the link member rather than providing its offset:
> proclist_push_tail(&list, procno, lwWaitLink).

I have reviewed this patch.  I like the design and overall I would say
that the patch looks quite elegant.  The only problem I found is that
proclist_types.h doesn't seem to need to #include <stddef.h>.  I tried
taking that out, and, at least on my machine, it still compiles just
fine.

Of course, performance is a concern, as with any change that touches
deep internals.  Andres and Thomas seem to be in agreement that this
patch should cause a performance loss but that it should be
undetectable in practice, since only the slow path where we're
entering the kernel anyway is affected.  I agree with that conclusion.
I spent a little bit of time trying to think of a case where this
would have a measurable impact, and I'm not coming up with anything.
We've done a lot of work over the last few releases to try to
eliminate LWLock contention, and the only way you could even
theoretically see an impact from this is if you can come up with a
workload with furious LWLock contention so that there is lots and lots
of linked-list manipulation going on inside lwlock.c.  But I don't
really know of a workload that behaves that way, and even if I did, I
think the number of cycles we're talking about here is too small to be
measurable.  Having to put processes to sleep is going to be multiple
orders of magnitude more costly than any possible details of how we
handle the linked list manipulation.

I also agree with the goal of the patch: the reason why I developed
the idea of LWLock tranches was with the goal of being able to put
LWLocks inside DSM segments, and that worked fine up until Andres
changed lwlock.c to use dlists.  This change will get us back to being
able to use LWLocks inside of DSM segments.  Of course, we have no
code like that today; if we did, it would be broken.  But Thomas will
be submitting code that does exactly that in the near future, and I
suspect other people will want to do it, too. As a fringe benefit, I
have another patch I'm working on that can make use of this proclist
infrastructure.

Therefore, I plan to commit this patch, removing the #include
<stddef.h> unless someone convinces me we need it, shortly after
development for v10 opens, unless there are objections before then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: LWLocks in DSM memory

From

Robert Haas

Date:

15 August 2016, 22:15:30

On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Therefore, I plan to commit this patch, removing the #include
> <stddef.h> unless someone convinces me we need it, shortly after
> development for v10 opens, unless there are objections before then.

Hearing no objections, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: LWLocks in DSM memory

From

Andres Freund

Date:

16 August 2016, 21:03:35

On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Therefore, I plan to commit this patch, removing the #include
> > <stddef.h> unless someone convinces me we need it, shortly after
> > development for v10 opens, unless there are objections before then.
> 
> Hearing no objections, done.

I'd have objected, if I hadn't been on vacation.  While I intuitively
*do* think that the increased wait-list overhead won't be relevant, I
also know that my intuition has frequently been wrong around the lwlock
code.  This needs some benchmarks on a 4+ socket machine,
first. Something exercising the slow path obviously. E.g. a pgbench with
a small number of writers, and a large number of writers.

Regards,

Andres

Re: LWLocks in DSM memory

From

Robert Haas

Date:

17 August 2016, 03:09:13

On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
>> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > Therefore, I plan to commit this patch, removing the #include
>> > <stddef.h> unless someone convinces me we need it, shortly after
>> > development for v10 opens, unless there are objections before then.
>>
>> Hearing no objections, done.
>
> I'd have objected, if I hadn't been on vacation.  While I intuitively
> *do* think that the increased wait-list overhead won't be relevant, I
> also know that my intuition has frequently been wrong around the lwlock
> code.  This needs some benchmarks on a 4+ socket machine,
> first. Something exercising the slow path obviously. E.g. a pgbench with
> a small number of writers, and a large number of writers.

I have to admit that I totally blanked about you being on vacation.
Thanks for mentioning the workload you think might be adversely
affected, but to be honest, even if there's some workload where this
causes a small regression, I'm not really sure what you think we
should do instead.  Should we have a separate copy of lwlock.c just
for parallel query and other stuff that uses DSM?  Won't that slow
down every error-handling path in the system, if they all have to
release two kinds of lwlocks rather than one?  And bloat the binary?
Or are you going to argue that parallel query doesn't really need
LWLocks?  I'm sure that's not true.  We got by without it for this
release, but that's because the only truly parallel operation as yet
is Parallel Seq Scan whose requirements are simple enough to be
handled with a spinlock.

Anyway, I guess we should wait for the benchmark results and then see,
but if we're not going to do this then we need some reasonable
alternative.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: LWLocks in DSM memory

From

Robert Haas

Date:

17 August 2016, 12:31:47

On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
>> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > Therefore, I plan to commit this patch, removing the #include
>> > <stddef.h> unless someone convinces me we need it, shortly after
>> > development for v10 opens, unless there are objections before then.
>>
>> Hearing no objections, done.
>
> I'd have objected, if I hadn't been on vacation.  While I intuitively
> *do* think that the increased wait-list overhead won't be relevant, I
> also know that my intuition has frequently been wrong around the lwlock
> code.  This needs some benchmarks on a 4+ socket machine,
> first. Something exercising the slow path obviously. E.g. a pgbench with
> a small number of writers, and a large number of writers.

Amit just pointed out to me that you wrote "a small number of writers,
and a large number of writers".  I assume one of those is supposed to
say "readers"?  Probably the second one?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: LWLocks in DSM memory

From

Andres Freund

Date:

17 August 2016, 15:42:13

On 2016-08-17 08:31:36 -0400, Robert Haas wrote:
> On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
> >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> > Therefore, I plan to commit this patch, removing the #include
> >> > <stddef.h> unless someone convinces me we need it, shortly after
> >> > development for v10 opens, unless there are objections before then.
> >>
> >> Hearing no objections, done.
> >
> > I'd have objected, if I hadn't been on vacation.  While I intuitively
> > *do* think that the increased wait-list overhead won't be relevant, I
> > also know that my intuition has frequently been wrong around the lwlock
> > code.  This needs some benchmarks on a 4+ socket machine,
> > first. Something exercising the slow path obviously. E.g. a pgbench with
> > a small number of writers, and a large number of writers.
> 
> Amit just pointed out to me that you wrote "a small number of writers,
> and a large number of writers".  I assume one of those is supposed to
> say "readers"?  Probably the second one?

Yes. I want a long wait list, modified in bulk - which should be the
case with the above.

Re: LWLocks in DSM memory

From

Andres Freund

Date:

17 August 2016, 15:44:51

On 2016-08-16 23:09:07 -0400, Robert Haas wrote:
> On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
> >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> > Therefore, I plan to commit this patch, removing the #include
> >> > <stddef.h> unless someone convinces me we need it, shortly after
> >> > development for v10 opens, unless there are objections before then.
> >>
> >> Hearing no objections, done.
> >
> > I'd have objected, if I hadn't been on vacation.  While I intuitively
> > *do* think that the increased wait-list overhead won't be relevant, I
> > also know that my intuition has frequently been wrong around the lwlock
> > code.  This needs some benchmarks on a 4+ socket machine,
> > first. Something exercising the slow path obviously. E.g. a pgbench with
> > a small number of writers, and a large number of writers.
> 
> I have to admit that I totally blanked about you being on vacation.
> Thanks for mentioning the workload you think might be adversely
> affected, but to be honest, even if there's some workload where this
> causes a small regression, I'm not really sure what you think we
> should do instead.

Well, you convincingly argued against that approach in a nearby thread
;). Joking aside: I do think that we should make such a change
knowingly. It might also be possible to address it somehow.

I really hope there's no slowdown.


> Should we have a separate copy of lwlock.c just
> for parallel query and other stuff that uses DSM?

No, that'd be horrible.


> Or are you going to argue that parallel query doesn't really need
> LWLocks?

Definitely not.


- Andres

Re: LWLocks in DSM memory

From

Mithun Cy

Date:

19 August 2016, 10:59:06

On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:

>Yes. I want a long wait list, modified in bulk - which should be the
>case with the above.

I ran some pgbench. And, I do not see much difference in performance, small variance in perf can be attributed to variance in probability of drawing the particular built-in script.

Server configuration:
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

pgbench configuration: initialized with scale_factor 300
./pgbench -c $threads -j $threads -T 1800 -M prepared -b simple-update@1 -b select-only@20 postgres

Simple-update : select-only = 5:95
cilents	8	64	128
Current Code	38279.663784	258196.067342	283150.495921
After Patch revert	37316.09022	268285.506338	280077.913954
% diff	-2.5171944284	3.9076656356	-1.0851409449

Simple-update : selec-only = 20:80
cilents	8	64	128
Current Code	23169.021469	134791.996882	154057.101004
After Patch revert	23892.91539	135091.645551	150402.80543
%diff	3.1244043775	0.2223044958	-2.3720396854

And this was done on our 8 socket intel machine machine

--

Thanks and Regards

Mithun C Y

EnterpriseDB: http://www.enterprisedb.com

Re: LWLocks in DSM memory

From

Amit Kapila

Date:

19 August 2016, 11:14:49

On Fri, Aug 19, 2016 at 4:29 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:
>Yes. I want a long wait list, modified in bulk - which should be the
>case with the above.

I ran some pgbench. And, I do not see much difference in performance, small variance in perf can be attributed to variance in probability of drawing the particular built-in script.

Can you specify the m/c details as Andres wants tests to be conducted on some high socket m/c?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: LWLocks in DSM memory

From

Mithun Cy

Date:

19 August 2016, 11:46:31

On Fri, Aug 19, 2016 at 4:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

>Can you specify the m/c details as Andres wants tests to be conducted on some high socket m/c?

As I have specified at the last line of my mail it is a 8 socket intel machine.

available: 8 nodes (0-7)

node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103

node 0 size: 65498 MB

node 0 free: 50308 MB

node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111

node 1 size: 65536 MB

node 1 free: 44594 MB

node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119

node 2 size: 65536 MB

node 2 free: 61814 MB

node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127

node 3 size: 65536 MB

node 3 free: 55258 MB

node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40

node 4 size: 65536 MB

node 4 free: 46705 MB

node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48

node 5 size: 65536 MB

node 5 free: 40948 MB

node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56

node 6 size: 65536 MB

node 6 free: 47468 MB

node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64

node 7 size: 65536 MB

node 7 free: 17285 MB

Thanks and Regards

Mithun C Y

EnterpriseDB: http://www.enterprisedb.com

Thread: LWLocks in DSM memory

LWLocks in DSM memory

Attachment

Re: LWLocks in DSM memory

Attachment

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Attachment

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Attachment

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory

Re: LWLocks in DSM memory