Thread: LWLocks in DSM memory
Hi, As already noted[1], LWLocks don't currently work in DSM segments, because they use dlist for the list of waiters. Even though all of the waiter nodes are in PGPROC and therefore have stable addresses, the dlist code internally constructs a circular list including pointers to a special sentinel node inside the dlist_head object, and those pointers may be invalid in other backends. One solution could be to provide a non-circular variant of the dlist interface that uses NULL list termination. I've attached a quick sketch of something like that which seems to work correctly. It is only lightly tested so far and probably buggy, but shows the general idea. Any thoughts on this approach, or better ways to solve this problem? How valuable is the branch-free property of the circular push and delete operations? [1] https://www.postgresql.org/message-id/CA+Tgmobjia49CCJ0ZazbWaVv7nKgYt+1Zo5CwxkH9Aahgn2vPg@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
On Sun, Jul 24, 2016 at 1:10 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > One solution could be to provide a non-circular variant of the dlist > interface that uses NULL list termination. I've attached a quick > sketch of something like that which seems to work correctly. It is > only lightly tested so far and probably buggy, but shows the general > idea. On reflection, it wouldn't make much sense to put "noncircular" in the names of interfaces, as that is an internal detail. Maybe "relocatable" or "position independent" or something like that, since that's a user-visible property of the dlist_head. Here is a version of the patch that uses _relocatable. > Any thoughts on this approach, or better ways to solve this problem? > How valuable is the branch-free property of the circular push and > delete operations? I investigated this a bit. If I do this on my laptop (clang, no asserts, -O2), it takes 3895 milliseconds, or 4.8ns per push/delete operation: dlist_init(&head); for (i = 0; i < 100000000; ++i) { dlist_push_head(&head, &nodes[0]); dlist_push_tail(&head, &nodes[1]); dlist_push_head(&head, &nodes[2]); dlist_push_tail(&head, &nodes[3]); dlist_delete(&nodes[2]); dlist_delete(&nodes[3]); dlist_delete(&nodes[0]); dlist_delete(&nodes[1]); } The relocatable version takes 5907 milliseconds, or 7.4ns per push/delete operation: dlist_init_relocatable(&head); for (i = 0; i < 100000000; ++i) { dlist_push_head_relocatable(&head, &nodes[0]); dlist_push_tail_relocatable(&head, &nodes[1]); dlist_push_head_relocatable(&head, &nodes[2]); dlist_push_tail_relocatable(&head, &nodes[3]); dlist_delete_relocatable(&head, &nodes[2]); dlist_delete_relocatable(&head, &nodes[3]); dlist_delete_relocatable(&head, &nodes[0]); dlist_delete_relocatable(&head, &nodes[1]); } Those operations are ~50% slower. So getting rid of dlist's clever branch-free code generally seems like a bad idea. Next I wondered if it would be a bad idea to use slower relocatable dlist heads for all LWLocks. Obviously LWLocks are used all over the place and it would be bad to slow them down, but I wondered if maybe dlist manipulation might not be a very significant part of their work. So I put a LWLock and a counter in shared memory, and had N background workers run a tight loop that locks, increments and unlocks concurrently until the counter reaches 1 billion. This creates different degrees of contention and wait list sizes. The worker loop looks like this: while (!finished) { LWLockAcquire(&control->lock, LW_EXCLUSIVE); ++control->counter; if (control->counter >= control->goal) finished = true; LWLockRelease(&control->lock); } I measured the following times for unpatched master, on my 4 core laptop: 16 workers = 73.067s, 74.869s, 75.338s 8 workers = 65.846s, 67.622s, 68.039s 4 workers = 68.763s, 68.980s, 69.035s <-- curiously slower here 3 workers = 59.701s, 59.991s, 60.133s 2 workers = 53.620s, 55.300s, 55.790s 1 worker = 21.578s, 21.535s, 21.598s With the attached patched I got: 16 workers = 75.341s, 77.445s, 77.635s <- +3.4% 8 workers = 67.462s, 68.622s, 68.851s <- +1.4% 4 workers = 64.983s, 65.021s, 65.496s <- -5.7% 3 workers = 60.247s, 60.425s, 60.492s <- +0.7% 2 workers = 57.544s, 57.626s, 58.133s <- +2.3% 1 worker = 21.403s, 21.486s, 21.661s <- -0.2% Somewhat noisy data and different effects at different numbers of workers. I can post the source for those tests if anyone is interested. If you have any other ideas for access patterns to test, or clever ways to keep push and delete branch-free while also avoiding internal pointers back to dlist_head, I'm all ears. Otherwise, if a change affecting all LWLocks turns out to be unacceptable, maybe we would need to have a different LWLock interface for relocatable LWLocks to make them suitable for use in DSM segments. Any thoughts? -- Thomas Munro http://www.enterprisedb.com
Attachment
On Mon, Jul 25, 2016 at 3:02 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I measured the following times for unpatched master, on my 4 core laptop: > > 16 workers = 73.067s, 74.869s, 75.338s > 8 workers = 65.846s, 67.622s, 68.039s > 4 workers = 68.763s, 68.980s, 69.035s <-- curiously slower here > 3 workers = 59.701s, 59.991s, 60.133s > 2 workers = 53.620s, 55.300s, 55.790s > 1 worker = 21.578s, 21.535s, 21.598s > > With the attached patched I got: > > 16 workers = 75.341s, 77.445s, 77.635s <- +3.4% > 8 workers = 67.462s, 68.622s, 68.851s <- +1.4% > 4 workers = 64.983s, 65.021s, 65.496s <- -5.7% > 3 workers = 60.247s, 60.425s, 60.492s <- +0.7% > 2 workers = 57.544s, 57.626s, 58.133s <- +2.3% > 1 worker = 21.403s, 21.486s, 21.661s <- -0.2% Correction, that +2.3% for 2 workers should be +4.2%. And to clarify, I ran the test 3 times as shown and those percentage changes are based on the middle times. -- Thomas Munro http://www.enterprisedb.com
On Sun, Jul 24, 2016 at 11:02 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, Jul 24, 2016 at 1:10 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> One solution could be to provide a non-circular variant of the dlist >> interface that uses NULL list termination. I've attached a quick >> sketch of something like that which seems to work correctly. It is >> only lightly tested so far and probably buggy, but shows the general >> idea. > > On reflection, it wouldn't make much sense to put "noncircular" in the > names of interfaces, as that is an internal detail. Maybe > "relocatable" or "position independent" or something like that, since > that's a user-visible property of the dlist_head. Here is a version > of the patch that uses _relocatable. > >> Any thoughts on this approach, or better ways to solve this problem? >> How valuable is the branch-free property of the circular push and >> delete operations? > > I investigated this a bit. If I do this on my laptop (clang, no > asserts, -O2), it takes 3895 milliseconds, or 4.8ns per push/delete > operation: > > dlist_init(&head); > for (i = 0; i < 100000000; ++i) > { > dlist_push_head(&head, &nodes[0]); > dlist_push_tail(&head, &nodes[1]); > dlist_push_head(&head, &nodes[2]); > dlist_push_tail(&head, &nodes[3]); > dlist_delete(&nodes[2]); > dlist_delete(&nodes[3]); > dlist_delete(&nodes[0]); > dlist_delete(&nodes[1]); > } > > The relocatable version takes 5907 milliseconds, or 7.4ns per > push/delete operation: > > dlist_init_relocatable(&head); > for (i = 0; i < 100000000; ++i) > { > dlist_push_head_relocatable(&head, &nodes[0]); > dlist_push_tail_relocatable(&head, &nodes[1]); > dlist_push_head_relocatable(&head, &nodes[2]); > dlist_push_tail_relocatable(&head, &nodes[3]); > dlist_delete_relocatable(&head, &nodes[2]); > dlist_delete_relocatable(&head, &nodes[3]); > dlist_delete_relocatable(&head, &nodes[0]); > dlist_delete_relocatable(&head, &nodes[1]); > } > > Those operations are ~50% slower. So getting rid of dlist's clever > branch-free code generally seems like a bad idea. > > Next I wondered if it would be a bad idea to use slower relocatable > dlist heads for all LWLocks. Obviously LWLocks are used all over the > place and it would be bad to slow them down, but I wondered if maybe > dlist manipulation might not be a very significant part of their work. > So I put a LWLock and a counter in shared memory, and had N background > workers run a tight loop that locks, increments and unlocks > concurrently until the counter reaches 1 billion. This creates > different degrees of contention and wait list sizes. The worker loop > looks like this: > > while (!finished) > { > LWLockAcquire(&control->lock, LW_EXCLUSIVE); > ++control->counter; > if (control->counter >= control->goal) > finished = true; > LWLockRelease(&control->lock); > } > > I measured the following times for unpatched master, on my 4 core laptop: > > 16 workers = 73.067s, 74.869s, 75.338s > 8 workers = 65.846s, 67.622s, 68.039s > 4 workers = 68.763s, 68.980s, 69.035s <-- curiously slower here > 3 workers = 59.701s, 59.991s, 60.133s > 2 workers = 53.620s, 55.300s, 55.790s > 1 worker = 21.578s, 21.535s, 21.598s > > With the attached patched I got: > > 16 workers = 75.341s, 77.445s, 77.635s <- +3.4% > 8 workers = 67.462s, 68.622s, 68.851s <- +1.4% > 4 workers = 64.983s, 65.021s, 65.496s <- -5.7% > 3 workers = 60.247s, 60.425s, 60.492s <- +0.7% > 2 workers = 57.544s, 57.626s, 58.133s <- +2.3% > 1 worker = 21.403s, 21.486s, 21.661s <- -0.2% > > Somewhat noisy data and different effects at different numbers of workers. > > I can post the source for those tests if anyone is interested. If you > have any other ideas for access patterns to test, or clever ways to > keep push and delete branch-free while also avoiding internal pointers > back to dlist_head, I'm all ears. Otherwise, if a change affecting > all LWLocks turns out to be unacceptable, maybe we would need to have > a different LWLock interface for relocatable LWLocks to make them > suitable for use in DSM segments. Any thoughts? Thanks for looking into this. Any theory on why this is slower in the abstract but not slower, perhaps even faster, for LWLocks? I wonder if it has to do with how likely it is for a list to be completely empty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-07-25 15:02:41 +1200, Thomas Munro wrote: > > Any thoughts on this approach, or better ways to solve this problem? > > How valuable is the branch-free property of the circular push and > > delete operations? I think it's generally valuable, but I suspect that it doesn't really matter much for lwlocks, given that this is already the slow path, where we enter the kernel and such. I suspect that the performance difference here is a more due to code movement than anything. > Next I wondered if it would be a bad idea to use slower relocatable > dlist heads for all LWLocks. Obviously LWLocks are used all over the > place and it would be bad to slow them down, but I wondered if maybe > dlist manipulation might not be a very significant part of their work. I'm pretty sure thats's the case. > So I put a LWLock and a counter in shared memory, and had N background > workers run a tight loop that locks, increments and unlocks > concurrently until the counter reaches 1 billion. This creates > different degrees of contention and wait list sizes. The worker loop > looks like this: > > while (!finished) > { > LWLockAcquire(&control->lock, LW_EXCLUSIVE); > ++control->counter; > if (control->counter >= control->goal) > finished = true; > LWLockRelease(&control->lock); > } You really need shared lwlocks here as well, because exclusive lwlocks will only ever trigger a single list manipulation (basically a pop from the wait queue), further minimizing the list manipulation overhead. I think the better fix here would acutally be to get rid of a pointer based list here, and a) replace the queue with integer offsets b) make the queue lock-free. But I understand that you might not want to tackle that :P Greetings, Andres Freund
On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund <andres@anarazel.de> wrote: > I think the better fix here would acutally be to get rid of a pointer > based list here, and a) replace the queue with integer offsets ... Here is a prototype of that idea. It replaces that dlist with a proclist, a new specialised doubly-linked list for pgprocno values, using INVALID_PGPROCNO for termination. It works with proclist_node objects inside PGPROC at an arbitrary offset which must be provided when you initialise the proclist. It could be made more general than just the PGPROC array by taking the base address and stride of the array, perhaps even allowing for a different base address in each backend for arrays that live in DSM, but I happened to see https://xkcd.com/974/ today and didn't pursue that thought. > ... b) make > the queue lock-free. But I understand that you might not want to tackle > that :P This may be complicated by the not-strictly-queue-like access pattern. I haven't looked into it yet, but I can see that it requires some serious study (Harris or Zhang techniques, maybe with extra tagging to avoid ABA problems on concurrent removal of the same node?, and if so that might require a wider CAS than we have?). I wonder if a proclist with an optional lock-free interface would also be interesting for syncrep queues and others. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Thu, Jul 28, 2016 at 12:12 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund <andres@anarazel.de> wrote: >> I think the better fix here would acutally be to get rid of a pointer >> based list here, and a) replace the queue with integer offsets ... > > Here is a prototype of that idea. It replaces that dlist with a > proclist, a new specialised doubly-linked list for pgprocno values, > using INVALID_PGPROCNO for termination. It works with proclist_node > objects inside PGPROC at an arbitrary offset which must be provided > when you initialise the proclist. Aside from the fact that this allows LWLocks inside DSM segments, which I definitely want to support, this seems to have the nice property of reducing the size of an LWLock by 8 bytes. We need to consider what to do about LWLOCK_MINIMAL_SIZE. We could (a) decide to still pad all arrays of LWLocks to 32 bytes per LWLock so as to reduce false sharing, and rename this constant not to imply minimality; or (b) alter the macro definition to allow for 16 bytes as a possible result. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 29, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jul 28, 2016 at 12:12 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund <andres@anarazel.de> wrote: >>> I think the better fix here would acutally be to get rid of a pointer >>> based list here, and a) replace the queue with integer offsets ... >> >> Here is a prototype of that idea. It replaces that dlist with a >> proclist, a new specialised doubly-linked list for pgprocno values, >> using INVALID_PGPROCNO for termination. It works with proclist_node >> objects inside PGPROC at an arbitrary offset which must be provided >> when you initialise the proclist. > > Aside from the fact that this allows LWLocks inside DSM segments, > which I definitely want to support, this seems to have the nice > property of reducing the size of an LWLock by 8 bytes. We need to > consider what to do about LWLOCK_MINIMAL_SIZE. We could (a) decide to > still pad all arrays of LWLocks to 32 bytes per LWLock so as to reduce > false sharing, and rename this constant not to imply minimality; or > (b) alter the macro definition to allow for 16 bytes as a possible > result. That version didn't actually make LWLock any smaller, because of the additional offset stored in proclist_head when initialising the list. Here is a new version that requires callers to provide it when accessing the list, bringing sizeof(LWLock) down to 16 on LP64/LLP64 systems. In the spirit of the dlist_container macro, I added macros so you can name the link member rather than providing its offset: proclist_push_tail(&list, procno, lwWaitLink). -- Thomas Munro http://www.enterprisedb.com
Attachment
On Thu, Jul 28, 2016 at 6:51 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > That version didn't actually make LWLock any smaller, because of the > additional offset stored in proclist_head when initialising the list. > Here is a new version that requires callers to provide it when > accessing the list, bringing sizeof(LWLock) down to 16 on LP64/LLP64 > systems. In the spirit of the dlist_container macro, I added macros > so you can name the link member rather than providing its offset: > proclist_push_tail(&list, procno, lwWaitLink). I have reviewed this patch. I like the design and overall I would say that the patch looks quite elegant. The only problem I found is that proclist_types.h doesn't seem to need to #include <stddef.h>. I tried taking that out, and, at least on my machine, it still compiles just fine. Of course, performance is a concern, as with any change that touches deep internals. Andres and Thomas seem to be in agreement that this patch should cause a performance loss but that it should be undetectable in practice, since only the slow path where we're entering the kernel anyway is affected. I agree with that conclusion. I spent a little bit of time trying to think of a case where this would have a measurable impact, and I'm not coming up with anything. We've done a lot of work over the last few releases to try to eliminate LWLock contention, and the only way you could even theoretically see an impact from this is if you can come up with a workload with furious LWLock contention so that there is lots and lots of linked-list manipulation going on inside lwlock.c. But I don't really know of a workload that behaves that way, and even if I did, I think the number of cycles we're talking about here is too small to be measurable. Having to put processes to sleep is going to be multiple orders of magnitude more costly than any possible details of how we handle the linked list manipulation. I also agree with the goal of the patch: the reason why I developed the idea of LWLock tranches was with the goal of being able to put LWLocks inside DSM segments, and that worked fine up until Andres changed lwlock.c to use dlists. This change will get us back to being able to use LWLocks inside of DSM segments. Of course, we have no code like that today; if we did, it would be broken. But Thomas will be submitting code that does exactly that in the near future, and I suspect other people will want to do it, too. As a fringe benefit, I have another patch I'm working on that can make use of this proclist infrastructure. Therefore, I plan to commit this patch, removing the #include <stddef.h> unless someone convinces me we need it, shortly after development for v10 opens, unless there are objections before then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Therefore, I plan to commit this patch, removing the #include > <stddef.h> unless someone convinces me we need it, shortly after > development for v10 opens, unless there are objections before then. Hearing no objections, done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-08-15 18:15:23 -0400, Robert Haas wrote: > On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Therefore, I plan to commit this patch, removing the #include > > <stddef.h> unless someone convinces me we need it, shortly after > > development for v10 opens, unless there are objections before then. > > Hearing no objections, done. I'd have objected, if I hadn't been on vacation. While I intuitively *do* think that the increased wait-list overhead won't be relevant, I also know that my intuition has frequently been wrong around the lwlock code. This needs some benchmarks on a 4+ socket machine, first. Something exercising the slow path obviously. E.g. a pgbench with a small number of writers, and a large number of writers. Regards, Andres
On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-08-15 18:15:23 -0400, Robert Haas wrote: >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> > Therefore, I plan to commit this patch, removing the #include >> > <stddef.h> unless someone convinces me we need it, shortly after >> > development for v10 opens, unless there are objections before then. >> >> Hearing no objections, done. > > I'd have objected, if I hadn't been on vacation. While I intuitively > *do* think that the increased wait-list overhead won't be relevant, I > also know that my intuition has frequently been wrong around the lwlock > code. This needs some benchmarks on a 4+ socket machine, > first. Something exercising the slow path obviously. E.g. a pgbench with > a small number of writers, and a large number of writers. I have to admit that I totally blanked about you being on vacation. Thanks for mentioning the workload you think might be adversely affected, but to be honest, even if there's some workload where this causes a small regression, I'm not really sure what you think we should do instead. Should we have a separate copy of lwlock.c just for parallel query and other stuff that uses DSM? Won't that slow down every error-handling path in the system, if they all have to release two kinds of lwlocks rather than one? And bloat the binary? Or are you going to argue that parallel query doesn't really need LWLocks? I'm sure that's not true. We got by without it for this release, but that's because the only truly parallel operation as yet is Parallel Seq Scan whose requirements are simple enough to be handled with a spinlock. Anyway, I guess we should wait for the benchmark results and then see, but if we're not going to do this then we need some reasonable alternative. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-08-15 18:15:23 -0400, Robert Haas wrote: >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> > Therefore, I plan to commit this patch, removing the #include >> > <stddef.h> unless someone convinces me we need it, shortly after >> > development for v10 opens, unless there are objections before then. >> >> Hearing no objections, done. > > I'd have objected, if I hadn't been on vacation. While I intuitively > *do* think that the increased wait-list overhead won't be relevant, I > also know that my intuition has frequently been wrong around the lwlock > code. This needs some benchmarks on a 4+ socket machine, > first. Something exercising the slow path obviously. E.g. a pgbench with > a small number of writers, and a large number of writers. Amit just pointed out to me that you wrote "a small number of writers, and a large number of writers". I assume one of those is supposed to say "readers"? Probably the second one? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-08-17 08:31:36 -0400, Robert Haas wrote: > On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-08-15 18:15:23 -0400, Robert Haas wrote: > >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > Therefore, I plan to commit this patch, removing the #include > >> > <stddef.h> unless someone convinces me we need it, shortly after > >> > development for v10 opens, unless there are objections before then. > >> > >> Hearing no objections, done. > > > > I'd have objected, if I hadn't been on vacation. While I intuitively > > *do* think that the increased wait-list overhead won't be relevant, I > > also know that my intuition has frequently been wrong around the lwlock > > code. This needs some benchmarks on a 4+ socket machine, > > first. Something exercising the slow path obviously. E.g. a pgbench with > > a small number of writers, and a large number of writers. > > Amit just pointed out to me that you wrote "a small number of writers, > and a large number of writers". I assume one of those is supposed to > say "readers"? Probably the second one? Yes. I want a long wait list, modified in bulk - which should be the case with the above.
On 2016-08-16 23:09:07 -0400, Robert Haas wrote: > On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-08-15 18:15:23 -0400, Robert Haas wrote: > >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > Therefore, I plan to commit this patch, removing the #include > >> > <stddef.h> unless someone convinces me we need it, shortly after > >> > development for v10 opens, unless there are objections before then. > >> > >> Hearing no objections, done. > > > > I'd have objected, if I hadn't been on vacation. While I intuitively > > *do* think that the increased wait-list overhead won't be relevant, I > > also know that my intuition has frequently been wrong around the lwlock > > code. This needs some benchmarks on a 4+ socket machine, > > first. Something exercising the slow path obviously. E.g. a pgbench with > > a small number of writers, and a large number of writers. > > I have to admit that I totally blanked about you being on vacation. > Thanks for mentioning the workload you think might be adversely > affected, but to be honest, even if there's some workload where this > causes a small regression, I'm not really sure what you think we > should do instead. Well, you convincingly argued against that approach in a nearby thread ;). Joking aside: I do think that we should make such a change knowingly. It might also be possible to address it somehow. I really hope there's no slowdown. > Should we have a separate copy of lwlock.c just > for parallel query and other stuff that uses DSM? No, that'd be horrible. > Or are you going to argue that parallel query doesn't really need > LWLocks? Definitely not. - Andres
On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:
>Yes. I want a long wait list, modified in bulk - which should be the
>case with the above.
I ran some pgbench. And, I do not see much difference in performance, small variance in perf can be attributed to variance in probability of drawing the particular built-in script.
Server configuration:
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
pgbench configuration: initialized with scale_factor 300
./pgbench -c $threads -j $threads -T 1800 -M prepared -b simple-update@1 -b select-only@20 postgres
Simple-update : select-only = 5:95 | |||
cilents | 8 | 64 | 128 |
Current Code | 38279.663784 | 258196.067342 | 283150.495921 |
After Patch revert | 37316.09022 | 268285.506338 | 280077.913954 |
% diff | -2.5171944284 | 3.9076656356 | -1.0851409449 |
Simple-update : selec-only = 20:80 | |||
cilents | 8 | 64 | 128 |
Current Code | 23169.021469 | 134791.996882 | 154057.101004 |
After Patch revert | 23892.91539 | 135091.645551 | 150402.80543 |
%diff | 3.1244043775 | 0.2223044958 | -2.3720396854 |
And this was done on our 8 socket intel machine machine
--
--
On Fri, Aug 19, 2016 at 4:29 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:>Yes. I want a long wait list, modified in bulk - which should be the
>case with the above.
I ran some pgbench. And, I do not see much difference in performance, small variance in perf can be attributed to variance in probability of drawing the particular built-in script.
Can you specify the m/c details as Andres wants tests to be conducted on some high socket m/c?
On Fri, Aug 19, 2016 at 4:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>Can you specify the m/c details as Andres wants tests to be conducted on some high socket m/c?
As I have specified at the last line of my mail it is a 8 socket intel machine.
available: 8 nodes (0-7)
node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103
node 0 size: 65498 MB
node 0 free: 50308 MB
node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111
node 1 size: 65536 MB
node 1 free: 44594 MB
node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119
node 2 size: 65536 MB
node 2 free: 61814 MB
node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127
node 3 size: 65536 MB
node 3 free: 55258 MB
node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40
node 4 size: 65536 MB
node 4 free: 46705 MB
node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48
node 5 size: 65536 MB
node 5 free: 40948 MB
node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56
node 6 size: 65536 MB
node 6 free: 47468 MB
node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64
node 7 size: 65536 MB
node 7 free: 17285 MB