Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4) - Mailing list pgsql-bugs
From | Andres Freund |
---|---|
Subject | Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4) |
Date | |
Msg-id | 20150708125512.GL10242@alap3.anarazel.de Whole thread Raw |
In response to | Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4) (Andres Freund <andres@anarazel.de>) |
Responses |
Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)
Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4) |
List | pgsql-bugs |
On 2015-07-08 11:12:38 +0200, Andres Freund wrote: > On 2015-07-07 21:13:04 -0400, Tom Lane wrote: > > There is some discussion going on about improving the scalability of > > snapshot acquisition, but nothing will happen in that line before 9.6 > > at the earliest. > > 9.5 should be less bad at it than 9.5, at least if it's mostly read-only > ProcArrayLock acquisitions which sounds like it should be the case here. test 3: master: 1 clients: 3112.7 2 clients: 6806.7 4 clients: 13441.2 8 clients: 15765.4 16 clients: 21102.2 9.4: 1 clients: 2524.2 2 clients: 5903.2 4 clients: 11756.8 8 clients: 14583.3 16 clients: 19309.2 So there's an interesting "dip" between 4 and 8 clients. A perf profile doesn't show any actual lock contention on master. Not that surprising, there shouldn't be any exclusive locks here. One interesting thing in exactly such cases is to consider intel's turboboost. Disabling it (echo 0 > /sys/devices/system/cpu/cpufreq/boost) gives us these results: test 3: master: 1 clients: 2926.6 2 clients: 6634.3 4 clients: 13905.2 8 clients: 15718.9 so that's not it in this case. comparing stats between the 4 and 8 client runs shows (removing boring data): 4 clients: 90859.517328 task-clock (msec) # 3.428 CPUs utilized 109,655,985,749 stalled-cycles-frontend # 54.27% frontend cycles idle (27.79%) 62,906,918,008 stalled-cycles-backend # 31.14% backend cycles idle (27.78%) 219,063,494,214 instructions # 1.08 insns per cycle # 0.50 stalled cycles per insn (33.32%) 41,664,400,828 branches # 458.558 M/sec (33.32%) 374,426,805 branch-misses # 0.90% of all branches (33.32%) 62,504,845,665 L1-dcache-loads # 687.928 M/sec (27.78%) 1,224,842,848 L1-dcache-load-misses # 1.96% of all L1-dcache hits (27.81%) 321,981,924 LLC-loads # 3.544 M/sec (22.33%) 23,219,438 LLC-load-misses # 7.21% of all LL-cache hits (5.52%) 26.507528305 seconds time elapsed 8 clients: 165168.247631 task-clock (msec) # 6.824 CPUs utilized 247,231,674,170 stalled-cycles-frontend # 67.04% frontend cycles idle (27.84%) 101,354,900,788 stalled-cycles-backend # 27.48% backend cycles idle (27.83%) 285,829,642,649 instructions # 0.78 insns per cycle # 0.86 stalled cycles per insn (33.39%) 54,503,992,461 branches # 329.991 M/sec (33.39%) 761,911,056 branch-misses # 1.40% of all branches (33.38%) 81,373,091,784 L1-dcache-loads # 492.668 M/sec (27.74%) 4,419,307,036 L1-dcache-load-misses # 5.43% of all L1-dcache hits (27.72%) 510,940,577 LLC-loads # 3.093 M/sec (21.86%) 26,963,120 LLC-load-misses # 5.28% of all LL-cache hits (5.37%) 24.205675255 seconds time elapsed It's quite visible that all caches have considerably worse characteristics on the 8 clients case, and that "instructions per cycle" has gone down considerably. Presumably because more frontend cycles were idle, which in turn is probably caused by the higher cache miss ratios. L1 going from 1.96% misses to 5.43% misses is quite a drastic difference. Now, looking at where cache misses happen: 4 clients: + 7.64% postgres postgres [.] AllocSetAlloc + 3.90% postgres postgres [.] LWLockAcquire + 3.40% postgres plpgsql.so [.] plpgsql_exec_function + 2.64% postgres postgres [.] GetCachedPlan + 2.20% postgres postgres [.] slot_deform_tuple + 2.16% postgres libc-2.19.so [.] _int_free + 2.08% postgres libc-2.19.so [.] __memcpy_sse2_unaligned 8 clients: + 6.34% postgres postgres [.] AllocSetAlloc + 4.89% postgres plpgsql.so [.] plpgsql_exec_function + 2.63% postgres libc-2.19.so [.] _int_free + 2.60% postgres libc-2.19.so [.] __memcpy_sse2_unaligned + 2.50% postgres postgres [.] ExecLimit + 2.47% postgres postgres [.] LWLockAcquire + 2.18% postgres postgres [.] ExecProject So the characteristics interestingly change quite a bit between 4/8. I reproduced this a number of times to make sure it's not just a temporary issue. The memcpy rising is mainly: + 80.27% SearchCatCache + 10.56% appendBinaryStringInfo + 6.51% socket_putmessage + 0.78% pgstat_report_activity So at least on the hardware available to me right now this isn't caused by actual lock contention. Hm. I've a patch addressing the SearchCatCache memcpy() cost somewhere... Andres
pgsql-bugs by date: