Re: Some performance degradation in REL_16 vs REL_15 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Some performance degradation in REL_16 vs REL_15 |
Date | |
Msg-id | 20231115180433.p3eeaczbam5zxdz5@awork3.anarazel.de Whole thread Raw |
In response to | Re: Some performance degradation in REL_16 vs REL_15 ("Anton A. Melnikov" <a.melnikov@postgrespro.ru>) |
List | pgsql-hackers |
Hi, On 2023-11-15 11:33:44 +0300, Anton A. Melnikov wrote: > The configure options and test scripts on my pc and server were the same: > export CFLAGS="-O2" > ./configure --enable-debug --with-perl --with-icu --enable-depend --enable-tap-tests > #reinstall > #reinitdb > #create database bench > for ((i=0; i<100; i++)); do pgbench -U postgres -i -s8 bench> /dev/null 2>&1; > psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c6 -T20 -j6 bench; Even with scale 8 you're likely significantly impacted by contention. And obviously WAL write latency. See below for why that matters. > I can't understand why i get the opposite results on my pc and on the server. It is clear that the absolute > TPS values will be different for various configurations. This is normal. But differences? > Is it unlikely that some kind of reference configuration is needed to accurately > measure the difference in performance. Probably something wrong with my pc, but now > i can not figure out what's wrong. One very common reason for symptoms like this are power-saving measures by the CPU. In workloads where the CPU is not meaningfully utilized, the CPU will go into a powersaving mode - which can cause workloads that are latency sensitive to be badly affected. Both because initially the cpu will just work at a lower frequency and because it takes time to shift to a higher latency. Here's an example: I bound the server and psql to the same CPU core (nothing else is allowed to use that core. And ran the following: \o /dev/null SELECT 1; SELECT 1; SELECT 1; SELECT pg_sleep(0.1); SELECT 1; SELECT 1; SELECT 1; Time: 0.181 ms Time: 0.085 ms Time: 0.071 ms Time: 100.474 ms Time: 0.153 ms Time: 0.077 ms Time: 0.069 ms You can see how the first query timing was slower, the next two were faster, and then after the pg_sleep() it's slow again. # tell the CPU to optimize for performance not power cpupower frequency-set --governor performance # disable going to lower power states cpupower idle-set -D0 # disable turbo mode for consistent performance echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo Now the timings are: Time: 0.038 ms Time: 0.028 ms Time: 0.025 ms Time: 1000.262 ms (00:01.000) Time: 0.027 ms Time: 0.024 ms Time: 0.023 ms Look, fast and reasonably consistent timings. Switching back: Time: 0.155 ms Time: 0.123 ms Time: 0.074 ms Time: 1001.235 ms (00:01.001) Time: 0.120 ms Time: 0.077 ms Time: 0.068 ms The perverse thing is that this often means that *reducing* the number of instructions executed yields to *worse* behaviour when under non-sustained load, because from the CPUs point of view there is less need to increase clock speed. To show how much of a difference that can make, I ran pgbench with a single client on one core, and the server on another (so the CPU is idle inbetween): numactl --physcpubind 11 pgbench -n -M prepared -P1 -S -c 1 -T10 With power optimized configuration: latency average = 0.035 ms latency stddev = 0.002 ms initial connection time = 5.255 ms tps = 28434.334672 (without initial connection time) With performance optimized configuration: latency average = 0.025 ms latency stddev = 0.001 ms initial connection time = 3.544 ms tps = 40079.995935 (without initial connection time) That's a whopping 1.4x in throughput! Now, the same thing, except that I used a custom workload where pgbench transactions are executed in a pipelined fashion, 100 read-only transactions in one script execution: With power optimized configuration: latency average = 1.055 ms latency stddev = 0.125 ms initial connection time = 6.915 ms tps = 947.985286 (without initial connection time) (this means we actually executed 94798.5286 readonly pgbench transactions/s) With performance optimized configuration: latency average = 1.376 ms latency stddev = 0.083 ms initial connection time = 3.759 ms tps = 726.849018 (without initial connection time) Suddenly the super-duper performance optimized settings are worse (but note that stddev is down)! I suspect the problem is that now because we disabled idle states, the cpu ends up clocking *lower*, due to power usage. If I just change the relevant *cores* to the performance optimized configuration: cpupower -c 10,11 idle-set -D0; cpupower -c 10,11 frequency-set --governor performance latency average = 0.940 ms latency stddev = 0.061 ms initial connection time = 3.311 ms tps = 1063.719116 (without initial connection time) It wins again. Now, realistically you'd never use -D0 (i.e. disabling all downclocking, not just lower states) - the power differential is quite big and as shown here it can hurt performance as well. On an idle system, looking at the cpu power usage with: powerstat -D -R 5 1000 Time User Nice Sys Idle IO Run Ctxt/s IRQ/s Fork Exec Exit Watts pkg-0 dram pkg-1 09:45:03 0.6 0.0 0.2 99.2 0.0 1 2861 2823 0 0 0 46.84 24.82 3.68 18.33 09:45:08 1.0 0.0 0.1 99.0 0.0 2 2565 1602 0 0 0 54.78 28.92 4.74 21.12 09:45:13 0.8 0.0 0.3 98.9 0.0 1 3769 2322 0 0 0 55.65 29.43 4.72 21.50 09:45:18 0.8 0.0 0.1 99.1 0.0 3 2513 1479 0 0 0 51.95 27.47 4.23 20.24 09:45:23 0.6 0.0 0.1 99.3 0.0 1 2282 1448 0 0 0 49.44 26.12 3.91 19.41 09:45:28 0.8 0.0 0.1 99.1 0.0 2 2422 1465 0 0 0 51.79 27.33 4.27 20.19 09:45:33 0.9 0.0 0.1 99.0 0.0 2 2358 1566 0 0 0 55.05 29.03 4.73 21.29 09:45:38 0.6 0.0 0.1 99.4 0.0 1 2976 4207 0 0 0 54.38 29.08 4.02 21.28 09:45:43 1.2 0.0 0.2 98.6 0.0 2 3988 37670 0 0 0 125.51 64.15 7.97 53.39 09:45:48 0.6 0.0 0.2 99.2 0.0 2 3263 40000 0 0 0 126.31 63.84 7.97 54.50 09:45:53 0.4 0.0 0.0 99.6 0.0 1 2333 39716 0 0 0 125.94 63.64 7.90 54.40 09:45:58 0.3 0.0 0.0 99.6 0.0 1 2783 39795 0 0 0 125.93 63.58 7.90 54.44 09:46:03 0.6 0.0 0.2 99.2 0.0 1 3081 24910 0 0 0 93.55 47.69 6.10 39.75 09:46:08 0.5 0.0 0.1 99.4 0.0 1 2623 1356 0 0 0 42.65 22.59 3.18 16.89 (interesting columns are Watts, pkg-0, dram, pkg-1) Initially this was with cpupower idle-set -E, then I ran cpupower idle-set -D0, and then cpupower idle-set -E - you can easily see where. If I use a more sensible -D5 system-wide and run the pgbench from earlier, I get the performance *and* sensible power usage: latency average = 0.946 ms latency stddev = 0.063 ms initial connection time = 3.482 ms tps = 1057.413946 (without initial connection time) 09:50:36 0.4 0.0 0.1 99.5 0.0 1 2593 1206 0 0 0 42.63 22.47 3.18 16.97 To get back to your benchmark: Because you're measuring a highly contended system, where most of the time postgres will just wait for a) row-level locks to be released b) WAL writes to finish (you're clearly not using a very fast disk) the CPU would have plenty time to clock down. Benchmarking sucks. Modern hardware realities suck. Greetings, Andres Freund
pgsql-hackers by date: