Thread: How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

From

Sergio Rus

Date:

31 May 2023, 12:40:05

Hi guys,

I've been configuring a new server and tuning Postgresql 15.3, but I'm struggling with a latency I'm consistently seeing with this new server when running fast short queries, compared to the other server.

We're running two different versions of Postgresql:

- Server A: Postgresql 9.3
- Server B: Postgresql 15.3

Server B is the new server and is way more powerful than server A:

- Server A: 1x Intel Xeon E3-1270 3.5GHz, 2x 8GB DDR3, RAID0
- Server B: 2x Intel Xeon Platinum 8260 2.4GHz, 4x 16GB DDR4, RAID1

We're running Linux Ubuntu 20.04 on server B and I've been tweaking some settings in Linux and Postgresql 15.3. With the current setup, Postgresql 15.3 is able to process more than 1 million transactions per second running pgbench:

# pgbench --username postgres --select-only --client 100 --jobs 10 --time 20 test
pgbench (15.3 (Ubuntu 15.3-1.pgdg20.04+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: simple
number of clients: 100
number of threads: 10
maximum number of tries: 1
duration: 20 s
number of transactions actually processed: 23039950
number of failed transactions: 0 (0.000%)
latency average = 0.087 ms
initial connection time = 62.536 ms
tps = 1155053.135317 (without initial connection time)

As shown in pgbench, the performance is great. Also when testing individual queries, heavy queries (those taking a few ms) run faster on server B than server A. Unfortunately when we run fast short SELECT queries (< 1 ms), server A is consistently running faster than server B, even if the query plans are the same:

Server A:

# EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT 1 AS "a" FROM "foobar" WHERE ("foobar"."id" = 1) LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..8.44 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Output: (1)
Buffers: shared hit=5
-> Index Only Scan using foobar_pkey on public.foobar (cost=0.42..8.44 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Output: 1
Index Cond: (foobar.id = 1)
Heap Fetches: 1
Buffers: shared hit=5
Total runtime: 0.017 ms
(9 rows)

Time: 0.281 ms

Server B:

# EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT 1 AS "a" FROM "foobar" WHERE ("foobar"."id" = 1) LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.11 rows=1 width=4) (actual time=0.019..0.021 rows=1 loops=1)
Output: 1
Buffers: shared hit=4
-> Index Only Scan using foobar_pkey on public.foobar (cost=0.00..1.11 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=1)
Output: 1
Index Cond: (foobar.id = 1)
Heap Fetches: 0
Buffers: shared hit=4
Planning Time: 0.110 ms
Execution Time: 0.045 ms
(10 rows)

Time: 0.635 ms

RAID1 could add some latency on server B if it was reading from disk, but I've confirmed that these queries are hitting the buffer/cache and therefore reading data from memory and not from disk. I've checked the hit rate with the following query:

SELECT 'cache hit rate' AS name, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio FROM pg_statio_user_tables;

The hit rate was over 95% and it increased as soon as I ran those queries. Same thing with the index hit rate.

I've been playing with some parameters in Postgresql, decreasing/increasing the number of workers, shared buffers, work_mem, JIT, cpu_*_cost variables, etc, but nothing did help to reduce that latency.

Here are the settings I'm currently using with Postgresql 15.3 after a lot of work experimenting with different values:

checkpoint_completion_target = 0.9
checkpoint_timeout = 900
cpu_index_tuple_cost = 0.00001
cpu_operator_cost = 0.00001
effective_cache_size = 12GB
effective_io_concurrency = 200
jit = off
listen_addresses = 'localhost'
maintenance_work_mem = 1GB
max_connections = 100
max_parallel_maintenance_workers = 4
max_parallel_workers = 12
max_parallel_workers_per_gather = 4
max_wal_size = 4GB
max_worker_processes = 12
min_wal_size = 1GB
random_page_cost = 1.1
shared_buffers = 4GB
ssl = off
timezone = 'UTC'
wal_buffers = 16MB
work_mem = 64MB

Some notes about those settings:

- We're running other services on this server, that's why I'm not using more resources.
- Tweaking the cpu_*_cost parameters was crucial to improve the query plan. With the default values Postgresql was consistently using a slower query plan.

I've been looking at some settings in Linux as well:

- Swappiness is set to the lowest safe value: vm.swappiness = 1
- Huge Pages is not being used and Transparent Huge Pages (THP) is set to 'madvise'. Postgresql 15.3 is using the default value for the 'huge_pages' parameter: 'try'.
- The memory overcommit policy is set to 1: vm.overcommit_memory = 1

I've been playing with Huge Pages, to try to force Postgresql using this feature. I manually allocated the number of Huge Pages as shown in this query:

SHOW shared_memory_size_in_huge_pages;

I confirmed Huge Pages were being used by Postgresql, but unfortunately I didn't see any improvement regarding latency and performance. So I set this back to the previous state.

Conclusion:

The latency is quite low on both servers, but when you're running dozens or hundreds of fast short queries concurrently, on aggregate you see the difference, with server A being 0.1-1.0 seconds faster than server B.

As you can see, server B has 2 CPUs and is using NUMA on Linux. And the CPU clock is slower on server B than server A. Maybe any of those are causing that latency?

Any suggestions or ideas where to look? I'd really appreciate your help.

Thank you

Re: How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

From

Kenneth Marshall

Date:

31 May 2023, 13:47:12

On Wed, May 31, 2023 at 02:40:05PM +0200, Sergio Rus wrote:
> Hi guys,
> 
> I've been configuring a new server and tuning Postgresql 15.3, but I'm
> struggling with a latency I'm consistently seeing with this new server when
> running fast short queries, compared to the other server.
> 
> We're running two different versions of Postgresql:
> 
> - Server A: Postgresql 9.3
> - Server B: Postgresql 15.3
> 
> Server B is the new server and is way more powerful than server A:
> 
> - Server A: 1x Intel Xeon E3-1270 3.5GHz, 2x 8GB DDR3, RAID0
> - Server B: 2x Intel Xeon Platinum 8260 2.4GHz, 4x 16GB DDR4, RAID1
> ...
> Conclusion:
> 
> As you can see, server B has 2 CPUs and is using NUMA on Linux.  And the
> CPU clock is slower on server B than server A. Maybe any of those are
> causing that latency?
> 

Hi Sergio,

This really looks like it is caused by the CPU clock speed difference.
The E3 is 1.6X faster at the base frequency. Many times that is the
trade-off when going to many more cores. Simple short will run faster on
the older CPU even though overall the new CPU has much more total
capacity.

Regards,
Ken

Re: How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

From

Sergio Rus

Date:

31 May 2023, 16:03:11

Thanks for your reply, Ken.

With such a big server I was convinced that we should see a boost everywhere, but after spending so much time tweaking and looking at so many parameters on Linux, Postgresql and our current setup, I started to think that maybe that latency was intrinsic to the hardware and therefore inevitable. So after all, the CPU clock speed still counts these days! I think we're many just looking at the number of CPU cores and forgetting that the clock speed is still relevant for many tasks.

I guess those simple short queries are very sensible to the hardware specs and there is no room for improving as much as the heavy queries in recent versions of Postgres, as I have seen in my tests.

On Wed, 31 May 2023 at 15:47, Kenneth Marshall <ktm@rice.edu> wrote:

On Wed, May 31, 2023 at 02:40:05PM +0200, Sergio Rus wrote:
> Hi guys,
>
> I've been configuring a new server and tuning Postgresql 15.3, but I'm
> struggling with a latency I'm consistently seeing with this new server when
> running fast short queries, compared to the other server.
>
> We're running two different versions of Postgresql:
>
> - Server A: Postgresql 9.3
> - Server B: Postgresql 15.3
>
> Server B is the new server and is way more powerful than server A:
>
> - Server A: 1x Intel Xeon E3-1270 3.5GHz, 2x 8GB DDR3, RAID0
> - Server B: 2x Intel Xeon Platinum 8260 2.4GHz, 4x 16GB DDR4, RAID1
> ...
> Conclusion:
>
> As you can see, server B has 2 CPUs and is using NUMA on Linux. And the
> CPU clock is slower on server B than server A. Maybe any of those are
> causing that latency?
>

Hi Sergio,

This really looks like it is caused by the CPU clock speed difference.
The E3 is 1.6X faster at the base frequency. Many times that is the
trade-off when going to many more cores. Simple short will run faster on
the older CPU even though overall the new CPU has much more total
capacity.

Regards,
Ken

Re: How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

From

Imre Samu

Date:

31 May 2023, 17:17:28

> Server B is the new server and is way more powerful than server A:
> ...
> So after all, the CPU clock speed still counts these days!

Hi Sergio,

Maybe "powerful" + "powersave"?

as I see Sever B : Processor Base Frequency : 2.40 GHz AND Max Turbo Frequency : 3.90 GHz

Could you verify this by running the 'cpupower frequency-info' command and checking the governor line?"

"But If we haven’t emphasised it enough, firstly whatever database benchmark you are running

make sure that your CPUs are not configured to run in powersave mode."
https://www.hammerdb.com/blog/uncategorized/how-to-maximize-cpu-performance-for-postgresql-12-0-benchmarks-on-linux/

regards,

Imre

Sergio Rus <geiros@gmail.com> ezt írta (időpont: 2023. máj. 31., Sze, 18:03):

Thanks for your reply, Ken.

With such a big server I was convinced that we should see a boost everywhere, but after spending so much time tweaking and looking at so many parameters on Linux, Postgresql and our current setup, I started to think that maybe that latency was intrinsic to the hardware and therefore inevitable. So after all, the CPU clock speed still counts these days! I think we're many just looking at the number of CPU cores and forgetting that the clock speed is still relevant for many tasks.

I guess those simple short queries are very sensible to the hardware specs and there is no room for improving as much as the heavy queries in recent versions of Postgres, as I have seen in my tests.

Re: How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

From

Andres Freund

Date:

31 May 2023, 17:54:23

Hi,

On 2023-05-31 14:40:05 +0200, Sergio Rus wrote:
> I've been configuring a new server and tuning Postgresql 15.3, but I'm
> struggling with a latency I'm consistently seeing with this new server when
> running fast short queries, compared to the other server.
> 
> We're running two different versions of Postgresql:
> 
> - Server A: Postgresql 9.3
> - Server B: Postgresql 15.3
> 
> Server B is the new server and is way more powerful than server A:
> 
> - Server A: 1x Intel Xeon E3-1270 3.5GHz, 2x 8GB DDR3, RAID0
> - Server B: 2x Intel Xeon Platinum 8260 2.4GHz, 4x 16GB DDR4, RAID1
> 
> We're running Linux Ubuntu 20.04 on server B and I've been tweaking some
> settings in Linux and Postgresql 15.3. With the current setup, Postgresql
> 15.3 is able to process more than 1 million transactions per second running
> pgbench:
> 
>     # pgbench --username postgres --select-only --client 100 --jobs 10
> --time 20 test

Could you post the pgbench results for both systems? Which one is this from?

> As shown in pgbench, the performance is great. Also when testing individual
> queries, heavy queries (those taking a few ms) run faster on server B than
> server A. Unfortunately when we run fast short SELECT queries (< 1 ms),
> server A is consistently running faster than server B, even if the query
> plans are the same:

One explanation for this can be the powersaving settings. Newer CPUs can
throttle down a lot further than the older ones. Increasing the clock speed
has a fair bit of latency - for a long running query that's not really
visible, but if you run a short query in isolation, it'll likely complete
before the clock speed has finished ramping up.

You can query that with
   cpupower frequency-info

Another thing is that you're comparing a two socket system with a one socket
system. Latency between a program running on one node and a client on another,
and similarly, a program running on one node and memory attached to the other
CPU, is higher.

You could check what happens if you bind both server and client to the same
CPU socket.
  numactl --membind 1 --cpunodebind 1 <program> <parameters>
forces programs to allocate memory and run on a specific CPU socket.

Greetings,

Andres Freund

Re: How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

From

Ranier Vilela

Date:

31 May 2023, 18:16:40

Em qua., 31 de mai. de 2023 às 09:40, Sergio Rus <geiros@gmail.com> escreveu:

As you can see, server B has 2 CPUs and is using NUMA on Linux. And the CPU clock is slower on server B than server A. Maybe any of those are causing that latency?

Any suggestions or ideas where to look? I'd really appreciate your help.

If this is cpu bound, linux perf can show the difference.

https://wiki.postgresql.org/wiki/Profiling_with_perf

regards,

Ranier Vilela

Re: How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

From

Sergio Rus

Date:

01 June 2023, 14:26:59

Thanks for your replies, you were totally right, it was due to the CPU
governor: the governor was set to 'powersave'. I've changed it to
'performance' and the server is flying now. I'm still working on it,
but the first quick tests I've made are showing much better numbers.
Those simple short queries are running faster now, the latency is now
basically the same or even lower than the old server. The server feels
more responsive overall.

I've finally installed cpupower, to simplify the process, but you can
use basic shell commands. Here are the output for some commands:

    # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
    =>
    performance powersave

    # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
    =>
    performance

    # cpupower -c all frequency-info
    =>
    analyzing CPU 0:
      driver: intel_pstate
      CPUs which run at the same hardware frequency: 0
      CPUs which need to have their frequency coordinated by software: 0
      maximum transition latency:  Cannot determine or is not supported.
      hardware limits: 1000 MHz - 3.90 GHz
      available cpufreq governors: performance powersave
      current policy: frequency should be within 1000 MHz and 3.90 GHz.
                      The governor "performance" may decide which speed to use
                      within this range.
      current CPU frequency: Unable to call hardware
      current CPU frequency: 1.94 GHz (asserted by call to kernel)
      boost state support:
        Supported: yes
        Active: yes

    analyzing CPU 1:
      driver: intel_pstate
      CPUs which run at the same hardware frequency: 1
      CPUs which need to have their frequency coordinated by software: 1
      maximum transition latency:  Cannot determine or is not supported.
      hardware limits: 1000 MHz - 3.90 GHz
      available cpufreq governors: performance powersave
      current policy: frequency should be within 1000 MHz and 3.90 GHz.
                      The governor "performance" may decide which speed to use
                      within this range.
      current CPU frequency: Unable to call hardware
      current CPU frequency: 1.91 GHz (asserted by call to kernel)
      boost state support:
        Supported: yes
        Active: yes

    analyzing CPU 2:
      driver: intel_pstate
      CPUs which run at the same hardware frequency: 2
      CPUs which need to have their frequency coordinated by software: 2
      maximum transition latency:  Cannot determine or is not supported.
      hardware limits: 1000 MHz - 3.90 GHz
      available cpufreq governors: performance powersave
      current policy: frequency should be within 1000 MHz and 3.90 GHz.
                      The governor "performance" may decide which speed to use
                      within this range.
      current CPU frequency: Unable to call hardware
      current CPU frequency: 2.14 GHz (asserted by call to kernel)
      boost state support:
        Supported: yes
        Active: yes

    ... (cropped)

    analyzing CPU 9:
      driver: intel_pstate
      CPUs which run at the same hardware frequency: 9
      CPUs which need to have their frequency coordinated by software: 9
      maximum transition latency:  Cannot determine or is not supported.
      hardware limits: 1000 MHz - 3.90 GHz
      available cpufreq governors: performance powersave
      current policy: frequency should be within 1000 MHz and 3.90 GHz.
                      The governor "performance" may decide which speed to use
                      within this range.
      current CPU frequency: Unable to call hardware
      current CPU frequency: 2.95 GHz (asserted by call to kernel)
      boost state support:
        Supported: yes
        Active: yes

    ... (cropped)

    analyzing CPU 26:
      driver: intel_pstate
      CPUs which run at the same hardware frequency: 26
      CPUs which need to have their frequency coordinated by software: 26
      maximum transition latency:  Cannot determine or is not supported.
      hardware limits: 1000 MHz - 3.90 GHz
      available cpufreq governors: performance powersave
      current policy: frequency should be within 1000 MHz and 3.90 GHz.
                      The governor "performance" may decide which speed to use
                      within this range.
      current CPU frequency: Unable to call hardware
      current CPU frequency: 1.00 GHz (asserted by call to kernel)
      boost state support:
        Supported: yes
        Active: yes

    analyzing CPU 27:
      driver: intel_pstate
      CPUs which run at the same hardware frequency: 27
      CPUs which need to have their frequency coordinated by software: 27
      maximum transition latency:  Cannot determine or is not supported.
      hardware limits: 1000 MHz - 3.90 GHz
      available cpufreq governors: performance powersave
      current policy: frequency should be within 1000 MHz and 3.90 GHz.
                      The governor "performance" may decide which speed to use
                      within this range.
      current CPU frequency: Unable to call hardware
      current CPU frequency: 1000 MHz (asserted by call to kernel)
      boost state support:
        Supported: yes
        Active: yes

    ... (cropped)

---

Before this change, with the CPU governor set to 'powersave',
basically all the CPU cores were at 1.00 GHz. I haven't listed all the
cores, but I'm seeing very different frequencies now. I noticed that
some of the cores are still at 1 GHz, which is good if they're idle,
otherwise the server would get really hot!

> Could you post the pgbench results for both systems? Which one is this from?
Andres, I ran pgbench on the new server. Unfortunately the old server
is in production and quite busy, so I can't run any benchmark over
there.

Thanks!