Fast DSM segments - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Fast DSM segments |
Date | |
Msg-id | CA+hUKGLAE2QBv-WgGp+D9P_J-=yne3zof9nfMaqq1h3EGHFXYQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Fast DSM segments
|
List | pgsql-hackers |
Hello PostgreSQL 14 hackers, FreeBSD is much faster than Linux (and probably Windows) at parallel hash joins on the same hardware, primarily because its DSM segments run in huge pages out of the box. There are various ways to convince recent-ish Linux to put our DSMs on huge pages (see below for one), but that's not the only problem I wanted to attack. The attached highly experimental patch adds a new GUC dynamic_shared_memory_main_size. If you set it > 0, it creates a fixed sized shared memory region that supplies memory for "fast" DSM segments. When there isn't enough free space, dsm_create() falls back to the traditional approach using eg shm_open(). This allows parallel queries to run faster, because: * no more expensive system calls * no repeated VM allocation (whether explicit posix_fallocate() or first-touch) * can be in huge pages on Linux and Windows This makes lots of parallel queries measurably faster, especially parallel hash join. To demonstrate with a very simple query: create table t (i int); insert into t select generate_series(1, 10000000); select pg_prewarm('t'); set work_mem = '1GB'; select count(*) from t t1 join t t2 using (i); Here are some quick and dirty results from a Linux 4.19 laptop. The first column is the new GUC, and the last column is from "perf stat -e dTLB-load-misses -p <backend>". size huge_pages time speedup TLB misses 0 off 2.595s 9,131,285 0 on 2.571s 1% 8,951,595 1GB off 2.398s 8% 9,082,803 1GB on 1.898s 37% 169,867 You can get some of this speedup unpatched on a Linux 4.7+ system by putting "huge=always" in your /etc/fstab options for /dev/shm (= where shm_open() lives). For comparison, that gives me: size huge_pages time speedup TLB misses 0 on 2.007s 29% 221,910 That still leave the other 8% on the table, and in fact that 8% explodes to a much larger number as you throw more cores at the problem (here I was using defaults, 2 workers). Unfortunately, dsa.c -- used by parallel hash join to allocate vast amounts of memory really fast during the build phase -- holds a lock while creating new segments, as you'll soon discover if you test very large hash join builds on a 72-way box. I considered allowing concurrent segment creation, but as far as I could see that would lead to terrible fragmentation problems, especially in combination with our geometric growth policy for segment sizes due to limited slots. I think this is the main factor that causes parallel hash join scalability to fall off around 8 cores. The present patch should really help with that (more digging in that area needed; there are other ways to improve that situation, possibly including something smarter than a stream of of dsa_allocate(32kB) calls). A competing idea would have freelists of lingering DSM segments for reuse. Among other problems, you'd probably have fragmentation problems due to their differing sizes. Perhaps there could be a hybrid of these two ideas, putting a region for "fast" DSM segments inside many OS-supplied segments, though it's obviously much more complicated. As for what a reasonable setting would be for this patch, well, erm, it depends. Obviously that's RAM that the system can't use for other purposes while you're not running parallel queries, and if it's huge pages, it can't be swapped out; if it's not huge pages, then it can be swapped out, and that'd be terrible for performance next time you need it. So you wouldn't want to set it too large. If you set it too small, it falls back to the traditional behaviour. One argument I've heard in favour of creating fresh segments every time is that NUMA systems configured to prefer local memory allocation (as opposed to interleaved allocation) probably avoid cross node traffic. I haven't looked into that topic yet; I suppose one way to deal with it in this scheme would be to have one such region per node, and prefer to allocate from the local one.
Attachment
pgsql-hackers by date: