scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | scalability bottlenecks with (many) partitions (and more) |
Date | |
Msg-id | 510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com Whole thread Raw |
Responses |
Re: scalability bottlenecks with (many) partitions (and more)
Re: scalability bottlenecks with (many) partitions (and more) |
List | pgsql-hackers |
Hi, I happened to investigate a query involving a partitioned table, which led me to a couple of bottlenecks severely affecting queries dealing with multiple partitions (or relations in general). After a while I came up with three WIP patches that improve the behavior by an order of magnitude, and not just in some extreme cases. Consider a partitioned pgbench with 20 partitions, say: pgbench -i -s 100 --partitions 100 testdb but let's modify the pgbench_accounts a little bit: ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT; UPDATE pgbench_accounts SET aid_parent = aid; CREATE INDEX ON pgbench_accounts(aid_parent); VACUUM FULL pgbench_accounts; which simply adds "aid_parent" column which is not a partition key. And now let's do a query SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid so pretty much the regular "pgbench -S" except that on the column that does not allow partition elimination. Now, the plan looks like this: QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=1.52..34.41 rows=10 width=465) Hash Cond: (pa.bid = pb.bid) -> Append (cost=0.29..33.15 rows=10 width=101) -> Index Scan using pgbench_accounts_1_aid_parent_idx on pgbench_accounts_1 pa_1 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_2_aid_parent_idx on pgbench_accounts_2 pa_2 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_3_aid_parent_idx on pgbench_accounts_3 pa_3 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_4_aid_parent_idx on pgbench_accounts_4 pa_4 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> ... -> Hash (cost=1.10..1.10 rows=10 width=364) -> Seq Scan on pgbench_branches pb (cost=0.00..1.10 rows=10 width=364) So yeah, scanning all 100 partitions. Not great, but no partitioning scheme is perfect for all queries. Anyway, let's see how this works on a big AMD EPYC machine with 96/192 cores - with "-M simple" we get: parts 1 8 16 32 64 96 160 224 ----------------------------------------------------------------------- 0 13877 105732 210890 410452 709509 844683 1050658 1163026 100 653 3957 7120 12022 12707 11813 10349 9633 1000 20 142 270 474 757 808 567 427 These are transactions per second, for different number of clients (numbers in the header). With -M prepared the story doesn't change - the numbers are higher, but the overall behavior is pretty much the same. Firstly, with no partitions (first row), the throughput by ~13k/client initially, then it gradually levels off. But it grows all the time. But with 100 or 1000 partitions, it peaks and then starts dropping again. And moreover, the throughput with 100 or 1000 partitions is just a tiny fraction of the non-partitioned value. The difference is roughly equal to the number of partitions - for example with 96 clients, the difference between 0 and 1000 partitions is 844683/808 = 1045. I could demonstrate the same behavior with fewer partitions - e.g. with 10 partitions you get ~10x difference, and so on. Another thing I'd mention is that this is not just about partitioning. Imagine a star schema with a fact table and dimensions - you'll get the same behavior depending on the number of dimensions you need to join with. With "-M simple" you may get this, for example: dims 1 8 16 32 64 96 160 224 ---------------------------------------------------------------------- 1 11737 92925 183678 361497 636598 768956 958679 1042799 10 462 3558 7086 13889 25367 29503 25353 24030 100 4 31 61 122 231 292 292 288 So, similar story - significant slowdown as we're adding dimensions. Now, what could be causing this? Clearly, there's a bottleneck of some kind, and we're hitting it. Some of this may be simply due to execution doing more stuff (more index scans, more initialization, ...) but maybe not - one of the reasons why I started looking into this was not using all the CPU even for small scales - the CPU was maybe 60% utilized. So I started poking at things. The first thing that I thought about was locking, obviously. That's consistent with the limited CPU utilization (waiting on a lock = not running), and it's somewhat expected when using many partitions - we need to lock all of them, and if we have 100 or 1000 of them, that's potentially lot of locks. From past experiments I've known about two places where such bottleneck could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to give it a try, increase these values and see what happens. For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The LWLock table has 16 partitions by default - it's quite possible that on machine with many cores and/or many partitions, we can easily hit this. So I bumped this 4x to 64 partitions. For fast-path locking the changes are more complicated (see 0002). We allow keeping 16 relation locks right in PGPROC, and only when this gets full we promote them to the actual lock table. But with enough partitions we're guaranteed to fill these 16 slots, of course. But increasing the number of slots is not simple - firstly, the information is split between an array of 16 OIDs and UINT64 serving as a bitmap. Increasing the size of the OID array is simple, but it's harder for the auxiliary bitmap. But there's more problems - with more OIDs a simple linear search won't do. But a simple hash table is not a good idea too, because of poor locality and the need to delete stuff ... What I ended up doing is having a hash table of 16-element arrays. There are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we have now. Each OID is mapped to exactly one of these parts as if in a hash table, and in each of those 16-element parts we do exactly the same thing we do now (linear search, removal, etc.). This works great, the locality is great, etc. The one disadvantage is this makes PGPROC larger, but I did a lot of benchmarks and I haven't seen any regression that I could attribute to this. (More about this later.) Unfortunately, for the pgbench join this does not make much difference. But for the "star join" (with -M prepared) it does this: 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 21610 137450 247541 300902 270932 229692 191454 189233 patched 21664 151695 301451 594615 1036424 1211716 1480953 1656203 speedup 1.0 1.1 1.2 2.0 3.8 5.3 7.7 8.8 That's a pretty nice speedup, I think. However, why doesn't the partitioned join improve (at not very much)? Well, perf profile says stuff like this: 9.16% 0.77% postgres [kernel.kallsyms] [k] asm_exc_page_fault | --8.39%--asm_exc_page_fault | --7.52%--exc_page_fault | --7.13%--do_user_addr_fault | --6.64%--handle_mm_fault | --6.29%--__handle_mm_fault | |--2.17%--__mem_cgroup_charge | | | |--1.25%--charge_memcg | | | | | --0.57%-- ... | | | --0.67%-- ... | |--2.04%--vma_alloc_folio After investigating this for a bit, I came to the conclusion this may be some sort of a scalability problem in glibc/malloc. I decided to try if the "memory pool" patch (which I've mentioned in the memory limit thread as an alternative way to introduce backend-level accounting/limit) could serve as a backend-level malloc cache, and how would that work. So I cleaned up the PoC patch I already had (see 0003), and gave it a try. And with both patches applied, the results for the partitioned join with 100 partitions look like this: -M simple 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 653 3957 7120 12022 12707 11813 10349 9633 both patches 954 7356 14580 28259 51552 65278 70607 69598 speedup 1.5 1.9 2.0 2.4 4.1 5.5 6.8 7.2 -M prepared 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 1639 8273 14138 14746 13446 14001 11129 10136 both patches 4792 30102 62208 122157 220984 267763 315632 323567 speedup 2.9 3.6 4.4 8.3 16.4 19.1 28.4 31.9 That's pretty nice, I think. And I've seen many such improvements, it's not a cherry-picked example. For the star join, the improvements are very similar. I'm attaching PDF files with a table visualizing results for these two benchmarks - there's results for different number of partitions/scales, and different builds (master, one or both of the patches). There's also a comparison to master, with color scale "red = slower, green = faster" (but there's no red anywhere, not even for low client counts). It's also interesting that with just the 0003 patch applied, the change is much smaller. It's as if the two bottlenecks (locking and malloc) are in balance - if you only address one one, you don't get much. But if you address both, it flies. FWIW where does the malloc overhead come from? For one, while we do have some caching of malloc-ed memory in memory contexts, that doesn't quite work cross-query, because we destroy the contexts at the end of the query. We attempt to cache the memory contexts too, but in this case that can't help because the allocations come from btbeginscan() where we do this: so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and thus always allocated using a separate malloc() call. Maybe we could break it into smaller/cacheable parts, but I haven't tried, and I doubt it's the only such allocation. I don't want to get into too much detail about the memory pool, but I think it's something we should consider doing - I'm sure there's stuff to improve, but caching the malloc may clearly be very beneficial. The basic idea is to have a cache that is "adaptive" (i.e. adjusts to caching blocks of sizes needed by the workload) but also cheap. The patch is PoC/WIP and needs more work, but I think it works quite well. If anyone wants to take a look or have a chat at FOSDEM, for example, I'm available. FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves the behavior a lot - it gets us maybe ~80% of the mempool benefits. Which is nice, it confirms it's glibc-specific (I wonder if there's a way to tweak glibc to address this), and it also means systems using jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also says the mempool has ~20% benefit on top of jemalloc. FWIW there's another bottleneck people may not realize, and that's the number of file descriptors. Once you get to >1000 relations, you can easily get into situation like this: 54.18% 0.48% postgres [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe | --53.70%--entry_SYSCALL_64_after_hwframe | --53.03%--do_syscall_64 | |--28.29%--__x64_sys_openat | | | --28.14%--do_sys_openat2 | | | |--23.14%--do_filp_open | | | | | --22.72%--path_openat That's pretty bad, it means we're closing/opening file descriptors like crazy, because every query needs the files. If I increase the number of file descriptors (both in ulimit and max_files_per_process) to prevent this trashing, I can increase the throughput ~5x. Of course, this is not a bottleneck that we can "fix" in code, it's simply a consequence of not having enough file descriptors etc. But I wonder if we might make it easier to monitor this, e.g. by tracking the fd cache hit ratio, or something like that ... There's a more complete set of benchmarking scripts and results for these and other tests, in various formats (PDF, ODS, ...) at https://github.com/tvondra/scalability-patches There's results from multiple machines - not just the big epyc machine, but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it helps even on rpi5, quite a bit). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: