tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | tweaking NTUP_PER_BUCKET |
Date | |
Msg-id | 53B4A03F.3070409@fuzzy.cz Whole thread Raw |
Responses |
Re: tweaking NTUP_PER_BUCKET
Re: tweaking NTUP_PER_BUCKET |
List | pgsql-hackers |
Hi, while hacking on the 'dynamic nbucket' patch, scheduled for the next CF (https://commitfest.postgresql.org/action/patch_view?id=1494) I was repeatedly stumbling over NTUP_PER_BUCKET. I'd like to propose a change in how we handle it. TL;DR; version -------------- I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1) once the table is built and there's free space in work_mem. The patch mentioned above makes implementing this possible / rather simple. Long version ------------ I've seen discussions in the list in the past, but I don't recall any numbers comparing impact of different values. So I did a simple test, measuring the duration of a simple hashjoin with different hashtable sizes and NTUP_PER_BUCKET values. In short, something like this: CREATE TABLE outer_table (id INT); CREATE TABLE inner_table (id INT, val_int INT, val_txt TEXT); INSERT INTO outer_table SELECT i FROM generate_series(1,50000000) s(i); INSERT INTO inner_table SELECT i, mod(i,10000), md5(i::text) FROM generate_series(1,10000000) s(i); ANALYZE outer_table; ANALYZE inner_table; -- force hash join with a single batch set work_mem = '1GB'; set enable_sort = off; set enable_indexscan = off; set enable_bitmapscan = off; set enable_mergejoin = off; set enable_nestloop = off; -- query with various values in the WHERE condition, determining -- the size of the hash table (single batch, thanks to work_mem) SELECT COUNT(i.val_txt) FROM outer_table o LEFT JOIN inner_table i ON (i.id = o.id) WHERE val_int < 1000; which leads to plans like this: http://explain.depesz.com/s/i7W I've repeated this with with NTUP_PER_BUCKET between 1 and 10, and various values in the WHERE condition, and I got this table, illustrating impact of various NTUP_PER_BUCKET values for various hashtable sizes: kB NTUP=1 NTUP=2 NTUP=4 NTUP=5 NTUP=9 NTUP=10 1407 6753 7218 7890 9324 9488 12574 7032 9417 11586 17417 15820 25732 25402 35157 13179 17101 27225 24763 43323 43175 70313 14203 18475 29354 26690 46840 46676 175782 14319 17458 25325 34542 37342 60949 351563 16297 19928 29124 43364 43232 70014 703125 19027 23125 33610 50283 50165 81398 (I've skipped columns for NTUP values that are almost exactly the same as the adjacent columns due to the fact bucket count grows by doubling). If you prefer charts to tables, attached are two showing the same data (complete and "small hashtables"). Also, attached is a CSV with the results, tracking number of buckets and durations for each hashtable size / NTUP_PER_BUCKET combination. I'm aware that the numbers and exact behaviour depends on the hardware (e.g. L1/L2 cache, RAM etc.), and the query used to get these values is somewhat simple (at least compared to some real-world values). But I believe the general behaviour won't vary too much. The first observation is that for large hashtables the difference between NTUP=1 and NTUP=10 is much larger than for small ones. Not really surprising - probably due to better hit ratios for CPU caches. For the large hashtable the difference is significant, easily beating price for the rebuild of the hashtable (increasing the nbuckets). Therefore I'd like to propose dynamic increase of bucket count. The patch mentioned at the beginning does pretty much exactly this for plans with significantly underestimated cardinality of the hashtable (thus causing inappropriately low bucket count and poor performance of the join). However it allows dynamic approach to NTUP_PER_BUCKET too, so I propose a change along these lines: 1) consider lowering the NTUP_PER_BUCKET (see measurements below) 2) do the initial nbucket/nbatch estimation as today (with whatever NTUP_PER_BUCKET value we end up using) and build the hashtable 3) while building the hashtable, include space for buckets into spaceUsed while deciding whether to increase number of batches (the fact it's not already included seems wrong), but instead of using initial bucket count, do this: if (hashtable->spaceUsed + BUCKET_SPACE(hashtable) > hashtable->spaceAllowed) ExecHashIncreaseNumBatches(hashtable); where BUCKET_SPACE is 'how many buckets would be appropriate, given current tuples per batch'. 4) consider what is the "appropriate" number of buckets and rebuild the table if appropriate This is where the patch fixes the underestimated nbuckets value, but what we can do is look at how much free space remains in work_mem and use as many buckets as fit into work_mem (capped with tuples per batch, of course) if there's free space. Even if the initial cardinality estimate was correct (so we're below NTUP_PER_BUCKET) this may be a significant speedup. Say for example we have 10M tuples in the table - with the default NTUP_PER_BUCKET this means ~1M buckets. But there's significant performance difference between hashtable with 1 or 10 tuples per bucket (see below), and if we have 80MB in work_mem (instead of the default 8MB) this may be a huge gain. Any comments / objections welcome. regards Tomas
Attachment
pgsql-hackers by date: