Experimenting with hash join prefetch - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Experimenting with hash join prefetch |
Date | |
Msg-id | CAEepm=2y9HM9QP+HhRZdQ3pU6FShSMyu=V1uHXhQ5gG-dketHg@mail.gmail.com Whole thread Raw |
Responses |
Re: Experimenting with hash join prefetch
|
List | pgsql-hackers |
Hello hackers, I have a list of micro-optimisations and things to look into for hash joins, which I've updated on the wiki[1]. Here's one that I was inspired to poke at with a stick in a spare hour today. Cache-oblivious hash joins cause a lot of TLB and cache misses. Researchers tell us to reduce those using huge/super VM pages[2] and cache prefetch instructions[3]. (There is another class of cache-aware hash join algorithms that partition carefully up front to avoid them; that's not us.) Here is a totally contrived experiment that shows the effect quite reproducibly here: shared_buffers = '1GB' create table t as select generate_series(1, 20000000)::int i; set max_parallel_workers_per_gather = 0; set work_mem = '2GB'; select pg_prewarm('t'::regclass); select count(*) from t t1 join t t2 using (i); -> 00:12.683 First, let's try to prefetch the hash bucket for the next tuple while computing the hash for the current tuple, since we can see into the future quite easily: we know the keys are sequential integers in this contrived experiment. In ExecHashGetHashValue(): + /* Prefetch the bucket for the next key */ + uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1); + uint32 next_bucket = next_hash % hashtable->nbuckets; + __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]); select count(*) from t t1 join t t2 using (i); -> 00:09.901 Huzzah! Next up, let's try a two-stage prefetch pipeline for the probing phase, seeing two steps ahead: + /* Prefetch the bucket for the tuple after next */ + uint32 next_next_hash = hash_uint32(DatumGetInt32(keyval) + 2); + uint32 next_next_bucket = next_next_hash % hashtable->nbuckets; + __builtin_prefetch(&hashtable->buckets.unshared[next_next_bucket]); + if (outer_tuple) + { + /* Prefetch the first tuple in the next bucket */ + uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1); + uint32 next_bucket = next_hash % hashtable->nbuckets; + __builtin_prefetch(hashtable->buckets.unshared[next_bucket]); + } -> 00:09.138 It's nice to see this effect, but it isn't really surprising: there is no doubt about the value of prefetching random access data. I think we could probably do this for the build phase with the existing tuple-at-a-time executor interface by doing the bucket insertions from a queue that runs N tuples behind the one we're currently loading and hashing. Or something like that. For the probe phase (probably much more interesting) I think it'd involve extra tuple copying, so that it could still access the last tuple while pulling the next tuple to hash its key. To avoid that we'd need a facility for peeking at future tuples, or a proper first class batch mode. [1] https://wiki.postgresql.org/wiki/Parallel_Hash [2] https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf (table VI) [3] http://www.cs.cmu.edu/~chensm/papers/hashjoin_icde04.pdf -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: