Hash joins vs small-integer join values - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Hash joins vs small-integer join values |
Date | |
Msg-id | 11487.1180640285@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Hash joins vs small-integer join values
Re: Hash joins vs small-integer join values |
List | pgsql-hackers |
I was idly thinking about Joseph Shraibman's problem here: http://archives.postgresql.org/pgsql-general/2007-05/msg01011.php in which a large hash join seemed to be blowing out memory. By chance I tried the following test case: js=# create table ml (jid int); CREATE TABLE js=# insert into ml select random()*1000 from generate_series(1,185391404); INSERT 0 185391404 js=# create table tempjr (id int); CREATE TABLE js=# insert into tempjr select random()*1000 from generate_series(1,60000); INSERT 0 60000 js=# analyze ml; ANALYZE js=# select count(*) from tempjr join ml on (jid=id) group by jid; Since I hadn't remembered to increase work_mem beyond the default, this set up a hash join with 4111 buckets in each of 8192 batches, which didn't seem too awfully unreasonable, so I let it go. Imagine my horror as I watched it stuff all 185 million ml rows into batch 4365. Naturally, when it got to trying to process that batch, the in-memory hashtable blew out real good. I'm not certain this is what happened to Joseph, since I don't know the stats of his jid column, but in any case it's got to be fixed. Hash join is a probabilistic algorithm, so there will always be some input distributions for which it sucks, but I don't think we can tolerate "uniformly distributed on the integers 0-N" as being one of them. The problem comes from the rather simplistic assignment of bucket and batch numbers in ExecHashGetBucketAndBatch(): * Note: on-the-fly increases of nbatch must not change the bucket number* for a given hash code (since we don't move tuplesto different hash* chains), and must only cause the batch number to remain the same or* increase. Our algorithm is* bucketno = hashvalue MOD nbuckets* batchno = (hashvalue DIV nbuckets) MOD nbatch* where nbuckets shouldpreferably be prime so that all bits of the* hash value can affect both bucketno and batchno.* nbuckets doesn't changeover the course of the join. This would be fine if the hashvalues were reasonably randomly distributed over all uint32 values, but take a look at hashint4 --- it's just a one's-complement: Datum hashint4(PG_FUNCTION_ARGS) {PG_RETURN_UINT32(~PG_GETARG_UINT32(0)); } Two inputs that differ by 1 will have hash values also differing by 1. Therefore, in my test case with 4111 buckets, consecutive ranges of 4111 input values map to the same batch --- different buckets in the batch, but the same batch. My example with inputs 0..999 would have mapped to either 1 or 2 batches depending on luck. With a more realistic work_mem, nbuckets would have been larger, making this problem worse not better. 8.1 and up are broken this way; in 8.0 and before we were calculating the batch number in a different way that doesn't seem vulnerable to this particular failure mode. Arguably, the problem here is a chintzy hash function, and we should fix it by making the integer hash functions use hash_any(). I'm inclined to do that for 8.3. The problem is that this is not a back-patchable answer, because changing the hash functions would corrupt existing hash indexes. The best idea I can come up with for the back branches is to make ExecHashGetBucketAndBatch do hash_any internally, say if (nbatch > 1){ *bucketno = hashvalue % nbuckets; /* since nbatch is a power of 2, can do MOD by masking */ - *batchno = (hashvalue / nbuckets) & (nbatch - 1); + *batchno = hash_any(&hashvalue, sizeof(int32)) & (nbatch - 1);}else{ *bucketno = hashvalue % nbuckets; *batchno= 0;} Comments, better ideas? regards, tom lane
pgsql-hackers by date: