Thread: Speed up Hash Join by teaching ExprState about hashing
In master, if you look at ExecHashGetHashValue() in nodeHash.c, you can see that it calls ExecEvalExpr() and then manually calls the hash function on the returned value. This process is repeated once for each hash key. This is inefficient for a few reasons: 1) ExecEvalExpr() will only deform tuples up the max varattno that's mentioned in the hash key. That means we might have to deform attributes in multiple steps, once for each hash key. 2) ExecHashGetHashValue() is very branchy and checks if hashStrict[] and keep_nulls on every loop. There's also a branch to check which hash functions to use. 3) foreach isn't exactly the pinnacle of efficiency either. All of the above points can be improved by making ExprState handle hashing. This means we'll deform all attributes that are needed for hashing once, rather than incrementally once per key. This also allows JIT compilation of hashing ExprStates, which will make things even faster. The attached patch implements this. Here are some performance numbers. ## Test 1: rows=1000 jit=0 1 hash key master = 4938.5 tps patched = 5126.7 tps (+3.81%) 2 hash keys master = 4326.4 tps patched = 4520.2 tps (+4.48%) 3 hash keys master = 4145.5 tps patched = 4559.7 tps (+9.99%) ## Test 2: rows = 1000000 jit=1 (with opt and inline) 1 hash key master = 3.663 tps patched = 3.816 tps (+4.16%) 2 hash keys master = 3.392 tps patched = 3.550 tps (+4.67%) 3 hash keys master = 3.086 tps patched = 3.411 tps (+10.55%) Benchmark script attached Notes: The ExecBuildHash32Expr() function to build the ExprState isn't called from the same location as the previous ExecInitExprList() code. The reason for this is that it's not possible to build the ExprState for hashing in ExecInitHash() because we don't yet know the jointype and we need to know that because the expression ExecBuildHash32Expr() needs to allow NULLs for outer join types. I've put the ExecBuildHash32Expr() call in ExecInitHashJoin() just after we set hj_NullOuterTupleSlot and hj_NullOuterTupleSlot fields. I tried having this code in ExecHashTableCreate(). but that's no good as we only call that during executor run, which is too late as any SubPlans in the hash keys need to be attributed to the correct parent. Since EXPLAIN shows the subplans, this needs to be done before executor run. I've not hacked on llvmjit_expr.c much before, so I'd be happy for a detailed review of that code. I manually checked hashvalues between JIT and non-JIT. They matched. If we ever consider JITting more granularly, it might be worth always applying the same jit flags to the hash exprs on either side of the join. I've slight concerns about compiler bugs producing different hash codes. Unsure if there are non-bug reasons for them to differ on the same CPU architecture. I've not looked at applications of this beyond hash join. I'm considering other executor nodes to be follow-on material. Thanks to Andres Freund for mentioning this idea to me.
Attachment
On Mon, 13 May 2024 at 21:23, David Rowley <dgrowleyml@gmail.com> wrote: > In master, if you look at ExecHashGetHashValue() in nodeHash.c, you > can see that it calls ExecEvalExpr() and then manually calls the hash > function on the returned value. This process is repeated once for each > hash key. This is inefficient for a few reasons: > > 1) ExecEvalExpr() will only deform tuples up the max varattno that's > mentioned in the hash key. That means we might have to deform > attributes in multiple steps, once for each hash key. > 2) ExecHashGetHashValue() is very branchy and checks if hashStrict[] > and keep_nulls on every loop. There's also a branch to check which > hash functions to use. > 3) foreach isn't exactly the pinnacle of efficiency either. > > All of the above points can be improved by making ExprState handle > hashing. This means we'll deform all attributes that are needed for > hashing once, rather than incrementally once per key. This also allows > JIT compilation of hashing ExprStates, which will make things even > faster. > > The attached patch implements this. Here are some performance numbers. I've been doing a bit more work on this to start to add support for faster hashing for hashing needs other than Hash Join. In the attached, I've added support to give the hash value an initial value. Support for that is required to allow Hash Aggregate to work. If you look at what's being done now inside BuildTupleHashTableExt(), you'll see that "hash_iv" exists there to allow an initial hash value. This seems to be getting used to allow some variation in hash values calculated inside parallel workers, per hashtable->hash_iv = murmurhash32(ParallelWorkerNumber). One of my aims for this patch is to always produce the same hash value before and after the patch, so I've gone and implemented the equivalent functionality which can be enabled or disabled as required depending on the use case. I've not added support for Hash Aggregate quite yet. I did look at doing that, but it seems to need quite a bit of refactoring to do it nicely. The problem is that BuildTupleHashTableExt() receives keyColIdx with the attribute numbers to hash. The new ExecBuildHash32Expr() function requires a List of Exprs. It looks like the keyColIdx array comes directly from the planner which is many layers up and would need lots of code churn of function signatures to change. While I could form Vars using the keyColIdx array to populate the required List of Exprs, I so far can't decide where exactly that should happen. I think probably the planner should form the Expr List. It seems a bit strange to be doing makeVar() in the executor. I currently think that it's fine to speed up Hash Join as phase one for this patch. I can work more on improving hash value generation in other locations later. I'd be happy if someone else were to give this patch a review and test. One part I struggled a bit with was finding a way to cast the Size variable down to uint32 in LLVM. I tried to add a new supported type for uint32 but just couldn't get it to work. Instead, I did: v_tmp1 = LLVMBuildAnd(b, v_tmp1, l_sizet_const(0xffffffff), ""); which works and I imagine compiled to the same code as a cast. It just looks a bit strange. David
Attachment
Hello David, you wrote: > v4 patch attached. If nobody else wants to look at this then I'm > planning on pushing it soon. Had a very brief look at this bit caught my attentioon: + EEO_CASE(EEOP_HASHDATUM_NEXT32_STRICT) + { + FunctionCallInfo fcinfo = op->d.hashdatum.fcinfo_data; + uint32 existing_hash = DatumGetUInt32(*op->resvalue); + uint32 hashvalue; + + /* combine successive hash values by rotating */ + existing_hash = pg_rotate_left32(existing_hash, 1); + + if (fcinfo->args[0].isnull) + { Is it nec. to rotate existing_hash here before checking for isnull? Because in case of isnull, isn't the result of the rotate thrown away? Or in other words, mnaybe this bit here can be moved to after the isnull check: + /* combine successive hash values by rotating */ + existing_hash = pg_rotate_left32(existing_hash, 1); -- Best regards, Tels
On Mon, 19 Aug 2024 at 18:41, David Rowley <dgrowleyml@gmail.com> wrote: > The attached v5 patch includes this change. I made a few more tweaks to the comments and pushed the result. Thank you both of you for having a look at this. David