Re: Optimize numeric.c mul_var() using the Karatsuba algorithm - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: Optimize numeric.c mul_var() using the Karatsuba algorithm |
Date | |
Msg-id | CAEZATCXQnJO-GbV-8QPWsjRuBodoFDbbbinRKANFUazoFW=YBQ@mail.gmail.com Whole thread Raw |
In response to | Re: Optimize numeric.c mul_var() using the Karatsuba algorithm (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm Re: Optimize numeric.c mul_var() using the Karatsuba algorithm |
List | pgsql-hackers |
On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote: > Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var(). > Something to watch out for is that not all callers of mul_var() want an exact result. Several internal callers request an approximate result by passing it an rscale value less than the sum of the input dscales. The schoolbook algorithm handles that by computing up to rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas the new Karatsuba code always computes the full result and then rounds. That impacts the performance of various functions, for example: select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x; Time: 1790.825 ms (00:01.791) [HEAD] Time: 2161.244 ms (00:02.161) [with patch] Looking at mul_var_karatsuba_half(), I don't really like the approach it takes. The whole correctness proof using the Karatsuba formula seems to somewhat miss the point that this function isn't actually implementing the Karatsuba algorithm, it is implementing the schoolbook algorithm in two steps, by splitting the longer input into two pieces. But why just split it into two pieces? That will just lead to a lot of unnecessary recursion for very unbalanced inputs. Instead, why not split the longer input into N roughly equal sized pieces, each around the same length as the shorter input, multiplying and adding them at the appropriate offsets? As an example, given inputs with var1ndigits = 1000 and var2ndigits = 10000, mul_var() will invoke mul_var_karatsuba_half(), which will then recursively invoke mul_var() twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook algorithm on each half, which stands no chance of being any faster. On the other hand, if it divided var2 into 10 chunks of length 1000, it would invoke the Karatsuba algorithm on each chunk, which would at least stand a chance of being faster. Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to make a lot of sense. For inputs with var1ndigits between 128 and 2000, and var2ndigits > 9000, this condition will pass and it will recursively break up the longer input into smaller and smaller pieces until eventually that condition no longer passes, but none of the other conditions in KARATSUBA_CONDITION() will pass either, so it'll just invoke the schoolbook algorithm on each piece, which is bound to be slower once all the overheads are taken into account. For example, given var1ndigits = 200 and var2ndigits = 30000, KARATSUBA_CONDITION() will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse with var1ndigits = 200 and var2ndigits = 15000, and then again with var1ndigits = 200 and var2ndigits = 7500, at which point KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half() implemented as it is, that is bound to happen, because each half will end up having var2ndigits between 4500 and 9000, which fails KARATSUBA_CONDITION() if var1ndigits < 2000. If mul_var_karatsuba_half() was replaced by something that recursed with more balanced chunks, then it might make more sense, though allowing values of var1ndigits down to 128 doesn't make sense, since the Karatsuba algorithm will never be invoked for inputs shorter than 384. Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied, so these tests could be simplified, eliminating some of those magic constants. However, I really don't like having these magic constants at all, because in practice the threshold above which the Karatsuba algorithm is a win can vary depending on a number of factors, such as whether it's running on 32-bit or 64-bit, whether or not SIMD instructions are available, the relative timings of CPU instructions, the compiler options used, and probably a bunch of other things. The last time I looked at the Java source code, for example, they had separate thresholds for 32-bit and 64-bit platforms, and even that's probably too crude. Some numeric libraries tune the thresholds for a large number of different platforms, but that takes a lot of effort. I think a better approach would be to have a configurable threshold. Ideally, this would be just one number, with all other numbers being derived from it, possibly using some simple heuristic to reduce the effective threshold for more balanced inputs, for which the Karatsuba algorithm is more efficient. Having a configurable threshold would allow people to tune for best performance on their own platforms, and also it would make it easier to write tests that hit the new code. As it stands, it's not obvious how much of the new code is being hit by the existing tests. Doing a quick test on my machine, using random equal-length inputs of various sizes, I got the following performance results: digits | rate (HEAD) | rate (patch) | change --------+---------------+---------------+-------- 10 | 6.060014e+06 | 6.0189365e+06 | -0.7% 100 | 2.7038752e+06 | 2.7287925e+06 | +0.9% 1000 | 88640.37 | 90504.82 | +2.1% 1500 | 39885.23 | 41041.504 | +2.9% 1600 | 36355.24 | 33368.28 | -8.2% 2000 | 23308.582 | 23105.932 | -0.9% 3000 | 10765.185 | 11360.11 | +5.5% 4000 | 6118.2554 | 6645.4116 | +8.6% 5000 | 3928.4985 | 4639.914 | +18.1% 10000 | 1003.80164 | 1431.9335 | +42.7% 20000 | 255.46135 | 456.23462 | +78.6% 30000 | 110.69313 | 226.53398 | +104.7% 40000 | 62.29333 | 148.12916 | +137.8% 50000 | 39.867493 | 95.16788 | +138.7% 60000 | 27.7672 | 74.01282 | +166.5% The Karatsuba algorithm kicks in at 384*4 = 1536 decimal digits, so presumably the variations below that are just noise, but this does seem to suggest that KARATSUBA_BASE_LIMIT = 384 is too low for me, and I'd probably want it to be something like 500-700. There's another complication though (if the threshold is made configurable): the various numeric functions that use mul_var() are immutable, which means that the results from the Karatsuba algorithm must match those from the schoolbook algorithm exactly, for all inputs. That's not currently the case when computing approximate results with a reduced rscale. That's fixable, but it's not obvious whether or not the Karatsuba algorithm can actually be made beneficial when computing such approximate results. There's a wider question as to how many people use such big numeric values -- i.e., how many people are actually going to benefit from this? I don't have a good feel for that. Regards, Dean
pgsql-hackers by date: