Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays |
Date | |
Msg-id | CAAaqYe_Kvzjrt9iM=Uem4ej548ELGULnHZGKSKj0QHAoeuJmzA@mail.gmail.com Whole thread Raw |
In response to | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays (James Coleman <jtc331@gmail.com>) |
Responses |
Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
|
List | pgsql-hackers |
On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote: > > On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote: > > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra > > ><tomas.vondra@2ndquadrant.com> wrote: > > >> > > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: > > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > > >> >> > > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I > > >> >> > think ran into mostly the same challenge of deciding when the bloom > > >> >> > filter can be useful and is worth the extra work. > > >> >> > > >> >> Speaking of that, it would be interesting to see how a test where you > > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would > > >> >> be interesting to know if the planner is able to make a more suitable > > >> >> choice and also to see how all the work over the years to improve Hash > > >> >> Joins compares to the bsearch with and without the bloom filter. > > >> > > > >> >It would be interesting. > > >> > > > >> >It also makes one wonder about optimizing these into to hash > > >> >joins...which I'd thought about over at [1]. I think it'd be a very > > >> >significant effort though. > > >> > > > >> > > >> I modified the script to also do the join version of the query. I can > > >> only run it on my laptop at the moment, so the results may be a bit > > >> different from those I shared before, but it's interesting I think. > > >> > > >> In most cases it's comparable to the binsearch/bloom approach, and in > > >> some cases it actually beats them quite significantly. It seems to > > >> depend on how expensive the comparison is - for "int" the comparison is > > >> very cheap and there's almost no difference. For "text" the comparison > > >> is much more expensive, and there are significant speedups. > > >> > > >> For example the test with 100k lookups in array of 10k elements and 10% > > >> match probability, the timings are these > > >> > > >> master: 62362 ms > > >> binsearch: 201 ms > > >> bloom: 65 ms > > >> hashjoin: 36 ms > > >> > > >> I do think the explanation is fairly simple - the bloom filter > > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus > > >> some overhead to build and check the bits. The hash join probably > > >> eliminates a lot of the remaining comparisons, because the hash table > > >> is sized to have one tuple per bucket. > > >> > > >> Note: I also don't claim the PoC has the most efficient bloom filter > > >> implementation possible. I'm sure it could be made faster. > > >> > > >> Anyway, I'm not sure transforming this to a hash join is worth the > > >> effort - I agree that seems quite complex. But perhaps this suggest we > > >> should not be doing binary search and instead just build a simple hash > > >> table - that seems much simpler, and it'll probably give us about the > > >> same benefits. > > > > > >That's actually what I originally thought about doing, but I chose > > >binary search since it seemed a lot easier to get off the ground. > > > > > > > OK, that makes perfect sense. > > > > >If we instead build a hash is there anything else we need to be > > >concerned about? For example, work mem? I suppose for the binary > > >search we already have to expand the array, so perhaps it's not all > > >that meaningful relative to that... > > > > > > > I don't think we need to be particularly concerned about work_mem. We > > don't care about it now, and it's not clear to me what we could do about > > it - we already have the array in memory anyway, so it's a bit futile. > > Furthermore, if we need to care about it, it probably applies to the > > binary search too. > > > > >I was looking earlier at what our standard hash implementation was, > > >and it seemed less obvious what was needed to set that up (so binary > > >search seemed a faster proof of concept). If you happen to have any > > >pointers to similar usages I should look at, please let me know. > > > > > > > I think the hash join implementation is far too complicated. It has to > > care about work_mem, so it implements batching, etc. That's a lot of > > complexity we don't need here. IMO we could use either the usual > > dynahash, or maybe even the simpler simplehash. > > > > FWIW it'd be good to verify the numbers I shared, i.e. checking that the > > benchmarks makes sense and running it independently. I'm not aware of > > any issues but it was done late at night and only ran on my laptop. > > Some quick calculations (don't have the scripting in a form I can > attach yet; using this as an opportunity to hack on a genericized > performance testing framework of sorts) suggest your results are > correct. I was also testing on my laptop, but I showed 1.) roughly > equivalent results for IN (VALUES ...) and IN (<list>) for integers, > but when I switch to (short; average 3 characters long) text values I > show the hash join on VALUES is twice as fast as the binary search. > > Given that, I'm planning to implement this as a hash lookup and share > a revised patch. I've attached a patch series as before, but with an additional patch that switches to using dynahash instead of binary search. Whereas before the benchmarking ended up with a trimodal distribution (i.e., master with IN <list>, patch with IN <list>, and either with IN VALUES), the hash implementation brings us back to an effectively bimodal distribution -- though the hash scalar array op expression implementation for text is about 5% faster than the hash join. Current outstanding thoughts (besides comment/naming cleanup): - The saop costing needs to be updated to match, as Tomas pointed out. - Should we be concerned about single execution cases? For example, is the regression of speed on a simple SELECT x IN something we should try to defeat by only kicking in the optimization if we execute in a loop at least twice? That might be of particular interest to pl/pgsql. - Should we have a test for an operator with a non-strict function? I'm not aware of any built-in ops that have that characteristic; would you suggest just creating a fake one for the test? James
Attachment
pgsql-hackers by date: