Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: GIN improvements part2: fast scan |
Date | |
Msg-id | 53208532.8090005@vmware.com Whole thread Raw |
In response to | Re: GIN improvements part2: fast scan (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: GIN improvements part2: fast scan
|
List | pgsql-hackers |
On 02/26/2014 11:25 PM, Alexander Korotkov wrote: > On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote: > >> On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas < >> hlinnakangas@vmware.com> wrote: >> >>> On 02/09/2014 12:11 PM, Alexander Korotkov wrote: >>> >>>> I've rebased catalog changes with last master. Patch is attached. I've >>>> rerun my test suite with both last master ('committed') and attached >>>> patch ('ternary-consistent'). >>>> >>> >>> Thanks! >>> >>> >>> method | sum >>>> ------------------------+------------------ >>>> committed | 143491.715000001 >>>> fast-scan-11 | 126916.111999999 >>>> fast-scan-light | 137321.211 >>>> fast-scan-light-heikki | 138168.028000001 >>>> master | 446976.288 >>>> ternary-consistent | 125923.514 >>>> >>>> I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8 >>>> to 4. >>>> >>> >>> Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make >>> sure we get similar behavior in Tomas' tests that used 6 search terms. But >>> I always felt that it was too large for real queries, once we have the >>> catalog changes, that's why I lowered to 4 when committing. If an opclass >>> benefits greatly from fast scan, it should provide the ternary consistent >>> function, and not rely on the shim implementation. >>> >>> >>> I'm not sure about decision to reserve separate procedure number for >>>> ternary consistent. Probably, it would be better to add ginConfig method. >>>> It would be useful for post 9.4 improvements. >>>> >>> >>> Hmm, it might be useful for an opclass to provide both, a boolean and >>> ternary consistent function, if the boolean version is significantly more >>> efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a >>> quick check through the array to see if there are any MAYBE arguments, >>> within the consistent function. But I'm inclined to keep the possibility to >>> provide both versions. As long as we support the boolean version at all, >>> there's not much difference in terms of the amount of code to support >>> having them both for the same opclass. >>> >>> A ginConfig could be useful for many other things, but I don't think it's >>> worth adding it now. >>> >>> >>> What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck? >>> We discussed that earlier, but didn't reach any conclusion. That needs to >>> be clarified in the docs. One possibility is to document that they're >>> equivalent. Another is to forbid one of them. Yet another is to assign a >>> different meaning to each. >>> >>> I've been thinking that it might be useful to define them so that a MAYBE >>> result from the tri-consistent function means that it cannot decide if you >>> have a match or not, because some of the inputs were MAYBE. And >>> TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or >>> FALSE, the result would be the same, TRUE+recheck. The practical difference >>> would be that if the tri-consistent function returns TRUE+recheck, ginget.c >>> wouldn't need to bother fetching the other entries, it could just return >>> the entry with recheck=true immediately. While with MAYBE result, it would >>> fetch the other entries and call tri-consistent again. ginget.c doesn't >>> currently use the tri-consistent function that way - it always fetches all >>> the entries for a potential match before calling tri-consistent, but it >>> could. I had it do that in some of the patch versions, but Tomas' testing >>> showed that it was a big loss on some queries, because the consistent >>> function was called much more often. Still, something like that might be >>> sensible in the future, so it might be good to distinguish those cases in >>> the API now. Note that ginarrayproc is already using the return values like >>> that: in GinContainedStrategy, it always returns TRUE+recheck regardless of >>> the inputs, but in other cases it uses GIN_MAYBE. >> >> >> Next revision of patch is attached. >> >> In this version opclass should provide at least one consistent function: >> binary or ternary. It's expected to achieve best performance when opclass >> provide both of them. However, tests shows opposite :( I've to recheck it >> carefully. >> > > However, it's not! > This revision of patch completes my test case in 123330 ms. This is > slightly faster than previous revision. Great. Committed, I finally got around to it. Some minor changes: I reworded the docs and also updated the table of support functions in xindex.sgml. I refactored the query in opr_sanity.sql, to make it easier to read, and to check more carefully that the required support functions are present. I also added a runtime check to avoid crashing if neither is present. A few things we ought to still discuss: * I just noticed that the dummy trueTriConsistentFn returns GIN_MAYBE, rather than GIN_TRUE. The equivalent boolean version returns 'true' without recheck. Is that a typo, or was there some reason for the discrepancy? * Come to think of it, I'm not too happy with the name GinLogicValue. It's too vague. It ought to include "ternary" or "tri-value" or something like that. How about renaming it to "GinTernaryValue" or just "GinTernary"? Any better suggestion for the name? * This patch added a triConsistent function for array and tsvector opclasses. Were you planning to submit a patch to do that for the rest of the opclasses, like pg_trgm? (it's getting awfully late for that...) - Heikki
pgsql-hackers by date: