[HACKERS] Rethinking our fulltext phrase-search implementation - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | [HACKERS] Rethinking our fulltext phrase-search implementation |
Date | |
Msg-id | 28215.1481999808@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: [HACKERS] Rethinking our fulltext phrase-search implementation
Re: [HACKERS] Rethinking our fulltext phrase-search implementation Re: [HACKERS] Rethinking our fulltext phrase-search implementation |
List | pgsql-hackers |
I've been thinking about how to fix the problem Andreas Seltenreich reported at https://postgr.es/m/87eg1y2s3x.fsf@credativ.de The core of that problem is that the phrase-search patch attempts to restructure tsquery trees so that there are no operators underneath a PHRASE operator, except possibly other PHRASE operators. The goal is to not have to deal with identifying specific match locations except while processing a PHRASE operator. Well, that's an OK idea if it can be implemented reasonably, but there are several problems with the code as it stands: 1. The transformation is done by normalize_phrase_tree(), which is currently invoked (via cleanup_fakeval_and_phrase()) in a rather ad-hoc set of places including tsqueryin() and the various variants of to_tsquery(). This leaves lots of scope for errors of omission, which is exactly the proximate cause of Andreas' bug: ts_rewrite() is neglecting to re-normalize its result tsquery. I have little faith that there aren't other similar errors of omission today, and even less that we won't introduce more in future. 2. Because the transformation is invoked as early as tsqueryin(), it's user-visible: regression=# select 'a <-> (b | c)'::tsquery; tsquery ---------------------------'a' <-> 'b' | 'a' <-> 'c' (1 row) This is confusing to users, and I do not think it's a good idea to expose what's basically an optimization detail this way. For stored tsqueries, we're frozen into this approach forever even if we later decide that checking for 'a' twice isn't such a hot idea. 3. Worse, early application of the transformation creates problems for operations such as ts_rewrite: a rewrite might fail to match, or produce surprising results, because the query trees actually being operated on aren't what the user probably thinks they are. At a minimum, to get consistent results I think we'd have to re-normalize after *each step* of ts_rewrite, not only at the end. That would be expensive, and it's far from clear that it would eliminate all the surprises. 4. The transformations are wrong anyway. The OR case I showed above is all right, but as I argued in <24331.1480199636@sss.pgh.pa.us>, the AND case is not: regression=# select 'a <-> (b & c)'::tsquery; tsquery ---------------------------'a' <-> 'b' & 'a' <-> 'c' (1 row) This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be matched at different places in that text; but it seems highly unlikely to me that that's what the writer of such a query wanted. (If she did want that, she would write it that way to start with.) NOT is not very nice either: regression=# select '!a <-> !b'::tsquery; tsquery ------------------------------------!'a' & !( !( 'a' <-> 'b' ) & 'b' ) (1 row) If you dig through that, you realize that the <-> test is pointless; the query can't match any vector containing 'a', so certainly the <-> condition can't succeed, making this an expensive way to spell "!a & !b". And that's not the right semantics anyway: if 'x y' matches this query, which it does and should, why doesn't 'x y a' match? 5. The case with only one NOT under <-> looks like this: regression=# select '!a <-> b'::tsquery; tsquery ------------------------!( 'a' <-> 'b' ) & 'b' (1 row) This is more or less in line with the naive view of what its semantics should be, although I notice that it will match a 'b' at position 1, which might be a surprising result. We're not out of the woods though: this will (and should) match, eg, 'c b a'. But that means that '!a & b' is not a safe lossy approximation to '!a <-> b', which is an assumption that is wired into a number of places. Simple testing shows that indeed GIN and GIST index searches get the wrong answers, different from what you get in a non-indexed search, for queries like this. So we have a mess here, which needs to be cleaned up quite aside from the fact that it's capable of provoking Assert failures and/or crashes. I thought for awhile about moving the normalize_phrase_tree() work to occur at the start of tsquery execution, rather than in tsqueryin() et al. That addresses points 1..3, but doesn't by itself do anything for points 4 or 5. Also, it would be expensive because right now execution works directly from the flat tsquery representation; there's no conversion to an explicit tree structure on which normalize_phrase_tree() could conveniently be applied. Nor do we know at the start whether the tsquery contains any PHRASE operators. On the whole, it seems like the best fix would be to get rid of normalize_phrase_tree() altogether, and instead fix the TS_execute() engine so it can cope with regular operators underneath phrase operators. We can still have the optimization of not worrying about lexeme locations when no phrase operator has been seen, but we have to change TS_phrase_execute to cope with plain AND/OR/NOT operators and calculate proper location information for the result of one. Here is a design for doing that. The hard part seems to be dealing with NOT: merging position lists during AND or OR is pretty clear, but how do we represent the positions where a lexeme isn't? I briefly considered explicitly negating the position list, eg [2,5,7] goes to [1,3,4,6,8,...], but the problem is where to stop. If we knew the largest position present in the current document, we could stop there, but AFAICS we generally don't know that --- and even if we did, this approach could be quite expensive for large documents. So my design is based on keeping the original position list and adding a "negate" flag that says these are the positions where the query pattern does NOT occur. Hence, I propose adding a field to ExecPhraseData: typedef struct ExecPhraseData{ int npos; /* number of positions reported */ bool allocated; /* pos points to palloc'd data? */ + bool negate; /* positions are where value is NOT */ WordEntryPos *pos; /* ordered, non-duplicatelexeme positions */} ExecPhraseData; It's already the case that TS_phrase_execute is always responsible for initializing this struct, so adding this field and initializing it to false doesn't break any existing TSExecuteCallback functions (it's even ABI-compatible because the field is going into a padding hole). I've worked out the following specification for the meaning of this field given that a TSExecuteCallback function, or recursive execution of TS_phrase_execute, has returned true or false: if npos > 0: func result = false, negate = false: disallowed func result = true, negate = false: asserts that query is matched atspecified position(s) (and only those positions) func result = false, negate = true: disallowed func result = true, negate = true: asserts that query is matched at allpositions *except* specified position(s) if npos == 0: func result = false, negate = false: asserts that query is not matchedanywhere func result = true, negate = false: query is (possibly) matched, matchingposition(s) are unknown func result = false, negate = true: disallowed func result = true, negate = true: asserts that query is matched at allpositions The negate = false cases agree with the existing semantics, so that TSExecuteCallback functions do not need to know about the field; there is no case where they'd need to set it to true. Given this definition, the tsquery operators can be implemented as follows in TS_phrase_execute: OP_NOT: if npos > 0 (implying its subquery returned true), invert the negateflag and return true, keeping the same list of positions if npos == 0, the possible cases are:subquery result subquery's negate return result negatefalse false true truetrue false true falsetrue true false false The correctness of these rules can be seen from the specification of the flag meanings. Notice that NOT atop NOT is a no-op, as expected. OP_AND, OP_OR: "not" notations here indicate that L or R input has the negate flag set: a & b: emit positions listed in both inputs a & !b: emit positions listed in a but not b !a & b: emit positions listed in b but not a !a & !b: treat as !(a | b), ie emit positions listed in either input, then set negate flag on output a | b: emit positions listed in either input a | !b: treat as !(!a & b), ie emit positions listed in b but not a, then set negate flag on output !a | b: treat as !(a & !b), ie emit positions listed in a but not b, then set negate flag on output !a | !b: treat as !(a & b), ie emit positions listed in both inputs, then set negate flag on output AND/OR function result is always true when output negate flag is set, else it is true if output npos > 0 OP_PHRASE: Works like OP_AND except we allow for a possibly-nonzero offset between L and R positions, ie we compare an R position of X to an L position of X minus offset while deciding whether to emit position X. Note in particular that <0> acts *exactly* like OP_AND. We could accept a match only if X minus offset is greater than zero, so that "!a <-> b" doesn't match b at start of document. But I see no way to do the equivalent for "a <-> !b" at the end of the document, so I'm inclined not to do that, leaving the existing semantics for these cases alone. The above rules for AND/OR/PHRASE work only when we have position information for both inputs (ie, neither input returned true with npos = 0 and negate = false); otherwise we just do the dumb thing and return true/false with npos = 0, negate = false, to indicate either "there might be a match" or "there definitely is not a match". It's worth noting that with these rules, phrase searches will act as though "!x" always matches somewhere; for instance "!a <-> !b" will match any tsvector. I argue that this is not wrong, not even if the tsvector is empty: there could have been adjacent stopwords matching !a and !b in the original text. Since we've adjusted the phrase matching rules to treat stopwords as unknown-but-present words in a phrase, I think this is consistent. It's also pretty hard to assert this is wrong and at the same time accept "!a <-> b" matching b at the start of the document. This sounds like it would be a lot of code, but I think that all of the AND/OR/PHRASE cases can be implemented by a single subroutine that's told whether to emit a position depending on whether it finds that position in both inputs/neither input/left only/right only. That subroutine otherwise won't be much more complicated than the existing position-finding loop for OP_PHRASE. I'm guessing that it'll end up being roughly a wash once you allow for removing normalize_phrase_tree(). I haven't yet looked into point 5 (wrong GIN/GIST search results), but I'm hopeful that removing the assumption that <-> approximates as AND will fix it. In any case we need to make the base tsquery engine right before we try to fix the index approximations to it. Thoughts, objections? Anybody see errors in the logic? regards, tom lane
pgsql-hackers by date: