Thread: Text search prefix matching and stop words
Prefix matching should not omit stop words, as matching lexemes may legitimately begin with stop words. # select to_tsquery('english', 'over:*') @@ to_tsvector('english', 'overhaul'); NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored ?column? ---------- f (1 row) I noticed this after implementing interactive, incremental search in an application. As the user typed "overhaul," with eachsuccessive character executing a search, "ove" and "overh" matched a particular document, but "over" did not. Reproduced in PostgreSQL 11, 13, and 14.
Prefix matching should not omit stop words, as matching lexemes may legitimately begin with stop words.
# select to_tsquery('english', 'over:*') @@ to_tsvector('english', 'overhaul');
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
?column?
----------
f
(1 row)
I noticed this after implementing interactive, incremental search in an application. As the user typed "overhaul," with each successive character executing a search, "ove" and "overh" matched a particular document, but "over" did not.
Big thanks for the reporting!
I am not sure that it is a bug. I think this is a way how to_tsquery conversion work: stopwords first then template processing.
If you want to process successive characters typing, you can use casting to tsvector type until input is not finished
'over:*'::tsquery;
and when the user finishes input then process the result via to_tsquery with stop words.
if we do to_tsquery in a way you described I expect it will never apply the stop-word filter on templated input as it can not be compared to stop words.
I must commect myself
"If you want to process successive characters typing, you can use casting to tsquery type until input is not finished"
Pavel Borisov <pashkin.elfe@gmail.com> writes: >> Prefix matching should not omit stop words, as matching lexemes may >> legitimately begin with stop words. > I am not sure that it is a bug. I think this is a way how to_tsquery > conversion work: stopwords first then template processing. I concur with the OP that this is a bug, or at least that it'd be nice if it worked better. But I'm not sure we can make it better. The basic design of our text search stuff combined the functions of normalization and stop-word-suppression into a single dictionary stack, so that it's impossible to ask for just one of those to happen. But if we skip applying the dictionaries at all for a prefix item, then word normalization doesn't happen, which would create a different set of unexpected-failure-to-match conditions. (So your proposed workaround of casting directly to tsquery just moves the problem somewhere else.) I think we could only fix this with a dictionary API change that allows telling the dictionaries not to suppress stopwords. Not sure how practical that is. If we'd had the prefix-match feature from the beginning, maybe it'd have occurred to us that we needed that API option ... but we didn't. regards, tom lane
On Fri, Oct 8, 2021 at 10:31 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > If you want to process successive characters typing, you can use casting to tsvector type until input is not finished > > 'over:*'::tsquery; Also it is possible to use a custom configuration without stop words if you want normalization: postgres=# select to_tsquery('english_wo_stop', 'over:*') && to_tsquery('english', 'foo'); ?column? ------------------ 'over':* & 'foo' -- Artur