Thread: Text search prefix matching and stop words

Text search prefix matching and stop words

From

"Matthew Nelson"

Date:

08 October 2021, 18:17:16

Prefix matching should not omit stop words, as matching lexemes may legitimately begin with stop words.

# select to_tsquery('english', 'over:*') @@ to_tsvector('english', 'overhaul');
NOTICE:  text-search query contains only stop words or doesn't contain lexemes, ignored
 ?column? 
----------
 f
(1 row)

I noticed this after implementing interactive, incremental search in an application. As the user typed "overhaul," with
eachsuccessive character executing a search, "ove" and "overh" matched a particular document, but "over" did not.
 

Reproduced in PostgreSQL 11, 13, and 14.

Re: Text search prefix matching and stop words

From

Pavel Borisov

Date:

08 October 2021, 20:30:41

Prefix matching should not omit stop words, as matching lexemes may legitimately begin with stop words.

# select to_tsquery('english', 'over:*') @@ to_tsvector('english', 'overhaul');
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
?column?
----------
f
(1 row)

I noticed this after implementing interactive, incremental search in an application. As the user typed "overhaul," with each successive character executing a search, "ove" and "overh" matched a particular document, but "over" did not.

Big thanks for the reporting!

I am not sure that it is a bug. I think this is a way how to_tsquery conversion work: stopwords first then template processing.

If you want to process successive characters typing, you can use casting to tsvector type until input is not finished

'over:*'::tsquery;

and when the user finishes input then process the result via to_tsquery with stop words.

if we do to_tsquery in a way you described I expect it will never apply the stop-word filter on templated input as it can not be compared to stop words.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Text search prefix matching and stop words

From

Pavel Borisov

Date:

08 October 2021, 20:32:28

I must commect myself

"If you want to process successive characters typing, you can use casting to tsquery type until input is not finished"

Re: Text search prefix matching and stop words

From

Tom Lane

Date:

08 October 2021, 21:06:27

Pavel Borisov <pashkin.elfe@gmail.com> writes:
>> Prefix matching should not omit stop words, as matching lexemes may
>> legitimately begin with stop words.

> I am not sure that it is a bug. I think this is a way how to_tsquery
> conversion work: stopwords first then template processing.

I concur with the OP that this is a bug, or at least that it'd be nice
if it worked better.  But I'm not sure we can make it better.  The basic
design of our text search stuff combined the functions of normalization
and stop-word-suppression into a single dictionary stack, so that it's
impossible to ask for just one of those to happen.  But if we skip
applying the dictionaries at all for a prefix item, then word
normalization doesn't happen, which would create a different set of
unexpected-failure-to-match conditions.  (So your proposed workaround
of casting directly to tsquery just moves the problem somewhere else.)

I think we could only fix this with a dictionary API change that
allows telling the dictionaries not to suppress stopwords.  Not
sure how practical that is.  If we'd had the prefix-match feature
from the beginning, maybe it'd have occurred to us that we needed
that API option ... but we didn't.

            regards, tom lane

Re: Text search prefix matching and stop words

From

Artur Zakirov

Date:

11 October 2021, 14:50:51

On Fri, Oct 8, 2021 at 10:31 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> If you want to process successive characters typing, you can use casting to tsvector type until input is not
finished
>
> 'over:*'::tsquery;

Also it is possible to use a custom configuration without stop words
if you want normalization:

postgres=# select to_tsquery('english_wo_stop', 'over:*') &&
to_tsquery('english', 'foo');
     ?column?
------------------
 'over':* & 'foo'

-- 
Artur