Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs
From | Alexander Korotkov |
---|---|
Subject | Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug |
Date | |
Msg-id | CAPpHfdsDtQJ+pYfapH7GMt9n4RLByKMzHEBWcPL99C_g+rkUKQ@mail.gmail.com Whole thread Raw |
In response to | [BUGS] pg_trgm word_similarity inconsistencies or bug (Cristiano Coelho <cristianocca@hotmail.com>) |
Responses |
Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
|
List | pgsql-bugs |
I'd like to forward a feedback from our customer who uses word_similarity() function.
François finds current behavior of word_similarity() to be useful. Thus, I think we should preserve it. But documentation correction is needed and option for alternative behavior would be useful too.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From: François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>
Date: Wed, Nov 1, 2017 at 1:04 AM
Subject: RE: [BUGS] pg_trgm word_similarity inconsistencies or bug
To: Alexander Korotkov <a.korotkov@postgrespro.ru>
Cc: Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>, "foli@numen.mg" <foli@numen.mg>
Hello Alexander,
We agree that the current pg_trgm documentation does not correctly reflect the de facto behavior of word_similarity(), and that something has to be changed. But to us, it is more a documentation problem than anything else.
What is computed is still « substring_similarity » as was initially specified between us, but it is influenced by a strong word boundary bias caused by the way trigrams are padded at word boundaries. This bias was noticed by early reviewers and you explained that this motivated the name switch to « word_similarity ». As you will remember, at the time we discovered this, we were suprised because we considerd this as a slight misnomer. Indeed, what is currently described in the 9.6 pg_trgm documentation is inaccurate (although seemingly consistent with this new name) and has to be amended.
Now, word_similarity() has been out for more than a year and, of course, it is preferable to avoid any breaking changes… In our case, we consider the name « unfortunate » and the explanation buggy, not the function itself.
As you may remember from the initial discussion, some other users stressed the importance to be able to matchsub strings. We tend to agree with what Jeff Janes wrote in this discussion :
The reason I like the option of not treating word boundaries as
special in this case is that often in scientific vocabulary, and in
catalog part numbers, people are pretty inconsistent about whether
they included spaces. "HEK 293", "HEK293", and "HEK-293" could be all
the same thing. So I like to strip out spaces and punctuation on both
sides of operator. Of course I can't do that if there are invisible
un-removable spaces on the substring side.
But, It doesn't sound like I am going to win that debate. Given that,
I don't think we need a different name for the function. I'm fine with
explaining the word-boundary subtlety in the documentation, and
keeping the function name itself simple.
Now, considering your proposal :
As far as we are concerned, we use <% and %> everyday for efficient fuzzy matching on large databases. Our typical usage scenario is matching noisy OCRized text strings against reference databases. > 1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,
Ok for us, iff default behavior remains the same as now, for backward compatibility reasons. We could take advantage, in some cases, of the new « word rounded » behavior controlled by the GUC variable, but this would not cover all scenarios currently in use.
> 2) Document both cases of word_similarity() behavior.
This is clearly needed anyway.
Best regards,
François CHAHUNEAU Directeur des technologies |
NUMEN DIGITAL| 24, rue Marc Seguin 75018 Paris France | www.numen.fr Tel +33 1 40 37 95 03 | Mob +33 6 07 85 21 79 | Fax +33 1 40 37 94 94 |
![]() |
De : Alexander Korotkov [mailto:a.korotkov@
Envoyé : mardi 31 octobre 2017 16:18
À : Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>; François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>
Objet : Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Dear, Thierry and François!
PostgreSQL users found inconsistency between documentation and implementation of word_similarity().
Possible solution proposed by the reporter is to alter the implementation.
But it's important for me that your interests are not affected but potential further change of implementation of word_similarity().
Could you please share your opinion on changes proposed by Jan in the pgsql-bugs mailing list?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
pgsql-bugs by date: