Re: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs
From | François CHAHUNEAU |
---|---|
Subject | Re: [BUGS] pg_trgm word_similarity inconsistencies or bug |
Date | |
Msg-id | AM4PR02MB131633B46174D056E6FE6130F6330@AM4PR02MB1316.eurprd02.prod.outlook.com Whole thread Raw |
In response to | Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug (Alexander Korotkov <a.korotkov@postgrespro.ru>) |
List | pgsql-bugs |
Hello Alexander,
This is fine with us. Yes, separate thresholds seem preferable.
Best Regards
Obtenez Outlook pour iOS
From: Alexander Korotkov <a.korotkov@postgrespro.ru>
Sent: Thursday, December 7, 2017 4:38:59 PM
To: Jan Przemysław Wójcik; Cristiano Coelho
Cc: pgsql-bugs@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Sent: Thursday, December 7, 2017 4:38:59 PM
To: Jan Przemysław Wójcik; Cristiano Coelho
Cc: pgsql-bugs@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
François CHAHUNEAU Directeur des technologies |
NUMEN DIGITAL| 24, rue Marc Seguin 75018 Paris France | www.numen.fr Tel +33 1 40 37 95 03 | Mob +33 6 07 85 21 79 | Fax +33 1 40 37 94 94 |
![]() |
On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Simultaneously, word_similarity() now forces extent bounds to be word bounds. Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow.
On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote: my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.
I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest.
I've written a draft patch for fixing this inconsistency. Please, find it in attachment. This patch doesn't contain proper documentation and comments yet.
I've called existing behavior subset_similarity(). I didn't use name substring_similarity(), because it doesn't really looking for substring with appropriate padding, but rather searching for continuous subset of trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are provided. I've added extra arrow sign to denote these operators look deeper into string.
# with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, subset_similarity('sage', t), word_similarity('sage', t)
from data;
t | subset_similarity | word_similarity
------------------+-------------------+-----------------
message | 0.6 | 0.3
message s | 0.8 | 0.363636
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
The difference here is only in 'messsage s' row, because word_similarity() allows matching one word to two or more while my_word_similarity() doesn't allow that. In this case word_similarity() returns similarity between 'sage' and 'message s'.
# select similarity('sage', 'message s');
similarity
------------
0.363636
(1 row)
I think behavior of word_similarity() appears better here, because typo can break word into two.
I also wonder if word_similarity() and subset_similarity() should share same threshold value for indexed search. subset_similarity() typically returns higher values than word_similarity(). Thus, it's probably makes sense to split their threshold values.
The Russian Postgres Company
pgsql-bugs by date: