Thread: BUG #18580: The pg_similarity appears to be wrong
The following bug has been logged on the website: Bug reference: 18580 Logged by: Karan Bosamia Email address: bosamia.karan@gmail.com PostgreSQL version: 14.10 Operating system: Ubuntu Description: SELECT * FROM ( SELECT *, similarity(provision_clean_description, 'Policies The General Partner shall promptly notify the Investor of any proposed changes in the Funds leverage policies including adjustments to leverage ratios') AS sim_tim FROM provision_database ) pd WHERE sim_tim <= 1 and sim_tim > 0.7 and firm_id=18; This both sentences giving similarity score as 1 despite the fact that the sentence 1. has Policies as the starting word(do not include the starting hyphen in the sentences): - Policies The General Partner shall promptly notify the Investor of any proposed changes in the Funds leverage policies including adjustments to leverage ratios - The General Partner shall promptly notify the Investor of any proposed changes in the Funds leverage policies including adjustments to leverage ratios
On Mon, Aug 12, 2024, at 6:58 AM, PG Bug reporting form wrote:
SELECT *FROM (SELECT*,similarity(provision_clean_description, 'Policies The General Partnershall promptly notify the Investor of any proposed changes in the Fundsleverage policies including adjustments to leverage ratios') AS sim_timFROM provision_database) pdWHERE sim_tim <= 1 and sim_tim > 0.7 and firm_id=18;This both sentences giving similarity score as 1 despite the fact that thesentence 1. has Policies as the starting word(do not include the startinghyphen in the sentences):- Policies The General Partner shall promptly notify the Investor of anyproposed changes in the Funds leverage policies including adjustments toleverage ratios- The General Partner shall promptly notify the Investor of any proposedchanges in the Funds leverage policies including adjustments to leverageratios
This is not a bug.
That's how trigram works. The documentation [1] explains that the words don't
need to be in the same order because it counts the number of common trigrams.
Trigrams are extracted ignoring non-alphanumeric characters. Trigrams are
case-insensitive. You can check the trigrams extracted using the show_trgm()
function.
--
-- return the non-common trigrams
--
WITH a AS (
SELECT x FROM unnest(show_trgm('Policies The General Partner shall promptly
notify the Investor of any proposed changes in the Funds leverage policies
including adjustments to leverage ratios')) x),
b AS (
SELECT x FROM unnest(show_trgm('The General Partner shall promptly notify the
Investor of any proposed changes in the Funds leverage policies including
adjustments to leverage ratios')) x)
SELECT * FROM a FULL JOIN b ON (a.x = b.x) WHERE a.x IS NULL OR b.x IS NULL;