Re: BUG #18580: The pg_similarity appears to be wrong - Mailing list pgsql-bugs

From Euler Taveira
Subject Re: BUG #18580: The pg_similarity appears to be wrong
Date
Msg-id f90f08b5-ba6d-4b39-9653-ede0f70a1be9@app.fastmail.com
Whole thread Raw
In response to BUG #18580: The pg_similarity appears to be wrong  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
On Mon, Aug 12, 2024, at 6:58 AM, PG Bug reporting form wrote:
SELECT *
FROM (
  SELECT 
    *, 
    similarity(provision_clean_description, 'Policies The General Partner
shall promptly notify the Investor of any proposed changes in the Funds
leverage policies including adjustments to leverage ratios') AS sim_tim
  FROM provision_database
) pd
  WHERE sim_tim <= 1 and sim_tim > 0.7 and firm_id=18;

This both sentences giving similarity score as 1 despite the fact that the
sentence 1. has Policies as the starting word(do not include the starting
hyphen in the sentences):
- Policies The General Partner shall promptly notify the Investor of any
proposed changes in the Funds leverage policies including adjustments to
leverage ratios
- The General Partner shall promptly notify the Investor of any proposed
changes in the Funds leverage policies including adjustments to leverage
ratios


This is not a bug.

That's how trigram works. The documentation [1] explains that the words don't
need to be in the same order because it counts the number of common trigrams.
Trigrams are extracted ignoring non-alphanumeric characters. Trigrams are
case-insensitive. You can check the trigrams extracted using the show_trgm()
function.

--
-- return the non-common trigrams
--
WITH a AS (
SELECT x FROM unnest(show_trgm('Policies The General Partner shall promptly
notify the Investor of any proposed changes in the Funds leverage policies
including adjustments to leverage ratios')) x),
b AS (
SELECT x FROM unnest(show_trgm('The General Partner shall promptly notify the
Investor of any proposed changes in the Funds leverage policies including
adjustments to leverage ratios')) x)
SELECT * FROM a FULL JOIN b ON (a.x = b.x) WHERE a.x IS NULL OR b.x IS NULL;



--
Euler Taveira

pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: BUG #18558: ALTER PUBLICATION fails with unhelpful error on attempt to use system column
Next
From: Jacob Champion
Date:
Subject: Re: TLS session tickets disabled?