trgm and index-usage when using similarity function - Mailing list pgsql-performance

From Andreas Joseph Krogh
Subject trgm and index-usage when using similarity function
Date
Msg-id VisenaEmail.4.25b9c8bc175afa23.19a530e9ec5@origo-test01.app.internal.visena.net
Whole thread Raw
List pgsql-performance

Using PG-18.0, I'm having problems with trgm and index-usage when using similarity-function.

 

Given the following schema:

CREATE TABLE person
(
entity_id BIGSERIAL PRIMARY KEY,
firstname text,
lastname text,
fullname_search text,
birthdate date
);

INSERT INTO person
SELECT pers.entity_id, pers.firstname, pers.lastname, pers.fullname_search, pers.birthdate
FROM onp_crm_person pers;

CREATE INDEX test_idx_onp_crm_person_duplicates
ON person USING gin (fullname_search gin_trgm_ops)
;

 

This query uses index and is fast, but I have to set pg_trgm.similarity_threshold manually first:

SET pg_trgm.similarity_threshold = 0.9;

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
WHERE d.fullname_search % p.fullname_search
AND d.entity_id <> p.entity_id
LIMIT 1
)
;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Semi Join (cost=2.19..319906.05 rows=256 width=45) (actual time=29.527..2626.465 rows=7497.00 loops=1) │
│ Buffers: shared hit=972919 │
│ -> Seq Scan on person p (cost=0.00..488.81 rows=25581 width=45) (actual time=0.024..1.127 rows=25581.00 loops=1) │
│ Buffers: shared hit=233 │
│ -> Bitmap Heap Scan on person d (cost=2.19..10.03 rows=256 width=25) (actual time=0.102..0.102 rows=0.29 loops=25581) │
│ Recheck Cond: (fullname_search % p.fullname_search) │
│ Rows Removed by Index Recheck: 0 │
│ Filter: (entity_id <> p.entity_id) │
│ Rows Removed by Filter: 1 │
│ Heap Blocks: exact=34252 │
│ Buffers: shared hit=972686 │
│ -> Bitmap Index Scan on test_idx_onp_crm_person_duplicates (cost=0.00..2.13 rows=256 width=0) (actual time=0.097..0.097 rows=1.99 loops=25581) │
│ Index Cond: (fullname_search % p.fullname_search) │
│ Index Searches: 25581 │
│ Buffers: shared hit=933853 │
│ Planning: │
│ Buffers: shared hit=2 │
│ Planning Time: 3.620 ms │
│ JIT: │
│ Functions: 7 │
│ Options: Inlining false, Optimization false, Expressions true, Deforming true │
│ Timing: Generation 0.772 ms (Deform 0.230 ms), Inlining 0.000 ms, Optimization 5.522 ms, Emission 23.219 ms, Total 29.513 ms │
│ Execution Time: 2628.643 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(23 rows)

 

But using similarity-function:

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
WHERE similarity(d.fullname_search, p.fullname_search) >= 0.9
AND d.entity_id <> p.entity_id
LIMIT 1
)
;

… this has been running for 30 minutes now and I've cancelled it.

 

Any suggestions on how to fix this query so I can:

  1. specify the similarity-factor as part of the query

  2. Make it fast

Thanks.

 

--
Andreas Joseph Krogh

pgsql-performance by date:

Previous
From: David Rowley
Date:
Subject: Re: GEQO plans much slower than standard join plans
Next
From: "Dirschel, Steve"
Date:
Subject: Problem getting query to use index inside a function