trgm and index-usage when using similarity function - Mailing list pgsql-performance
| From | Andreas Joseph Krogh |
|---|---|
| Subject | trgm and index-usage when using similarity function |
| Date | |
| Msg-id | VisenaEmail.4.25b9c8bc175afa23.19a530e9ec5@origo-test01.app.internal.visena.net Whole thread Raw |
| List | pgsql-performance |
Using PG-18.0, I'm having problems with trgm and index-usage when using similarity-function.
Given the following schema:
CREATE TABLE person
(
entity_id BIGSERIAL PRIMARY KEY,
firstname text,
lastname text,
fullname_search text,
birthdate date
);
INSERT INTO person
SELECT pers.entity_id, pers.firstname, pers.lastname, pers.fullname_search, pers.birthdate
FROM onp_crm_person pers;
CREATE INDEX test_idx_onp_crm_person_duplicates
ON person USING gin (fullname_search gin_trgm_ops)
;
This query uses index and is fast, but I have to set pg_trgm.similarity_threshold manually first:
SET pg_trgm.similarity_threshold = 0.9;
EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
WHERE d.fullname_search % p.fullname_search
AND d.entity_id <> p.entity_id
LIMIT 1
)
;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Semi Join (cost=2.19..319906.05 rows=256 width=45) (actual time=29.527..2626.465 rows=7497.00 loops=1) │
│ Buffers: shared hit=972919 │
│ -> Seq Scan on person p (cost=0.00..488.81 rows=25581 width=45) (actual time=0.024..1.127 rows=25581.00 loops=1) │
│ Buffers: shared hit=233 │
│ -> Bitmap Heap Scan on person d (cost=2.19..10.03 rows=256 width=25) (actual time=0.102..0.102 rows=0.29 loops=25581) │
│ Recheck Cond: (fullname_search % p.fullname_search) │
│ Rows Removed by Index Recheck: 0 │
│ Filter: (entity_id <> p.entity_id) │
│ Rows Removed by Filter: 1 │
│ Heap Blocks: exact=34252 │
│ Buffers: shared hit=972686 │
│ -> Bitmap Index Scan on test_idx_onp_crm_person_duplicates (cost=0.00..2.13 rows=256 width=0) (actual time=0.097..0.097 rows=1.99 loops=25581) │
│ Index Cond: (fullname_search % p.fullname_search) │
│ Index Searches: 25581 │
│ Buffers: shared hit=933853 │
│ Planning: │
│ Buffers: shared hit=2 │
│ Planning Time: 3.620 ms │
│ JIT: │
│ Functions: 7 │
│ Options: Inlining false, Optimization false, Expressions true, Deforming true │
│ Timing: Generation 0.772 ms (Deform 0.230 ms), Inlining 0.000 ms, Optimization 5.522 ms, Emission 23.219 ms, Total 29.513 ms │
│ Execution Time: 2628.643 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(23 rows)
But using similarity-function:
EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
WHERE similarity(d.fullname_search, p.fullname_search) >= 0.9
AND d.entity_id <> p.entity_id
LIMIT 1
)
;
… this has been running for 30 minutes now and I've cancelled it.
Any suggestions on how to fix this query so I can:
specify the similarity-factor as part of the query
Make it fast
Thanks.
--
Andreas Joseph Krogh
pgsql-performance by date: