Home > mailing lists

trgm and index-usage when using similarity function - Mailing list pgsql-performance

From	Andreas Joseph Krogh
Subject	trgm and index-usage when using similarity function
Date	November 5 11:18:06
Msg-id	VisenaEmail.4.25b9c8bc175afa23.19a530e9ec5@origo-test01.app.internal.visena.net Whole thread Raw
List	pgsql-performance

Tree view

Using PG-18.0, I'm having problems with trgm and index-usage when using similarity-function.

Given the following schema:

CREATE TABLE person
(
    entity_id   BIGSERIAL PRIMARY KEY,
    firstname   text,
    lastname    text,
    fullname_search text,
    birthdate   date
);

INSERT INTO person
SELECT pers.entity_id, pers.firstname, pers.lastname, pers.fullname_search, pers.birthdate
FROM onp_crm_person pers;

CREATE INDEX test_idx_onp_crm_person_duplicates
    ON person USING gin (fullname_search gin_trgm_ops)
;

This query uses index and is fast, but I have to set pg_trgm.similarity_threshold manually first:

SET pg_trgm.similarity_threshold = 0.9;

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
    SELECT *
    FROM person d
    WHERE d.fullname_search % p.fullname_search
      AND d.entity_id <> p.entity_id
    LIMIT 1
)
;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                        QUERY PLAN                                                                        │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Semi Join  (cost=2.19..319906.05 rows=256 width=45) (actual time=29.527..2626.465 rows=7497.00 loops=1)                                      │
│   Buffers: shared hit=972919                                                                                                                             │
│   ->  Seq Scan on person p  (cost=0.00..488.81 rows=25581 width=45) (actual time=0.024..1.127 rows=25581.00 loops=1)                                     │
│         Buffers: shared hit=233                                                                                                                          │
│   ->  Bitmap Heap Scan on person d  (cost=2.19..10.03 rows=256 width=25) (actual time=0.102..0.102 rows=0.29 loops=25581)                                │
│         Recheck Cond: (fullname_search % p.fullname_search)                                                                                              │
│         Rows Removed by Index Recheck: 0                                                                                                                 │
│         Filter: (entity_id <> p.entity_id)                                                                                                               │
│         Rows Removed by Filter: 1                                                                                                                        │
│         Heap Blocks: exact=34252                                                                                                                         │
│         Buffers: shared hit=972686                                                                                                                       │
│         ->  Bitmap Index Scan on test_idx_onp_crm_person_duplicates  (cost=0.00..2.13 rows=256 width=0) (actual time=0.097..0.097 rows=1.99 loops=25581) │
│               Index Cond: (fullname_search % p.fullname_search)                                                                                          │
│               Index Searches: 25581                                                                                                                      │
│               Buffers: shared hit=933853                                                                                                                 │
│ Planning:                                                                                                                                                │
│   Buffers: shared hit=2                                                                                                                                  │
│ Planning Time: 3.620 ms                                                                                                                                  │
│ JIT:                                                                                                                                                     │
│   Functions: 7                                                                                                                                           │
│   Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                          │
│   Timing: Generation 0.772 ms (Deform 0.230 ms), Inlining 0.000 ms, Optimization 5.522 ms, Emission 23.219 ms, Total 29.513 ms                           │
│ Execution Time: 2628.643 ms                                                                                                                              │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(23 rows)

But using similarity-function:

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
    SELECT *
    FROM person d
    WHERE similarity(d.fullname_search, p.fullname_search) >= 0.9
      AND d.entity_id <> p.entity_id
    LIMIT 1
)
;

… this has been running for 30 minutes now and I've cancelled it.

Any suggestions on how to fix this query so I can:

specify the similarity-factor as part of the query
Make it fast

Thanks.

Andreas Joseph Krogh

pgsql-performance by date:

From: David Rowley
Date: 31 October, 04:19:03
Subject: Re: GEQO plans much slower than standard join plans

From: "Dirschel, Steve"
Date: 05 November, 19:02:27
Subject: Problem getting query to use index inside a function

trgm and index-usage when using similarity function - Mailing list pgsql-performance

Previous

Next