Re: First implementation of GIN for pg_trgm - Mailing list pgsql-patches
From | Guillaume Smet |
---|---|
Subject | Re: First implementation of GIN for pg_trgm |
Date | |
Msg-id | 1d4e0c10702220600q715a9329q98ecf0ada562c0a1@mail.gmail.com Whole thread Raw |
In response to | Re: First implementation of GIN for pg_trgm (Teodor Sigaev <teodor@sigaev.ru>) |
Responses |
Re: First implementation of GIN for pg_trgm
|
List | pgsql-patches |
On 2/22/07, Teodor Sigaev <teodor@sigaev.ru> wrote: > > From a previous discussion with Teodor, it would be better to store an > > int in the index instead of a text (it takes less space and is > > faster). I couldn't find any example so if anyone has an advice to fix > > that, it's welcome (mostly how to pack the trigram into an int instead > > of a text). > > Something like that: [snip] I worked on that this morning before I got your email. You'll find attached a second version of the patch using int4 to store the trigrams (it's not exactly what you gave me but it should work I think). I didn't see any improvement in terms of size of the index (14 MB for 642 738 rows in the index in both cases) or speed. Our dictionary table contains 78367 words and its size is 3 MB. Did I miss something? If there's no benefit to use an int4, I propose we keep the text version because it will be better if we add UTF-8 support someday. > our query. So, our consistent function should say FALSE when indexed value is > not similar to query exactly and TRUE in opposite case. Let lquery is a length > of query and ncommon is a number of common trigrams (look at cnt_sml() > function), and consistent function should be: Yes, that's what I did in my patch if I understand you correctly but it's not very selective IMHO. > Now consistent function doesn't guarantee exact result, so we should mark '%' > operator in CREATE OPERATOR CLASS with RECHECK option. The attached patch adds a RECHECK too. It seems to work correctly but the RECHECK COND costs a lot of time :/. At the moment, I have the following results: GIST: ------- test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml FROM lieu_mots_gist WHERE word % 'aub' ORDER BY sml DESC; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Sort (cost=204.40..204.59 rows=78 width=11) (actual time=98.028..98.132 rows=50 loops=1) Sort Key: similarity('aub'::text, word) -> Bitmap Heap Scan on lieu_mots_gist (cost=4.88..201.95 rows=78 width=11) (actual time=97.575..97.861 rows=50 loops=1) Recheck Cond: (word % 'aub'::text) -> Bitmap Index Scan on idx_word_gist (cost=0.00..4.86 rows=78 width=0) (actual time=97.539..97.539 rows=50 loops=1) Index Cond: (word % 'aub'::text) Total runtime: 98.303 ms (7 rows) test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml FROM lieu_mots_gist WHERE word % 'auberge' ORDER BY sml DESC; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Sort (cost=204.40..204.59 rows=78 width=11) (actual time=135.128..135.196 rows=41 loops=1) Sort Key: similarity('auberge'::text, word) -> Bitmap Heap Scan on lieu_mots_gist (cost=4.88..201.95 rows=78 width=11) (actual time=134.829..135.016 rows=41 loops=1) Recheck Cond: (word % 'auberge'::text) -> Bitmap Index Scan on idx_word_gist (cost=0.00..4.86 rows=78 width=0) (actual time=134.797..134.797 rows=41 loops=1) Index Cond: (word % 'auberge'::text) Total runtime: 135.335 ms (7 rows) With GIN: ------------ test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml FROM lieu_mots_gin WHERE word % 'aub' ORDER BY sml DESC; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Sort (cost=208.45..208.64 rows=78 width=11) (actual time=60.409..60.513 rows=50 loops=1) Sort Key: similarity('aub'::text, word) -> Bitmap Heap Scan on lieu_mots_gin (cost=8.93..205.99 rows=78 width=11) (actual time=10.279..60.228 rows=50 loops=1) Filter: (word % 'aub'::text) -> Bitmap Index Scan on idx_word_gin (cost=0.00..8.91 rows=78 width=0) (actual time=10.069..10.069 rows=6676 loops=1) Index Cond: (word % 'aub'::text) Total runtime: 60.688 ms (7 rows) test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml FROM lieu_mots_gin WHERE word % 'auberge' ORDER BY sml DESC; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Sort (cost=208.45..208.64 rows=78 width=11) (actual time=50.293..50.381 rows=41 loops=1) Sort Key: similarity('auberge'::text, word) -> Bitmap Heap Scan on lieu_mots_gin (cost=8.93..205.99 rows=78 width=11) (actual time=21.006..50.122 rows=41 loops=1) Filter: (word % 'auberge'::text) -> Bitmap Index Scan on idx_word_gin (cost=0.00..8.91 rows=78 width=0) (actual time=20.839..20.839 rows=928 loops=1) Index Cond: (word % 'auberge'::text) Total runtime: 50.534 ms (7 rows) -- Guillaume
Attachment
pgsql-patches by date: