Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers
From | Rod Taylor |
---|---|
Subject | Re: GIN improvements part2: fast scan |
Date | |
Msg-id | CAKddOFDakagyJYYyVHrx69K89HwSZZLvZU0xJ03hZ_oLkPO7XQ@mail.gmail.com Whole thread Raw |
In response to | Re: GIN improvements part2: fast scan (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: GIN improvements part2: fast scan
|
List | pgsql-hackers |
2%.
It's essentially sentence fragments from 1 to 5 words in length. I wasn't expecting it to be much smaller.10 recent value selections:
white vinegar reduce color running
vinegar cure uti
cane vinegar acidity depends parameter
how remedy fir clogged shower
use vinegar sensitive skin
home remedies removing rust heating
does non raw apple cider
home remedies help maintain healthy
can vinegar mess up your
apple cide vineger ph balance
On Fri, Nov 15, 2013 at 12:51 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Fri, Nov 15, 2013 at 6:57 PM, Rod Taylor <pg@rbt.ca> wrote:I tried again this morning using gin-packed-postinglists-16.patch and gin-fast-scan.6.patch. No crashes.
It is about a 0.1% random sample of production data (10,000,000 records) with the below structure. Pg was compiled with debug enabled in both cases.
Table "public.kp"
Column | Type | Modifiers
--------+---------+-----------
id | bigint | not null
string | text | not null
score1 | integer |
score2 | integer |
score3 | integer |
score4 | integer |
Indexes:
"kp_pkey" PRIMARY KEY, btree (id)
"kp_string_key" UNIQUE CONSTRAINT, btree (string)
"textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string)) WHERE score1 IS NOT NULL
This is a query tested. All data is in Pg buffer cache for these timings. Words like "the" and "and" are very common (~9% of entries, each) and a word like "hotel" is much less common (~0.2% of entries).
SELECT id,string
FROM kp
WHERE score1 IS NOT NULL
AND to_tsvector('simple', string) @@ to_tsquery('simple', ?)
-- ? is substituted with the query strings
ORDER BY score1 DESC, score2 ASC
LIMIT 1000;
Limit (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032 rows=142 loops=1)
-> Sort (cost=56.04..56.04 rows=1 width=37) (actual time=250.008..250.017 rows=142 loops=1)
Sort Key: score1, score2
Sort Method: quicksort Memory: 36kB
-> Bitmap Heap Scan on kp (cost=52.01..56.03 rows=1 width=37) (actual time=249.711..249.945 rows=142 loops=1)
Recheck Cond: ((to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL))
-> Bitmap Index Scan on textsearch_gin_idx (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142 loops=1)
Index Cond: (to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery)
Total runtime: 250.096 ms
Times are from \timing on.
MASTER
=======
the: 888.436 ms 926.609 ms 885.502 ms
and: 944.052 ms 937.732 ms 920.050 ms
hotel: 53.992 ms 57.039 ms 65.581 ms
and & the & hotel: 260.308 ms 248.275 ms 248.098 msThese numbers roughly match what we get with Pg 9.2. The time savings between 'the' and 'and & the & hotel' is mostly heap lookups for the score and the final sort.
The size of the index on disk is about 2% smaller in the patched version.
PATCHED
=======
the: 1055.169 ms 1081.976 ms 1083.021 ms
and: 912.173 ms 949.364 ms 965.261 ms
hotel: 62.591 ms 64.341 ms 62.923 ms
and & the & hotel: 268.577 ms 259.293 ms 257.408 mshotel & and & the: 253.574 ms 258.071 ms 250.280 ms
I was hoping that the 'and & the & hotel' case would improve with this patch to be closer to the 'hotel' search, as I thought that was the kind of thing it targeted. Unfortunately, it did not. I actually applied the patches, compiled, initdb/load data, and ran it again thinking I made a mistake.Reordering the terms 'hotel & and & the' doesn't change the result.Oh, in this path new consistent method isn't implemented for tsvector opclass, for array only. Will be fixed soon.BTW, was index 2% smaller or 2 times smaller? If it's 2% smaller than I need to know more about your dataset :)
------
With best regards,
Alexander Korotkov.
pgsql-hackers by date: