Re: Fixing GIN for empty/null/full-scan cases - Mailing list pgsql-hackers
From | David E. Wheeler |
---|---|
Subject | Re: Fixing GIN for empty/null/full-scan cases |
Date | |
Msg-id | EACA86F3-83D2-431B-8071-5136E28FF90B@kineticode.com Whole thread Raw |
In response to | Re: Fixing GIN for empty/null/full-scan cases (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Fixing GIN for empty/null/full-scan cases
Re: Fixing GIN for empty/null/full-scan cases Re: Fixing GIN for empty/null/full-scan cases |
List | pgsql-hackers |
On Jan 8, 2011, at 9:41 PM, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote: >>> Well, actually, I just committed it. If you want to test, feel free. >>> Note that right now only the anyarray && <@ @> operators are genuinely >>> fixed ... I plan to hack on tsearch and contrib pretty soon though. > >> Hrm, the queries I wrote for this sort of thing use intarray: >> WHERE blah @@ '(12|14)'::query_int >> That's not done yet though, right? > > intarray is done now, feel free to test ... Tom, Well, I regret to say that what I found is…all over the place. So I have a rather wide table with two GIST indexes, one on an integer[] column and one on a tsvector column. I duped thetable and replaced those indexes with GIN indexes. And the results of my testing don't make much sense. Well, first the good news: I got no NULL-related errors at all. There are a lot of rows with an empty array in the integer[]column. And I got the same results for my queries against the table with the GIN indexes as the one with the GiSTindexes. So all that's good. One of the reasons our client wants GIN for the integer[] column so bad is because recreating the GiST integer[] index isquite painful. Before I duped the table, I was just dropping and recreating the index on the original table. It was greatto create the GIN index; for 400K rows, it took 1300 ms. After my initial round of testing, I dropped it and createdthe GiST index. That ran for…well, *hours*. Four at least, maybe six or seven (I forgot \timing and was letting itrun on screen while I did some iOS hacking). I think something might be really wrong with GiST index creation for integerarrays, because the difference was just appalling. As a sanity check, I did the same thing today with the same table(s) on their ts_vector columns. Creating the GiST indextook just under 7 seconds; the GIN index took 23.4 seconds. On a second attempt, GiST took 16371 ms and GIN 30452. Ihad expected GIN to be faster here, but both are within the realms of the acceptable, compared to the time it took to createthe GiST index on the integer[] column. As for the queries, here too I was surprised. As I said, the integer[] column had a lot of empty arrays. And the indexeslook like so: "idx_gist_features" gist (features) WHERE deleted_at IS NULL AND status = 1 "idx_gist_textsearch" gist (ts_index_col) WHEREdeleted_at IS NULL AND status = 1 Just s/gist/gin/ for the gin indexes. For the integer[] column, I ran a bunch of queries like so (again, just s/gist/gin/for the gin versions): explain analyze SELECT count(*) FROM gist_listings WHERE features @@ '(1369043)'::query_int AND deleted_at ISNULL AND mls_status_id = 1; This integer had pretty high selectvity, 86 out of 406K rows. Gist: 117.444 on the first run, around 3.2 ms thereafter GIN: Around 325 ms on all runs explain analyze SELECT count(*) FROM gist_listings WHERE features @@ '(1368798|1369043)'::query_int AND deleted_atIS NULL AND mls_status_id = 1; Rows selected: 91528. Gist: 4030.282 ms on the first run, around 210 ms thereafter. GIN: 4309.259 ms on the first run, around 400 ms thereafter explain analyze SELECT count(*) FROM gist_listings WHERE features @@ '(1368799&1368800&1369043)'::query_int AND deleted_at IS NULL AND mls_status_id = 1; Rows selected: 91528 Gist: 1738.568 ms on the first run, around 24 ms thereafter. GIN: 4427.517 ms on the first run, around 340 ms thereafter These numbers are a bit crazy-making, but the upshot is that Gist is slow out of the gate, but with data cached, it's prettyspeedy. With indexscan and bitmapscan disabled, these queries all took 300-400 ms. So GIN was never better performingthan a table scan. So while GIN is a big win for re-indexing (this database gets its intarray Gist indexes updatedquite frequently, as they get quited bloated in a hurry), it's not a win at all for querying. So, thinking that there might be something funky with intarray GIN support, we wanted to test performance of GIST vs. GINon the tsquery column. Here GIN was a much bigger win. With a query like this: SELECT l.* FROM gist_listings WHERE ts_index_col @@ to_tsquery(regexp_replace( plainto_tsquery('english', '1Infinite Loop')::text, '''(?=[[:space:]]|$)', ''':B', 'g' )) and deleted_at IS NULL AND status = 1; With zero rows returned, GIN consistently executed in 20 ms. Gist took 838.274 for the first run and 25-30 ms on subsequentruns. So GIN is the clear winner here, except for index creation as noted above. And here's one that selects a single row: SELECT l.* FROM gist_listings WHERE ts_index_col @@ to_tsquery(regexp_replace( plainto_tsquery('english', 'volkswagon')::text, '''(?=[[:space:]]|$)', ''':B', 'g' )) and deleted_at IS NULL AND status = 1; GiST: 495.867 first run, 380 thereafter GIN: 83.980 first run, 330 thereafter Again, GIN is the clear winner here, though it's negligible when the data is in the cache. So some questions: * Is something seriously wrong with GiST index creation on integer[] columns? * Why does GIN performance appear to be no better than table scans on integer[] columns? * Why does it take 3-4x longer to create the GIN than the GiST index on tsvector? I thought that GIN was supposed to be fasterto update Hope this is helpful, and please do let me know if there are any other tests you'd like me to run against this data. Best, David
pgsql-hackers by date: