Re: faster way to calculate top "tags" for a "resource" based on a column - Mailing list pgsql-general
From | Jim Nasby |
---|---|
Subject | Re: faster way to calculate top "tags" for a "resource" based on a column |
Date | |
Msg-id | 54331003.1020505@BlueTreble.com Whole thread Raw |
In response to | faster way to calculate top "tags" for a "resource" based on a column (Jonathan Vanasco <postgres@2xlp.com>) |
Responses |
Re: faster way to calculate top "tags" for a "resource" based on a column
|
List | pgsql-general |
On 10/3/14, 11:21 AM, Jonathan Vanasco wrote: > I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any timeoff this one. I'm hoping someone has another strategy. > > I have 2 tables: > resource > resource_2_tag > > I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource"table. > > both tables have around 1.6million records. > > If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk. > If the database doesn't need to warm up, it averages 1.76seconds. > > The 1.76s time is troubling me. > Searching for the discrete elements of this is pretty lightweight. > > here's an explain -- http://explain.depesz.com/s/PndC > > I tried a subquery instead of a join, and the query optimized the plan to the same. > > i'm hoping someone will see something that I just don't see. > > > > Table "public.resource_2_tag" > Column | Type | Modifiers > -----------------------+---------+----------- > resource_id | integer | > tag_id | integer | > Indexes: > "_idx_speed_resource_2_tag__resource_id" btree (resource_id) > "_idx_speed_resource_2_tag__tag_id" btree (tag_id) > > Table "public.resource" > Column | Type | Modifiers > -------------------------------------+-----------------------------+---------------------------------------------------------- > id | integer | not null default nextval('resource_id_seq'::regclass) > resource_attribute1_id | integer | > lots of other columns | | > Indexes: > "resource_attribute1_idx" btree (resource_attribute1_id) > > -------------------------------------------------------------------------------- > > select count(*) from resource; > -- 1669729 > > select count(*) from resource_2_tag; > -- 1676594 > > select count(*) from resource where resource_attribute1_id = 614; > -- 5184 > -- 4.386ms > > select id from resource where resource_attribute1_id = 614; > -- 5184 > -- 87.303ms > > popping the 5k elements into an "in" clause, will run the query in around 100ms. > > > EXPLAIN ANALYZE > SELECT > resource_2_tag.tag_id AS resource_2_tag_tag_id, > count(resource_2_tag.tag_id) AS counted > FROM > resource_2_tag > JOIN resource ON resource.id = resource_2_tag.resource_id > WHERE > resource.resource_attribute1_id = 614 > GROUP BY resource_2_tag.tag_id > ORDER BY counted DESC > LIMIT 25 OFFSET 0; Don't join to the resource table; there's no reason to because you're not pulling anything from it. If for some reason you do need data out of the resource table, do the LIMIT 25 first, in a sub-select: SELECT r.*, counted FROM resource r JOIN ( SELECT tag_id, count(*) FROM resource_2_tag GROUP BY tag_id ORDER BY tag_id LIMIT 25 ) t ON ... ; -- -- Jim Nasby, Data Architect, Blue Treble Data in Trouble? Get it in Treble! http://BlueTreble.com
pgsql-general by date: