Home > mailing lists

Re: Slow query with a lot of data - Mailing list pgsql-performance

From	Moritz Onken
Subject	Re: Slow query with a lot of data
Date	August 21, 2008 04:04:11
Msg-id	7DE7116C-4BBC-4776-AEE0-63A2E2DEAF16@houseofdesign.de Whole thread Raw
In response to	Re: Slow query with a lot of data ("Scott Carey" <scott@richrelevance.com>)
Responses	Re: Slow query with a lot of data
List	pgsql-performance

Tree view

Am 20.08.2008 um 20:06 schrieb Scott Carey:

> Ok, so the problem boils down to the sort at the end.
>
> The query up through the merge join on domain is as fast as its
> going to get.  The sort at the end however, should not happen
> ideally.  There are not that many rows returned, and it should
> hash_aggregate if it thinks there is enough space to do so.
>
> The query planner is going to choose the sort > agg over the hash-
> agg if it estimates the total number of resulting rows to be large
> enough so that the hash won't fit in work_mem.   However, there
> seems to be another factor here based on this:
>
>
> GroupAggregate  (cost=11745105.66..12277396.
> 81 rows=28704 width=12)"
> "  ->  Sort  (cost=11745105.66..11878034.93 rows=53171707 width=12)"
>
> "        Sort Key: a."user", b.category"
> "        ->  Merge Join  (cost=149241.25..1287278.89 rows=53171707
> width=12)"
>
> "              Merge Cond: (b.domain = a.domain)"
>
>
> The planner actually thinks there will only be 28704 rows returned
> of width 12.  But it chooses to sort 53 million rows before
> aggregating.  Thats either a bug or there's something else wrong
> here.   That is the wrong way to aggregate those results no matter
> how much work_mem you have unless I'm completely missing something...
>
> You can try rearranging the query just to see if you can work around
> this.  What happens if you compare the explain on:
>
> select
>  a."user", b.category, sum(1.0/b.cat_count)::float
>  from result a, domain_categories b
>  where a."domain" = b."domain"
>  and b.depth < 4
>  and a.results > 100
>  and a."user" < 30000
>  group by a."user", b.category
>
>

"HashAggregate  (cost=1685527.69..1686101.77 rows=28704 width=12)"
"  ->  Merge Join  (cost=148702.25..1286739.89 rows=53171707 width=12)"
"        Merge Cond: (b.domain = a.domain)"
"        ->  Index Scan using domain_categories_domain on
domain_categories b  (cost=0.00..421716.32 rows=5112568 width=12)"
"              Filter: (depth < 4)"
"        ->  Sort  (cost=148415.16..148513.60 rows=39376 width=8)"
"              Sort Key: a.domain"
"              ->  Bitmap Heap Scan on result a
(cost=1249.93..145409.79 rows=39376 width=8)"
"                    Recheck Cond: ("user" < 30000)"
"                    Filter: (results > 100)"
"                    ->  Bitmap Index Scan on result_user_idx
(cost=0.00..1240.08 rows=66881 width=0)"
"                          Index Cond: ("user" < 30000)"



> to
>
> select
>  c."user", c.category, sum(1.0/c.cat_count)::float
>  from (select a."user", b.category, b.cat_count
>    from result a, domain_categories b
>      where a."domain" = b."domain"
>        and b.depth < 4
>        and a.results > 100
>        and a."user" < 30000 ) c
>   group by c."user", c.category
>


"HashAggregate  (cost=1685527.69..1686101.77 rows=28704 width=12)"
"  ->  Merge Join  (cost=148702.25..1286739.89 rows=53171707 width=12)"
"        Merge Cond: (b.domain = a.domain)"
"        ->  Index Scan using domain_categories_domain on
domain_categories b  (cost=0.00..421716.32 rows=5112568 width=12)"
"              Filter: (depth < 4)"
"        ->  Sort  (cost=148415.16..148513.60 rows=39376 width=8)"
"              Sort Key: a.domain"
"              ->  Bitmap Heap Scan on result a
(cost=1249.93..145409.79 rows=39376 width=8)"
"                    Recheck Cond: ("user" < 30000)"
"                    Filter: (results > 100)"
"                    ->  Bitmap Index Scan on result_user_idx
(cost=0.00..1240.08 rows=66881 width=0)"
"                          Index Cond: ("user" < 30000)"



> It shouldn't make a difference, but I've seen things like this help
> before so its worth a try.  Make sure work_mem is reasonably sized
> for this test.

It's exactly the same. work_mem was set to 3000MB.

>
>
> Another thing that won't be that fast, but may avoid the sort, is to
> select the subselection above into a temporary table, analyze it,
> and then do the outer select.  Make sure your settings for temporary
> space (temp_buffers in 8.3) are large enough for the intermediate
> results (700MB should do it).  That won't be that fast, but it will
> most likely be faster than sorting 50 million + rows.  There are
> lots of problems with this approach but it may be worth the
> experiment.
>

I'll try this.

Thanks so far!

pgsql-performance by date:

From: "Kranti K K Parisa™"
Date: 21 August 2008, 04:04:00
Subject: Re: PostgreSQL+Hibernate Performance

From: Moritz Onken
Date: 21 August 2008, 04:05:42
Subject: Re: Slow query with a lot of data

Re: Slow query with a lot of data - Mailing list pgsql-performance

Previous

Next