Re: Speeding up aggregates - Mailing list pgsql-performance
From | Joe Conway |
---|---|
Subject | Re: Speeding up aggregates |
Date | |
Msg-id | 3DF39E2A.4010802@joeconway.com Whole thread Raw |
In response to | Re: Speeding up aggregates (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Speeding up aggregates
|
List | pgsql-performance |
Tom Lane wrote: > FWIW, I've implemented hashed aggregation in CVS tip. I have not had > the time to try to benchmark it, but I'd be interested if anyone can > run some tests on 7.4devel. Eliminating the need for a SORT step > should help aggregations over large datasets. > > Note that even though there's no SORT, the sort_mem setting is used > to determine the allowable hashtable size, so a too-small sort_mem > might discourage the planner from selecting hashed aggregation. > Use EXPLAIN to see which query plan gets chosen. > Here's some tests on a reasonable sized (and real life as opposed to contrived) dataset: parts=# set enable_hashagg to off; SET parts=# explain analyze select i.part_id, sum(w.qty_oh) as total_oh from inv i, iwhs w where i.part_id = w.part_id group by i.part_id; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=11111.93..11744.90 rows=35528 width=36) (actual time=2799.40..3140.17 rows=34575 loops=1) -> Sort (cost=11111.93..11293.31 rows=72553 width=36) (actual time=2799.35..2896.43 rows=72548 loops=1) Sort Key: i.part_id -> Hash Join (cost=1319.10..5254.45 rows=72553 width=36) (actual time=157.72..1231.01 rows=72548 loops=1) Hash Cond: ("outer".part_id = "inner".part_id) -> Seq Scan on iwhs w (cost=0.00..2121.53 rows=72553 width=22) (actual time=0.01..286.80 rows=72553 loops=1) -> Hash (cost=1230.28..1230.28 rows=35528 width=14) (actual time=157.50..157.50 rows=0 loops=1) -> Seq Scan on inv i (cost=0.00..1230.28 rows=35528 width=14) (actual time=0.02..88.00 rows=35528 loops=1) Total runtime: 3168.73 msec (9 rows) parts=# set enable_hashagg to on; SET parts=# explain analyze select i.part_id, sum(w.qty_oh) as total_oh from inv i, iwhs w where i.part_id = w.part_id group by i.part_id; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=5617.22..5706.04 rows=35528 width=36) (actual time=1507.89..1608.32 rows=34575 loops=1) -> Hash Join (cost=1319.10..5254.45 rows=72553 width=36) (actual time=153.46..1231.34 rows=72548 loops=1) Hash Cond: ("outer".part_id = "inner".part_id) -> Seq Scan on iwhs w (cost=0.00..2121.53 rows=72553 width=22) (actual time=0.01..274.74 rows=72553 loops=1) -> Hash (cost=1230.28..1230.28 rows=35528 width=14) (actual time=153.21..153.21 rows=0 loops=1) -> Seq Scan on inv i (cost=0.00..1230.28 rows=35528 width=14) (actual time=0.03..84.67 rows=35528 loops=1) Total runtime: 1661.53 msec (7 rows) parts=# explain analyze select i.part_id, sum(w.qty_oh) as total_oh from inv i, iwhs w where i.part_id = w.part_id group by i.part_id having sum(w.qty_oh) > 0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=11111.93..12015.10 rows=35528 width=36) (actual time=2823.65..3263.16 rows=4189 loops=1) Filter: (sum(qty_oh) > 0::double precision) -> Sort (cost=11111.93..11293.31 rows=72553 width=36) (actual time=2823.40..2926.07 rows=72548 loops=1) Sort Key: i.part_id -> Hash Join (cost=1319.10..5254.45 rows=72553 width=36) (actual time=156.39..1240.61 rows=72548 loops=1) Hash Cond: ("outer".part_id = "inner".part_id) -> Seq Scan on iwhs w (cost=0.00..2121.53 rows=72553 width=22) (actual time=0.01..290.47 rows=72553 loops=1) -> Hash (cost=1230.28..1230.28 rows=35528 width=14) (actual time=156.16..156.16 rows=0 loops=1) -> Seq Scan on inv i (cost=0.00..1230.28 rows=35528 width=14) (actual time=0.02..86.95 rows=35528 loops=1) Total runtime: 3282.27 msec (10 rows) Note that similar to Josh, I saw a nice improvement when using the HashAggregate on the simpler case, but as soon as I added a HAVING clause the optimizer switched back to GroupAggregate. I'll try to play around with this a bit more later today. Joe
pgsql-performance by date: