Aggregate functions, fast! (long) - Mailing list pgsql-sql
From | Ang Chin Han |
---|---|
Subject | Aggregate functions, fast! (long) |
Date | |
Msg-id | 20000809145345.B5894@pintoo.com Whole thread Raw |
Responses |
Re: Aggregate functions, fast! (long)
|
List | pgsql-sql |
Apologies in advance for the length of this post, but this has been bugging me for a week or so. Consider a table with a primary key pk and a list of attributes a, b and c: Table t pk a b c ----------- 1 1 1 1 2 1 2 3 : etc : 9998 1 1 1 9999 2 1 2 If the table is a large but the range of possible values for the attributes small, aggregate queries such as the following with be very slow: 1. select min(a) from t; 2. select count(*) from t where a = 1 and b = 3; 3. select sum(d) from t where a = 1 and c >3; 4. select avg(b) from t where c = 1; One way of speeding these type of queries is to have a table summarizing that table t: Table t_sum a b c cnt -------------- 1 1 1 2 (This is from pk 1 and 9998 having the same a,b,c) 1 2 3 1 2 1 2 1 : : etc : with a primary key of (a, b, c) and an integer cnt counting the number of timesa particular combination of (a, b, c) occuring in table t The queries making use of these might be rewritten as: 1. select min(a) from t_sum; -- same as above, -- but we've less rows to scan 2. select cnt from t_sum where a = 1 and b = 3; 3. select sum(d * cnt) as sumfrom t_sum where a = 1 and c > 3; 4. select sum(b * cnt) / cnt as avg from t_sum where c = 1; (CAVEAT/BUG: 1. must return cnt = 0 if it doesn't exist in t_sum 2. rows with cnt = 0 willhave to be deleted immediately or select min(foo) might return the wrong result) Now, t_sum can be automatically updated by triggers/rules (I'll get into this later) It might seem a bit pointless but I've made the example very generic to show that this _is_ a generic class of problem. Specific examples might include: select count(*) from books where category=x; select count(*) from articles where category=xand author=y; My point is it'll be nice if there is an easier mechanism ala CREATE VIEW as a shortcut (and more!) for 'select x from y where z = a;' The syntax might look like: CREATE AGGREGATE INDEX t1_sum on t1 (a, b, c); which would create the implicit triggers and table. The hard part is for postgres to have a rule rewrite system capable of converting the queries. (Perhaps we'll get this when views-with-aggregate-columns bug is fixed?) Of course, the performance gain can be achieved if you manually rewrite your queries to take advantage of the summary table (or aggregate index, similar to the normal index speeding up ranged lookups). And, oh, the rules/triggers: (I used these, and they work great for some of the tests I did, but I haven't fully debugged these for all cases, but they definately have bugs 1 & 2 described above. ) ---------- 8<- Cut here --------- CREATE TABLE t_sum ( a INTEGER, b INTEGER, c INTEGER, cnt INTEGER, PRIMARY KEY (a, b, c) ); CREATE RULE t_sum_add_rule AS ON INSERT TO t_sum WHERE EXISTS (SELECT cnt FROM t_sum WHERE a = new.a and b = new.band c = new.c) DO INSTEAD UPDATE t_sum set cnt = cnt + 1 WHERE a = new.a and b = new.b and c = new.c; CREATE RULE t_insert_rule AS ON INSERT TO t DO INSERT INTO t_sum values (new.a, new.b, new.c, 1); CREATE FUNCTION "t_sum_upd" ( ) RETURNS opaque AS ' begin update t_sum set cnt = cnt - 1 where t_sum.a = old.a and t_sum.b = old.b and t_sum.c = old.c; insert intot_sum values (new.a, new.b, new.c); return new; end;' LANGUAGE 'plpgsql'; CREATE TRIGGER "t_upd" BEFORE UPDATE ON "t" FOR EACH ROW EXECUTE PROCEDURE "t_sum_upd" (); CREATE FUNCTION "t_sum_del" ( ) RETURNS opaque AS ' begin update t_sum set cnt = cnt - 1 where t_sum.a = old.a and t_sum.b = old.b and t_sum.c = old.c; return old; end;' LANGUAGE 'plpgsql'; CREATE TRIGGER "t_del" BEFORE DELETE ON "t" FOR EACH ROW EXECUTE PROCEDURE "t_sum_del" (); ---------- 8<- Cut here --------- P.S. This post is inspired when someone mentioned on the list that a separate counter might be kept by postgres to speed up some aggregate functions like select count(*) from t; P.P.S. Curious how do the commercial RDBMS handle this: select count(*) from people where gender='m'; when people contains one million rows and gender distribution is NEARLY 50% male/female?