Re: using extended statistics to improve join estimates - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: using extended statistics to improve join estimates |
Date | |
Msg-id | ff412e82-5427-3b72-cf32-4700156dc47c@enterprisedb.com Whole thread Raw |
In response to | Re: using extended statistics to improve join estimates (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: using extended statistics to improve join estimates
Re: using extended statistics to improve join estimates Re: using extended statistics to improve join estimates |
List | pgsql-hackers |
Hi, attached is an improved version of this patch, addressing some of the points mentioned in my last message: 1) Adds a couple regression tests, testing various join cases with expressions, additional conditions, etc. 2) Adds support for expressions, so the join clauses don't need to reference just simple columns. So e.g. this can benefit from extended statistics, when defined on the expressions: -- CREATE STATISTICS s1 ON (a+1), b FROM t1; -- CREATE STATISTICS s2 ON (a+1), b FROM t2; SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b); 3) Can combine extended statistics and regular (per-column) statistics. The previous version required extended statistics MCV on both sides of the join, but adding extended statistics on both sides may impractical (e.g. if one side does not have correlated columns it's silly to have to add it just to make this patch happy). For example you may have extended stats on the dimension table but not the fact table, and the patch still can combine those two. Of course, if there's no MCV on either side, we can't do much. So this patch works when both sides have extended statistics MCV, or when one side has extended MCV and the other side regular MCV. It might seem silly, but the extended MCV allows considering additional baserel conditions (if there are any). examples ======== The table / data is very simple, but hopefully good enough for some simple examples. create table t1 (a int, b int, c int); create table t2 (a int, b int, c int); insert into t1 select mod(i,50), mod(i,50), mod(i,50) from generate_series(1,1000) s(i); insert into t2 select mod(i,50), mod(i,50), mod(i,50) from generate_series(1,1000) s(i); analyze t1, t2; First, without extended stats (just the first line of explain analyze, to keep the message short): explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=31.00..106.00 rows=400 width=24) (actual time=5.426..22.678 rows=20000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=28.50..160.75 rows=10000 width=24) (actual time=5.325..19.760 rows=10000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25 and t2.c > 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=24.50..104.75 rows=4800 width=24) (actual time=5.618..5.639 rows=0 loops=1) Now, let's create statistics: create statistics s1 on a, b, c from t1 ; create statistics s2 on a, b, c from t2 ; analyze t1, t2; and now the same queries again: explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=31.00..106.00 rows=20000 width=24) (actual time=5.448..22.713 rows=20000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=28.50..160.75 rows=10000 width=24) (actual time=5.317..16.680 rows=10000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25 and t2.c > 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=24.50..104.75 rows=1 width=24) (actual time=5.647..5.667 rows=0 loops=1) Those examples are a bit simplistic, but the improvements are fairly clear I think. limitations & open issues ========================= Let's talk about the main general restrictions and open issues in the current patch that I can think of at the moment. 1) statistics covering all join clauses The patch requires the statistics to cover all the join clauses, mostly because it simplifies the implementation. This means that to use the per-column statistics, there has to be just a single join clause. AFAICS this could be relaxed and we could use multiple statistics to estimate the clauses. But it'd make selection of statistics much more complicated, because we have to pick "matching" statistics on both sides of the join. So it seems like an overkill, and most joins have very few conditions anyway. 2) only equality join clauses The other restriction is that at the moment this only supports simple equality clauses, combined with AND. So for example this is supported t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1)) while these are not: t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1)) t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1)) t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c))) I'm not entirely sure these restrictions can be relaxed. It's not that difficult to evaluate these cases when matching items between the MCV lists, similarly to how we evaluate bitmaps for baserel estimates. But I'm not sure what to do about the part not covered by the MCV lists. The eqjoinsel() approach uses ndistinct estimates for that, but that only works for AND clauses, I think. How would that work for OR? Similarly, I'm not sure we can do much for non-equality conditions, but those are currently estimated as default selectivity in selfuncs.c. 3) estimation by join pairs At the moment, the estimates are calculated for pairs of relations, so for example given a query explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t1.b = t3.b and t2.c = t3.c); we'll estimate the first join (t1,t2) just fine, but then the second join actually combines (t1,t2,t3). What the patch currently does is it splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this should actually combine all three MCVs at once - we're pretty much combining the MCVs into one large MCV representing the join result. But I haven't done that yet, as it requires the MCVs to be combined using the join clauses (overlap in a way), but I'm not sure how likely that is in practice. In the example it could help, but that's a bit artificial example. 4) still just inner equi-joins I haven't done any work on extending this to outer joins etc. Adding outer and semi joins should not be complicated, mostly copying and tweaking what eqjoinsel() does. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: