Re: PATCH: use foreign keys to improve join estimates v1 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PATCH: use foreign keys to improve join estimates v1 |
Date | |
Msg-id | 55D5323A.9060503@2ndquadrant.com Whole thread Raw |
In response to | Re: PATCH: use foreign keys to improve join estimates v1 (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: PATCH: use foreign keys to improve join estimates v1
Re: PATCH: use foreign keys to improve join estimates v1 |
List | pgsql-hackers |
Hi, attached is a significantly reworked patch for using the foreign keys in selectivity estimation. The previous patch only really worked if the clauses matched the foreign key perfectly (i.e. no additional join clauses) - this patch attempts to relax those restrictions a bit. This patch also significantly improves the comments - the best place to start reading is clauselist_join_selectivity(). In general, the patch works by looking for foreign keys between the inner and outer side of the join, but only for keys that: (a) have more than 2 keys (as this only really helps with multi- column foreign keys) (b) are 'fully matched' by the join clauses, i.e. there's a clause exactly matching each part of the foreign key Once we have matched foreign keys (for each join), we can estimate each of them using 1/cardinality of the referenced table and estimate the remaining clauses (not used to match any foreign key) the old way. example with 3 tables --------------------- create table a (a1 int, a2 int, primary key (a1, a2)); create table b (b1 int, b2 int, primary key (b1, b2)); create table f (f1 int, f2 int, f3 int, f4 int, foreign key (f1,f2) references a(a1,a2), foreign key (f3,f4) references b(b1,b2)); insert into a select i, i from generate_series(1,10000) s(i); insert into b select i, i from generate_series(1,10000) s(i); -- 10x insert into f select i, i, i, i from generate_series(1,10000) s(i); analyze; Then on current master, I get these estimates (showing just rows, because that's what matter): while with the patch I get this: select * from f join a on (f1 = a1 and f2 = a2); QUERY PLAN ------------------------------------------------------------------------ Hash Join (rows=100000) (actual rows=100000) Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2)) -> Seq Scan on f (rows=100000) (actual rows=100000) -> Hash (rows=10000) (actual rows=10000) Buckets: 16384 Batches: 1 Memory Usage: 519kB -> Seq Scan on a (rows=10000) (actual rows=10000) select * from f join a on (f1 = a1 and f2 = a2) join b on (f3 = b1 and f4 = b2); QUERY PLAN ------------------------------------------------------------------------ Hash Join (rows=100000) (actual rows=100000) Hash Cond: ((f.f3 = b.b1) AND (f.f4 = b.b2)) -> Hash Join (rows=100000) (actual rows=100000) Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2)) -> Seq Scan on f (rows=100000) (actual rows=100000) -> Hash (rows=10000) (actual rows=10000) Buckets: 16384 Batches: 1 Memory Usage: 519kB -> Seq Scan on a (rows=10000) (actual rows=10000) -> Hash (rows=10000) (actual rows=10000) Buckets: 16384 Batches: 1 Memory Usage: 519kB -> Seq Scan on b (rows=10000) (actual rows=10000) So, that's pretty good. I believe it might be possible to use even foreign keys matched only partially (e.g. foreign key on 3 columns, but only 2 of those matched by clauses), but I think that's a bit too much for now. The examples above are rather trivial, and sadly it's not difficult to break them. For example by adding a single additional join clause to the first query does this: select * from f join a on (f1 = a1 and f2 = a2 and f1 = a2); QUERY PLAN ------------------------------------------------------------------------ Hash Join (rows=2) (actual rows=100000) Hash Cond: (f.f1 = a.a1) -> Seq Scan on f (rows=500) (actual rows=100000) Filter: (f1 = f2) -> Hash (rows=50) (rows=10000) Buckets: 16384 (originally 1024) Batches: 1 (originally 1) ... -> Seq Scan on a (rows=50) (actual rows=10000) Filter: (a1 = a2) This of course happens "thanks to" equality classes because the planner is smart enough to realize that (f1=f2=a1=a2) thanks to the new clause. I'm not sure how to fix this, and maybe this particular case would be easier to fix using the multivariate stats (once it can estimate clauses like a1=a2). Similarly, the equality classes may break other examples by deriving completely new clauses - for example assume the "f" table is defined like this (again, with 100k rows): create table f (f1 int, f2 int, foreign key (f1,f2) references a(a1,a2), foreign key (f1,f2) references b(b1,b2)); then this query select * from f join a on (f1 = a1 and f2 = a2) join b on (f1 = b1 and f2 = b2); may get planned like this: QUERY PLAN ------------------------------------------------------------------------ Hash Join (rows=100000) (actual rows=100000) Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2)) -> Seq Scan on f (rows=100000) (actual rows=100000) -> Hash (rows=1) (actual rows=10000) -> Hash Join (rows=1) (actual rows=10000) Hash Cond: ((a.a1 = b.b1) AND (a.a2 = b.b2)) -> Seq Scan on a (rows=10000) (actual rows=10000) -> Hash (rows=10000) (actual rows=10000) -> Seq Scan on b (rows=10000) (actual rows=10000) because the planner derived that (a1=b1 and a2=b2) by looking at both join clauses, and joined 'a' and 'b' first. And of course, there are no foreign keys between these two tables (effectively dimensions), so the patch can do nothing about the selectivities. I'm not sure how serious problem this really is in practice - those examples are constructed to show the issue. I also can't find a good place to address this - either by tweaking the estimates before the equality classes are processed, or somehow after. It however illustrates that with this patch the user would be able to influence the planner - either intentionally or by accident. For example what should happen if someone creates the same foreign key twice? Should we detect it and only count it once into the selectivity, or what? There's a bunch of similar cases mentioned in the comment before clauselist_join_selectivity. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: