Join Removal/ Vertical Partitioning - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Join Removal/ Vertical Partitioning |
Date | |
Msg-id | 1214477827.3845.87.camel@ebony.site Whole thread Raw |
Responses |
Re: Join Removal/ Vertical Partitioning
|
List | pgsql-hackers |
There are common cases where we want to remove unwanted joins from queries, especially with view and Object Relational Mapping systems such as Hibernate etc.. (I've mentioned this before on -hackers or -perform, but I can't find the links) Typical case is where we have a class that has a subclass present fairly infrequently, so we want to physically separate the subclass for performance. It's easy to separate the data, but then much harder to get the SQL to recognise that is what we have done. The same problem also occurs in Data Warehousing and is typically known as Vertical Partitioning in that context. The attached example shows this for one and two subclasses. [joinrem.sql] Why do we care? Because the extra join hurts performance considerably. pgbench test files attached, executed using dual CPU system with pgbench -n -f <filename> -c 2 - t 30000 Access to 1 table 14,500 tps Access to 2 tables 9,000 tps Access to 3 tables 7,000 tps We might typically expect there to be many subclasses that follow this pattern (as well as some subclasses that are so frequent we would want to merge them with the main class table, but that is not relevant to discussion here), so the number of subclasses could be quite large. explain select c_c1 from wholeclass2tables where pk = 6; QUERY PLAN --------------------------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..16.55 rows=1 width=4) Join Filter: (c.pk = sc.pk) -> Index Scan using class_pkey on class c (cost=0.00..8.27 rows=1 width=8) Index Cond: (pk = 6) -> Index Scan using subclass_pkey on subclass sc (cost=0.00..8.27 rows=1 width=4) Index Cond: (sc.pk = 6) (6 rows) The EXPLAIN shows that we access both tables, even though the access to "subclass" is not required to correctly resolve the query in *all* cases. Similar plan for 3+ table access. The join is removable because * rows are returned by the query whether or not rows exist in subclass * every row in "class" matches at *most* one row in "subclass" because the join columns are unique on both sides of the join. * the join operator function is IMMUTABLE (and is therefore able to be optimised away when circumstances allow) (Note that the FK link between the tables is not required to prove this, it is just shown because that is the way we would typically do this). Why can't we do this manually? We can do the "join removal" manually by creating different sets of views that mention or don't mention the tables. If we have say 5 sub-classes, then to optimise queries we would need to produce 5! = 120 views, all defined with the different combinations of tables that we might want to access. This is a management nightmare, but so is the idea that we might need to run 5 table joins when only direct access to a single table is required. We can check for removal of a rel by 1. inspecting the target list for the query to see if there are rels that do not provide any attributes. (We might also use equivalence classes to recode the targetlist to minimise the numbers of tables touched, but I think that might be overkill). 2. checking the rel is on the excluding side of an outer join (i.e. the right hand table in a "left outer join") 3. checking that the join columns on the rel are all the columns of any unique index on the rel. (If we join on *more* or less columns than are in the index then we must still do the join.) Once we have marked all rels that are removable we then need to check that we can still make one rel using the remaining tables. In some cases that may not be possible just yet, because of the limited inference possibilities across outer join boundaries, a problem discussed elsewhere on -hackers. But it seems possible to allow removal of rels mentioned in exactly one join. i.e. given A <-> B <-> C then B is not removable, A and C are. We do join removal as part of the prep before join planning takes place. First we would annotate which rels have attributes that form part of the targetlist during build_base_rel_tlists(). During or immediately after deconstruct_jointree() we can attempt to remove joins by testing the other conditions in order. It looks to me that there would be little additional planning time for cases where this optimisation would not apply. I'm not sure whether it would be best to attempt to remove joins before we have established equivalence classes or afterwards. In the longer term, afterwards would be best. ISTM it will be easier to remove joins before we attempt to establish a join order, yet many of the tests run during join order selection would need to be run to test join removal. So some thoughts on where to attempt this would be very useful. Are there specific problems foreseen with this? Would working on this be sensible before the outer join equivalence problem has been solved? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
pgsql-hackers by date: