Re: Asymmetric partition-wise JOIN - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: Asymmetric partition-wise JOIN |
Date | |
Msg-id | CAPpHfdt0NevYsYSy_QuJAJmk34+9qJY3Cc34Qi6LbdTugPBTmw@mail.gmail.com Whole thread Raw |
In response to | Re: Asymmetric partition-wise JOIN (Andrei Lepikhov <a.lepikhov@postgrespro.ru>) |
Responses |
Re: Asymmetric partition-wise JOIN
|
List | pgsql-hackers |
Hi! On Tue, Apr 2, 2024 at 6:07 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > On 15/10/2023 13:25, Alexander Korotkov wrote: > > Great! I'm looking forward to the revised patch. > Revising the code and opinions before restarting this work, I found two > different possible strategies mentioned in the thread: > 1. 'Common Resources' shares the materialised result of the inner table > scan (a hash table in the case of HashJoin) to join each partition one > by one. It gives us a profit in the case of parallel append and possibly > other cases, like the one shown in the initial message. > 2. 'Individual strategies' - By limiting the AJ feature to cases when > the JOIN clause contains a partitioning expression, we can push an > additional scan clause into each copy of the inner table scan, reduce > the number of tuples scanned, and even prune something because of proven > zero input. > > I see the pros and cons of both approaches. The first option is more > straightforward, and its outcome is obvious in the case of parallel > append. But how can we guarantee the same join type for each join? Why > should we ignore the positive effect of different strategies for > different partitions? > The second strategy is more expensive for the optimiser, especially in > the multipartition case. But as I can predict, it is easier to implement > and looks more natural for the architecture. What do you think about that? Actually, the idea I tried to express is the combination of #1 and #2: to build individual plan for every partition, but consider the 'Common Resources'. Let me explain this a bit more. Right now, we basically we consider the following properties during selection of paths. 1) Cost. The cheaper path wins. There a two criteria though: startup cost and total cost. So, we can keep both paths with cheaper startup costs and paths with cheaper total cost. 2) Pathkeys. We can keep a path with more expensive path, which has pathkeys potentially useful in future. My idea is to introduce a new property for paths selection. 3) Usage of common resources. The common resource can be: hash representation of relation, memoize over relation scan, etc. We can exclude the cost of common resource generation from the path cost, but keep the reference for the common resource with its generation cost. If one path uses more common resources than another path, it could cost-dominate another one only if its cheaper together with its extra common resources cost. If one path uses less or equal common resources than another, it could normally cost-dominate another one. Using these rules, we can gather the the plurality of paths for each child join taking common resources into account. After that we can apply some global optimization finding generation of which common resources can reduce the global cost. However, I understand this is huge amount of work given we have to introduce new basic optimizer concepts. I get that the main application of this patch is sharding. If we have global tables residing each shard, we can push down any joins with them. Given this patch gives some optimization for non-sharded case, I think we *probably* can accept its concept even that it this optimization is obviously not perfect. ------ Regards, Alexander Korotkov Supabase
pgsql-hackers by date: