Re: Partition-wise join for join between (declaratively) partitioned tables - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Partition-wise join for join between (declaratively) partitioned tables |
Date | |
Msg-id | CAFjFpRe44=PsszMCbP9nro_oaTkF8M_f=GaPp_fgSkbB2XDYgg@mail.gmail.com Whole thread Raw |
In response to | Re: Partition-wise join for join between (declaratively) partitioned tables (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Partition-wise join for join between (declaratively)
partitioned tables
|
List | pgsql-hackers |
On Tue, Oct 18, 2016 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> Have you tested the effect of this patch on planner memory consumption >>> with multi-way joins between tables with many partitions? If you >>> haven't, you probably should. (Testing runtime would be good, too.) >>> Does it grow linearly? Quadratically? Exponentially? Minor leaks >>> don't matter, but if we're generating too much garbage we'll have to >>> make sure it gets cleaned up soon enough to prevent runaway memory >>> usage. >> >> I tried to check memory usage with various combinations of number of >> partitions and number of relations being joined. For higher number of >> relations being joined like 10 with 100 partitions, OOM killer kicked >> in during the planning phase. I am suspecting >> adjust_partitionrel_attrs() (changed that name to >> adjust_join_appendrel_attrs() to be in sync with >> adjust_appendrel_attrs()) to be the culprit. It copies expression >> trees every time for joining two children. That's an exponentially >> increasing number as the number of legal joins increases >> exponentially. I am still investigating this. > > I think the root of this problem is that the existing paths shares a > lot more substructure than the ones created by the new code. Without > a partition-wise join, the incremental memory usage for a joinrel > isn't any different whether the underlying rel is partitioned or not. > If it's partitioned, we'll be pointing to an AppendPath; if not, we'll > be pointing to some kind of Scan. But the join itself creates exactly > the same amount of new stuff regardless of what's underneath it. With > partitionwise join, that ceases to be true. Every joinrel - and the > number of those grows exponentially in the number of baserels, IICU - > needs its own list of paths for every member rel. So if a > non-partition-wise join created X paths, and there are K partitions, a > partition-wise join creates X * K paths. That's a lot. > > Although we might be able to save some memory by tightening things up > here and there - for example, right now the planner isn't real smart > about recycling paths that are evicted by add_path(), and there's > probably other wastage as well - I suspect that what this shows is > that the basic design of this patch is not going to be viable. > Intuitively, it's often going to be the case that we want the "same > plan" for every partition-set. That is, if we have A JOIN B ON A.x = > B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility > partitioned, then the result should be an Append plan with 100 join > plans under it, and all 100 of those plans should be basically mirror > images of each other. Of course, that's not really right in general: > for example, it could be that A1 is big and A2 is small while B1 is > small and B2 is big, so that the right plan for (A1 JOIN B1) and for > (A2 JOIN B2) are totally different from each other. But in many > practical cases we'll want to end up with a plan of precisely the same > shape for all children, and the current design ignores this, expending > both memory and CPU time to compute essentially-equivalent paths > across all children. I think there are going to be two kinds of partitioning use-cases. First, carefully hand-crafted by DBAs so that every partition is different from other and so is every join between two partitions. There will be lesser number of partitions, but creating paths for each join between partitions will be crucial from performance point of view. Consider, for example, systems which use partitions to consolidate results from different sources for analytical purposes or sharding. If we consider various points you have listed in [1] as to why a partition is equivalent to a table, each join between partitions is going to have very different characteristics and thus deserves a set of paths for its own. Add to that possibility of partition pruning or certain conditions affecting particular partitions, the need for detailed planning evident. The other usage of partitioning is to distribute the data and/or quickly eliminate the data by partition pruning. In such case, all partitions of a given table will have very similar properties. There is a large chance that we will end up having same plans for every partition and for joins between partitions. In such cases, I think it suffices to create paths for just one or may be a handful partitions of join and repeat that plan for other partitions of join. But in such cases it also makes sense to have a light-weight representation for partitions as compared to partitions being a full-fledged tables. If we have such a light-weight representation, we may not even create RelOptInfos representing joins between partitions, and different paths for each join between partitions. > > One way of attacking this problem is to gang together partitions which > are equivalent for planning purposes, as discussed in the paper "Join > Optimization Techniques for Partitioned Tables" by Herodotou, Borisov, > and Babu. However, it's not exactly clear how to do this: we could > gang together partitions that have the same index definitions, but the > sizes of the heaps, the sizes of their indexes, and the row counts > will vary from one partition to the next, and any of those things > could cause the plan choice to be different for one partition vs. the > next. We could try to come up with heuristics for when those things > are likely to be true. For example, suppose we compute the set of > partitions such that all joined relations have matching index > definitions on all tables; then, we take the biggest table in the set > and consider all tables more than half that size as part of one gang. > The biggest table becomes the leader and we compute partition-wise > paths for just that partition; the other members of the gang will > eventually get a plan that is of the same shape, but we don't actually > create it that plan until after scan/join planning is concluded. Section 5 of that paper talks about clustering partitions together for joining, only when there is 1:m OR n:1 partition matching for join. In such a case, it clusters all the partitions from one relation that are all joined with a single partition of the other relation. But I think your idea to gang up partitions with similar properties may reduce the number of paths we create but as you have mentioned how to gang them up is not very clear. There are just too many factors like availability of the indexes, sizes of tables, size of intermediate results etc. which make it difficult to identify the properties used for ganging up. Even after we do that, in the worst case, we will still end up creating paths for all partitions of all joins, thus causing increase in paths proportionate to the number of partitions. In the section 6.3, the paper mentions that the number of paths retained are linear in the number of child joins per parent join. So, it's clear that the paper never considered linear increase in the paths to be a problem or at least a problem that that work had to solve. Now, it's surprising that their memory usage increased by 7% to 10%. But 1. they might be measuring total memory and not the memory used by the planner and they experimented with PostgreSQL 8.3.7, which probably tried much less number of paths than the current optimizer. > > Another idea is to try to reduce peak memory usage by performing > planning separately for each partition-set. For example, suppose we > decide to do a partition-wise join of A, B, and C. Initially, this > gets represented as a PartitionJoinPath tree, like this: > > PartitionJoinPath > -> AppendPath for A > -> PartitionJoinPath > -> AppendPath for B > -> AppendPath for C > > Because we haven't created individual join paths for the members, this > doesn't use much memory. Somehow, we come up with a cost for the > PartitionJoinPath; it probably won't be entirely accurate. Once > scan/join planning is concluded, if our final path contains a > PartitionJoinPath, we go back and loop over the partitions. A typical join tree will be composite: some portion partitioned and some portion unpartitioned or different portions partitioned by different partition schemes. In such case, inaccurate costs for PartitionJoinPath, can affect the plan heavily, causing a suboptimal path to be picked. Assuming that partitioning will be useful for large sets of data, choosing a suboptimal plan can be more dangerous than consuming memory for creating paths. If we could come up with costs for PartitionJoinPath using some methods of interpolation, say by sampling few partitions and then extrapolating their costs for entire PartitionJoinPath, we can use this method. But unless the partitions have very similar characteristics or have such characteristics that costs can be guessed based on the differences between the characteristics, I do not see how that can happen. For example, while costing a PartitionJoinPath with pathkeys, the costs will change a lot based on whether underlying relations have indexes, or which join methods are used, which in turn is based on properties on the partitions. Same is the case for paths with parameterization. All such paths are important when a partitioned join relation joins with other unpartitioned relation or a partitioned relation with different partitioning scheme. When each partition of base relation being joined has different properties, the cost for join between one set of partitions can differ from join between other set of partitions. Not only that, the costs for various properties of resultant paths like pathkeys, parameterization can vary a lot, depending upon the available indexes and estimates of rows for each join. So, we need to come up with these cost estimates separately for each join between partitions to come up with cost of each PartitionJoinPath. If we have to calculate those costs to create PartitionJoinPath, we better save them in paths rather than recalculating them in the second round of planning for joins between partitions. > For each > partition, we switch to a new memory context, perform planning, copy > the best path and its substructure back to the parent context, and > then reset the context. This could be rather tricky. It assumes that all the code that creates paths for joins, should not allocate any memory which is linked to some object in a context that lives longer than the path creation context. There is some code like create_join_clause() or make_canonical_pathkey(), which carefully chooses which memory context to allocate memory in. But can we ensure it always? postgres_fdw for example allocates memory for PgFdwRelationInfo in current memory context and attaches it in RelOptInfo, which should be in the planner's original context. So, if we create a new memory context for each partition, fpinfos would be invalidated when those contexts are released. Not that, we can not enforce some restriction on the memory usage while planning, it's hard to enforce it and bugs arising from it may go unnoticed. GEQO planner might have its own problems with this approach. Third party FDWs will pose a problem. A possible solution would be to keep the track of used paths using a reference count. Once the paths for given join tree are created, free up the unused paths by traversing pathlist in each of the RelOptInfos. Attached patch has a prototype implementation for the same. There are some paths which are not linked to RelOptInfos, which need a bit different treatment, but they can be handled too. > In that way, peak memory usage only grows by > about a factor of 2 rather than a factor equal to the partition count, > because we don't need to keep every possibly-useful path for every > partition all at the same time, but rather every possibly-useful path > for a single partition. > > Maybe there are other ideas but I have a feeling any way you slice it > this is going to be a lot of work. For the case of carefully hand-crafted partitions, I think, users would expect the planner to use really the best plan and thus may be willing to accommodate for increased memory usage. Going by any approach that does not create the paths for joins between partitions is not guaranteed to give the best plan. Users willing to provide increased memory will be unhappy if we do not give them the best path. The user who creates hundreds of partitions, will ideally be using pretty powerful servers with a lot of memory. On such servers, the linear increase in memory for paths may not be as bad as you are portraying above, as long as its producing the best plan. Just joining partitioned tables with hundreds of partitions does not increase the number of paths. Number of paths increases when two partitioned tables with similar partitioning scheme are joined with equality condition on partition key. Unless we consider repartitioning, how many of the joining relations share same partitioning scheme? Section 8.6 mentions, "no TPC-H query plan, regardless of the partitioning scheme, contains n-way child joins for n >= 4. Maximum partitions that the paper mentions is 168 (Table 3). My VM which has 8GB RAM and 4 cores handled that case pretty well. We may add logic to free up space used by useless paths post-join to free up some memory for next stages of query execution. There will still be users, for whom the increase in the memory usage is unexpected. Those will need to be educated or for them we might take heuristic PartitionJoinPath based approach discussed above. But I don't think that heuristic approach should be the default case. May be we should supply a GUC which can switch between the approaches. Some ideas for GUCs are 1. delay_partition_wise_join - when ON uses the heuristic approach of PartitionJoinPath. 2. A GUC similar to join_collapse_limit may be used to limit the number of partitioned relations being joined using partition-wise join technique. A value of 1, indicates enable_partition_wise_join = false. So, we may replace enable_partition_wise_join withe this GUC. 3. A GUC max_joinable_partitions (open to suggestions for name) may specify the maximum number of partitions that two relations may have to be eligible for partition-wise join. I guess, using these GUCs allows a user handle the trade-off between getting the best plan and memory usage consciously. I think, users would like to accept a suboptimal plans consciously than being thrown a suboptimal plan without choice. [1] http://postgresql.nabble.com/design-for-a-partitioning-feature-was-inheritance-td5921603.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
pgsql-hackers by date: