Thread: Plans with bad estimates/paths
Hello, I am struggling with a problem that appears planning related. I'm hoping folk here may be able to advise on how best to tacklethe issue. We have a system that ingests JSON messages containing order data. The data from these messages is inserted into a normalisedtable structure; "order", "order_item", "order_discount", etc. Orders can have multiple items, items can havemultiple discounts and so forth. Referential constraints exist between the tables in the expected way. Fields in thetables are all declared NOT NULL. If the schema of the JSON is such that a field is optional, we reflect that optionalityin another "order_<X>" table, and make "order_<X>" be a subset of "order". Moreover, we ingest messages for differentclients/customers. Therefore each message-related table carries with it a client identifier which forms part ofthe primary key on the table. For example, "order" has a key of "(client_id, order-id)". We have written transformations that calculate various facts about orders. For example, one of the transforms emits orderitem data where we have calculated the overall discount for the item, and have "filled in" some of the optional fieldswith defaults, and have categorised the order item on the basis of some aspect of the order (e.g. "this is an e-commerceorder, this is retail order"). These transforms are typically per client (e.g. `WHERE client_id = 123`) althoughin some cases we transform over multiple clients (e.g. `WHERE client_id = ANY (SELECT client_id FROM clients WHERE...)`). The issue is that for some clients, or combination of clients, the planner is choosing a path that takes substantially longerto evaluate than the plan it typically chooses for other clients. The number of tables being joined is in the regionof 15. There is an extended statistic object in place to help the one aggregation that occurs (defined on the basisof the `GROUP BY` columns) to try and get a better estimate of the likely number of rows emitted. However, what I amoften seeing in the explain plan is that the estimated rows is small and the actuals are significantly larger e.g. Merge Join (cost=1.14..253250.32 rows=1099 width=69) (actual time=1268.587..2400.353 rows=4282355 loops=1) I am assuming this underestimation is the source of the planner choosing the "wrong" path; in production, we have had toresort to setting the join and from collapse limits to 1 to force a naive plan to be generated. This is giving us executiontimes in the 10/20 second range vs. >45m in some cases. (a) Do you have any suggestions on a general approach to tackling the problem? For example, one option might be to pre-computesome of the subqueries that are occurring in the transforms, write the results into their own tables, and substitutethose tables in place of the subqueries in the main transform. Is this something people typically do in this situation? (b) Do I need to provide a schema and explain plans to get any concrete advice on how to proceed? Any advice/suggestions would be much appreciated. Thanks, -Joe
On 11/16/21 9:22 PM, Joe Wildish wrote: > Hello, > > I am struggling with a problem that appears planning related. I'm > hoping folk here may be able to advise on how best to tackle the > issue. > > We have a system that ingests JSON messages containing order data. > The data from these messages is inserted into a normalised table > structure; "order", "order_item", "order_discount", etc. Orders can > have multiple items, items can have multiple discounts and so forth. > Referential constraints exist between the tables in the expected way. > Fields in the tables are all declared NOT NULL. If the schema of the > JSON is such that a field is optional, we reflect that optionality in > another "order_<X>" table, and make "order_<X>" be a subset of > "order". Moreover, we ingest messages for different > clients/customers. Therefore each message-related table carries with > it a client identifier which forms part of the primary key on the > table. For example, "order" has a key of "(client_id, order-id)". > > We have written transformations that calculate various facts about > orders. For example, one of the transforms emits order item data > where we have calculated the overall discount for the item, and have > "filled in" some of the optional fields with defaults, and have > categorised the order item on the basis of some aspect of the order > (e.g. "this is an e-commerce order, this is retail order"). These > transforms are typically per client (e.g. `WHERE client_id = 123`) > although in some cases we transform over multiple clients (e.g. > `WHERE client_id = ANY (SELECT client_id FROM clients WHERE ...)`). > > The issue is that for some clients, or combination of clients, the > planner is choosing a path that takes substantially longer to > evaluate than the plan it typically chooses for other clients. The > number of tables being joined is in the region of 15. There is an > extended statistic object in place to help the one aggregation that > occurs (defined on the basis of the `GROUP BY` columns) to try and > get a better estimate of the likely number of rows emitted. However, > what I am often seeing in the explain plan is that the estimated rows > is small and the actuals are significantly larger e.g. > > Merge Join (cost=1.14..253250.32 rows=1099 width=69) (actual > time=1268.587..2400.353 rows=4282355 loops=1) > It sure smells like a case of correlated data for some clients but not others, but who knows ... Hard to say without seeing the nodes below the join. If the lower nodes are estimated accurately, then it's the join selectivity that is estimated poorly, and there's not much we can do about it :-( Do the "good" plans have the same underestimate? Maybe there's just much less data for those clients, and the "poor" plan ends up being fast anyway? > I am assuming this underestimation is the source of the planner > choosing the "wrong" path; in production, we have had to resort to > setting the join and from collapse limits to 1 to force a naive plan > to be generated. This is giving us execution times in the 10/20 > second range vs. >45m in some cases. > That may be happening, yes. So is it the join order that ends up being wrong, or the join methods? Have you tried increasing the collapse limit instead? Although, if it works for some queries but not others, that's likely not going to help. > (a) Do you have any suggestions on a general approach to tackling the > problem? For example, one option might be to pre-compute some of the > subqueries that are occurring in the transforms, write the results > into their own tables, and substitute those tables in place of the > subqueries in the main transform. Is this something people typically > do in this situation? > The first thing I'd do is reduce the query size as much as possible. In this case I'd try removing as many joins as possible until the issue disappears. The simpler the query, the easier it is to investigate. And yes, replacing parts of a query with a temporary table is a common solution, because it's possible to collect statistics on it, build indexes etc. That usually solves estimation issues in multi-tenancy. Sometimes even a CTE with materialization is enough. > (b) Do I need to provide a schema and explain plans to get any > concrete advice on how to proceed? > That would be helpful, particularly after making the query as small as possible. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Tomas, On Tue, 16 Nov 2021, at 22:03, Tomas Vondra wrote: > It sure smells like a case of correlated data for some clients but not > others, but who knows ... Hard to say without seeing the nodes below the > join. If the lower nodes are estimated accurately, then it's the join > selectivity that is estimated poorly, and there's not much we can do > about it :-( Here is a link to a segment of a plan where the estimations degrade. I've also pasted the plan segment at the end of this message. https://gist.githubusercontent.com/joewildish/8eb66815d728399687b24647df941746/raw/ef8c62455dd34d76807feff5d164e6d8857014e2/gistfile1.txt The nodes beneath the Merge Join seem to be estimated accurately (within 2x of the actual)? But, the Merge Join itself isa 10x under-estimate and the ones above that are even further out. You can see in the plan that this particular executionis for multiple clients (144, 145, 146 & 155). In my experiments, I am getting better results with a single client,although I don't know if that is down to the actual data size being smaller, or if the estimation for multiple valuesis inherently more inaccurate. Anyway, is this an example of a join selectivity problem? > Do the "good" plans have the same underestimate? Maybe there's just much > less data for those clients, and the "poor" plan ends up being fast anyway? I think that may be happening but haven't been able to capture a plan yet that confirms it. >> I am assuming this underestimation is the source of the planner >> choosing the "wrong" path; in production, we have had to resort to >> setting the join and from collapse limits to 1 to force a naive plan >> to be generated. This is giving us execution times in the 10/20 >> second range vs. >45m in some cases. > > That may be happening, yes. So is it the join order that ends up being > wrong, or the join methods? I've seen both. For example, in the worst-performing plans, the root node has its first input estimated to produce 1 rowand its second input estimated to produce c.40,000 rows. The second input is a SeqScan, presumably because of the single-rowestimate of its sibling. Of course, the estimate of 1 turns out to be wildly inaccurate, the join produces c.2BNrows, and most are then filtered out. In other (better) plans, the troublesome SeqScan doesn't exist: the relation in question gets joined lower down the tree,and it is not traversed by a SeqScan. > Have you tried increasing the collapse limit > instead? Although, if it works for some queries but not others, that's > likely not going to help. Yes but it often creates poorer plans rather than better plans. In fact, I increase those limits locally when testing, toget the planner to consistently produce what it thinks is the best plan. (I find without this, sub-paths can be materialised,presumably because one of the collapse limits has been hit. Obviously I'd rather remove this particular variabilitywhen trying to debug). > The first thing I'd do is reduce the query size as much as possible. In > this case I'd try removing as many joins as possible until the issue > disappears. The simpler the query, the easier it is to investigate. > > And yes, replacing parts of a query with a temporary table is a common > solution, because it's possible to collect statistics on it, build > indexes etc. That usually solves estimation issues in multi-tenancy. > Sometimes even a CTE with materialization is enough. Thank you. It seems, then, that the solution lies in simplifying the queries such that the chances of poor estimation arereduced/removed. (I have had some success with this today. One of the queries was bringing in a view which resulted inneedless self-joins). However, such a solution begs the question -- which bits of the query should be pre-computed? And,will such work survive further changes in the underlying data distributions? Thanks, -Joe Plan segment: Nested Loop Left Join (cost=2.29..348095.52 rows=3 width=93) (actual time=828.619..3368.362 rows=517367 loops=1) -> Nested Loop (cost=1.86..348094.00 rows=3 width=81) (actual time=828.609..2655.136 rows=517367 loops=1) Join Filter: (clients.id = order_items.client_id) -> Nested Loop (cost=1.43..347875.73 rows=400 width=60) (actual time=828.603..1890.900 rows=517367 loops=1) Join Filter: (clients.id = order_items_1.client_id) -> Merge Join (cost=1.00..322712.24 rows=50370 width=48) (actual time=828.584..1224.993 rows=517367 loops=1) Merge Cond: (order_items_2.client_id = clients.id) -> GroupAggregate (cost=0.85..290718.67 rows=2518498 width=44) (actual time=0.040..1126.298 rows=1856351loops=1) Group Key: order_items_2.client_id, order_items_2.order_item_id -> Merge Left Join (cost=0.85..240348.71 rows=2518498 width=18) (actual time=0.033..535.466 rows=1858076loops=1) Merge Cond: ((order_items_2.client_id = discount_allocations.client_id) AND (order_items_2.order_item_id= discount_allocations.order_item_id)) -> Index Only Scan using pk_order_items on order_items order_items_2 (cost=0.43..131145.90rows=2518498 width=12) (actual time=0.022..150.068 rows=1856352 loops=1) Heap Fetches: 0 -> Index Scan using pk_discount_allocations on discount_allocations (cost=0.42..89823.14rows=931889 width=18) (actual time=0.008..110.223 rows=677531 loops=1) -> Index Only Scan using pk_clients on clients (cost=0.14..8.64 rows=4 width=4) (actual time=0.003..0.011rows=4 loops=1) Index Cond: (id = ANY ('{144,145,146,155}'::integer[])) Heap Fetches: 0 -> Index Only Scan using pk_order_items on order_items order_items_1 (cost=0.43..0.49 rows=1 width=12) (actualtime=0.001..0.001 rows=1 loops=517367) Index Cond: ((client_id = order_items_2.client_id) AND (order_item_id = order_items_2.order_item_id)) Heap Fetches: 0 -> Index Scan using pk_order_items on order_items (cost=0.43..0.53 rows=1 width=29) (actual time=0.001..0.001 rows=1loops=517367) Index Cond: ((client_id = order_items_1.client_id) AND (order_item_id = order_items_1.order_item_id)) -> Index Scan using pk_order_item_variants on order_item_variants (cost=0.43..0.51 rows=1 width=20) (actual time=0.001..0.001rows=1 loops=517367) Index Cond: ((client_id = order_items_1.client_id) AND (order_item_id = order_items_1.order_item_id))