Re: pg9.6 segfault using simple query (related to use fk for join estimates) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pg9.6 segfault using simple query (related to use fk for join estimates) |
Date | |
Msg-id | 8344835e-18af-9d40-aed7-bd2261be9162@2ndquadrant.com Whole thread Raw |
In response to | Re: pg9.6 segfault using simple query (related to use fk for join estimates) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: pg9.6 segfault using simple query (related to use fk
for join estimates)
Re: pg9.6 segfault using simple query (related to use fk for join estimates) |
List | pgsql-hackers |
Hi, On 04/30/2016 07:35 PM, Tom Lane wrote: > Julien Rouhaud <julien.rouhaud@dalibo.com> writes: >> On 29/04/2016 18:05, Tom Lane wrote: >>> Julien Rouhaud <julien.rouhaud@dalibo.com> writes: >>>> The segfault is caused by quals_match_foreign_key() calling get_leftop() >>>> and get_rightop() on a ScalarArrayOpExpr node. >>>> >>>> I'm not sure that assuming this compatibility is the right way to fix >>>> this though. > >>> It certainly isn't. > >> Agreed. New attached patch handles explicitly each node tag. > > No, this is completely nuts. The problem is that quals_match_foreign_key > is simply assuming that every clause in the list it's given is an OpExpr, > which is quite wrong. It should just ignore non-OpExpr quals, since it > cannot do anything useful with them anyway. There's a comment claiming > that non-OpExpr quals were already rejected: > > * Here since 'usefulquals' only contains bitmap indexes for quals > * of type "var op var" we can safely skip checking this. > > but that doesn't appear to have anything to do with current reality. Yes, I agree - there should be a check that the node is OpExpr in quals_match_foreign_key. This is clearly a bug :-( > > While this in itself is about a two-line fix, after reviewing > 137805f89acb3611 I'm pretty unhappy that it got committed at all. > I think this obvious bug is a good sign that it wasn't ready. > Other unfinished aspects like invention of an undocumented GUC > don't leave a good impression either. The GUC is meant to make testing during development easier. I haven't realized it got committed, but I assume Simon plans to remove it before the final release. Can't check right now with Simon, though, as he's is out of office this week. In any case, I do agree the GUC should have been documented better, and we should probably put it on the TODO so that it gets removed before the actual release. > > Moreover, it looks to me like this will add quite a lot of overhead, > probably far more than is justified, because > clauselist_join_selectivity is basically O(N^2) in the > relation-footprint of the current join --- and not with a real small > multiplier, either, as the functions it calls contain about four > levels of nested loops themselves. Maybe that's unmeasurable on > trivial test cases but it's going to be disastrous in large queries, > or for relations with large numbers of foreign keys. > That depends on a lot of factors, I guess. Attached are two SQL scripts that create 5 tables (f1, f2, f3, f4, f5) and then create N foreign keys on each of them. The foreign keys reference other tables, which means the loops won't terminate early (after finding the first match) when joining the first 5 tables, which makes this the worst case. And then we join different numbers of tables (2, 3, 4 or 5) and do explain analyze to measure planning time. The first script (fk-test.sql) does joins on 2 columns, fk-test-6.sql does joins on 6 columns (so that we also know how the number of columns affects the planning time). Sum of planning time for 100 queries (in milliseconds) with 2 columns, where "2/on" means "join on 2 tables with enable_fk_estimates=on": N 2/on 2/off 3/on 3/off 4/on 4/off 5/on 5/off 10 6 4 9 8 22 20 77 68 100 15 11 26 19 64 36 177 93 1000 97 84 217 133 516 233 1246 342 and comparison of the timings: 2 3 4 5 10 155% 115% 114% 113% 100 142% 138% 177% 190% 1000 116% 163% 221% 364% And with the 6 columns: 2/on 2/off 3/on 3/off 4/on 4/off 5/on 5/off 10 25 7 23 20 96 82 344 297 100 21 14 65 33 233 104 735 328 1000 151 99 492 153 1603 320 4627 593 and comparison: 2 3 4 5 10 371% 120% 117% 116% 100 149% 196% 224% 224% 1000 152% 322% 502% 780% So yes, the overhead is clearly measurable, no doubt about that. > > I think this should be reverted and pushed out to 9.7 development. > It needs a significant amount of rewriting to fix the performance > issue, and now's not the time to be doing that. > There are probably a few reasonably simple things we could do - e.g. ignore foreign keys with just a single column, as the primary goal of the patch is improving estimates with multi-column foreign keys. I believe that covers a vast majority of foreign keys in the wild. If that's deemed insufficient, we'll have to resort to a more complex improvement, perhaps something akin to the cache proposed in in the unijoin patch. But if that's required, that's 9.7 material. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: