Re: [HACKERS] Regression in join selectivity estimations when usingforeign keys - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: [HACKERS] Regression in join selectivity estimations when usingforeign keys |
Date | |
Msg-id | CAKJS1f9QGqqm8BX6WceKB4RA3ADVH1N6rU83GkSRZJcv+9B-yw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Regression in join selectivity estimations when using foreign keys (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] Regression in join selectivity estimations when usingforeign keys
|
List | pgsql-hackers |
On 21 May 2017 at 07:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm entirely unconvinced by this patch --- it seems to simply be throwing > away a lot of logic. Notably it lobotomizes the FK code altogether for > semi/antijoin cases, but you've not shown any example that even involves > such joins, so what's the argument for doing that? Also, the reason > we had the use_smallest_selectivity code in the first place was that we > didn't believe the FK-based selectivity could be applied safely to > outer-join cases, so simply deciding that it's OK to apply it after all > seems insufficiently justified. Originally I looked at just multiplying the selectivities in use_smallest_selectivity, but on further thought, I didn't manage to see why using foreign keys to assist with selectivity estimations when the ref_is_outer is true. We still have a very high probability that the outer rel contains a tuple to match each inner rel's tuples (because inner references outer). The only difference between OUTER and INNER typed joins is that OUTER will produce a bunch of NULL rows in place of non-matched inner rows. That part seems to be handled in calc_joinrel_size_estimate() by code such as: if (nrows < outer_rows) nrows = outer_rows; We could do much better than what we have today by also adding outer_rows - (n_distinct inner rows on referencing Vars), to also account for those unmatched rows. However, I'd say that's not for back-patching, although it may be especially good now that we have multi-variate ndistinct in PG10. > Or in short, exhibiting one counterexample to the existing code is not > a sufficient argument for changing things this much. You need to give > an argument why this is the right thing to do instead. > > Stepping back a bit, it seems like the core thing the FK selectivity code > was meant to do was to prevent us from underestimating selectivity of > multiple-clause join conditions through a false assumption of clause > independence. The problem with the use_smallest_selectivity code is that > it's overestimating selectivity, but that doesn't mean that we want to go > all the way back to the old way of doing things. I wonder if we could get > any traction in these dubious cases by computing the product, instead of > minimum, of the clause selectivities (thus matching the old estimate) and > then taking the greater of that and the FK-based selectivity. My concern with going back to selectivity multiplication was exactly because I didn't want to go back to the original undesired behaviour of underestimation when conditions are co-related. I'm unsure why taking the greater is any better than just using the foreign key selectivity. It seems senseless to use clause based selectivities at all when we've proved the foreign key exists -- there will be no unmatched inner tuples. The reason I dropped the non-singleton join for SEMI and ANTI is that I couldn't see how we could make any improvements over the original join estimation code for this case. Perhaps I'm assuming too much about how get_foreign_key_join_selectivity() is used by doing this? I'm assuming that leaving the clause list intact and referring the whole case to calc_joinrel_size_estimate() is okay to do. The reason I added the nullfrac code in 0002 was because I was concerned with regressing OUTER join cases where the nulfrac works due to using the clause_selectivity(). Although this case regressed between 9.5 and 9.6 for INNER joins with a non-zero nullfrac on the join condition. In short; if we replace the use_smallest_selectivity code with fkselec *= clauselist_selectivity(root, removedlist, 0, jointype, sjinfo); then I'm worried about regressing back to the old underestimations. The extended stats code in PG10's version of clauselist_selectivity() will ignore these join conditions, so nothing can be done to assist the underestmations by adding extended stats yet. I also just noticed that I don't think I've got ANTI join cases correct in the patch I sent. I'll look at that now. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: