Re: More stable query plans via more predictable column statistics - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: More stable query plans via more predictable column statistics |
Date | |
Msg-id | CA+Tgmobh+u_Km5O5P-_5o9fVWrTK-5Gb17iVvDVvcmkibmtGbg@mail.gmail.com Whole thread Raw |
In response to | More stable query plans via more predictable column statistics ("Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de>) |
Responses |
Re: More stable query plans via more predictable column statistics
Re: More stable query plans via more predictable column statistics |
List | pgsql-hackers |
On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote: > Hi Hackers! > > This post summarizes a few weeks of research of ANALYZE statistics > distribution on one of our bigger production databases with some real-world > data and proposes a patch to rectify some of the oddities observed. > > > Introduction > ============ > > We have observed that for certain data sets the distribution of samples > between most_common_vals and histogram_bounds can be unstable: so that it > may change dramatically with the next ANALYZE run, thus leading to radically > different plans. > > I was revisiting the following performance thread and I've found some > interesting details about statistics in our environment: > > > http://www.postgresql.org/message-id/flat/CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com#CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com > > My initial interest was in evaluation if distribution of samples could be > made more predictable and less dependent on the factor of luck, thus leading > to more stable execution plans. > > > Unexpected findings > =================== > > What I have found is that in a significant percentage of instances, when a > duplicate sample value is *not* put into the MCV list, it does produce > duplicates in the histogram_bounds, so it looks like the MCV cut-off happens > too early, even though we have enough space for more values in the MCV list. > > In the extreme cases I've found completely empty MCV lists and histograms > full of duplicates at the same time, with only about 20% of distinct values > in the histogram (as it turns out, this happens due to high fraction of > NULLs in the sample). Wow, this is very interesting work. Using values_cnt rather than samplerows to compute avgcount seems like a clear improvement. It doesn't make any sense to raise the threshold for creating an MCV based on the presence of additional nulls or too-wide values in the table. I bet compute_distinct_stats needs a similar fix. But for plan stability considerations, I'd say we should back-patch this all the way, but those considerations might mitigate for a more restrained approach. Still, maybe we should try to sneak at least this much into 9.5 RSN, because I have to think this is going to help people with mostly-NULL (or mostly-really-wide) columns. As far as the rest of the fix, your code seems to remove the handling for ndistinct < 0. That seems unlikely to be correct, although it's possible that I am missing something. Aside from that, the rest of this seems like a policy change, and I'm not totally sure off-hand whether it's the right policy. Having more MCVs can increase planning time noticeably, and the point of the existing cutoff is to prevent us from choosing MCVs that aren't actually "C". I think this change significantly undermines those protections. It seems to me that it might be useful to evaluate the effects of this part of the patch separately from the samplerows -> values_cnt change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: