Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: MCV lists for highly skewed distributions |
Date | |
Msg-id | 8a524773-3299-fe7b-de8b-bc977de97bec@2ndquadrant.com Whole thread Raw |
In response to | Re: MCV lists for highly skewed distributions (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: MCV lists for highly skewed distributions
Re: MCV lists for highly skewed distributions |
List | pgsql-hackers |
On 03/11/2018 10:23 AM, Dean Rasheed wrote: > ... > > I'm moving this back to a status of "Needs review" partly because the > code has changed significantly, but also because I want to do more > testing, particularly with larger datasets. > Thanks for working on this. The code seems fine to me, although it might be a good idea to add comments explaining why analyze_mcv_list() starts with full MCV lists and then removes items (essentially the explanation why Jeff's original idea would not work so well). Actually, one question - when deciding whether to keep the item in the MCV list, analyze_mcv_list only compares it's frequency with an average of the rest. But as we're removing items from the MCV list, the average frequency of the non-MCV items is growing (we're removing items with higher and higher frequencies). That means the estimates for the least common items will get higher and higher - essentially overestimates. So, couldn't/shouldn't analyze_mcv_list consider this too? I've also done a bunch of testing regarding join cardinality estimates, because eqjoinsel_inner() is one of the places using MCV lists too. So I've generated tables with different sizes and data distributions, and observed how the patch affects the join estimates. The scripts and results are available here: https://github.com/tvondra/join-estimates-tests The tables were not particularly huge at this point (1000 to 100k rows), I've also varied number of distinct values (100 - 10k), statistics target (10 - 10k) and distribution (for each table independently): 1) uniform insert into t1 select random() * $ndistinct1, i from generate_series(1,$nrows1) s(i) 2) skewed insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) 3) skewed-inverse insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) 4) skewed insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) 5) skewed-strong-inverse insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) The results are a bit too large to attach them here - see for example https://github.com/tvondra/join-estimates-tests/blob/master/bench/summary.ods. Looking at Mean Relative Error, i.e. essentially MRE = AVERAGE(MAX(estimate/actual,actual/estimate)) over the 20 runs for each combination of parameters, The "average" sheet shows (MRE for patched) / (MRE for master) and the patch seems to clearly improve accuracy in this case. There are a few cases where the estimate gets slightly worse (say, by 10%) compared to current master. So I think that's nice. I'm open to running additional tests with other distributions, table sizes etc. if needed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: