Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: MCV lists for highly skewed distributions |
Date | |
Msg-id | CAJVSVGU5s60-pK7MR9fO_ux+KPortpQRCdFBpFw4ExRm2B+pBA@mail.gmail.com Whole thread Raw |
In response to | Re: MCV lists for highly skewed distributions (John Naylor <jcnaylor@gmail.com>) |
Responses |
Re: MCV lists for highly skewed distributions
|
List | pgsql-hackers |
I wrote: > Looks good. I'll run some tests of my own this week, trying to find > corner cases different from yours. Over the weekend I tried out a test to measure: -The size of the MCV list -The ratio between actual and estimated cardinality of the least common value in the MCV list. -The ratio between actual and estimated cardinality of the most common value not in the MCV list. The script, results, and some starter queries are attached. I ran a bunch more queries to drill down in certain areas, but they're all based on the ones here. I ran against master, the RSE patch, and 3 different variations of the V2 patch (with stddev cutoff of 2, 2.5, and 3.0). For each file, I loaded into each DB, ran ANALYZE 10 times, and took the average for each of the three metrics above. I only tested with the default stats target. For this test, I used the same distributions as Dean's original script, but I whacked around the script to enable inserting results into a table. -- Summary for non-uniform distributions: Note: There's a lot of variation in the estimation of the most common non-MCV. About 1% of all ANALYZE runs apparently failed to sample one of the good candidates for the MCV list, resulting in huge underestimates. It didn't matter what patch was applied. For future tests, I'll do more runs and measure a truncated mean. Looking at the number of MCVs is much more stable, but unlike the uniform distribution case, we don't have an intuitive feel for what that number should be. That said, -The V2 patch is somewhat better than RSE and much better than master at estimating the most common value not in the MCV list. -Bumping the sample stddev threshold to 3 seems to behave similarly to the RSE patch, maybe slightly better. -- Summary for uniform distributions: Note 1: Using the average is not really adequate for testing under/overestimation of values in my test setup, since I set the estimation ratio to zero if there was either an empty MCV list or a completely full list. In future runs, I'll drill down a bit more precisely. That said, there didn't seem to be a huge amount variation between the patches as far as either of the ratios I was testing. Looking at the number of MCVs is much better anyway, since we know we want either none or everything. Note 2: All patches fully populated the list when all the values fit in the MCV list. For the rest of the cases: -The RSE patch is not much better than master at excluding MCVs. -The V2 patch is much better than either of these, but still not ideal (up to 30) -The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3). -With a cutoff of 3.0, all are empty. -- Here's one interesting case. It's a normal distribution, but highly dispersed (N=500000, sd=1000), so it resembles a uniform one. Looking at the number of MCVs, it jumps around a lot between patches, but the estimates don't vary much: Master: 100 RSE: 1 V2: 100 V2 with 2.5 cutoff: 100 V2 with 3.0 cutoff: 32 -- Thoughts and future testing: I think the V2 patch strikes a good balance. I'm partial to the V2 patch with the 2.5 cutoff in sample standard deviation, since it's much more aggressive about excluding values for uniform distributions. It's almost as good as V2 at including values in non-uniform distributions, but possibly worse enough that it's not the best choice. Time is running out, but some things I'd like to look at: -As mentioned above, better noise reduction in my data analysis. -The continuity-corrected Wald interval Dean mentioned [1] -I wonder if we can actually untie the max size of the MCV list from the stats target, and use a hard-coded maximum like 1000. That might allow us to bump the max stat target without hurting planning time. Perhaps next cycle. -John Naylor [1] https://www.postgresql.org/message-id/CAEZATCVHEEg%2BCrP%2B-0JUsVeNPu5rs_S23oJVeH4VF%3DfgDwhfLQ%40mail.gmail.com
Attachment
pgsql-hackers by date: