Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Use extended statistics to estimate (Var op Var) clauses |
Date | |
Msg-id | 69ce226b-cfbb-2712-bfd7-ec62cf0e0b2a@enterprisedb.com Whole thread Raw |
In response to | Re: Use extended statistics to estimate (Var op Var) clauses (Mark Dilger <mark.dilger@enterprisedb.com>) |
Responses |
Re: Use extended statistics to estimate (Var op Var) clauses
Re: Use extended statistics to estimate (Var op Var) clauses |
List | pgsql-hackers |
Hi Mark, This thread inspired me to do something fairly similar - a generator that generates queries of varying complexity, executes them on table with and without extended statistics. I've been thinking about that before, but this finally pushed me to do that, and some of the results are fairly interesting ... I've pushed everything (generator and results) to this github repo: https://github.com/tvondra/stats-test with a summary of all results here: https://github.com/tvondra/stats-test/blob/master/results.md Some basic facts about the generator.py (see query_generator): * It's using a fixed seed to make it deterministic. * A small fraction of generated queries is sampled and executed (5%). * Thanks to a fixed seed we generate/sample the same set of queries for different runs, which allows us to compare runs easily. * The queries use 2 - 5 clauses, either (Var op Const) or (Var op Var). * The operators are the usual equality/inequality ones. * The clauses are combined using AND/OR (also randomly picked). * There's a random set of parens added, to vary the operator precedence (otherwise it'd be driven entirely by AND/OR). * There are two datasets - a random and correlated one, with different number of distinct values in each column (10, 100, 1000, 10000). * The statistics target is set to 10, 100, 1000, 10000. It's a bit hacky, with various bits hard-coded at the moment. But it could be extended to do other stuff fairly easily, I think. Anyway, the repository contains results for three cases: 1) master 2) patched: master with the (Var op Var) patch 3) fixed: patched, with a fix for "simple" clauses (a crude patch) And for each case we have three row counts: * actual (from explain analyze) * estimate without extended stats * estimate with extended stats And then we can calculate "estimation error" as estimate / actual both with and without statistics. Results for two cases can be plotted as a scatter plot, with the two estimation errors as (x,y) values. The idea is that this shows how a patch affects estimates - a point (100,10) means that it was 100x over-estimated, and with the patch it's just 10x, and similarly for other points. This is what the charts at https://github.com/tvondra/stats-test/blob/master/results.md do, for each combination of parameters (dataset, statistics target and number of distinct values). There's one chart without extended stats, one with extended stats. An "ideal" chart would look like like a single point (1,1) which means "accurate estimates without/with patch", or (?,1) which means "poor estimates before, accurate estimates now". Diagonal means "no change". In principle, we expect the charts to look like this: * random: diagonal charts, because there should be no extended stats built, hence no impact on estimates is expected * correlated: getting closer to 1.0, which looks like a horizontal line in the chart Consider for example this: https://github.com/tvondra/stats-test/raw/master/correlated-1000-10.png which clearly shows that the first patch is almost exactly the same as master, while with the fix the estimates improve significantly (and are almost perfect), at least with the statistics. Without stats there's a bunch of queries that suddenly get from "perfect" to much worse (looks like a vertical line on the left chart). But there are other "strange" cases with "interesting patterns", like for example * https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-100-100.png * https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-1000-100.png * https://raw.githubusercontent.com/tvondra/stats-test/master/random-10000-10.png This likely shows the patches are a significant improvement for some queries (either getting better than master, or even making the estimates pretty accurate). But it's probably worth looking into the queries that got worse, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: