Re: two dimensional statistics in Postgres - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: two dimensional statistics in Postgres |
Date | |
Msg-id | 07b431d480db44208daa93cbaeb7a577.squirrel@2.emaily.eu Whole thread Raw |
In response to | two dimensional statistics in Postgres (Katharina Büchse<katharina.buechse@uni-jena.de>) |
Responses |
Re: two dimensional statistics in Postgres
|
List | pgsql-hackers |
Hi, Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a): > Hi, > > I'm a phd-student at the university of Jena, Thüringen, Germany, in the > field of data bases, more accurate query optimization. > I want to implement a system in PostgreSQL that detects column > correlations and creates statistical data about correlated columns for > the optimizer. Therefore I need to store two dimensional statistics > (especially two dimensional histograms) in PostgreSQL. Cool! > I had a look at the description of "WIP: multivariate statistics / proof > of concept", which looks really promising, I guess these statistics are > based on scans of the data? For my system I need both -- statistical Yes, it's based on a sample of the data. > data based on table scans (actually, samples are enough) and those based > on query feedback. Query feedback (tuple counts and, speaking a little > inaccurately, the where-part of the query itself) needs to be extracted > and there needs to be a decision for the optimizer, when to take > multivariate statistics and when to use the one dimensional ones. Oracle > in this case just disables one dimensional histograms if there is > already a multidimensional histogram, but this is not always useful, > especially in the case of a feedback based histogram (which might not > cover the whole data space). I want to use both kinds of histograms What do you mean by not covering the whole data space? I assume that when building feedback-based histogram, parts of the data will be filtered out because of WHERE clauses etc. Is that what you mean? I don't see how this could happen for regular histograms, though. > because correlations might occur only in parts of the data. In this case > a histogram based on a sample of the whole table might not get the point > and wouldn't help for the part of the data the user seems to be > interested in. Yeah, there may be dependencies that are difficult to spot in the whole dataset, but emerge once you filter to a specific subset. Now, how would that work in practice? Initially the query needs to be planned using regular stats (because there's no feedback yet), and then - when we decide the estimates are way off - may be re-planned using the feedback. The feedback is inherently query-specific, so I'm not sure if it's possible to reuse it for multiple queries (might be possible for "sufficiently similar" ones). Would this be done automatically for all queries / all conditions, or only when specifically enabled (on a table, columns, ...)? > There are special data structures for storing multidimensional > histograms based on feedback and I already tried to implement one of > these in C. In the case of two dimensions they are of course not "for > free" (one dimensional would be much cheaper), but based on the > principle of maximum entropy they deliver really good results. I decided > for only two dimensions because in this case we have the best proportion > of cost and benefit when searching for correlation (here I'm relying on I think hardcoding the two-dimensions limit is wrong. I understand higher number of dimensions means more expensive operation, but if the user can influence it, I believe it's OK. Also, is there any particular reason why not to support other kinds of stats (say, MCV lists)? In the end it's just a different way to approximate the distribution, and it may be way cheaper than histograms. > tests that were made in DB2 within a project called CORDS which detects > correlations even between different tables). Is this somehow related to LEO? I'm not familiar with the details, but from the description it might be related. regards Tomas
pgsql-hackers by date: