Re: cost_sort() improvements - Mailing list pgsql-hackers
From | Teodor Sigaev |
---|---|
Subject | Re: cost_sort() improvements |
Date | |
Msg-id | 72f7e61e-95ad-ee6a-ba62-a500dd089f1a@sigaev.ru Whole thread Raw |
In response to | Re: cost_sort() improvements (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: cost_sort() improvements
|
List | pgsql-hackers |
> One more thought about estimating the worst case - I wonder if simply > multiplying the per-tuple cost by 1.5 is the right approach. It does not > seem particularly principled, and it's trivial simple to construct > counter-examples defeating it (imagine columns with 99% of the rows > having the same value, and then many values in exactly one row - that > will defeat the ndistinct-based group size estimates). Agree. But right now that is best what we have. and constant 1.5 easily could be changed to 1 or 10 - it is just arbitrary value, intuitively chosen. As I mentioned in v7 patch description, new estimation function provides ~10% bigger estimation and this constant should not be very large, because we could easily get overestimation. > > The reason why constructing such counter-examples is so simple is that > this relies entirely on the ndistinct stats, ignoring the other stats we > already have. I think this might leverage the per-column MCV lists (and > eventually multi-column MCV lists, if that ever gets committed). > > We're estimating the number of tuples in group (after fixing values in > the preceding columns), because that's what determines the number of > comparisons we need to make at a given stage. The patch does this by > estimating the average group size, by estimating ndistinct of the > preceding columns G(i-1), and computing ntuples/G(i-1). > > But consider that we also have MCV lists for each column - if there is a > very common value, it should be there. So let's say M(i) is a frequency > of the most common value in i-th column, determined either from the MCV > list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i) > for the fist i columns, then the largest possible group of values when > comparing values in i-th column is > > N * m(i-1) > > This seems like a better way to estimate the worst case. I don't know if > this should be used instead of NG(i), or if those two estimates should > be combined in some way. Agree, definitely we need an estimation of larger group size to use it in cost_sort. But I don't feel power to do that, pls, could you attack a computing group size issue? Thank you. > > I think estimating the largest group we need to sort should be helpful > for the incremental sort patch, so I'm adding Alexander to this thread.Agree I think so. suggested estimation algorithm should be modified to support leading preordered keys and this form could be directly used in GROUP BY and incremental sort patches. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
pgsql-hackers by date: