Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Collect frequency statistics for arrays |
Date | |
Msg-id | 17172.1330825115@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Collect frequency statistics for arrays (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: Collect frequency statistics for arrays
Re: Collect frequency statistics for arrays |
List | pgsql-hackers |
Alexander Korotkov <aekorotkov@gmail.com> writes: > [ array statistics patch ] I've committed this after a fair amount of editorialization. There are still some loose ends to deal with, but I felt it was ready to go into the tree for wider testing. The main thing I changed that wasn't in the nature of cleanup/bugfixing was that I revised the effort-limiting logic in mcelem_array_contained_selec. The submitted code basically just punted if the estimated work was too much, but as was already noted in http://archives.postgresql.org/pgsql-hackers/2011-10/msg01349.php that can result in really bad estimates. What I did instead is something closer to Robert's original suggestion: trim the number of element values taken into consideration from the array constant to a value that fits within the desired effort limit. If we consider just the N most common values from the array constant, we still get a pretty good estimate (since the trimmed N will still be close to 100 for the values we're talking about). I redid the tests in the above-mentioned message and see no cases where the estimate is off by more than a factor of 2, and very few where it's off by more than 20%, so this seems to work pretty well now. The remaining loose ends IMO are: 1. I'm still unhappy about the loop that fills the count histogram, as I noted earlier today. It at least needs a decent comment and some overflow protection, and I'm not entirely convinced that it doesn't have more bugs than the overflow issue. 2. The tests in the above-mentioned message show that in most cases where mcelem_array_contained_selec falls through to the "rough estimate", the resulting rowcount estimate is just 1, ie we are coming out with very small selectivities. Although that path will now only be taken when there are no stats, it seems like we'd be better off to return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there must be something wrong with the "rough estimate" logic. Could you recheck that? 3. As I mentioned yesterday, I think it'd be a good idea to make some provisions to reduce the width of pg_statistic rows for array columns by not storing the scalar-style and/or array-style stats, if the DBA knows that they're not going to be useful for a particular column. I have not done anything about that. regards, tom lane
pgsql-hackers by date: