Home > mailing lists

Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Collect frequency statistics for arrays
Date	March 3, 2012 21:38:57
Msg-id	17172.1330825115@sss.pgh.pa.us Whole thread Raw
In response to	Re: Collect frequency statistics for arrays (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: Collect frequency statistics for arrays Re: Collect frequency statistics for arrays
List	pgsql-hackers

Tree view

Alexander Korotkov <aekorotkov@gmail.com> writes:
> [ array statistics patch ]

I've committed this after a fair amount of editorialization.  There are
still some loose ends to deal with, but I felt it was ready to go into
the tree for wider testing.

The main thing I changed that wasn't in the nature of cleanup/bugfixing
was that I revised the effort-limiting logic in
mcelem_array_contained_selec.  The submitted code basically just punted
if the estimated work was too much, but as was already noted in
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01349.php
that can result in really bad estimates.  What I did instead is
something closer to Robert's original suggestion: trim the number of
element values taken into consideration from the array constant to a
value that fits within the desired effort limit.  If we consider just
the N most common values from the array constant, we still get a pretty
good estimate (since the trimmed N will still be close to 100 for the
values we're talking about).

I redid the tests in the above-mentioned message and see no cases where
the estimate is off by more than a factor of 2, and very few where it's
off by more than 20%, so this seems to work pretty well now.

The remaining loose ends IMO are:

1. I'm still unhappy about the loop that fills the count histogram,
as I noted earlier today.  It at least needs a decent comment and some
overflow protection, and I'm not entirely convinced that it doesn't have
more bugs than the overflow issue.

2. The tests in the above-mentioned message show that in most cases
where mcelem_array_contained_selec falls through to the "rough
estimate", the resulting rowcount estimate is just 1, ie we are coming
out with very small selectivities.  Although that path will now only be
taken when there are no stats, it seems like we'd be better off to
return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
must be something wrong with the "rough estimate" logic.  Could you
recheck that?

3. As I mentioned yesterday, I think it'd be a good idea to make some
provisions to reduce the width of pg_statistic rows for array columns
by not storing the scalar-style and/or array-style stats, if the DBA
knows that they're not going to be useful for a particular column.
I have not done anything about that.
        regards, tom lane

pgsql-hackers by date:

From: Jeff Janes
Date: 03 March 2012, 21:20:25
Subject: Re: archive_keepalive_command

From: Tom Lane
Date: 03 March 2012, 22:00:49
Subject: Re: poll: CHECK TRIGGER?

Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

Previous

Next