When reviewing some recent patches, I decided the statistics gathered for arrays had some pre-existing shortcomings.
The main one is that when the arrays contain rare elements there is no histogram to fall back upon when the MCE array is empty, the way there is for scalar stats. So it has to punt completely and resort to saying that it is 0.5% selectivity without recourse to any data at all.
The rationale for applying the threshold before things are eligible for inclusion in the MCE array seems to be that this puts some theoretical bound on the amount of error we are likely to have in that element. But I think it is better to exceed that theoretical bound than it is to have no data at all.
The attached patch forces there to be at least one element in MCE, keeping the one element with the highest predicted frequency if the MCE would otherwise be empty. Then any other element queried for is assumed to be no more common than this most common element.
I'd also briefly considered just having the part of the code that pulls the stats out of pg_stats interpret a MCE array as meaning that nothing is more frequent than the threshold, but that would mean that that part of the code needs to know about how the threshold is chosen, which just seems wrong. And it would need to know the difference between NULL MCE because no stats were gathered, versus because stats were gathered but nothing met the threshold.
I'm only marginally interested in this area, but since I did the investigation into it already I wanted to present it here. I'd be quite happy if someone else wanted to take over the patch (or the concept behind it).
Here is an example, where the estimate drops from 500 rows to 3 rows where the correct number is zero:
create table foobar as
select (
select
array_agg(floor((random()*1000000000+g.y/1000000+f.x/10000000))::int)
from generate_series(1,10) g(y)
) foo
from generate_series(1,100000) f(x);
Unpatched:
analyze;
explain (analyze) select * from foobar where foo @> '{567}';
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2387.00 rows=500 width=61) (actual time=76.448..76.448 rows=0 loops=1)
Filter: (foo @> '{567}'::integer[])
Rows Removed by Filter: 100000
Planning time: 0.211 ms
Execution time: 76.492 ms
With patch:
analyze;
explain (analyze) select * from foobar where foo @> '{567}';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2387.00 rows=3 width=61) (actual time=78.604..78.604 rows=0 loops=1)
Filter: (foo @> '{567}'::integer[])
Rows Removed by Filter: 100000
Planning time: 0.199 ms
Execution time: 78.665 ms
Cheers,
Jeff