Thread: query plan using partial index expects a much larger number of rows than is possible

query plan using partial index expects a much larger number of rows than is possible

From

"Olivier Poquet"

Date:

28 October 2020, 22:46:12

Hello,
I have a large table (about 13 million rows) full of customer order information. Since most of that information is for
ordersthat have already been fulfilled, I have a partial index to help quickly zero in on rows that have not been
fulfilled.This works well, but I noticed today when joining with another large table using its primary key that even
thoughthe planner was using my partial index, it decided to do a merge join to the second large table instead of the
nestedloop I would have expected.
 

Looking at it in more detail, I found that the planner is assuming that I'll get millions of rows back even when I do a
simplequery that does an index scan on my partial index:
 

=> \d orderitems_committed_unfulfilled 
Index "public.orderitems_committed_unfulfilled"
 Column |  Type  | Key? | Definition 
--------+--------+------+------------
 id     | bigint | yes  | id
btree, for table "public.orderitems", predicate (LEAST(committed, quantity) > fulfilled)

=> explain (analyze, buffers) select oi.id from orderitems oi where LEAST(oi.committed, oi.quantity) > oi.fulfilled;
                                                                            QUERY PLAN
                                         
 

------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using orderitems_committed_unfulfilled on orderitems oi  (cost=0.41..31688.23 rows=2861527 width=8)
(actualtime=0.039..2.092 rows=2274 loops=1)
 
   Heap Fetches: 2493
   Buffers: shared hit=1883
 Planning Time: 0.110 ms
 Execution Time: 2.255 ms
(5 rows)

So nice and quick, but the planner thought it would get back 2861527 rows instead of the 2274 I actually have.  That
explainswhy it thought it would make sense to do a merge join with my other large table instead of the nested loop over
the2k rows.  I would have expected the planner to know that there's no way it'll get back 2 million rows though, given
that:

=> select relname, relpages, reltuples from pg_class where relname = 'orderitems_committed_unfulfilled';
             relname              | relpages | reltuples 
----------------------------------+----------+-----------
 orderitems_committed_unfulfilled |     3051 |      2112
(1 row)

It knows there's only 2k-ish of them from the index. The 2 mil number is the same as what the planner expects if I
disableusing indexes and it does a seq scan, so I'm assuming it's just the guess from the column statistics and the
planneris not using the size of the partial index.
 

I'm running:
=> select version();
                                          version                                          
-------------------------------------------------------------------------------------------
 PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc, a 97d579287 p 0be2109a97, 64-bit
(1 row)

I'm wondering if the behavior that I'm seeing is expected in 12.4, and if so if it changes in a later version or if I
shouldfile an enhancement request? Or if it's not expected is there's something I'm doing wrong, or should file a bug?
 

Thanks for your time.

-- 
  Olivier Poquet
  opoquet@plumdev.com

Re: query plan using partial index expects a much larger number of rows than is possible

From

Tom Lane

Date:

28 October 2020, 23:30:18

"Olivier Poquet" <opoquet@plumdev.com> writes:
> Looking at it in more detail, I found that the planner is assuming that I'll get millions of rows back even when I do
asimple query that does an index scan on my partial index: 

We don't look at partial-index predicates when trying to estimate the
selectivity of a WHERE clause.  It's not clear to me whether that'd be
a useful thing to do, or whether it could be shoehorned into the system
easily.  (One big problem is that while the index size could provide
an upper bound, it's not apparent how to combine that knowledge with
selectivities of unrelated conditions.  Also, it's riskier to extrapolate
a current rowcount estimate from stale relpages/reltuples data for an
index than it is for a table, because the index is less likely to scale
up linearly.)

If this particular query is performance-critical, you might consider
materializing the condition, that is something like

create table orderitems (
   ... ,
   committed_unfulfilled bool GENERATED ALWAYS AS
     (LEAST(committed, quantity) > fulfilled) STORED
);

and then your queries and your partial-index predicate must look
like "WHERE committed_unfulfilled".  Having done this, ANALYZE
would gather stats on the values of that column and the WHERE
clauses would be estimated accurately.

            regards, tom lane

Re: query plan using partial index expects a much larger number of rows than is possible

From

"Olivier Poquet"

Date:

29 October 2020, 00:53:09

Thanks Tom,
That makes perfect sense.  

I'd already gone the route of materializing the condition but I didn't even realize that generated columns was an
option(I'd done the same with triggers instead).  So thanks a lot of that too!
 

-- 
  Olivier Poquet
  opoquet@plumdev.com

On Wed, Oct 28, 2020, at 7:30 PM, Tom Lane wrote:
> "Olivier Poquet" <opoquet@plumdev.com> writes:
> > Looking at it in more detail, I found that the planner is assuming that I'll get millions of rows back even when I
doa simple query that does an index scan on my partial index:
 
> 
> We don't look at partial-index predicates when trying to estimate the
> selectivity of a WHERE clause.  It's not clear to me whether that'd be
> a useful thing to do, or whether it could be shoehorned into the system
> easily.  (One big problem is that while the index size could provide
> an upper bound, it's not apparent how to combine that knowledge with
> selectivities of unrelated conditions.  Also, it's riskier to extrapolate
> a current rowcount estimate from stale relpages/reltuples data for an
> index than it is for a table, because the index is less likely to scale
> up linearly.)
> 
> If this particular query is performance-critical, you might consider
> materializing the condition, that is something like
> 
> create table orderitems (
>    ... ,
>    committed_unfulfilled bool GENERATED ALWAYS AS
>      (LEAST(committed, quantity) > fulfilled) STORED
> );
> 
> and then your queries and your partial-index predicate must look
> like "WHERE committed_unfulfilled".  Having done this, ANALYZE
> would gather stats on the values of that column and the WHERE
> clauses would be estimated accurately.
> 
>             regards, tom lane
>

Re: query plan using partial index expects a much larger number of rows than is possible

From

Michael Lewis

Date:

29 October 2020, 15:08:11

On Wed, Oct 28, 2020 at 5:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Olivier Poquet" <opoquet@plumdev.com> writes:
> Looking at it in more detail, I found that the planner is assuming that I'll get millions of rows back even when I do a simple query that does an index scan on my partial index:

We don't look at partial-index predicates when trying to estimate the
selectivity of a WHERE clause. It's not clear to me whether that'd be
a useful thing to do, or whether it could be shoehorned into the system
easily. (One big problem is that while the index size could provide
an upper bound, it's not apparent how to combine that knowledge with
selectivities of unrelated conditions. Also, it's riskier to extrapolate
a current rowcount estimate from stale relpages/reltuples data for an
index than it is for a table, because the index is less likely to scale
up linearly.)

regards, tom lane

Aren't there custom stats created for functional indexes? Would it be feasible to create those for partial indexes as well, maybe only optionally? I assume there may be giant gaps with that notion.