Re: Too few rows expected by Planner on partitioned tables - Mailing list pgsql-performance
| From | Julian Wolf | 
|---|---|
| Subject | Re: Too few rows expected by Planner on partitioned tables | 
| Date | |
| Msg-id | AM5PR10MB1617D1CEBE727976FC66F050F0790@AM5PR10MB1617.EURPRD10.PROD.OUTLOOK.COM Whole thread Raw | 
| In response to | Re: Too few rows expected by Planner on partitioned tables (Justin Pryzby <pryzby@telsasoft.com>) | 
| List | pgsql-performance | 
 Hello Justin,
 thank you very much for your fast response. 
 > Is there a correlation between daterange and spacial_feature_id ?
 I am not entirely sure, what you mean by that. Basically, no, they are not correlated - spatial features are places on a map, date ranges are time periods. But, as they are both part of a primary key in this particular table, they are correlated in some way as to be a part of uniquely identifying a row. 
> Are the estimates good if you query on *only* daterange? spacial_feature_id ?
Unfortunately no, they are not:
------------------------------------------------------------------------------------------------------------------------------------------------
EXPLAIN (ANALYZE , BUFFERS)
SELECT sum(visitors * n)
FROM location_statistics st
WHERE st.daterange = '[2019-03-04,2019-03-11)'::DATERANGE
QUERY PLAN
Aggregate  (cost=2.79..2.80 rows=1 width=8) (actual time=1143.393..1143.393 rows=1 loops=1)
  Buffers: shared hit=304958
  ->  Index Scan using location_statistics_y2019m03w_pkey on location_statistics_y2019m03w st  (cost=0.56..2.78 rows=1 width=8) (actual time=0.024..931.645 rows=4296639 loops=1)
        Index Cond: (daterange = '[2019-03-04,2019-03-11)'::daterange)
        Buffers: shared hit=304958
Planning Time: 0.080 ms
Execution Time: 1143.421 ms
------------------------------------------------------------------------------------------------------------------------------------------------
EXPLAIN (ANALYZE , BUFFERS)
SELECT sum(visitors * n)
FROM location_statistics_y2019m03w st
WHERE st.daterange = '[2019-03-04,2019-03-11)'::DATERANGE
QUERY PLAN
Aggregate  (cost=2.79..2.80 rows=1 width=8) (actual time=1126.819..1126.820 rows=1 loops=1)
  Buffers: shared hit=304958
  ->  Index Scan using location_statistics_y2019m03w_pkey on location_statistics_y2019m03w st  (cost=0.56..2.78 rows=1 width=8) (actual time=0.023..763.852 rows=4296639 loops=1)
        Index Cond: (daterange = '[2019-03-04,2019-03-11)'::daterange)
        Buffers: shared hit=304958
Planning Time: 0.046 ms
Execution Time: 1126.845 ms
------------------------------------------------------------------------------------------------------------------------------------------------
 Checking only on the spatial_feature is not the same query, as the table contains 4 different date ranges. Furthermore, there is no index for this operation. Because of that, I can only invoke this query on one partition, otherwise the query would take days.
EXPLAIN (ANALYZE , BUFFERS)
SELECT sum(visitors * n)
FROM location_statistics_y2019m03w st
WHERE spatial_feature_id = 12675
QUERY PLAN
Finalize Aggregate  (cost=288490.25..288490.26 rows=1 width=8) (actual time=1131.593..1131.593 rows=1 loops=1)
  Buffers: shared hit=40156 read=139887
  ->  Gather  (cost=288490.03..288490.24 rows=2 width=8) (actual time=1131.499..1148.872 rows=2 loops=1)
        Workers Planned: 2
        Workers Launched: 1
        Buffers: shared hit=40156 read=139887
        ->  Partial Aggregate  (cost=287490.03..287490.04 rows=1 width=8) (actual time=1118.578..1118.579 rows=1 loops=2)
              Buffers: shared hit=40156 read=139887
              ->  Parallel Seq Scan on location_statistics_y2019m03w st  (cost=0.00..280378.27 rows=948235 width=8) (actual time=3.544..1032.899 rows=1134146 loops=2)
                    Filter: (spatial_feature_id = 12675)
                    Rows Removed by Filter: 8498136
                    Buffers: shared hit=40156 read=139887
Planning Time: 0.218 ms
JIT:
  Functions: 12
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 0.929 ms, Inlining 0.000 ms, Optimization 0.426 ms, Emission 6.300 ms, Total 7.655 ms
Execution Time: 1191.741 ms
 The estimates seem to be good though.
 Thanks in Advance
 Julian
|  | Julian P. Wolf | Invenium Data Insights GmbH julian.wolf@invenium.io | +43 664 88 199 013 Herrengasse 28  | 8010 Graz | www.invenium.io | 
From: Justin Pryzby <pryzby@telsasoft.com>
Sent: Tuesday, July 21, 2020 7:27 PM
To: Julian Wolf <julian.wolf@invenium.io>
Cc: pgsql-performance Postgres Mailing List <pgsql-performance@lists.postgresql.org>
Subject: Re: Too few rows expected by Planner on partitioned tables 
Sent: Tuesday, July 21, 2020 7:27 PM
To: Julian Wolf <julian.wolf@invenium.io>
Cc: pgsql-performance Postgres Mailing List <pgsql-performance@lists.postgresql.org>
Subject: Re: Too few rows expected by Planner on partitioned tables
On Tue, Jul 21, 2020 at 01:09:22PM +0000, Julian Wolf wrote:
> Our problem is, that the planner always predicts one row to be returned, although only a part of the primary key is queried. This problem exceeds feasibility of performance rapidly - a query only involving a few days already takes dozens of seconds. All tables are analyzed and pg_stats looks reasonable IMHO.
> daterange daterange NOT NULL,
> spatial_feature_id INTEGER,
> Aggregate (cost=2.79..2.80 rows=1 width=8) (actual time=143.073..143.073 rows=1 loops=1)
> Buffers: shared hit=67334
> -> Index Scan using location_statistics_y2019m03w_pkey on location_statistics_y2019m03w st (cost=0.56..2.78 rows=1 width=8) (actual time=0.026..117.284 rows=516277 loops=1)
> Index Cond: ((daterange = '[2019-03-04,2019-03-11)'::daterange) AND (spatial_feature_id = 12675))
> Buffers: shared hit=67334
>
> As can be seen, the planner predicts one row to be returned, although it should be around 3% (11% of the entries are of the given ID, which are distributed over 4 weeks = date ranges) of the table. Using the partition table directly, does not change this fact.
Is there a correlation between daterange and spacial_feature_id ?
Are the estimates good if you query on *only* daterange? spacial_feature_id ?
Maybe what you need is:
https://www.postgresql.org/docs/devel/sql-createstatistics.html
CREATE STATISTICS stats (dependencies) ON daterange, spacial_feature_id FROM location_statistics;
ANALYZE location_statistics;
--
Justin
> Our problem is, that the planner always predicts one row to be returned, although only a part of the primary key is queried. This problem exceeds feasibility of performance rapidly - a query only involving a few days already takes dozens of seconds. All tables are analyzed and pg_stats looks reasonable IMHO.
> daterange daterange NOT NULL,
> spatial_feature_id INTEGER,
> Aggregate (cost=2.79..2.80 rows=1 width=8) (actual time=143.073..143.073 rows=1 loops=1)
> Buffers: shared hit=67334
> -> Index Scan using location_statistics_y2019m03w_pkey on location_statistics_y2019m03w st (cost=0.56..2.78 rows=1 width=8) (actual time=0.026..117.284 rows=516277 loops=1)
> Index Cond: ((daterange = '[2019-03-04,2019-03-11)'::daterange) AND (spatial_feature_id = 12675))
> Buffers: shared hit=67334
>
> As can be seen, the planner predicts one row to be returned, although it should be around 3% (11% of the entries are of the given ID, which are distributed over 4 weeks = date ranges) of the table. Using the partition table directly, does not change this fact.
Is there a correlation between daterange and spacial_feature_id ?
Are the estimates good if you query on *only* daterange? spacial_feature_id ?
Maybe what you need is:
https://www.postgresql.org/docs/devel/sql-createstatistics.html
CREATE STATISTICS stats (dependencies) ON daterange, spacial_feature_id FROM location_statistics;
ANALYZE location_statistics;
--
Justin
pgsql-performance by date: