Cardinality estimate of the inner relation - Mailing list pgsql-performance

From Frédéric Yhuel
Subject Cardinality estimate of the inner relation
Date
Msg-id b860c71a-7cab-4d88-ad87-8c1f2eea9ae8@dalibo.com
Whole thread Raw
Responses Re: Cardinality estimate of the inner relation
List pgsql-performance
My colleague Christophe Courtois and I have been trying to fix a bad 
plan for one of Dalibo's clients. It is a (probably well-known) problem 
with skewed data and a parameterized Nested Loop with an underestimation 
of the cardinality of the inner relation.

Here is a test case (the script to create and populate the two tables is 
at the end):

EXPLAIN (ANALYZE, SETTINGS)
   SELECT name, sum(quantity)
   FROM products p
   JOIN orders o ON (p.id = o.product_id)
   WHERE p.name = 'babar'        /*    28% of orders    */
   GROUP BY name;
  
QUERY PLAN 


--------------------------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=5.50..1022.27 rows=1 width=40) (actual 
time=66.213..66.215 rows=1 loops=1)
    ->  Nested Loop  (cost=5.50..1019.29 rows=1189 width=36) (actual 
time=0.083..56.991 rows=320000 loops=1)
          ->  Bitmap Heap Scan on products p  (cost=5.08..250.17 rows=85 
width=40) (actual time=0.058..0.153 rows=80 loops=1)
                Recheck Cond: (name = 'babar'::text)
                Heap Blocks: exact=80
                ->  Bitmap Index Scan on products_name_idx 
(cost=0.00..5.06 rows=85 width=0) (actual time=0.030..0.031 rows=80 loops=1)
                      Index Cond: (name = 'babar'::text)
          ->  Index Scan using orders_product_id_idx on orders o 
(cost=0.43..8.79 rows=26 width=12) (actual time=0.006..0.440 rows=4000 
loops=80)
                Index Cond: (product_id = p.id)
  Planning Time: 0.560 ms
  Execution Time: 66.268 ms

The cardinaly estimate on 'orders' is wrong (26 rows against 4000), 
hence the Nested Loop.
Any other value for product.name gives less rows and a good plan (i.e. 
NL makes more sense).

If I'm not wrong, the row count estimate for the inner table is computed 
as follow:

SELECT n_distinct AS product_id_n_distinct FROM pg_stats
   WHERE tablename = 'orders' AND attname = 'product_id' \gset

SELECT reltuples AS orders_tuples FROM pg_class
   WHERE relname = 'orders' \gset

SELECT :orders_tuples / :product_id_n_distinct AS inner_estimation;
   inner_estimation
---------------------
  26.4049450290190157

I have two questions:

1) Is there a known workaround for this problem (besides unatural 
partitioning of the inner table)?

2) Would it be possible to add cross-table statistics to fix this? Has 
this been discussed?


Here is the script to create and populate the two tables:

  SELECT 80000 AS nbp \gset

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;

CREATE UNLOGGED TABLE products(
         id bigint PRIMARY KEY,
         name TEXT,
         price INT
);

CREATE UNLOGGED TABLE orders(
         id bigint generated always as identity,
         product_id BIGINT, -- REFERENCES products(id),
         quantity INT
);

INSERT INTO products (id, name, price)
   SELECT
     i,
     CASE WHEN i % 1000 = 0 THEN 'babar' ELSE md5(i::text) END,
     random() * 1000
   FROM generate_series(1, :nbp) AS T(i);

INSERT INTO orders (product_id, quantity)
   SELECT product_id, quantity FROM (
     SELECT
       id AS product_id,
       random() * 10 AS quantity,
       name
     FROM products,
     LATERAL (SELECT * FROM generate_series (1, CASE WHEN name = 'babar' 
THEN 4000 ELSE 10 END)) T) U;

CREATE INDEX ON orders (product_id);
VACUUM ANALYZE products, orders;

ALTER TABLE orders ADD FOREIGN KEY (product_id) REFERENCES products(id);

CREATE INDEX ON products (name);



pgsql-performance by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Performance of Query 60 on TPC-DS Benchmark
Next
From: Tomas Vondra
Date:
Subject: Re: could not send data to client: Connection reset by peer