Cardinality estimate of the inner relation - Mailing list pgsql-performance
From | Frédéric Yhuel |
---|---|
Subject | Cardinality estimate of the inner relation |
Date | |
Msg-id | b860c71a-7cab-4d88-ad87-8c1f2eea9ae8@dalibo.com Whole thread Raw |
Responses |
Re: Cardinality estimate of the inner relation
|
List | pgsql-performance |
My colleague Christophe Courtois and I have been trying to fix a bad plan for one of Dalibo's clients. It is a (probably well-known) problem with skewed data and a parameterized Nested Loop with an underestimation of the cardinality of the inner relation. Here is a test case (the script to create and populate the two tables is at the end): EXPLAIN (ANALYZE, SETTINGS) SELECT name, sum(quantity) FROM products p JOIN orders o ON (p.id = o.product_id) WHERE p.name = 'babar' /* 28% of orders */ GROUP BY name; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=5.50..1022.27 rows=1 width=40) (actual time=66.213..66.215 rows=1 loops=1) -> Nested Loop (cost=5.50..1019.29 rows=1189 width=36) (actual time=0.083..56.991 rows=320000 loops=1) -> Bitmap Heap Scan on products p (cost=5.08..250.17 rows=85 width=40) (actual time=0.058..0.153 rows=80 loops=1) Recheck Cond: (name = 'babar'::text) Heap Blocks: exact=80 -> Bitmap Index Scan on products_name_idx (cost=0.00..5.06 rows=85 width=0) (actual time=0.030..0.031 rows=80 loops=1) Index Cond: (name = 'babar'::text) -> Index Scan using orders_product_id_idx on orders o (cost=0.43..8.79 rows=26 width=12) (actual time=0.006..0.440 rows=4000 loops=80) Index Cond: (product_id = p.id) Planning Time: 0.560 ms Execution Time: 66.268 ms The cardinaly estimate on 'orders' is wrong (26 rows against 4000), hence the Nested Loop. Any other value for product.name gives less rows and a good plan (i.e. NL makes more sense). If I'm not wrong, the row count estimate for the inner table is computed as follow: SELECT n_distinct AS product_id_n_distinct FROM pg_stats WHERE tablename = 'orders' AND attname = 'product_id' \gset SELECT reltuples AS orders_tuples FROM pg_class WHERE relname = 'orders' \gset SELECT :orders_tuples / :product_id_n_distinct AS inner_estimation; inner_estimation --------------------- 26.4049450290190157 I have two questions: 1) Is there a known workaround for this problem (besides unatural partitioning of the inner table)? 2) Would it be possible to add cross-table statistics to fix this? Has this been discussed? Here is the script to create and populate the two tables: SELECT 80000 AS nbp \gset DROP TABLE IF EXISTS orders; DROP TABLE IF EXISTS products; CREATE UNLOGGED TABLE products( id bigint PRIMARY KEY, name TEXT, price INT ); CREATE UNLOGGED TABLE orders( id bigint generated always as identity, product_id BIGINT, -- REFERENCES products(id), quantity INT ); INSERT INTO products (id, name, price) SELECT i, CASE WHEN i % 1000 = 0 THEN 'babar' ELSE md5(i::text) END, random() * 1000 FROM generate_series(1, :nbp) AS T(i); INSERT INTO orders (product_id, quantity) SELECT product_id, quantity FROM ( SELECT id AS product_id, random() * 10 AS quantity, name FROM products, LATERAL (SELECT * FROM generate_series (1, CASE WHEN name = 'babar' THEN 4000 ELSE 10 END)) T) U; CREATE INDEX ON orders (product_id); VACUUM ANALYZE products, orders; ALTER TABLE orders ADD FOREIGN KEY (product_id) REFERENCES products(id); CREATE INDEX ON products (name);
pgsql-performance by date: