New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-bugs
From | Bradley Baetz |
---|---|
Subject | New hashed IN code ignores distinctiveness of subquery |
Date | |
Msg-id | 20030126122612.GA3820@mango.home Whole thread Raw |
Responses |
Re: New hashed IN code ignores distinctiveness of subquery
|
List | pgsql-bugs |
I've been trying out the new hased subselect code from CVS. It appears that the planner isn't taking the distinctiveness of the values from the subselect into account: bbaetz=# explain analyze select count(*) FROM bugs where product_id IN (SELECT product_id FROM bugs); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3485675.30..3485675.30 rows=1 width=8) (actual time=1430065.54..1430065.55 rows=1 loops=1) -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual time=0.15..1429696.69 rows=50000 loops=1) Merge Cond: ("outer".product_id = "inner".product_id) -> Index Scan using bugs_product_id_idx on bugs (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43 rows=50000 loops=1) -> Index Scan using bugs_product_id_idx on bugs (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44 rows=277884160 loops=1) Total runtime: 1430102.08 msec (6 rows) bbaetz=# explain analyze select count(*) FROM bugs where product_id IN (SELECT distinct product_id FROM bugs); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3033.30..3033.30 rows=1 width=8) (actual time=505.17..505.17 rows=1 loops=1) -> Hash Join (cost=1959.54..3032.67 rows=251 width=8) (actual time=282.19..456.66 rows=50000 loops=1) Hash Cond: ("outer".product_id = "inner".product_id) -> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4) (actual time=0.01..68.94 rows=50000 loops=1) -> Hash (cost=1959.52..1959.52 rows=9 width=4) (actual time=282.14..282.14 rows=0 loops=1) -> Subquery Scan "IN_subquery" (cost=0.00..1959.52 rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1) -> Unique (cost=0.00..1959.52 rows=9 width=4) (actual time=0.13..282.03 rows=9 loops=1) -> Index Scan using bugs_product_id_idx on bugs (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46 rows=50000 loops=1) Total runtime: 505.30 msec (9 rows) bbaetz=# set enable_mergejoin=false; SET bbaetz=# explain analyze select count(*) FROM bugs where product_id IN (SELECT product_id FROM bugs); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=4281600.45..4281600.45 rows=1 width=8) (actual time=486.80..486.80 rows=1 loops=1) -> Hash Join (cost=1091.00..4281586.50 rows=5580 width=8) (actual time=146.58..425.92 rows=50000 loops=1) Hash Cond: ("outer".product_id = "inner".product_id) -> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4) (actual time=0.04..75.73 rows=50000 loops=1) -> Hash (cost=795.00..795.00 rows=50000 width=4) (actual time=146.34..146.34 rows=0 loops=1) -> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4) (actual time=0.06..64.98 rows=50000 loops=1) Total runtime: 486.91 msec (7 rows) bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of the products are actually used in bugs.product_id, due to an off-by-one error in the script I used to generate the table) I still haven't tuned the various optimiser settings, which may explain part of the enable_mergejoin=false result, although the DISTINCT probably takes some time too (Side note - is it possible to notice that DISTINCT on a column with a unique index doesn't need a Unique pass?). However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly useless and silly query though - I was just playing arround. The tables have been analyzed, and there are separate unique indexes on products.id, bugs.bug_id and bugs.product_id. FWIW: bbaetz=# select n_distinct, correlation FROM pg_stats WHERE tablename='products' AND attname='product_id'; n_distinct | correlation ------------+------------- 9 | 0.0919474 (1 row) Thanks, Bradley
pgsql-bugs by date: