Home > mailing lists

New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-bugs

From	Bradley Baetz
Subject	New hashed IN code ignores distinctiveness of subquery
Date	January 26, 2003 07:26:16
Msg-id	20030126122612.GA3820@mango.home Whole thread Raw
Responses	Re: New hashed IN code ignores distinctiveness of subquery
List	pgsql-bugs

Tree view

I've been trying out the new hased subselect code from CVS. It appears
that the planner isn't taking the distinctiveness of the values from the
subselect into account:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3485675.30..3485675.30 rows=1 width=8) (actual
time=1430065.54..1430065.55 rows=1 loops=1)
   ->  Merge Join  (cost=0.00..3485661.38 rows=5570 width=8) (actual
time=0.15..1429696.69 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
rows=50000 loops=1)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
rows=277884160 loops=1)
 Total runtime: 1430102.08 msec
(6 rows)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT distinct product_id FROM bugs);

QUERY PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3033.30..3033.30 rows=1 width=8) (actual
time=505.17..505.17 rows=1 loops=1)
   ->  Hash Join  (cost=1959.54..3032.67 rows=251 width=8) (actual
time=282.19..456.66 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.01..68.94 rows=50000 loops=1)
         ->  Hash  (cost=1959.52..1959.52 rows=9 width=4) (actual
time=282.14..282.14 rows=0 loops=1)
               ->  Subquery Scan "IN_subquery"  (cost=0.00..1959.52
rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1)
                     ->  Unique  (cost=0.00..1959.52 rows=9 width=4)
(actual time=0.13..282.03 rows=9 loops=1)
                           ->  Index Scan using bugs_product_id_idx on
bugs  (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46
rows=50000 loops=1)
 Total runtime: 505.30 msec
(9 rows)

bbaetz=# set enable_mergejoin=false;
SET

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4281600.45..4281600.45 rows=1 width=8) (actual
time=486.80..486.80 rows=1 loops=1)
   ->  Hash Join  (cost=1091.00..4281586.50 rows=5580 width=8) (actual
time=146.58..425.92 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..75.73 rows=50000 loops=1)
         ->  Hash  (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)
               ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.06..64.98 rows=50000 loops=1)
 Total runtime: 486.91 msec
(7 rows)

bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of
the products are actually used in bugs.product_id, due to an off-by-one
error in the script I used to generate the table)

I still haven't tuned the various optimiser settings, which may explain
part of the enable_mergejoin=false result, although the DISTINCT
probably takes some time too (Side note - is it possible to notice that
DISTINCT on a column with a unique index doesn't need a Unique pass?).
However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly
useless and silly query though - I was just playing arround.

The tables have been analyzed, and there are separate unique indexes on
products.id, bugs.bug_id and bugs.product_id.

FWIW:

bbaetz=# select n_distinct, correlation FROM pg_stats WHERE
tablename='products' AND attname='product_id';
 n_distinct | correlation
------------+-------------
          9 |   0.0919474
(1 row)

Thanks,

Bradley

pgsql-bugs by date:

From: pgsql-bugs@postgresql.org
Date: 26 January 2003, 05:04:12
Subject: Bug #887: pg_restore blocks

From: Tom Lane
Date: 26 January 2003, 14:10:22
Subject: Re: New hashed IN code ignores distinctiveness of subquery

New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-bugs

Previous

Next