BUG #6668: hashjoin cost problem - Mailing list pgsql-bugs
From | postgresuser@yahoo.com |
---|---|
Subject | BUG #6668: hashjoin cost problem |
Date | |
Msg-id | E1SZtC5-00028E-DI@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #6668: hashjoin cost problem
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 6668 Logged by: Postgres User Email address: postgresuser@yahoo.com PostgreSQL version: 9.1.3 Operating system: Ubuntu Description:=20=20=20=20=20=20=20=20 work_mem 1MB create table small(i) as select (g/1000) * 1000 from generate_series(1,10000) g; create table large(i) as select generate_series(1,100000000); vacuum; vacuum; vacuum analyze; explain analyze select * from small inner join large using (i); QUERY PLAN=20= =20=20=20=20 =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 =20=20=20=20=20=20=20=20=20=20=20=20=20 ---------------------------------------------------------------------------= --------------------------------------------- ------------- Hash Join (cost=3D3083103.00..3475328.00 rows=3D10000 width=3D4) (actual time=3D84079.899..84989.419 rows=3D9001 loops=3D1) Hash Cond: (small.i =3D large.i) -> Seq Scan on small (cost=3D0.00..145.00 rows=3D10000 width=3D4) (act= ual time=3D0.008..0.588 rows=3D10000 loops=3D1) -> Hash (cost=3D1442478.00..1442478.00 rows=3D100000000 width=3D4) (ac= tual time=3D84079.741..84079.741 rows=3D100000000 loops =3D1) Buckets: 4096 Batches: 4096 Memory Usage: 853kB -> Seq Scan on large (cost=3D0.00..1442478.00 rows=3D100000000 width=3D4) (actual time=3D0.005..59011.443 rows=3D100000 000 loops=3D1) Total runtime: 84990.270 ms (7 rows) It doesn't matter how big the big table is... for this distribution large table is hashed. Forcing (gdb) the cost in one of the cost_hashjoin calls to 0, it chooses to hash the smaller table with better results: explain analyze select * from small inner join large using (i); QUERY PLAN=20=20= =20=20=20=20=20=20 =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 =20=20=20=20=20=20 ---------------------------------------------------------------------------= --------------------------------------------- ------ Hash Join (cost=3D270.00..0.00 rows=3D10000 width=3D4) (actual time=3D14028.034..16510.598 rows=3D9001 loops=3D1) Hash Cond: (large.i =3D small.i) -> Seq Scan on large (cost=3D0.00..1442478.00 rows=3D100000000 width= =3D4) (actual time=3D0.010..5893.344 rows=3D100000000 loo ps=3D1) -> Hash (cost=3D145.00..145.00 rows=3D10000 width=3D4) (actual time=3D3.854..3.854 rows=3D10000 loops=3D1) Buckets: 1024 Batches: 1 Memory Usage: 352kB -> Seq Scan on small (cost=3D0.00..145.00 rows=3D10000 width=3D4) (actual time=3D0.005..1.585 rows=3D10000 loops=3D1) Total runtime: 16510.962 ms (7 rows) More in gdb, all of the cost seems to come from: run_cost +=3D hash_qual_cost.per_tuple * outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5; (outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5) is 50 billion, leading to a wild cost. The parent's estimate of the number of rows is rightly estimated at 10000, so 50 billion comparisons is obviously bad estimate.
pgsql-bugs by date: