BUG #6668: hashjoin cost problem - Mailing list pgsql-bugs
| From | postgresuser@yahoo.com |
|---|---|
| Subject | BUG #6668: hashjoin cost problem |
| Date | |
| Msg-id | E1SZtC5-00028E-DI@wrigleys.postgresql.org Whole thread Raw |
| Responses |
Re: BUG #6668: hashjoin cost problem
|
| List | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 6668
Logged by: Postgres User
Email address: postgresuser@yahoo.com
PostgreSQL version: 9.1.3
Operating system: Ubuntu
Description:=20=20=20=20=20=20=20=20
work_mem 1MB
create table small(i) as select (g/1000) * 1000 from
generate_series(1,10000) g;
create table large(i) as select generate_series(1,100000000);
vacuum;
vacuum;
vacuum analyze;
explain analyze select * from small inner join large using (i);
QUERY PLAN=20=
=20=20=20=20
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
=20=20=20=20=20=20=20=20=20=20=20=20=20
---------------------------------------------------------------------------=
---------------------------------------------
-------------
Hash Join (cost=3D3083103.00..3475328.00 rows=3D10000 width=3D4) (actual
time=3D84079.899..84989.419 rows=3D9001 loops=3D1)
Hash Cond: (small.i =3D large.i)
-> Seq Scan on small (cost=3D0.00..145.00 rows=3D10000 width=3D4) (act=
ual
time=3D0.008..0.588 rows=3D10000 loops=3D1)
-> Hash (cost=3D1442478.00..1442478.00 rows=3D100000000 width=3D4) (ac=
tual
time=3D84079.741..84079.741 rows=3D100000000 loops
=3D1)
Buckets: 4096 Batches: 4096 Memory Usage: 853kB
-> Seq Scan on large (cost=3D0.00..1442478.00 rows=3D100000000
width=3D4) (actual time=3D0.005..59011.443 rows=3D100000
000 loops=3D1)
Total runtime: 84990.270 ms
(7 rows)
It doesn't matter how big the big table is... for this distribution large
table is hashed.
Forcing (gdb) the cost in one of the cost_hashjoin calls to 0, it chooses to
hash the smaller table with better results:
explain analyze select * from small inner join large using (i);
QUERY PLAN=20=20=
=20=20=20=20=20=20
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
=20=20=20=20=20=20
---------------------------------------------------------------------------=
---------------------------------------------
------
Hash Join (cost=3D270.00..0.00 rows=3D10000 width=3D4) (actual
time=3D14028.034..16510.598 rows=3D9001 loops=3D1)
Hash Cond: (large.i =3D small.i)
-> Seq Scan on large (cost=3D0.00..1442478.00 rows=3D100000000 width=
=3D4)
(actual time=3D0.010..5893.344 rows=3D100000000 loo
ps=3D1)
-> Hash (cost=3D145.00..145.00 rows=3D10000 width=3D4) (actual
time=3D3.854..3.854 rows=3D10000 loops=3D1)
Buckets: 1024 Batches: 1 Memory Usage: 352kB
-> Seq Scan on small (cost=3D0.00..145.00 rows=3D10000 width=3D4)
(actual time=3D0.005..1.585 rows=3D10000 loops=3D1)
Total runtime: 16510.962 ms
(7 rows)
More in gdb, all of the cost seems to come from:
run_cost +=3D hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
(outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5)
is 50 billion, leading to a wild cost. The parent's estimate of the number
of rows is rightly estimated at 10000, so 50 billion comparisons is
obviously bad estimate.
pgsql-bugs by date: