Thread: possible bug in 8.4
create table a(a int not null); insert into a(a) select generate_series(1,6000000); create table b as select * from a where a%10 <> 0; create index fooa on a(a); alter table b alter column a set not null; create index foob on b(a); vacuum analyze verbose; gj=# explain select a.a from a where a not in (select a from b); QUERY PLAN ------------------------------------------------------------------------- Seq Scan on a (cost=99035.00..257874197565.00rows=3000000 width=4) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=99035.00..171493.00rows=5400000 width=4) -> Seq Scan on b (cost=0.00..75177.00 rows=5400000 width=4) (5 rows) that's absolutely humongous cost, and it really does take ages before this thing finishes (had to kill it after an hour). For change, same type of query (should return same thing) gj=# explain analyze select a.a from a left join b on a.a=b.a where b.a is null; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- MergeAnti Join (cost=0.00..350302.50 rows=600000 width=4) (actual time=0.534..10851.995 rows=600000 loops=1) Merge Cond: (a.a = b.a) -> Index Scan using fooa on a (cost=0.00..166209.78rows=6000000 width=4) (actual time=0.232..3128.438 rows=6000000 loops=1) -> Index Scan using foob on b (cost=0.00..149592.72 rows=5400000 width=4) (actual time=0.161..2778.569 rows=5400000 loops=1) Total runtime: 10938.592 ms (5 rows) Time: 10939,414 ms
Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes: > gj=# explain select a.a from a where a not in (select a from b); > QUERY PLAN > ------------------------------------------------------------------------- > Seq Scan on a (cost=99035.00..257874197565.00 rows=3000000 width=4) > Filter: (NOT (subplan)) > SubPlan > -> Materialize (cost=99035.00..171493.00 rows=5400000 width=4) > -> Seq Scan on b (cost=0.00..75177.00 rows=5400000 width=4) > (5 rows) > > > that's absolutely humongous cost, and it really does take ages before this > thing finishes (had to kill it after an hour). I think Postgres can't do better because there could be a NULL in the subquery. If there's a NULL in the subquery then no record would match. Now your column is NOT NULL so Postgres could do better but AFAIK we don't look at column constraints like NOT NULL when planning. Historically we couldn't because we didn't have plan invalidation -- and the plan you posted below with the Anti-Join is brand new in 8.4 -- so there is room for improvement but it's not exactly a bug. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark <stark@enterprisedb.com> writes: > I think Postgres can't do better because there could be a NULL in the > subquery. If there's a NULL in the subquery then no record would match. Yeah. NOT IN does not have the right semantics to become an antijoin. FWIW, you can use a NOT EXISTS: select a.a from a where not exists (select 1 from b where a.a = b.a); In HEAD this should give the identical plan to the leftjoin/is null hack. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> I think Postgres can't do better because there could be a NULL in the >> subquery. If there's a NULL in the subquery then no record would match. > > Yeah. NOT IN does not have the right semantics to become an antijoin. If we noticed that the columns in the subquery are all guaranteed to be not null could we do it then? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Yeah. NOT IN does not have the right semantics to become an antijoin. > If we noticed that the columns in the subquery are all guaranteed to be not > null could we do it then? I think you'd also have to know that the outer-query value isn't null, plus assume that the comparison operator can't return null for two non-nulls (but we already assume that for btree/hash equality I think). As you said, this would never have been safe before plan invalidation, but it might be doable now. regards, tom lane
all I know, is that the same query will work on 8.3 in reasonably acceptable time frame.
Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes: > all I know, is that the same query will work on 8.3 in reasonably acceptable > time frame. What plan do you get in 8.3? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes: > all I know, is that the same query will work on 8.3 in reasonably acceptable > time frame. > Because I see the exact same plan -- in fact with the exact same cost: Welcome to psql 8.3.3, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g orterminate with semicolon to execute query \q to quit postgres=# explain select a.a from a where a not in (select a from b); QUERY PLAN -------------------------------------------------------------------------Seq Scan on a (cost=99035.00..257874197565.00 rows=3000000width=4) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=99035.00..171493.00 rows=5400000 width=4) -> Seq Scan on b (cost=0.00..75177.00 rows=5400000 width=4) (5 rows) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
>>> Gregory Stark <stark@enterprisedb.com> wrote: > Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes: > >> all I know, is that the same query will work on 8.3 in reasonably > acceptable >> time frame. >> > > Because I see the exact same plan -- in fact with the exact same cost: > > > Welcome to psql 8.3.3, the PostgreSQL interactive terminal. I'm seeing the exact same plan on 8.3.5, too. -Kevin
true, but as soon as I drop indices on both tables: root=# explain analyze select a.a from a where a not in (select a from b); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Seq Scan on a (cost=88677.00..187207.00 rows=3000000 width=4) (actual time=22803.470..26473.039 rows=600000 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on b (cost=0.00..75177.00rows=5400000 width=4) (actual time=9.232..15828.904 rows=5400000 loops=1) Total runtime: 29357.267 ms (5 rows) root=# select version(); version -------------------------------------------------------------------------------------------------------- PostgreSQL 8.3.5 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10) (1 row) I thought planner should be able to choose plan that costs less.
and the plan on 8.4: gj=# explain select a.a from a where a not in (select a from b); QUERY PLAN ------------------------------------------------------------------------- Seq Scan on a (cost=99035.00..257874197565.00rows=3000000 width=4) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=99035.00..171493.00rows=5400000 width=4) -> Seq Scan on b (cost=0.00..75177.00 rows=5400000 width=4) (5 rows) This looks like a regression to me, or am I wrong ?
Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes: > Filter: (NOT (hashed subplan)) ^^^^^^^^^^^^^^ If 8.3 does that, and 8.4 doesn't, it's most likely because you are using different work_mem settings. regards, tom lane
On 2008-12-19, at 13:07, Tom Lane wrote: > Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes: >> Filter: (NOT (hashed subplan)) > ^^^^^^^^^^^^^^ > > If 8.3 does that, and 8.4 doesn't, it's most likely because you are > using different work_mem settings. you're right, as always :) My point is, why is planner choosing plan on 8.4 that's obviously more expensive ? even without sort_mem (work_mem) set to higher value ? Obviously the plan is quite expensive, so probably sorting it on disc - would be still cheaper. The example is quite unrealistic, but I am trying different simple things to test planner differences between these two versions, and see if there's any improvement/regression. This time I will set both work_mem to same value :)