Thread: Possible planner bug/regression introduced in 8.2.5
Hi, we switched from 8.2.4 to 8.2.5 and were forced to switch back. Our main goal was to improve planner perfomance with outer joins per this change from release notes: Fix some planner problems with outer joins, notably poor size estimation for t1 LEFT JOIN t2 WHERE t2.col IS NULL (Tom) But it seems that this change introduced (or exhibited) new bug/regression. We can't get even query plan (explain) for some queries. The server process starts to chew up memory then server starts to swap and then we have to kill the process. On 8.2.4 the query works fine. I don't have self-contained test case but I'll try to provide as much details as possible. The simplified query but still causing the problem looks like this: SELECT 1 -- select list doesn't matter FROM a JOIN b ON (pk = fk) b JOIN c ON (...) [ 13 joins like this in total, not exactly star query ] LEFT OUTER JOIN x ON (pk = fk) LEFT OUTER JOIN y ON (pk = fk) -- WHERE clause doesn't matter ; There are cca 15 tables involved, if I remove one of the outer joins, query is planned ok. Some of the tables are big (millions), some are small. There are many fks between the tables and many indexes (most if not all join conditions are indexed). The behaviour is exhibited when "normal" planner is used, not geqo. Statistics target is set to 500. I tried to simplify the query even more and now I have query which on 8.2.4 is planned instantly and on 8.2.5 takes cca 8 seconds. The query and query plan is attached. All entities are tables, views were eliminated. What more info should I provide? What can I do to debug this problem? Can someone see the cause offhand from the planner changes in 8.2.5? On (possibly) unrelated note: what is recommanded maximum for geqo_threshold? We don't care if planning takes few seconds and produces good plan - our experience with non-deterministic geqo for queries with many big tables is not the best one (the plans vary too much between subsequent runs). Thanks, Kuba -- all entities are tables, views were eliminated explain SELECT 1 FROM obchodni_pripad_verze ov JOIN obchodni_pripad op ON op.obchodni_pripad_pk = ov.obchodni_pripad_pk JOIN stav_pripadu sp ON sp.stav_pripadu_pk = ov.stav_pripadu_pk JOIN stav_pripadu_ciselnik sc ON sc.stav_pripadu_ciselnik_id = sp.stav_pripadu_ciselnik_id JOIN fronta fr ON fr.fronta_id = sp.fronta_id JOIN kontakt ko ON ko.kontakt_pk = op.kontakt_pk JOIN kontakt_verze kov ON kov.kontakt_pk = ko.kontakt_pk JOIN kampan ka ON ka.kampan_pk = op.kampan_pk JOIN produkt pr ON pr.produkt_id = ka.produkt_id JOIN uzivatel uz_ko ON uz_ko.uzivatel_pk = kov.uzivatel_pk_ulozil JOIN abc_obchodni_pripad abc_pripad ON abc_pripad.obchodni_pripad_verze_pk = ov.obchodni_pripad_verze_pk LEFT OUTER JOIN adresa adresa_trvala ON adresa_trvala.adresa_pk = abc_pripad.adresa_pk_trvala LEFT OUTER JOIN abc_ciselnik abc_ciselnik_abc_pripad_produkt ON abc_ciselnik_abc_pripad_produkt.abc_ciselnik_id= abc_pripad.produkt ; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Left Join (cost=14433.26..18497.99 rows=1811 width=0) Hash Cond: (abc_pripad.produkt = abc_ciselnik_abc_pripad_produkt.abc_ciselnik_id) -> Hash Join (cost=14432.17..18490.02 rows=1811 width=4) Hash Cond: (kov.uzivatel_pk_ulozil = uz_ko.uzivatel_pk) -> Nested Loop (cost=14424.58..18457.53 rows=1811 width=8) -> Nested Loop (cost=14424.58..18115.40 rows=332 width=12) -> Nested Loop Left Join (cost=14424.58..17998.25 rows=332 width=8) -> Hash Join (cost=14424.58..15756.76 rows=332 width=12) Hash Cond: (op.kampan_pk = ka.kampan_pk) -> Nested Loop (cost=14420.25..15742.63 rows=1726 width=16) -> Hash Join (cost=14420.25..15188.26 rows=1726 width=12) Hash Cond: (abc_pripad.obchodni_pripad_verze_pk = ov.obchodni_pripad_verze_pk) -> Seq Scan on abc_obchodni_pripad abc_pripad (cost=0.00..610.91 rows=37291width=12) -> Hash (cost=12707.95..12707.95 rows=136984 width=8) -> Nested Loop (cost=287.49..12707.95 rows=136984 width=8) -> Nested Loop (cost=1.34..11.62 rows=9 width=4) -> Hash Join (cost=1.34..6.21 rows=15 width=8) Hash Cond: (sp.fronta_id = fr.fronta_id) -> Seq Scan on stav_pripadu sp (cost=0.00..3.98 rows=198width=12) -> Hash (cost=1.15..1.15 rows=15 width=4) -> Seq Scan on fronta fr (cost=0.00..1.15 rows=15width=4) -> Index Scan using stav_pripadu_ciselnik_stav_pripadu_ciselnik_id_keyon stav_pripadu_ciselnik sc (cost=0.00..0.35 rows=1 width=4) Index Cond: (sc.stav_pripadu_ciselnik_id = sp.stav_pripadu_ciselnik_id) -> Bitmap Heap Scan on obchodni_pripad_verze ov (cost=286.15..1162.48rows=19858 width=12) Recheck Cond: (sp.stav_pripadu_pk = ov.stav_pripadu_pk) -> Bitmap Index Scan on obchodni_pripad_verze_stav_pripadu_pk_idx (cost=0.00..281.19 rows=19858 width=0) Index Cond: (sp.stav_pripadu_pk = ov.stav_pripadu_pk) -> Index Scan using obchodni_pripad_pkey on obchodni_pripad op (cost=0.00..0.31rows=1 width=12) Index Cond: (op.obchodni_pripad_pk = ov.obchodni_pripad_pk) -> Hash (cost=4.08..4.08 rows=20 width=4) -> Hash Join (cost=1.45..4.08 rows=20 width=4) Hash Cond: (ka.produkt_id = pr.produkt_id) -> Seq Scan on kampan ka (cost=0.00..2.04 rows=104 width=8) -> Hash (cost=1.20..1.20 rows=20 width=4) -> Seq Scan on produkt pr (cost=0.00..1.20 rows=20 width=4) -> Index Scan using adresa_pkey on adresa adresa_trvala (cost=0.00..6.74 rows=1 width=4) Index Cond: (adresa_trvala.adresa_pk = abc_pripad.adresa_pk_trvala) -> Index Scan using kontakt_pkey on kontakt ko (cost=0.00..0.34 rows=1 width=4) Index Cond: (ko.kontakt_pk = op.kontakt_pk) -> Index Scan using kontakt_verze_kontakt_pk_idx on kontakt_verze kov (cost=0.00..0.92 rows=9 width=8) Index Cond: (kov.kontakt_pk = ko.kontakt_pk) -> Hash (cost=5.04..5.04 rows=204 width=4) -> Seq Scan on uzivatel uz_ko (cost=0.00..5.04 rows=204 width=4) -> Hash (cost=1.04..1.04 rows=4 width=4) -> Seq Scan on abc_ciselnik abc_ciselnik_abc_pripad_produkt (cost=0.00..1.04 rows=4 width=4) (45 rows)
Jakub Ouhrabka <kuba@comgate.cz> writes: > I tried to simplify the query even more and now I have query which on > 8.2.4 is planned instantly and on 8.2.5 takes cca 8 seconds. Are you sure you were using the same planner parameters (particularly join_collapse_limit and the geqo threshold) in both cases? regards, tom lane
Hi Tom, >> I tried to simplify the query even more and now I have query which on >> 8.2.4 is planned instantly and on 8.2.5 takes cca 8 seconds. > > Are you sure you were using the same planner parameters (particularly > join_collapse_limit and the geqo threshold) in both cases? thanks for the reply. Yes, I hope I'm using same parameters. To exhibit the behaviour I set geqo to off and join_collapse_limit (and from_collapse_limit) to 1000. On the test server, I can start postgresql 8.2.4 and 8.2.5 on the same datafiles and there is still this big difference in planning time. I hope I'm not missing something obvious and not wasting your time... What can I do to help to debug it? Thanks, Kuba
Jakub Ouhrabka <kuba@comgate.cz> writes: > What can I do to help to debug it? Either poke into the code yourself, or submit a self-contained test case (the query alone does not a test case make). I can't offhand think of a reason for 8.2.5 to be slower than 8.2.4 ... regards, tom lane
Hi Tom, > Either poke into the code yourself, or submit a self-contained test > case (the query alone does not a test case make). I can't offhand > think of a reason for 8.2.5 to be slower than 8.2.4 ... preparing the test case was easier than I expected. It's attached. Fast planning on 8.2.4, very slow on 8.2.5. Thanks, Kuba begin; set geqo to off; set join_collapse_limit to 1000; set from_collapse_limit to 1000; Create table a1 ( a1_pk Serial NOT NULL, primary key (a1_pk) ) Without Oids; Create table a2 ( a2_pk Serial NOT NULL, a3_id Integer NOT NULL, primary key (a2_pk) ) Without Oids; Create table a4 ( a4_pk Serial NOT NULL, a4_verze_pk_prvni Integer, a4_verze_pk_aktualni Integer, primary key (a4_pk) ) Without Oids; Create table a6 ( a6_pk Serial NOT NULL, primary key (a6_pk) ) Without Oids; Create table a7_verze ( a7_verze_pk Serial NOT NULL, a7_pk Integer NOT NULL, a7_verze_pk_predchozi Integer, a7_verze_pk_dalsi Integer, a9_pk Integer NOT NULL, a1_pk_ulozil Integer NOT NULL, primary key (a7_verze_pk) ) Without Oids; Create table a7 ( a7_pk Serial NOT NULL, a2_pk Integer NOT NULL, a4_pk Integer NOT NULL, a7_verze_pk_prvni Integer, a7_verze_pk_aktualni Integer, primary key (a7_pk) ) Without Oids; Create table a8 ( a8_pk Serial NOT NULL, a8_id Integer NOT NULL UNIQUE, primary key (a8_pk) ) Without Oids; Create table a9 ( a9_pk Serial NOT NULL, a9_a10_id Integer NOT NULL, a3_id Integer, a8_id Integer NOT NULL, primary key (a9_pk) ) Without Oids; Create table a4_verze ( a4_verze_pk Serial NOT NULL, a4_pk Integer NOT NULL, a4_verze_pk_predchozi Integer, a4_verze_pk_dalsi Integer, a1_pk_ulozil Integer NOT NULL, a6_pk Integer, primary key (a4_verze_pk) ) Without Oids; Create table a9_a10 ( a9_a10_pk Serial NOT NULL, a9_a10_id Integer NOT NULL UNIQUE, primary key (a9_a10_pk) ) Without Oids; CREATE TABLE abc_a7 ( abc_a7_pk integer NOT NULL, a7_verze_pk integer NOT NULL, a6_pk_trvala integer NOT NULL, a3 int ); CREATE TABLE abc_prirazeni_a10u ( abc_prirazeni_a10u_pk integer NOT NULL, abc_a7_pk integer NOT NULL, abc_a10_id integer NOT NULL ); CREATE TABLE a3 ( a3_pk integer NOT NULL, a3_id integer NOT NULL ); CREATE TABLE abc_a10 ( abc_a10_pk integer NOT NULL, abc_a10_id integer NOT NULL ); select now(); explain SELECT 1 FROM a7_verze ov JOIN a7 op ON op.a7_pk = ov.a7_pk JOIN a9 sp ON sp.a9_pk = ov.a9_pk JOIN a9_a10 sc ON sc.a9_a10_id = sp.a9_a10_id JOIN a8 fr ON fr.a8_id = sp.a8_id JOIN a4 ko ON ko.a4_pk = op.a4_pk JOIN a4_verze kov ON kov.a4_pk = ko.a4_pk JOIN a2 ka ON ka.a2_pk = op.a2_pk JOIN a3 pr ON pr.a3_id = ka.a3_id JOIN a1 uz_ko ON uz_ko.a1_pk = kov.a1_pk_ulozil JOIN abc_a7 abc_pripad ON abc_pripad.a7_verze_pk = ov.a7_verze_pk LEFT OUTER JOIN a6 a6_trvala ON a6_trvala.a6_pk = abc_pripad.a6_pk_trvala LEFT OUTER JOIN abc_a10 abc_a10_abc_pripad_a3 ON abc_a10_abc_pripad_a3.abc_a10_id = abc_pripad.a3 ; select now(); rollback;
Jakub Ouhrabka <kuba@comgate.cz> writes: > preparing the test case was easier than I expected. It's attached. Fast > planning on 8.2.4, very slow on 8.2.5. Hmm. I think there are two different bugs involved here. One is fixed by the attached patch, and it masks the performance problem on your test case, but I wonder whether it will be enough for your real application. Can you try this and see if it fixes 8.2.5 for you? regards, tom lane Index: initsplan.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/optimizer/plan/initsplan.c,v retrieving revision 1.123.2.7 diff -c -r1.123.2.7 initsplan.c *** initsplan.c 4 Oct 2007 20:44:55 -0000 1.123.2.7 --- initsplan.c 24 Oct 2007 20:34:08 -0000 *************** *** 625,630 **** --- 625,631 ---- * rel in the lower OJ's min_righthand, not its whole syn_righthand. */ if (bms_overlap(left_rels, otherinfo->syn_righthand) && + bms_overlap(clause_relids, otherinfo->syn_righthand) && !bms_overlap(strict_relids, otherinfo->min_righthand)) { min_lefthand = bms_add_members(min_lefthand,
Hi Tom, > Hmm. I think there are two different bugs involved here. One is fixed > by the attached patch, and it masks the performance problem on your test > case, but I wonder whether it will be enough for your real application. > Can you try this and see if it fixes 8.2.5 for you? many thanks for the quick patch! PostreSQL support is truly unbeatable... We did some test today with patched 8.2.5. For some cases it is ok but for large ones it is still taking quite a big time (e.g. for 23 joins is the planning time 107s). Maybe there's no regression from 8.2.4 - we'll do the tests against 8.2.4 on Monday - it's not usual for us to run with geqo switched off and we don't monitor the planning time - the total runtime for these queries is in minutes so we don't know were the time is spent. I'll keep you updated. Kuba
Jakub Ouhrabka <kuba@comgate.cz> writes: > We did some test today with patched 8.2.5. For some cases it is ok but > for large ones it is still taking quite a big time (e.g. for 23 joins is > the planning time 107s). Maybe there's no regression from 8.2.4 - we'll > do the tests against 8.2.4 on Monday - it's not usual for us to run with > geqo switched off and we don't monitor the planning time - the total > runtime for these queries is in minutes so we don't know were the time > is spent. Yeah, I was afraid that might happen. In the test case you sent, if the first LEFT JOIN is changed to RIGHT JOIN then the runtime goes right back up, because then it actually is the case make_outerjoininfo is looking for where the two outer joins can't be reordered. So you probably have some cases like that in your real application. But I'd expect 8.2.4 to be equally slow because it also contains the code that is slow in that scenario. I'm hoping to get some time today to think about how that could be fixed. regards, tom lane
I wrote: > ...probably have some cases like that in your real application. But I'd > expect 8.2.4 to be equally slow because it also contains the code that > is slow in that scenario. I'm hoping to get some time today to think > about how that could be fixed. Please try the attached patch (in addition to the one I sent earlier). regards, tom lane
Attachment
Hi Tom, >> ...probably have some cases like that in your real application. But I'd >> expect 8.2.4 to be equally slow because it also contains the code that >> is slow in that scenario. I'm hoping to get some time today to think >> about how that could be fixed. > > Please try the attached patch (in addition to the one I sent earlier). I can confirm that now there is no regression between 8.2.4 and 8.2.5+one-line-patch. I've also tried your last patch but still for some queries there is long planning time, e.g. more than 30s on a fast machine for query with 14 regularly JOINed tables plus 8 tables are LEFT OUTER JOINed (all normal joins are followed by all outer joins). All joins are constrained on pk/fk. Should this case run faster or is it simply too much for geqo=off? Shall I prepare a test case? Thanks, Kuba
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Tom Lane wrote: > Please try the attached patch (in addition to the one I sent earlier). This is biting us too, quite badly. Any chance this can get pushed into a 8.2.6? - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200710291212 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFHJgZyvJuQZxSWSsgRA61KAJ9ZIz220HZD8u2fr0T+NEg+rFh/AwCgmZIA iXhO5S+j3jGyqxamsQxqYlY= =pR+g -----END PGP SIGNATURE-----
Jakub Ouhrabka <kuba@comgate.cz> writes: > I've also tried your last patch but still for some queries there is long > planning time, e.g. more than 30s on a fast machine for query with 14 > regularly JOINed tables plus 8 tables are LEFT OUTER JOINed (all normal > joins are followed by all outer joins). All joins are constrained on > pk/fk. Should this case run faster or is it simply too much for > geqo=off? Shall I prepare a test case? By "last patch" you mean http://archives.postgresql.org/pgsql-committers/2007-10/msg00409.php ? If so that's about as fast as it's likely to get. 22 tables is well beyond what I'd consider reasonable to run through the exhaustive planner. You should try geqo again. regards, tom lane
> By "last patch" you mean > http://archives.postgresql.org/pgsql-committers/2007-10/msg00409.php > ? Sorry for confusion, I meant this one: http://archives.postgresql.org/pgsql-bugs/2007-10/msg00217.php Is it the same as the commited one? > If so that's about as fast as it's likely to get. 22 tables is well > beyond what I'd consider reasonable to run through the exhaustive > planner. You should try geqo again. OK, no problem... Anyway, many thanks for the quick patches! Kuba
Jakub Ouhrabka <kuba@comgate.cz> writes: >> By "last patch" you mean >> http://archives.postgresql.org/pgsql-committers/2007-10/msg00409.php > Sorry for confusion, I meant this one: > http://archives.postgresql.org/pgsql-bugs/2007-10/msg00217.php > Is it the same as the commited one? Yeah, should be the same. regards, tom lane
"Greg Sabino Mullane" <greg@turnstep.com> writes: > Tom Lane wrote: >> Please try the attached patch (in addition to the one I sent earlier). > This is biting us too, quite badly. Any chance this can get pushed into a > 8.2.6? Those patches are certainly already in the 8.2 CVS branch, so your question seems to mean "are we going to push 8.2.6 immediately to fix this". My vote would be no --- 8.2.5 is less than six weeks old and we don't have that many bugs against it. Given the overhead involved in a release, both from our point of view as packagers and users' point of view in having to install it, a single bug has to be pretty darn catastrophic to force an update by itself. This doesn't seem to me to reach that level... regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > Tom Lane wrote: >> Please try the attached patch (in addition to the one I sent earlier). > This is biting us too, quite badly. Any chance this can get pushed into a > 8.2.6? > Those patches are certainly already in the 8.2 CVS branch, so your > question seems to mean "are we going to push 8.2.6 immediately to fix > this". My vote would be no --- 8.2.5 is less than six weeks old and > we don't have that many bugs against it. Given the overhead involved > in a release, both from our point of view as packagers and users' point > of view in having to install it, a single bug has to be pretty darn > catastrophic to force an update by itself. This doesn't seem to me > to reach that level... I suppose catastophic is in the eye of the beholder, but this is very, very severe to one of our clients. So much so that they may end up going back to 8.1. The patches to 8.2.5 fix some of the queries, but not all; some of the problems seem to exist on 8.2.4 as well. We'll try to develop a self-contained test case that shows the problem, but until then wanted to give a heads up that a problem may still exist. - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation PGP Key: 0x14964AC8 200711051303 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFHL1rpvJuQZxSWSsgRAzNuAJ9xkww4QwjoavHO9SkTf9Zm7Jl9PgCfQ77k UXPer7AeI0xXe/f3XkMkUps= =imfW -----END PGP SIGNATURE-----
"Greg Sabino Mullane" <greg@turnstep.com> writes: >> Tom Lane wrote: >> Those patches are certainly already in the 8.2 CVS branch, so your >> question seems to mean "are we going to push 8.2.6 immediately to fix >> this". My vote would be no --- 8.2.5 is less than six weeks old and > I suppose catastophic is in the eye of the beholder, but this is very, > very severe to one of our clients. So much so that they may end up going > back to 8.1. The patches to 8.2.5 fix some of the queries, but not all; > some of the problems seem to exist on 8.2.4 as well. We'll try to > develop a self-contained test case that shows the problem, but until then > wanted to give a heads up that a problem may still exist. All the more reason not to push 8.2.6 immediately ... regards, tom lane