Thread: Crash in 9.4 Beta when partially collapsing left outer joins
I think this is the first time I've ever reported a PG crash, which is notable since I've been using PG for over 10 years. ;) Using the 9.4 Beta RPMs on CentOS 6.X/64, we're experiencing a reproducible crash when running a query with a left outer join, partially collapsed. TRAP: FailedAssertion("!(!restriction_is_or_clause((RestrictInfo *) orarg))", File: "indxpath.c", Line: 1213) < 2014-09-08 14:21:33.179 PDT >LOG: server process (PID 19957) was terminated by signal 6: Aborted < 2014-09-08 14:21:33.179 PDT >DETAIL: Failed process was running: SELECT students.id FROM students LEFT OUTER JOIN enrollments ON ( enrollments.students_id = students.id ) WHERE ( students.id = 5008 OR ( ( students.birthcity = 'Chico' OR students.birthcity IS NULL ) AND enrollments.start < 20141219 ) ); < 2014-09-08 14:21:33.179 PDT >LOG: terminating any other active server processes < 2014-09-08 14:21:33.179 PDT >WARNING: terminating connection because of crash of another server process < 2014-09-08 14:21:33.179 PDT >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Here's the smallest query I could find that it would crash on. Run on a blank database, the following will reliably crash postgres: CREATE TABLE students (id SERIAL PRIMARY KEY, birthcity VARCHAR DEFAULT NULL); CREATE TABLE enrollments (students_id INTEGER NOT NULL REFERENCES students(id), start INTEGER); SELECT students.id FROM students LEFT OUTER JOIN enrollments ON ( enrollments.students_id = students.id ) WHERE ( students.id = 5008 OR ( ( students.birthcity = 'Chico' OR students.birthcity IS NULL ) AND enrollments.start < 20141219 ) ); ----------------------- Other environment stuff: [root@db1 pgsql]# rpm -qa | grep postg postgresql94-libs-9.4beta2-1PGDG.rhel6.x86_64 postgresql94-server-9.4beta2-1PGDG.rhel6.x86_64 postgresql94-devel-9.4beta2-1PGDG.rhel6.x86_64 postgresql92-libs-9.2.9-1PGDG.rhel6.x86_64 postgresql94-9.4beta2-1PGDG.rhel6.x86_64 postgresql94-contrib-9.4beta2-1PGDG.rhel6.x86_64 [root@db1 pgsql]# uname -a Linux db1.schoolpathways.com 2.6.32-431.23.3.el6.x86_64 #1 SMP Thu Jul 31 17:20:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux What other information should I provide? We have the machine available if necessary. Benjamin Smith
lists@benjamindsmith.com writes: > Using the 9.4 Beta RPMs on CentOS 6.X/64, we're experiencing a reproducible > crash when running a query with a left outer join, partially collapsed. The test case crashes as described for me. Will take a look tomorrow. Thanks for the report! regards, tom lane
On Tue, Sep 9, 2014 at 6:36 AM, <lists@benjamindsmith.com> wrote: > What other information should I provide? We have the machine available if > necessary. This can be reproduced without especially LEFT OUTER JOIN, and system crashes as long as index path is taken in planner, and that WHERE clause uses a given combination of OR and AND like the one in the query given. Here is a more simple example: create table aa (a int); create index aai on aa(a); select a1.a from aa a1, aa a2 where a1.a = 0 or (a1.a = 0 or a1.a = 1) and a2.a = 0; Some bisecting is showing as well that the commit at the origin of the regression is f343a88. Regards, -- Michael
On Tue, Sep 9, 2014 at 2:43 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Some bisecting is showing as well that the commit at the origin of the > regression is f343a88. The failure is caused by an assertion not happy since this commit: frame #4: 0x0000000101d20670 postgres`generate_bitmap_or_paths(root=0x00007fd61d004d48, rel=0x00007fd61c033a58, clauses=0x00007fd61d010200, other_clauses=0x0000000000000000) + 480 at indxpath.c:1213 frame #5: 0x0000000101d1fc37 postgres`create_index_paths(root=0x00007fd61d004d48, rel=0x00007fd61c033a58) + 1255 at indxpath.c:314 frame #6: 0x0000000101d1146b postgres`set_plain_rel_pathlist(root=0x00007fd61d004d48, rel=0x00007fd61c033a58, rte=0x00007fd61c033c88) + 75 at allpaths.c:397 While reading the code of this commit, I noticed that extract_or_clause has added some logic for nested OR clauses: it extracts their content and adds them directly to the list of subclauses that are then used by generate_bitmap_or_paths, triggering the assertion failure reported by the trace above. The logic for nested OR is correct by reading it, hence why not simply removing the assertion failing? The attached patch 1 does so. Another approach would consist in removing the nested OR part and keep the old assertion logic, like in the patch 2 attached, but this seems like a no-go as f343a88 has actually improved nested OR tracking. Thoughts? Note: I added as well a regression tests in patch 1 as this is IMO the correct approach, if that's considered as correct of course :) -- Michael
Attachment
Michael Paquier <michael.paquier@gmail.com> writes: > The logic for nested OR is correct by reading it, hence why not simply > removing the assertion failing? The attached patch 1 does so. The reason for the assert is that there should never be an OR directly underneath an OR in the planner after eval_const_expressions has flattened such cases. Evidently commit f343a88 failed to preserve AND/OR flatness in some cases :-(. That code should be taught to do so, rather than lobotomizing this assertion. Lack of flatness causes optimization inefficiencies, which is why we don't want to just allow it. regards, tom lane
On Tue, Sep 9, 2014 at 10:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> The logic for nested OR is correct by reading it, hence why not simply >> removing the assertion failing? The attached patch 1 does so. > > The reason for the assert is that there should never be an OR directly > underneath an OR in the planner after eval_const_expressions has flattened > such cases. Evidently commit f343a88 failed to preserve AND/OR flatness > in some cases :-(. That code should be taught to do so, rather than > lobotomizing this assertion. Lack of flatness causes optimization > inefficiencies, which is why we don't want to just allow it. Ah, OK, I just saw your commit. so the trick is to add the arguments of subclause in case of an OR clause found to have a correct flattening here... Thanks! -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Tue, Sep 9, 2014 at 10:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The reason for the assert is that there should never be an OR directly >> underneath an OR in the planner after eval_const_expressions has flattened >> such cases. Evidently commit f343a88 failed to preserve AND/OR flatness >> in some cases :-(. That code should be taught to do so, rather than >> lobotomizing this assertion. Lack of flatness causes optimization >> inefficiencies, which is why we don't want to just allow it. > Ah, OK, I just saw your commit. so the trick is to add the arguments > of subclause in case of an OR clause found to have a correct > flattening here... Thanks! Right. If you look again at that code in orclauses.c, you'll notice that it is itself assuming AND/OR flatness in its input. We could discard that assumption, but it would just mean moving complexity from the places that currently have to preserve flatness to other places. For instance, right now we suppose that all "top level" WHERE clauses are in the top-level AND list ... if we had to check for sub-AND clauses and recurse into those, it would make life complicated in numerous places. I do wonder, having seen this bug, if there's someplace we could add assertions to check for AND/OR flatness that'd be more certainly hit by a violation. regards, tom lane