Thread: Recent failures on buildfarm member hornet
hornet has failed its last five runs with 2020-10-05 22:45:42.784 UTC [34734498:40] pg_regress/create_aggregate LOG: statement: create aggregate my_percentile_disc(float8ORDER BY anyelement) ( stype = internal, sfunc = ordered_set_transition, finalfunc = percentile_disc_final, finalfunc_extra = true, finalfunc_modify = read_write ); TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498) After looking at the commits immediately preceding the first failure, and digging around in the aggregate-related code, it seems like commit cc99baa43 (Improve pg_list.h's linitial(), lsecond() and co macros) must've broke it somehow. The nearest thing that I can see to a theory is that where DefineAggregate does numDirectArgs = intVal(lsecond(args)); it's coming out with the wrong result, leading to a failure of the numDirectArgs-vs-numArgs sanity check in AggregateCreate. But how could that be? I hesitate to blame the compiler twice in one week. OTOH, it's a not-very-mainstream compiler on a not-very-mainstream architecture. Noah, can you poke into this in a little more detail and try to verify what is happening? regards, tom lane
On Mon, Oct 05, 2020 at 08:42:15PM -0400, Tom Lane wrote: > hornet has failed its last five runs with > > 2020-10-05 22:45:42.784 UTC [34734498:40] pg_regress/create_aggregate LOG: statement: create aggregate my_percentile_disc(float8ORDER BY anyelement) ( > stype = internal, > sfunc = ordered_set_transition, > finalfunc = percentile_disc_final, > finalfunc_extra = true, > finalfunc_modify = read_write > ); > TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498) > > After looking at the commits immediately preceding the first failure, and > digging around in the aggregate-related code, it seems like commit > cc99baa43 (Improve pg_list.h's linitial(), lsecond() and co macros) > must've broke it somehow. The nearest thing that I can see to a theory > is that where DefineAggregate does > numDirectArgs = intVal(lsecond(args)); > it's coming out with the wrong result, leading to a failure of the > numDirectArgs-vs-numArgs sanity check in AggregateCreate. Building the tree with -O0 suppresses the problem. (xlc does not have -O1.) Building just aggregatecmds.c and pg_aggregate.c that way does not, so I suppose the damage arose elsewhere. > But how could > that be? I hesitate to blame the compiler twice in one week. OTOH, it's > a not-very-mainstream compiler on a not-very-mainstream architecture. A compiler bug is plausible. The compiler is eight years old, and we're not seeing the problem on 32-bit (mandrill) or on 2019-vintage xlc (hoverfly). Shall I make this animal use -O0 on v14+?
Noah Misch <noah@leadboat.com> writes: > On Mon, Oct 05, 2020 at 08:42:15PM -0400, Tom Lane wrote: >> hornet has failed its last five runs with >> TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498) > Building the tree with -O0 suppresses the problem. (xlc does not have -O1.) OK, that's pretty unsurprising. > Building just aggregatecmds.c and pg_aggregate.c that way does not, so I > suppose the damage arose elsewhere. Now that *is* surprising. Could you poke a little further to determine which module is getting miscompiled? (gram.o seems like the next most likely bet.) regards, tom lane
On Tue, Oct 06, 2020 at 09:56:49PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Mon, Oct 05, 2020 at 08:42:15PM -0400, Tom Lane wrote: > >> hornet has failed its last five runs with > >> TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498) > > > Building the tree with -O0 suppresses the problem. (xlc does not have -O1.) > > OK, that's pretty unsurprising. > > > Building just aggregatecmds.c and pg_aggregate.c that way does not, so I > > suppose the damage arose elsewhere. > > Now that *is* surprising. Could you poke a little further to determine > which module is getting miscompiled? (gram.o seems like the next most > likely bet.) gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o, but those seem far less likely.)
Noah Misch <noah@leadboat.com> writes: > On Tue, Oct 06, 2020 at 09:56:49PM -0400, Tom Lane wrote: >> Now that *is* surprising. Could you poke a little further to determine >> which module is getting miscompiled? (gram.o seems like the next most >> likely bet.) > gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o, > but those seem far less likely.) This suggests that the problem is misoptimization of gram.y's makeOrderedSetArgs: /* don't merge into the next line, as list_concat changes directargs */ ndirectargs = list_length(directargs); return list_make2(list_concat(directargs, orderedargs), makeInteger(ndirectargs)); I think that if the compiler did what the comment says not to, it'd match this symptom. However, I'm baffled as to why our recent pg_list.h changes would've affected that. regards, tom lane
Thanks for investigating this, Noah. On Thu, 8 Oct 2020 at 06:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Noah Misch <noah@leadboat.com> writes: > > gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o, > > but those seem far less likely.) > > This suggests that the problem is misoptimization of gram.y's > makeOrderedSetArgs: It would be interesting to see gram.s from both cc99baa4~1 and cc99baa4. David
On Thu, Oct 08, 2020 at 09:15:12AM +1300, David Rowley wrote: > On Thu, 8 Oct 2020 at 06:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Noah Misch <noah@leadboat.com> writes: > > > gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o, > > > but those seem far less likely.) > > > > This suggests that the problem is misoptimization of gram.y's > > makeOrderedSetArgs: > > It would be interesting to see gram.s from both cc99baa4~1 and cc99baa4. Attached. Both generated like this: (cd src/backend/parser && xlc_r -D_LARGE_FILES=1 -qnoansialias -g -O2 -S -qmaxmem=33554432 -I. -I. -I../../../src/include-c gram.c; ) (Again, the compiler is eight years old, and we're not seeing the problem on 32-bit (mandrill) or on 2019-vintage xlc (hoverfly). As soon as the situation no longer piques your curiosity, I'm happy to make hoverfly use -O0 on v14+.)
Attachment
Noah Misch <noah@leadboat.com> writes: > On Thu, Oct 08, 2020 at 09:15:12AM +1300, David Rowley wrote: >> It would be interesting to see gram.s from both cc99baa4~1 and cc99baa4. > Attached. Both generated like this: Hm. I'm too lazy to go bone up on PPC64 ABI conventions, but this does look suspiciously like the compiler is doing what I feared: GOOD: lwa r31,4(r27) # fetching list_length(directargs) ? .line 16295 ori r3,r27,0x0000 bl .list_concat{PR} ori r0,r0,0x0000 std r3,112(SP) .line 16295 extsw r3,r31 # ... and passing it to makeInteger bl .makeInteger{PR} ori r0,r0,0x0000 BAD: ori r3,r31,0x0000 bl .list_concat{PR} ori r0,r0,0x0000 std r3,112(SP) .line 16288 lwa r3,4(r31) # fetching list_length(directargs) ? bl .makeInteger{PR} ori r0,r0,0x0000 (I'm confused about why the line numbers don't match up, since cc99baa4 did not touch gram.y. But whatever.) I'm tempted to propose the attached small code rearrangement, which might dissuade the compiler from thinking it can get away with this. While I concur with your point that an old xlc version might not be that exciting, there could be other compilers doing the same thing in the future. regards, tom lane diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 0d101d8171..480d168346 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -16291,7 +16291,7 @@ makeOrderedSetArgs(List *directargs, List *orderedargs, core_yyscan_t yyscanner) { FunctionParameter *lastd = (FunctionParameter *) llast(directargs); - int ndirectargs; + Value *ndirectargs; /* No restriction unless last direct arg is VARIADIC */ if (lastd->mode == FUNC_PARAM_VARIADIC) @@ -16315,10 +16315,10 @@ makeOrderedSetArgs(List *directargs, List *orderedargs, } /* don't merge into the next line, as list_concat changes directargs */ - ndirectargs = list_length(directargs); + ndirectargs = makeInteger(list_length(directargs)); return list_make2(list_concat(directargs, orderedargs), - makeInteger(ndirectargs)); + ndirectargs); } /* insertSelectOptions()
I wrote: > I'm tempted to propose the attached small code rearrangement, which > might dissuade the compiler from thinking it can get away with this. > While I concur with your point that an old xlc version might not be > that exciting, there could be other compilers doing the same thing > in the future. After thinking about it a bit more, I'm not even convinced that what xlc seems to be doing is illegal per C spec. There are no sequence points within return list_make2(list_concat(directargs, orderedargs), makeInteger(ndirectargs)); and therefore there's an argument to be made that the compiler doesn't have to care whether any side-effects of list_concat() occur before or after the evaluation of makeInteger(ndirectargs). If the potential side-effects of list_concat() can be disregarded until the end of this statement, then the code change is perfectly legal. Maybe some very careful language-lawyering could prove differently, but it's not as open-and-shut as one could wish. So now I'm thinking that we need this patch anyway, xlc or not. regards, tom lane
On Wed, Oct 07, 2020 at 06:03:16PM -0400, Tom Lane wrote: > I wrote: > > I'm tempted to propose the attached small code rearrangement, which > > might dissuade the compiler from thinking it can get away with this. That patch does get hornet's -O2 build to again pass "make check". It doesn't harm the code, so we may as well use it. > > While I concur with your point that an old xlc version might not be > > that exciting, there could be other compilers doing the same thing > > in the future. > > After thinking about it a bit more, I'm not even convinced that what > xlc seems to be doing is illegal per C spec. There are no sequence > points within > > return list_make2(list_concat(directargs, orderedargs), > makeInteger(ndirectargs)); There is, however, a sequence point between list_length(directargs) and list_concat(), and the problem arises because xlc reorders those two. It's true that makeInteger() could run before or after list_concat(), but that alone would not have been a problem.
Noah Misch <noah@leadboat.com> writes: > On Wed, Oct 07, 2020 at 06:03:16PM -0400, Tom Lane wrote: >> After thinking about it a bit more, I'm not even convinced that what >> xlc seems to be doing is illegal per C spec. There are no sequence >> points within >> >> return list_make2(list_concat(directargs, orderedargs), >> makeInteger(ndirectargs)); > There is, however, a sequence point between list_length(directargs) and > list_concat(), and the problem arises because xlc reorders those two. It's > true that makeInteger() could run before or after list_concat(), but that > alone would not have been a problem. Yeah, that is the theory on which the existing code is built, specifically that the list_length fetch must occur before list_concat runs. What I am wondering about is a more aggressive interpretation of "sequence point", namely that the compiler is free to disregard exactly when list_concat's side-effects occur between this statement's sequence points. I'm not sure that the C spec allows that interpretation, but I'm not sure it doesn't, either. regards, tom lane
On Wed, Oct 07, 2020 at 06:22:04PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Wed, Oct 07, 2020 at 06:03:16PM -0400, Tom Lane wrote: > >> After thinking about it a bit more, I'm not even convinced that what > >> xlc seems to be doing is illegal per C spec. There are no sequence > >> points within > >> > >> return list_make2(list_concat(directargs, orderedargs), > >> makeInteger(ndirectargs)); > > > There is, however, a sequence point between list_length(directargs) and > > list_concat(), and the problem arises because xlc reorders those two. It's > > true that makeInteger() could run before or after list_concat(), but that > > alone would not have been a problem. > > Yeah, that is the theory on which the existing code is built, > specifically that the list_length fetch must occur before list_concat > runs. What I am wondering about is a more aggressive interpretation of > "sequence point", namely that the compiler is free to disregard exactly > when list_concat's side-effects occur between this statement's sequence > points. I'm not sure that the C spec allows that interpretation, but > I'm not sure it doesn't, either. C does not allow that, because the list_length() happens in a different "full expression" (different statement): ndirectargs = list_length(directargs);/* noah adds: a sequence point is here */ return list_make2(list_concat(directargs, orderedargs), makeInteger(ndirectargs)); A compiler may implement like this: ndirectargs = list_length(directargs); a := list_concat(directargs, orderedargs); b := makeInteger(ndirectargs); return list_make2(a, b); Or this: ndirectargs = list_length(directargs); b := makeInteger(ndirectargs); a := list_concat(directargs, orderedargs); return list_make2(a, b); But not like this: a := list_concat(directargs, orderedargs); ndirectargs = list_length(directargs); b := makeInteger(ndirectargs); return list_make2(a, b);