Thread: PostgreSQL12 crash bug report
1、Run the docker version of PostgreSQL 12 $ docker run --rm -p 5432:5432 -v /tmp:/data -e POSTGRES_USER=exchange -e POSTGRES_DB=exchange -e POSTGRES_PASSWORD=123456 postgres:12-alpine 2、Download the attached files to /tmp 3、Import the schema and data $ docker exec -it $CONTAINER psql -U exchange exchange -f /data/t_order_schema.sql $ docker exec -it $CONTAINER psql -U exchange exchange -f /data/t_order_data.sql 4、Run pgbench with provided test script $ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql
Attachment
yi huang <yi.codeplayer@gmail.com> writes: > [ crashes with ] > $ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql Thanks for the report! Note that this doesn't fail if you just run test.sql by itself, but it does soon fall over as described when running multiple copies in parallel. Unsurprisingly given that, the crash is inside EvalPlanQual: (gdb) bt #0 memcpy () at ../sysdeps/x86_64/memcpy.S:526 #1 0x00000000004863b1 in fill_val (tupleDesc=0x18c7060, values=<value optimized out>, isnull=0x195e198, data=<value optimized out>, data_size=148670068, infomask=<value optimized out>, bit=0x7f07aef4207f "\017") at heaptuple.c:247 #2 heap_fill_tuple (tupleDesc=0x18c7060, values=<value optimized out>, isnull=0x195e198, data=<value optimized out>, data_size=148670068, infomask=<value optimized out>, bit=0x7f07aef4207f "\017") at heaptuple.c:336 #3 0x0000000000486921 in heap_form_tuple (tupleDescriptor=0x18c7060, values=0x7ffc691d58b0, isnull=0x195e198) at heaptuple.c:1090 #4 0x00000000004ddf78 in toast_build_flattened_tuple (tupleDesc=0x18c7060, values=<value optimized out>, isnull=0x195e198) at tuptoaster.c:1336 #5 0x00000000006441f0 in ExecEvalWholeRowVar (state=<value optimized out>, op=0x1958030, econtext=<value optimized out>) at execExprInterp.c:4019 #6 0x0000000000647602 in ExecInterpExpr (state=0x1957ea0, econtext=0x19576a0, isnull=0x7ffc691d8dff) at execExprInterp.c:519 #7 0x00000000006543ef in ExecEvalExprSwitchContext ( node=<value optimized out>, accessMtd=0x67a140 <ValuesNext>, recheckMtd=0x679f00 <ValuesRecheck>) at ../../../src/include/executor/executor.h:307 #8 ExecProject (node=<value optimized out>, accessMtd=0x67a140 <ValuesNext>, recheckMtd=0x679f00 <ValuesRecheck>) at ../../../src/include/executor/executor.h:341 #9 ExecScan (node=<value optimized out>, accessMtd=0x67a140 <ValuesNext>, recheckMtd=0x679f00 <ValuesRecheck>) at execScan.c:199 #10 0x00000000006698dc in ExecProcNode (node=0x19573d8) at ../../../src/include/executor/executor.h:239 #11 MultiExecPrivateHash (node=0x19573d8) at nodeHash.c:164 #12 MultiExecHash (node=0x19573d8) at nodeHash.c:114 #13 0x000000000066b685 in ExecHashJoinImpl (pstate=0x1939d08) at nodeHashjoin.c:291 #14 ExecHashJoin (pstate=0x1939d08) at nodeHashjoin.c:572 #15 0x000000000064af7f in ExecProcNode (epqstate=<value optimized out>) at ../../../src/include/executor/executor.h:239 #16 EvalPlanQualNext (epqstate=<value optimized out>) at execMain.c:2692 #17 0x000000000064b502 in EvalPlanQual (estate=<value optimized out>, epqstate=0x18b5420, relation=<value optimized out>, rti=4, inputslot=0x195dfc0) at execMain.c:2475 #18 0x0000000000675f09 in ExecUpdate (mtstate=0x18b5328, tupleid=0x7ffc691d9130, oldtuple=0x0, slot=<value optimized out>, planSlot=0x18ca138, epqstate=0x18b5420, estate=0x18b4558, canSetTag=true) at nodeModifyTable.c:1389 #19 0x000000000067668d in ExecModifyTable (pstate=0x18b5328) at nodeModifyTable.c:2226 #20 0x000000000064d547 in ExecProcNode (queryDesc=0x18b4148, direction=<value optimized out>, count=0, execute_once=40) at ../../../src/include/executor/executor.h:239 #21 ExecutePlan (queryDesc=0x18b4148, direction=<value optimized out>, count=0, execute_once=40) at execMain.c:1648 #22 standard_ExecutorRun (queryDesc=0x18b4148, direction=<value optimized out>, count=0, execute_once=40) at execMain.c:365 #23 0x00000000007c32ab in ProcessQuery (plan=<value optimized out>, sourceText=0x17b7f28 "with tmp(id, avail_amount, deal_value, update_id) as (values (76107497731002368,0.00,16546.90746,76107519298113537),(76107514491441155,0.606,5808.03795,76107519298113537))\nupdate t_orderset avail_amo"..., params=0x0, queryEnv=0x0, dest=<value optimized out>, completionTag=0x7ffc691d9430 "") at pquery.c:161 #24 0x00000000007c3613 in PortalRunMulti (portal=0x181dc48, isTopLevel=true, setHoldSnapshot=false, dest=0x7f07b7f9fce0, altdest=0x7f07b7f9fce0, completionTag=0x7ffc691d9430 "") at pquery.c:1283 #25 0x00000000007c3d80 in PortalRun (portal=0x181dc48, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x7f07b7f9fce0, altdest=0x7f07b7f9fce0, completionTag=0x7ffc691d9430 "") at pquery.c:796 #26 0x00000000007c0001 in exec_simple_query ( ... As per the stack trace, we're trying to build a new tuple for the output of a ValuesScan node, but the targetlist for that seems completely insane: heap_form_tuple is being given a 15-column tupdesc that includes (gdb) p tupleDescriptor->attrs[1] $8 = {attrelid = 0, attname = {data = "side", '\000' <repeats 59 times>}, atttypid = 35644, attstattarget = -1, attlen = 4, attnum = 2, attndims = 0, attcacheoff = 8, atttypmod = -1, attbyval = true, attstorage = 112 'p', attalign = 105 'i', attnotnull = false, atthasdef = false, atthasmissing = false, attidentity = 0 '\000', attgenerated = 0 '\000', attisdropped = false, attislocal = true, attinhcount = 0, attcollation = 0} which seems to be the column that we have a bogus Datum for. But what's the code doing expecting that to be available from the ValueScan? It's a column of t_order. And the other data is all wrong too: that composite type should surely not have attlen = 4 nor attbyval = true. "explain verbose" claims that the Values node should be returning -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=184) Output: "*VALUES*".column2, "*VALUES*".column3, "*VALUES*".column4, "*VALUES*".*, "*VALUES*".column1 so it seems like somehow we've got wrong info in the EPQ context. I wonder if this is related to the "Hash join explain is broken" thread. regards, tom lane
I wrote: > yi huang <yi.codeplayer@gmail.com> writes: >> [ crashes with ] >> $ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql > As per the stack trace, we're trying to build a new tuple for the output > of a ValuesScan node, but the targetlist for that seems completely insane: After digging through this, it seems clear that it's the fault of ad0bda5d2, specifically this code in EvalPlanQualSlot: *slot = ExecAllocTableSlot(&epqstate->estate->es_tupleTable, epqstate->origslot->tts_tupleDescriptor, &TTSOpsVirtual); It's setting up an es_epqTupleSlot[] entry on the assumption that it should have the same tupdesc as the output tuple that's to be rechecked. This might accidentally fail to malfunction in single-table cases, but it's quite broken for any join case --- in particular, for the given test case, the scan tuple for the VALUES node surely doesn't have the same tupdesc as the join result. I think we need to change the API of EvalPlanQualSlot, because it cannot determine the correct tupdesc when it's not handed a Relation. I'm not entirely sure that its immediate callers can either :-( so this might propagate out a ways. Or perhaps we should move the slot-making someplace else. I'll go make an open item for this. regards, tom lane
Hi, On 2019-07-31 19:07:29 -0400, Tom Lane wrote: > I wrote: > > yi huang <yi.codeplayer@gmail.com> writes: > >> [ crashes with ] > >> $ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql > > > As per the stack trace, we're trying to build a new tuple for the output > > of a ValuesScan node, but the targetlist for that seems completely insane: > > After digging through this, it seems clear that it's the fault of > ad0bda5d2, specifically this code in EvalPlanQualSlot: > > *slot = ExecAllocTableSlot(&epqstate->estate->es_tupleTable, > epqstate->origslot->tts_tupleDescriptor, > &TTSOpsVirtual); > > It's setting up an es_epqTupleSlot[] entry on the assumption that it > should have the same tupdesc as the output tuple that's to be rechecked. > This might accidentally fail to malfunction in single-table cases, > but it's quite broken for any join case --- in particular, for the > given test case, the scan tuple for the VALUES node surely doesn't have > the same tupdesc as the join result. :( We really need to improve the test coverage for EPQ :(. I tried to add a few cases, but it's still one of the least tested complicated areas in PG. To make sure I understand - the problem isn't the slot that we've created in nodeModifyTable.c via EvalPlanQualSlot(), right? It's the one we create in EvalPlanQualFetchRowMarks(), because we don't have a proper tuple type handy to create the slot with? And then ExecScanFetch() (on behalf of ValueScan in this case) expects something with the correct format in that slot, which causes the backtrace you show. And this problem occurs "only" for ROW_MARK_COPY, because for everything else we have an associated relation. Previously we simply didn't need to know the type during EPQ setup, because we only stored a HeapTuple anyway. And we'd know the appropriate tupledesc at the places that access the tuple. Right? > I think we need to change the API of EvalPlanQualSlot, because it > cannot determine the correct tupdesc when it's not handed a Relation. > I'm not entirely sure that its immediate callers can either :-( so > this might propagate out a ways. Or perhaps we should move the > slot-making someplace else. I think we'll probably have to split EvalPlanQualSlot() into two functions. One for nodeModifyTable.c and nodeLockRows.c, and the !ROW_MARK_COPY cases in EvalPlanQualFetchRowMarks(). One bigger change - but possibly one worth it - would be to basically move the work done in EvalPlanQualFetchRowMarks() to be done on-demand, at least for ROW_MARK_COPY. We don't need to do /* fetch the whole-row Var for the relation */ datum = ExecGetJunkAttribute(epqstate->origslot, aerm->wholeAttNo, &isNull); /* non-locked rels could be on the inside of outer joins */ if (isNull) continue; that early in the query. We could call EvalPlanQualFetchRowMark() (note the singular) from ExecScanFetch(), that ought to currently be the only place that accesses ROW_MARK_COPY type rowmarks? We could do the same for the other types of rowmarks, but I think that'd be semantically somewhat murkier. That would have the advantage that we'd not unnecessarily put row marks we might never need into slots. There's probably not that many cases where that matters, though. Or we could keep enough information for ROW_MARK_COPY RowMarks around to create the necessary tupledesc (there's a number of variants as to how - don't zap the information in add_rte_to_flat_rtable, have setrefs.c somehow keep maintain rti indexed pointers to Plan nodes, ...). Not sure yet what's best here. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-07-31 19:07:29 -0400, Tom Lane wrote: >> It's setting up an es_epqTupleSlot[] entry on the assumption that it >> should have the same tupdesc as the output tuple that's to be rechecked. >> This might accidentally fail to malfunction in single-table cases, >> but it's quite broken for any join case --- in particular, for the >> given test case, the scan tuple for the VALUES node surely doesn't have >> the same tupdesc as the join result. > To make sure I understand - the problem isn't the slot that we've > created in nodeModifyTable.c via EvalPlanQualSlot(), right? It's the one > we create in EvalPlanQualFetchRowMarks(), because we don't have a proper > tuple type handy to create the slot with? Yeah, I think nodeModifyTable.c is fine, because it always passes the target relation. EvalPlanQualFetchRowMarks is not fine, and I'm unsure about the call in nodeLockRows.c. > Previously we simply didn't need to know the type during EPQ setup, > because we only stored a HeapTuple anyway. And we'd know the appropriate > tupledesc at the places that access the tuple. Right. So we gotta refactor that somehow. > One bigger change - but possibly one worth it - would be to basically > move the work done in EvalPlanQualFetchRowMarks() to be done on-demand, > at least for ROW_MARK_COPY. Hm. Too tired to think that through right now. regards, tom lane
Hi, On 2019-07-31 22:37:40 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-07-31 19:07:29 -0400, Tom Lane wrote: > >> It's setting up an es_epqTupleSlot[] entry on the assumption that it > >> should have the same tupdesc as the output tuple that's to be rechecked. > >> This might accidentally fail to malfunction in single-table cases, > >> but it's quite broken for any join case --- in particular, for the > >> given test case, the scan tuple for the VALUES node surely doesn't have > >> the same tupdesc as the join result. > > > To make sure I understand - the problem isn't the slot that we've > > created in nodeModifyTable.c via EvalPlanQualSlot(), right? It's the one > > we create in EvalPlanQualFetchRowMarks(), because we don't have a proper > > tuple type handy to create the slot with? > > Yeah, I think nodeModifyTable.c is fine, because it always passes the > target relation. EvalPlanQualFetchRowMarks is not fine, and I'm unsure > about the call in nodeLockRows.c. I think nodeLockRows.c itself should be fine - the slots it gets with EvalPlanQualSlot() are either used for the FDWs to store tuples in it, or nodeLockRows.c directly fetches tuples into them with table_tuple_lock(). It itself is only dealing with locking rowmarks, for which we obviously have a relation. It does trigger EvalPlanQualFetchRowMarks() for the non-locking marks - that part is obviously broken, but not in nodeLockRows.c itself. > > Previously we simply didn't need to know the type during EPQ setup, > > because we only stored a HeapTuple anyway. And we'd know the appropriate > > tupledesc at the places that access the tuple. > > Right. So we gotta refactor that somehow. One way to avoid doing so would be to just not deal with slots in the case of ROW_MARK_COPY. The reason why EPQ was made to use to use slots instead of HeapTuples is that not doing so would require expensive conversions for AMs that don't use HeapTuple. But for COPY rowmarks that's not an issue, because the source already is a HeapTuple (in the form of a Datum). That's imo not a particularly pretty approach, but might be the most viable approach for v12. Attached is a quick WIP implementation, which indeed fixes the crash reported here, and passes all our tests. Obviously it'd need some polish if we went for that. I think it's pretty darn ugly that we don't know what kind of tuples we're dealing with at the top-level of EPQ. But changing that seems far from trivial. The way things set up we don't have a good way to map from rtis to the associated PlanStates, and currently only the PlanStates have enough information to infer the slot type. In fact, if we knew the ScanState nodes for specific rtis, we could perhaps even just use each node's ScanState->ss_ScanTupleSlot - seems fragile, but we've effectively done so before v12. And we can't easy add a way to map from rti to PlanState, without adding work to ExecInit* for every Scan node. If ExecInitNode() knew whether the relevant nodes were Scan nodes, we could maintain the mapping at that stage, but we afaict don't know that. I briefly looked at hacking something like ExecInitScanTupleSlot() to maintain an EState rti->ScanState mapping, but we have ScanState executor nodes for planner nodes that do *not* inherit from Scan (e.g. Material). While that's ugly, it's not something we seem likely to change all that soon - and using ExecInitScanTupleSlot() as a proxy is just a hack anyway. It sure would be nice if IsA() actually worked when inheriting from a node... We could easily maintain a mapping from Plan->plan_node_id to the corresponding PlanState from within ExecInitNode(). But I don't think that'd help that much, as we'd need an rti -> plan_node_id mapping, which'd again not be trivial to maintain. We could do so in set_plan_refs(), as that knows both the relevant plan_node_id, and the adjusted scanrelid. I'm not exactly sure what the intended uses of plan_node_id are - it was introduce in d1b7c1ffe72e86932b5395f29e006c3f503bc53d and is, exclusively?, used as a unique key into dynamic shared memory for parallel query (so each node can cheaply can build a key for its own data). So perhaps it'd be a gross misuse of that system. But it doesn't look like that to me. Not sure if there's other things that could benefit from the rti -> ScanState or plan_node_id -> PlanState mapping. It'd be nice for debugging, I guess, but that doesn't quite seem enough. > > One bigger change - but possibly one worth it - would be to basically > > move the work done in EvalPlanQualFetchRowMarks() to be done on-demand, > > at least for ROW_MARK_COPY. > > Hm. Too tired to think that through right now. I briefly tried to implement that. The problem is that, as things are currently set up, we don't have access to the corresponding epqstate from within executor nodes. That prevents us from accessing EPQState->{origslot, arowMarks}, which we need to actually be able to fetch the rows to return. We could solve that by referencing the EPQState from its EState (but not from the "main" estate). I've wondered before whether that'd be better than having various es_epq* fields in EState. I don't think it'd make the modularity any worse, given that we're already referencing EPQstate fields from with EState. If we moved es_epqTupleSlot into EPQState and allocated it from within EvalPlanQualInit(), instead of EvalPlanQualBegin(), we'd address your complaint that we now call that earlier too. Doesn't sound that hard. Seems like a big change for v12 though. As far as I can tell, the only behavioural change of fetching rows later instead of in EvalPlanQualFetchRowMarks would be that we'd fetch (locally or via FDW) rows that are referenced, but not locked, at the same time we lock rows, and that we would not fetch them if any of the locked rows couldn't be locked. Which seems like a mild improvement. Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > On 2019-07-31 22:37:40 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> Previously we simply didn't need to know the type during EPQ setup, >>> because we only stored a HeapTuple anyway. And we'd know the appropriate >>> tupledesc at the places that access the tuple. >> Right. So we gotta refactor that somehow. > One way to avoid doing so would be to just not deal with slots in the > case of ROW_MARK_COPY. The reason why EPQ was made to use to use slots > instead of HeapTuples is that not doing so would require expensive > conversions for AMs that don't use HeapTuple. But for COPY rowmarks > that's not an issue, because the source already is a HeapTuple (in the > form of a Datum). > That's imo not a particularly pretty approach, but might be the most > viable approach for v12. Attached is a quick WIP implementation, which > indeed fixes the crash reported here, and passes all our > tests. Obviously it'd need some polish if we went for that. I agree that that's mighty ugly. But it does have the advantage of being not very invasive, so at this late date maybe it's what we should do for v12. >>> One bigger change - but possibly one worth it - would be to basically >>> move the work done in EvalPlanQualFetchRowMarks() to be done on-demand, >>> at least for ROW_MARK_COPY. >> Hm. Too tired to think that through right now. > I briefly tried to implement that. The problem is that, as things are > currently set up, we don't have access to the corresponding epqstate > from within executor nodes. That prevents us from accessing > EPQState->{origslot, arowMarks}, which we need to actually be able to > fetch the rows to return. > We could solve that by referencing the EPQState from its EState (but not > from the "main" estate). I've wondered before whether that'd be better > than having various es_epq* fields in EState. I don't think it'd make > the modularity any worse, given that we're already referencing EPQstate > fields from with EState. If we moved es_epqTupleSlot into EPQState and > allocated it from within EvalPlanQualInit(), instead of > EvalPlanQualBegin(), we'd address your complaint that we now call that > earlier too. > Doesn't sound that hard. Seems like a big change for v12 though. I feel that this seems like a promising solution for the longer term. I agree that keeping a pointer to the whole EPQState in EState is no worse than having pointers to some of its fields there. Perhaps the next step is to draft a patch for this and see just how big it is compared to the minimal fix. I'd rather not have v12 be different from later branches in this area if the change doesn't seem entirely unreasonable for late beta. regards, tom lane
Hi, On 2019-08-24 12:13:22 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-07-31 22:37:40 -0400, Tom Lane wrote: > > I briefly tried to implement that. The problem is that, as things are > > currently set up, we don't have access to the corresponding epqstate > > from within executor nodes. That prevents us from accessing > > EPQState->{origslot, arowMarks}, which we need to actually be able to > > fetch the rows to return. > > We could solve that by referencing the EPQState from its EState (but not > > from the "main" estate). I've wondered before whether that'd be better > > than having various es_epq* fields in EState. I don't think it'd make > > the modularity any worse, given that we're already referencing EPQstate > > fields from with EState. If we moved es_epqTupleSlot into EPQState and > > allocated it from within EvalPlanQualInit(), instead of > > EvalPlanQualBegin(), we'd address your complaint that we now call that > > earlier too. > > Doesn't sound that hard. Seems like a big change for v12 though. > > I feel that this seems like a promising solution for the longer term. > I agree that keeping a pointer to the whole EPQState in EState is no > worse than having pointers to some of its fields there. > > Perhaps the next step is to draft a patch for this and see just how > big it is compared to the minimal fix. Working on that. I just had a slightly crazy idea: What if we just blessed wholerow rowmark types? Then we'd pretty trivially have access to the relevant types? Or alternatively even just add type information to the PlanRowMark for ROW_MARK_COPY instead of reconstructing it at runtime? - Andres
Andres Freund <andres@anarazel.de> writes: > I just had a slightly crazy idea: What if we just blessed wholerow > rowmark types? Then we'd pretty trivially have access to the relevant > types? What would be trivial about that? You'd still need access to RTE-specific information. I'm a little worried about the extra load on the typcache, too. > Or alternatively even just add type information to the > PlanRowMark for ROW_MARK_COPY instead of reconstructing it at runtime? Maybe, but I'm not sure that leads to a less invasive solution. regards, tom lane
Hi, On 2019-08-24 15:54:01 -0700, Andres Freund wrote: > Working on that. Think it's going to look not too invasive. Needs a good bit more cleanup, but I'm getting there. One thing I'm struggling with is understanding and testing nested EPQ states: /* * Each EState must have its own es_epqScanDone state, but if we have * nested EPQ checks they should share es_epqTupleSlot arrays. This * allows sub-rechecks to inherit the values being examined by an outer * recheck. */ At the moment I don't understand what the point here is. When/why do we want to inherit values from the parent epqstate? I know how to construct queries with nested EPQ, but I don't see why we'd ever want to inherit the values from the outer check? Nor how we're doing so - EvalPlanQualFetchRowMarks() unconditionally clears out the old tuple, and nodeLockRows.c unconditionally fetches the source tuples too? Greetings, Andres Freund
Hi, On 2019-08-25 22:36:52 -0700, Andres Freund wrote: > On 2019-08-24 15:54:01 -0700, Andres Freund wrote: > > Working on that. > > Think it's going to look not too invasive. Needs a good bit more > cleanup, but I'm getting there. > > > One thing I'm struggling with is understanding and testing nested EPQ > states: > > /* > * Each EState must have its own es_epqScanDone state, but if we have > * nested EPQ checks they should share es_epqTupleSlot arrays. This > * allows sub-rechecks to inherit the values being examined by an outer > * recheck. > */ > > At the moment I don't understand what the point here is. When/why do we > want to inherit values from the parent epqstate? I know how to construct > queries with nested EPQ, but I don't see why we'd ever want to inherit > the values from the outer check? Nor how we're doing so - > EvalPlanQualFetchRowMarks() unconditionally clears out the old tuple, > and nodeLockRows.c unconditionally fetches the source tuples too? As far as I can tell it's not worth preserving this bit - it doesn't seem to be working in a way one might infer from the comment, and the current behaviour is pretty weird: In a nested EPQState, we'll overwrite the tuples accessed by an outer layer of EPQ. I'm not sure, but I could even see that causing crash type problems, e.g. if a columns from the slot fetched by the outer EPQ is passed to the inner EPQ, which then stores a new tuple in that slot, potentially leaving the argument be a dangling pointer into that tuple. Second, and somewhat related, question: Why are we passing the surrounding EState to both EvalPlanQualInit() and to EvalPlanQual()? Seems to not make much sense anymore? As I needed to have accesses to the parent EState from the EPQState, to be able to allocate slots for EvalPlanQualSlot() before EvalPlanQualBegin() has been called, that seems even less necessary. I've verified that the attached fixes the crash reported upthread. All tests pass. I think it needs a bit of cleanup, and a lot more test coverage. On the latter point, it seems that we are missing coverage around: 1) most of ROW_MARK_COPY 2) no nested EPQ 3) index only scan as an EPQ source 4) FDWs I'll not be able to add complete coverage, but clearly we need more than now. Tom, everyone, what do you think of the attached approach? It implements what I described upthread. The problem with fixing this bug is that we currently don't know the row types for ExecAuxRowMarks at the site that is calling EvalPlanQual, and that that is not easy to fix. But we don't really need the type that early as there is no real need to fetch row marks that early. The difficulty moving it to ExecScanFetch is that at that stage we currently don't access to the information about row marks from the EPQState. To fix that, this patch moves most of the EPQ related information out of the ESTate into the EPQState, and sets es_active_epq to the EPQState for ESTates that are created by EvalPlanQualStart(). That then allows us to have ExecScanFetch() fetch the rowmark on demand with the new EvalPlanQualFetchRowMark (which basically is just the old EvalPlanQualFetchRowMarks without a loop). For EvalPlanQualFetchRowMark to work efficiently, I added an rti - 1 indexed EPQState->substitute_rowmark. This also fixes Tom's complain elswhere that nodeLockRows.c does EvalPlanQualBegin() earlier, by allowing EvalPlanQualSlot() to be called before that. That seems considerably nicer to me. Questions: - Is there any potential downside to not fetching non-locking rowmarks eagerly? It's observable for FDWs, but it shouldn't matter? As far as I can tell the new behaviour should always be at least as efficient as before? - Other names for EPQState->substitute_*? - Should the RTI indexed version of arowmarks be computed as an array directly in nodeModifyTable.c etc? Right now we build it first as a list, and then convert it as an array? We could also just use an rti indexed List, and store NULLs where there are no RTIs? But there's no good API for that. - We currently - and haven't in recent history - free epq test tuples until EPQ is needed the next time. Perhaps we should? But that'd probably not be a candidate for backpatching. Todo: - tests - remove isolationtester rescheduling Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: >> One thing I'm struggling with is understanding and testing nested EPQ >> states: >> * Each EState must have its own es_epqScanDone state, but if we have >> * nested EPQ checks they should share es_epqTupleSlot arrays. This >> * allows sub-rechecks to inherit the values being examined by an outer >> * recheck. >> At the moment I don't understand what the point here is. When/why do we >> want to inherit values from the parent epqstate? I know how to construct >> queries with nested EPQ, but I don't see why we'd ever want to inherit >> the values from the outer check? Nor how we're doing so - >> EvalPlanQualFetchRowMarks() unconditionally clears out the old tuple, >> and nodeLockRows.c unconditionally fetches the source tuples too? It's been a really long time, but I think that the idea might have been to avoid useless repeat tuple fetches in nested-EPQ cases. However, those are such extreme corner cases that I agree it's not worth contorting the logic to make them a shade faster (if indeed it works at all at present, which it might not). > This also fixes Tom's complain elswhere that nodeLockRows.c does > EvalPlanQualBegin() earlier, by allowing EvalPlanQualSlot() to be called > before that. That seems considerably nicer to me. Yay. > Tom, everyone, what do you think of the attached approach? It > implements what I described upthread. Seems reasonable to store a pointer in the EPQState rather than pass that as a separate argument, but I think you need a bit more documentation work to keep the two EState pointers from being impossibly confusing. It might help to write something like "estate is an executor state node instantiated for the same rangetable as parentestate (though it might be running only a sub-tree of the main plan). While parentestate represents the "real" plan execution, estate runs EPQ rechecks in which the table scan nodes are changed to return only the current tuple of interest for each table. Those current tuples are found using the rowmark mechanisms." Maybe we should change the "estate" field name to something else, too --- "childestate" is the first thing that comes to mind but it's rather generic. "recheckestate", perhaps? Also I dislike the field name "es_active_epq", as what that suggests to me is exactly backwards from the way you apparently are using it. Maybe "es_parent_epq" instead? The comment for it is pretty opaque as well, not to mention that it misspells EState. I don't agree with the FIXME comment in execnodes.h. The rowmarks state is used for things that are outside of EPQ, particularly the actual row locking ;-), so I wouldn't want to move that even if it didn't create notational issues. I'm confused by the removal of the EvalPlanQualBegin call at nodeModifyTable.c:1373 ... why is that OK? The original intent of EvalPlanQualInit was to do an absolutely minimal amount of work, since we don't know yet whether EPQ will ever be needed (and, generally, the way to bet is that it won't be). You've added some pallocs and list-to-array conversion, which while not terribly expensive still seem like they don't belong there. Can't we move most of that to EvalPlanQualBegin? I see that we want the substitute_slot array to exist so that EvalPlanQualSlot can work without EvalPlanQualBegin, but if we're postponing non-locking-rowmark work then we don't need the rest. You seem to have removed EvalPlanQualFetchRowMarks, but the extern for it is still there? The bool result of EvalPlanQualFetchRowMark needs documented. In ExecIndexOnlyMarkPos and ExecIndexOnlyRestrPos, you seem to have incorrectly changed the function names given in comments and elog messages (copy&pasto?) > Questions: > - Is there any potential downside to not fetching non-locking rowmarks > eagerly? It's observable for FDWs, but it shouldn't matter? As far as > I can tell the new behaviour should always be at least as efficient as > before? It's possible that in some cases we'd not re-fetch a tuple that the existing code would re-fetch, but I have a hard time seeing how that's bad. > - Should the RTI indexed version of arowmarks be computed as an array > directly in nodeModifyTable.c etc? Right now we build it first as a > list, and then convert it as an array? We could also just use an rti > indexed List, and store NULLs where there are no RTIs? But there's no > good API for that. Do that as a follow-on patch if at all. I'm not sure it's worth changing. > - We currently - and haven't in recent history - free epq test tuples > until EPQ is needed the next time. Perhaps we should? Can't get excited about it. regards, tom lane
On 2019-08-27 13:06:37 -0400, Tom Lane wrote: > > Tom, everyone, what do you think of the attached approach? It > > implements what I described upthread. > > Seems reasonable to store a pointer in the EPQState rather than pass > that as a separate argument, but I think you need a bit more documentation > work to keep the two EState pointers from being impossibly confusing. Agreed. > Also I dislike the field name "es_active_epq", as what that suggests > to me is exactly backwards from the way you apparently are using it. > Maybe "es_parent_epq" instead? The comment for it is pretty opaque > as well, not to mention that it misspells EState. I think what I was trying to signal was that EPQ is currently active from the POV of executor nodes. Ought to have been es_epq_active, for that, I guess. For me "if (estate->es_epq_active)" explains the meaning of the check more than "if (estate->es_parent_epq)". > I don't agree with the FIXME comment in execnodes.h. The rowmarks > state is used for things that are outside of EPQ, particularly the > actual row locking ;-), so I wouldn't want to move that even if it > didn't create notational issues. Yea, I wasn't sure about that. > I'm confused by the removal of the EvalPlanQualBegin call at > nodeModifyTable.c:1373 ... why is that OK? EvalPlanQual() calls EvalPlanQualBegin(), it was just needed to initiate enough state to be able to get a slot. > The original intent of EvalPlanQualInit was to do an absolutely > minimal amount of work, since we don't know yet whether EPQ will > ever be needed (and, generally, the way to bet is that it won't be). > You've added some pallocs and list-to-array conversion, which > while not terribly expensive still seem like they don't belong > there. Can't we move most of that to EvalPlanQualBegin? I see that > we want the substitute_slot array to exist so that EvalPlanQualSlot > can work without EvalPlanQualBegin, but if we're postponing > non-locking-rowmark work then we don't need the rest. Yea, we can move the other fields without much trouble. > > - Should the RTI indexed version of arowmarks be computed as an array > > directly in nodeModifyTable.c etc? Right now we build it first as a > > list, and then convert it as an array? We could also just use an rti > > indexed List, and store NULLs where there are no RTIs? But there's no > > good API for that. > > Do that as a follow-on patch if at all. I'm not sure it's worth changing. Well, it'd at least partially address your concern above that we ought to do less work during EvalPlanQualInit - if we had it in array form we'd just need to set a variable, rather than do the transformation. Greetings, Andres Freund
Hi, On 2019-08-27 13:06:37 -0400, Tom Lane wrote: > Seems reasonable to store a pointer in the EPQState rather than pass > that as a separate argument, but I think you need a bit more documentation > work to keep the two EState pointers from being impossibly confusing. > It might help to write something like "estate is an executor state node > instantiated for the same rangetable as parentestate (though it might > be running only a sub-tree of the main plan). While parentestate > represents the "real" plan execution, estate runs EPQ rechecks in > which the table scan nodes are changed to return only the current tuple > of interest for each table. Those current tuples are found using the > rowmark mechanisms." I tried that for a while, but was unsatisfied with putting it to the EState(s) - felt like too general a comment. Moved it to the top of the struct instead. This basically ended up reordering all of struct EState, so it'd be good to have a second look. > Maybe we should change the "estate" field name to something else, too > --- "childestate" is the first thing that comes to mind but it's > rather generic. "recheckestate", perhaps? I went with recheckestate, and then also renamed the PlanState to follow recheck* pattern, to make clearer it's not the main plan's PlanState (sub-)tree. > Also I dislike the field name "es_active_epq", as what that suggests > to me is exactly backwards from the way you apparently are using it. > Maybe "es_parent_epq" instead? I went with es_epq_active, as I suggested in my earlier email, which still seems accurate to me. I really want to move consideration of es_ep_active to ExecInitNode() time, rather than execution time. If we add an execScan helper, we can have it set the corresponding executor node's ExecProcNode to a) a function that performs qual checks and projection b) a function that performs projection c) the fetch method from the scan node d) basically the current ExecScan, when es_epq_active > The original intent of EvalPlanQualInit was to do an absolutely > minimal amount of work, since we don't know yet whether EPQ will > ever be needed (and, generally, the way to bet is that it won't be). > You've added some pallocs and list-to-array conversion, which > while not terribly expensive still seem like they don't belong > there. Can't we move most of that to EvalPlanQualBegin? I see that > we want the substitute_slot array to exist so that EvalPlanQualSlot > can work without EvalPlanQualBegin, but if we're postponing > non-locking-rowmark work then we don't need the rest. Did that. > In ExecIndexOnlyMarkPos and ExecIndexOnlyRestrPos, you seem to have > incorrectly changed the function names given in comments and elog messages > (copy&pasto?) Fixed. Did a good bit more comment polishing, renamed a few more variables. I also added tests for things that I thought were clearly missing (including a test that errors out before the code changes in the patch). For one of the new tests I added a noisy_oper(comment, a, oper, b) function, to see which EPQ tests are performed with which values. Without that I found it to be really hard to be confident that we actually are using the right column values. I think it might make sense to expand it's use a bit more - but that seems like something for later. I tried for a while to develop one for mark/restore of IndexOnlyScans, but I concluded that that code is basically dead right now. Every scan node of a normal that gets modified or needs a rowmark implies having ctid as part of the targetlist. And we neither allow ctid to be part of index definitions, nor understand that we actually kinda know the ctid from within the index scan (HOT would make using the tid hard). So the relevant code in nodeIndexOnly.c seems dead? Greetings, Andres Freund
Attachment
Hi, On 2019-09-05 12:59:11 -0700, Andres Freund wrote: > @@ -562,6 +831,7 @@ LockRows > Merge Cond: (a.id = b.id) > -> Index Scan using jointest_id_idx on jointest a > -> Index Scan using jointest_id_idx on jointest b > +s1: WARNING: it's me, brother ExecIndexMarkPos() > id data id data > > 1 0 1 0 Err, I forgot to amend the commit with the stashed removal of debugging cruft in the output file... Greetings, Andres Freund
Attachment
Hi, On 2019-09-05 12:59:11 -0700, Andres Freund wrote: > Did a good bit more comment polishing, renamed a few more variables. Pushed now, after some more polishing. > I also added tests for things that I thought were clearly missing > (including a test that errors out before the code changes in the > patch). For 12, I had to replace the NOTICE with WARNING (including SET client_min_messages). I wonder if we ought to backpatch commit ebd49928215e3854d91167e798949a75b34958d0 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2019-07-27 15:59:57 -0400 Don't drop NOTICE messages in isolation tests. to avoid backpatching pain? > I tried for a while to develop one for mark/restore of IndexOnlyScans, > but I concluded that that code is basically dead right now. Every scan > node of a normal that gets modified or needs a rowmark implies having > ctid as part of the targetlist. And we neither allow ctid to be part of > index definitions, nor understand that we actually kinda know the ctid > from within the index scan (HOT would make using the tid hard). So the > relevant code in nodeIndexOnly.c seems dead? I wonder if, on master, we should make ExecIndexOnlyMarkPos(), ExecIndexOnlyRestrPos() ERROR out in case they're hit for an EPQ relation, given that they ought to be unreachable. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-09-05 12:59:11 -0700, Andres Freund wrote: >> I tried for a while to develop one for mark/restore of IndexOnlyScans, >> but I concluded that that code is basically dead right now. Every scan >> node of a normal that gets modified or needs a rowmark implies having >> ctid as part of the targetlist. And we neither allow ctid to be part of >> index definitions, nor understand that we actually kinda know the ctid >> from within the index scan (HOT would make using the tid hard). So the >> relevant code in nodeIndexOnly.c seems dead? > I wonder if, on master, we should make ExecIndexOnlyMarkPos(), > ExecIndexOnlyRestrPos() ERROR out in case they're hit for an EPQ > relation, given that they ought to be unreachable. I'd vote against. The chain of reasoning that says they're unreachable is long and involves mostly code that's nowhere near there, so when/if somebody made a change that broke that reasoning, they'd not necessarily notice that the ERROR has to be undone. regards, tom lane