Thread: PostgreSQL12 crash bug report

PostgreSQL12 crash bug report

From

yi huang

Date:

28 July 2019, 18:56:21

1、Run the docker version of PostgreSQL 12

$ docker run --rm -p 5432:5432 -v /tmp:/data -e POSTGRES_USER=exchange
-e POSTGRES_DB=exchange -e POSTGRES_PASSWORD=123456 postgres:12-alpine

2、Download the attached files to /tmp
3、Import the schema and data

$ docker exec -it $CONTAINER psql -U exchange exchange -f
/data/t_order_schema.sql
$ docker exec -it $CONTAINER psql -U exchange exchange -f /data/t_order_data.sql

4、Run pgbench with provided test script

$ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql

Attachment

Re: PostgreSQL12 crash bug report

From

Tom Lane

Date:

28 July 2019, 20:11:21

yi huang <yi.codeplayer@gmail.com> writes:
> [ crashes with ]
> $ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql

Thanks for the report!

Note that this doesn't fail if you just run test.sql by itself, but
it does soon fall over as described when running multiple copies in
parallel.  Unsurprisingly given that, the crash is inside EvalPlanQual:

(gdb) bt
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:526
#1  0x00000000004863b1 in fill_val (tupleDesc=0x18c7060,
    values=<value optimized out>, isnull=0x195e198,
    data=<value optimized out>, data_size=148670068,
    infomask=<value optimized out>, bit=0x7f07aef4207f "\017")
    at heaptuple.c:247
#2  heap_fill_tuple (tupleDesc=0x18c7060, values=<value optimized out>,
    isnull=0x195e198, data=<value optimized out>, data_size=148670068,
    infomask=<value optimized out>, bit=0x7f07aef4207f "\017")
    at heaptuple.c:336
#3  0x0000000000486921 in heap_form_tuple (tupleDescriptor=0x18c7060,
    values=0x7ffc691d58b0, isnull=0x195e198) at heaptuple.c:1090
#4  0x00000000004ddf78 in toast_build_flattened_tuple (tupleDesc=0x18c7060,
    values=<value optimized out>, isnull=0x195e198) at tuptoaster.c:1336
#5  0x00000000006441f0 in ExecEvalWholeRowVar (state=<value optimized out>,
    op=0x1958030, econtext=<value optimized out>) at execExprInterp.c:4019
#6  0x0000000000647602 in ExecInterpExpr (state=0x1957ea0, econtext=0x19576a0,
    isnull=0x7ffc691d8dff) at execExprInterp.c:519
#7  0x00000000006543ef in ExecEvalExprSwitchContext (
    node=<value optimized out>, accessMtd=0x67a140 <ValuesNext>,
    recheckMtd=0x679f00 <ValuesRecheck>)
    at ../../../src/include/executor/executor.h:307
#8  ExecProject (node=<value optimized out>, accessMtd=0x67a140 <ValuesNext>,
    recheckMtd=0x679f00 <ValuesRecheck>)
    at ../../../src/include/executor/executor.h:341
#9  ExecScan (node=<value optimized out>, accessMtd=0x67a140 <ValuesNext>,
    recheckMtd=0x679f00 <ValuesRecheck>) at execScan.c:199
#10 0x00000000006698dc in ExecProcNode (node=0x19573d8)
    at ../../../src/include/executor/executor.h:239
#11 MultiExecPrivateHash (node=0x19573d8) at nodeHash.c:164
#12 MultiExecHash (node=0x19573d8) at nodeHash.c:114
#13 0x000000000066b685 in ExecHashJoinImpl (pstate=0x1939d08)
    at nodeHashjoin.c:291
#14 ExecHashJoin (pstate=0x1939d08) at nodeHashjoin.c:572
#15 0x000000000064af7f in ExecProcNode (epqstate=<value optimized out>)
    at ../../../src/include/executor/executor.h:239
#16 EvalPlanQualNext (epqstate=<value optimized out>) at execMain.c:2692
#17 0x000000000064b502 in EvalPlanQual (estate=<value optimized out>,
    epqstate=0x18b5420, relation=<value optimized out>, rti=4,
    inputslot=0x195dfc0) at execMain.c:2475
#18 0x0000000000675f09 in ExecUpdate (mtstate=0x18b5328,
    tupleid=0x7ffc691d9130, oldtuple=0x0, slot=<value optimized out>,
    planSlot=0x18ca138, epqstate=0x18b5420, estate=0x18b4558, canSetTag=true)
    at nodeModifyTable.c:1389
#19 0x000000000067668d in ExecModifyTable (pstate=0x18b5328)
    at nodeModifyTable.c:2226
#20 0x000000000064d547 in ExecProcNode (queryDesc=0x18b4148,
    direction=<value optimized out>, count=0, execute_once=40)
    at ../../../src/include/executor/executor.h:239
#21 ExecutePlan (queryDesc=0x18b4148, direction=<value optimized out>,
    count=0, execute_once=40) at execMain.c:1648
#22 standard_ExecutorRun (queryDesc=0x18b4148,
    direction=<value optimized out>, count=0, execute_once=40)
    at execMain.c:365
#23 0x00000000007c32ab in ProcessQuery (plan=<value optimized out>,
    sourceText=0x17b7f28 "with tmp(id, avail_amount, deal_value, update_id) as (values
(76107497731002368,0.00,16546.90746,76107519298113537),(76107514491441155,0.606,5808.03795,76107519298113537))\nupdate
t_orderset avail_amo"...,  
    params=0x0, queryEnv=0x0, dest=<value optimized out>,
    completionTag=0x7ffc691d9430 "") at pquery.c:161
#24 0x00000000007c3613 in PortalRunMulti (portal=0x181dc48, isTopLevel=true,
    setHoldSnapshot=false, dest=0x7f07b7f9fce0, altdest=0x7f07b7f9fce0,
    completionTag=0x7ffc691d9430 "") at pquery.c:1283
#25 0x00000000007c3d80 in PortalRun (portal=0x181dc48,
    count=9223372036854775807, isTopLevel=true, run_once=true,
    dest=0x7f07b7f9fce0, altdest=0x7f07b7f9fce0,
    completionTag=0x7ffc691d9430 "") at pquery.c:796
#26 0x00000000007c0001 in exec_simple_query (
...

As per the stack trace, we're trying to build a new tuple for the output
of a ValuesScan node, but the targetlist for that seems completely insane:
heap_form_tuple is being given a 15-column tupdesc that includes

(gdb) p tupleDescriptor->attrs[1]
$8 = {attrelid = 0, attname = {data = "side", '\000' <repeats 59 times>},
  atttypid = 35644, attstattarget = -1, attlen = 4, attnum = 2, attndims = 0,
  attcacheoff = 8, atttypmod = -1, attbyval = true, attstorage = 112 'p',
  attalign = 105 'i', attnotnull = false, atthasdef = false,
  atthasmissing = false, attidentity = 0 '\000', attgenerated = 0 '\000',
  attisdropped = false, attislocal = true, attinhcount = 0, attcollation = 0}

which seems to be the column that we have a bogus Datum for.  But what's
the code doing expecting that to be available from the ValueScan?  It's a
column of t_order.  And the other data is all wrong too: that composite
type should surely not have attlen = 4 nor attbyval = true.

"explain verbose" claims that the Values node should be returning

               ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=184)
                     Output: "*VALUES*".column2, "*VALUES*".column3, "*VALUES*".column4, "*VALUES*".*,
"*VALUES*".column1

so it seems like somehow we've got wrong info in the EPQ context.
I wonder if this is related to the "Hash join explain is broken"
thread.

            regards, tom lane

Re: PostgreSQL12 crash bug report

From

Tom Lane

Date:

31 July 2019, 23:07:29

I wrote:
> yi huang <yi.codeplayer@gmail.com> writes:
>> [ crashes with ]
>> $ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql

> As per the stack trace, we're trying to build a new tuple for the output
> of a ValuesScan node, but the targetlist for that seems completely insane:

After digging through this, it seems clear that it's the fault of
ad0bda5d2, specifically this code in EvalPlanQualSlot:

            *slot = ExecAllocTableSlot(&epqstate->estate->es_tupleTable,
                                       epqstate->origslot->tts_tupleDescriptor,
                                       &TTSOpsVirtual);

It's setting up an es_epqTupleSlot[] entry on the assumption that it
should have the same tupdesc as the output tuple that's to be rechecked.
This might accidentally fail to malfunction in single-table cases,
but it's quite broken for any join case --- in particular, for the
given test case, the scan tuple for the VALUES node surely doesn't have
the same tupdesc as the join result.

I think we need to change the API of EvalPlanQualSlot, because it
cannot determine the correct tupdesc when it's not handed a Relation.
I'm not entirely sure that its immediate callers can either :-( so
this might propagate out a ways.  Or perhaps we should move the
slot-making someplace else.

I'll go make an open item for this.

            regards, tom lane

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

01 August 2019, 00:40:00

Hi,

On 2019-07-31 19:07:29 -0400, Tom Lane wrote:
> I wrote:
> > yi huang <yi.codeplayer@gmail.com> writes:
> >> [ crashes with ]
> >> $ pgbench -h 127.0.0.1 -p 5432 -U exchange exchange -T 2 -c 4 -f /tmp/test.sql
> 
> > As per the stack trace, we're trying to build a new tuple for the output
> > of a ValuesScan node, but the targetlist for that seems completely insane:
> 
> After digging through this, it seems clear that it's the fault of
> ad0bda5d2, specifically this code in EvalPlanQualSlot:
> 
>             *slot = ExecAllocTableSlot(&epqstate->estate->es_tupleTable,
>                                        epqstate->origslot->tts_tupleDescriptor,
>                                        &TTSOpsVirtual);
> 
> It's setting up an es_epqTupleSlot[] entry on the assumption that it
> should have the same tupdesc as the output tuple that's to be rechecked.
> This might accidentally fail to malfunction in single-table cases,
> but it's quite broken for any join case --- in particular, for the
> given test case, the scan tuple for the VALUES node surely doesn't have
> the same tupdesc as the join result.

:(

We really need to improve the test coverage for EPQ :(. I tried to add a
few cases, but it's still one of the least tested complicated areas in
PG.

To make sure I understand - the problem isn't the slot that we've
created in nodeModifyTable.c via EvalPlanQualSlot(), right? It's the one
we create in EvalPlanQualFetchRowMarks(), because we don't have a proper
tuple type handy to create the slot with?  And then ExecScanFetch() (on
behalf of ValueScan in this case) expects something with the correct
format in that slot, which causes the backtrace you show. And this
problem occurs "only" for ROW_MARK_COPY, because for everything else we
have an associated relation.

Previously we simply didn't need to know the type during EPQ setup,
because we only stored a HeapTuple anyway. And we'd know the appropriate
tupledesc at the places that access the tuple.

Right?

> I think we need to change the API of EvalPlanQualSlot, because it
> cannot determine the correct tupdesc when it's not handed a Relation.
> I'm not entirely sure that its immediate callers can either :-( so
> this might propagate out a ways.  Or perhaps we should move the
> slot-making someplace else.

I think we'll probably have to split EvalPlanQualSlot() into two
functions. One for nodeModifyTable.c and nodeLockRows.c, and the
!ROW_MARK_COPY cases in EvalPlanQualFetchRowMarks().

One bigger change - but possibly one worth it - would be to basically
move the work done in EvalPlanQualFetchRowMarks() to be done on-demand,
at least for ROW_MARK_COPY. We don't need to do

            /* fetch the whole-row Var for the relation */
            datum = ExecGetJunkAttribute(epqstate->origslot,
                                         aerm->wholeAttNo,
                                         &isNull);
            /* non-locked rels could be on the inside of outer joins */
            if (isNull)
                continue;

that early in the query. We could call EvalPlanQualFetchRowMark() (note
the singular) from ExecScanFetch(), that ought to currently be the only
place that accesses ROW_MARK_COPY type rowmarks?

We could do the same for the other types of rowmarks, but I think that'd
be semantically somewhat murkier.

That would have the advantage that we'd not unnecessarily put row
marks we might never need into slots. There's probably not that many
cases where that matters, though.

Or we could keep enough information for ROW_MARK_COPY RowMarks around to
create the necessary tupledesc (there's a number of variants as to how -
don't zap the information in add_rte_to_flat_rtable, have setrefs.c
somehow keep maintain rti indexed pointers to Plan nodes, ...).

Not sure yet what's best here.

Greetings,

Andres Freund

Re: PostgreSQL12 crash bug report

From

Tom Lane

Date:

01 August 2019, 02:37:40

Andres Freund <andres@anarazel.de> writes:
> On 2019-07-31 19:07:29 -0400, Tom Lane wrote:
>> It's setting up an es_epqTupleSlot[] entry on the assumption that it
>> should have the same tupdesc as the output tuple that's to be rechecked.
>> This might accidentally fail to malfunction in single-table cases,
>> but it's quite broken for any join case --- in particular, for the
>> given test case, the scan tuple for the VALUES node surely doesn't have
>> the same tupdesc as the join result.

> To make sure I understand - the problem isn't the slot that we've
> created in nodeModifyTable.c via EvalPlanQualSlot(), right? It's the one
> we create in EvalPlanQualFetchRowMarks(), because we don't have a proper
> tuple type handy to create the slot with?

Yeah, I think nodeModifyTable.c is fine, because it always passes the
target relation.  EvalPlanQualFetchRowMarks is not fine, and I'm unsure
about the call in nodeLockRows.c.

> Previously we simply didn't need to know the type during EPQ setup,
> because we only stored a HeapTuple anyway. And we'd know the appropriate
> tupledesc at the places that access the tuple.

Right.  So we gotta refactor that somehow.

> One bigger change - but possibly one worth it - would be to basically
> move the work done in EvalPlanQualFetchRowMarks() to be done on-demand,
> at least for ROW_MARK_COPY.

Hm.  Too tired to think that through right now.

            regards, tom lane

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

01 August 2019, 04:36:58

Hi,

On 2019-07-31 22:37:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-07-31 19:07:29 -0400, Tom Lane wrote:
> >> It's setting up an es_epqTupleSlot[] entry on the assumption that it
> >> should have the same tupdesc as the output tuple that's to be rechecked.
> >> This might accidentally fail to malfunction in single-table cases,
> >> but it's quite broken for any join case --- in particular, for the
> >> given test case, the scan tuple for the VALUES node surely doesn't have
> >> the same tupdesc as the join result.
> 
> > To make sure I understand - the problem isn't the slot that we've
> > created in nodeModifyTable.c via EvalPlanQualSlot(), right? It's the one
> > we create in EvalPlanQualFetchRowMarks(), because we don't have a proper
> > tuple type handy to create the slot with?
> 
> Yeah, I think nodeModifyTable.c is fine, because it always passes the
> target relation.  EvalPlanQualFetchRowMarks is not fine, and I'm unsure
> about the call in nodeLockRows.c.

I think nodeLockRows.c itself should be fine - the slots it gets with
EvalPlanQualSlot() are either used for the FDWs to store tuples in it,
or nodeLockRows.c directly fetches tuples into them with
table_tuple_lock(). It itself is only dealing with locking rowmarks, for
which we obviously have a relation.  It does trigger
EvalPlanQualFetchRowMarks() for the non-locking marks - that part is
obviously broken, but not in nodeLockRows.c itself.

> > Previously we simply didn't need to know the type during EPQ setup,
> > because we only stored a HeapTuple anyway. And we'd know the appropriate
> > tupledesc at the places that access the tuple.
> 
> Right.  So we gotta refactor that somehow.

One way to avoid doing so would be to just not deal with slots in the
case of ROW_MARK_COPY. The reason why EPQ was made to use to use slots
instead of HeapTuples is that not doing so would require expensive
conversions for AMs that don't use HeapTuple.  But for COPY rowmarks
that's not an issue, because the source already is a HeapTuple (in the
form of a Datum).

That's imo not a particularly pretty approach, but might be the most
viable approach for v12. Attached is a quick WIP implementation, which
indeed fixes the crash reported here, and passes all our
tests. Obviously it'd need some polish if we went for that.

I think it's pretty darn ugly that we don't know what kind of tuples
we're dealing with at the top-level of EPQ. But changing that seems far
from trivial.

The way things set up we don't have a good way to map from rtis to the
associated PlanStates, and currently only the PlanStates have enough
information to infer the slot type.  In fact, if we knew the ScanState
nodes for specific rtis, we could perhaps even just use each node's
ScanState->ss_ScanTupleSlot - seems fragile, but we've effectively done
so before v12.

And we can't easy add a way to map from rti to PlanState, without adding
work to ExecInit* for every Scan node. If ExecInitNode() knew whether
the relevant nodes were Scan nodes, we could maintain the mapping at
that stage, but we afaict don't know that.

I briefly looked at hacking something like ExecInitScanTupleSlot() to
maintain an EState rti->ScanState mapping, but we have ScanState
executor nodes for planner nodes that do *not* inherit from Scan
(e.g. Material). While that's ugly, it's not something we seem likely to
change all that soon - and using ExecInitScanTupleSlot() as a proxy is
just a hack anyway.  It sure would be nice if IsA() actually worked when
inheriting from a node...

We could easily maintain a mapping from Plan->plan_node_id to the
corresponding PlanState from within ExecInitNode(). But I don't think
that'd help that much, as we'd need an rti -> plan_node_id mapping,
which'd again not be trivial to maintain. We could do so in
set_plan_refs(), as that knows both the relevant plan_node_id, and the
adjusted scanrelid.

I'm not exactly sure what the intended uses of plan_node_id are - it was
introduce in d1b7c1ffe72e86932b5395f29e006c3f503bc53d and is,
exclusively?, used as a unique key into dynamic shared memory for
parallel query (so each node can cheaply can build a key for its own
data). So perhaps it'd be a gross misuse of that system. But it doesn't
look like that to me.

Not sure if there's other things that could benefit from the
rti -> ScanState or plan_node_id -> PlanState mapping. It'd be nice for
debugging, I guess, but that doesn't quite seem enough.

> > One bigger change - but possibly one worth it - would be to basically
> > move the work done in EvalPlanQualFetchRowMarks() to be done on-demand,
> > at least for ROW_MARK_COPY.
> 
> Hm.  Too tired to think that through right now.

I briefly tried to implement that. The problem is that, as things are
currently set up, we don't have access to the corresponding epqstate
from within executor nodes. That prevents us from accessing
EPQState->{origslot, arowMarks}, which we need to actually be able to
fetch the rows to return.

We could solve that by referencing the EPQState from its EState (but not
from the "main" estate). I've wondered before whether that'd be better
than having various es_epq* fields in EState.  I don't think it'd make
the modularity any worse, given that we're already referencing EPQstate
fields from with EState.  If we moved es_epqTupleSlot into EPQState and
allocated it from within EvalPlanQualInit(), instead of
EvalPlanQualBegin(), we'd address your complaint that we now call that
earlier too.

Doesn't sound that hard. Seems like a big change for v12 though.

As far as I can tell, the only behavioural change of fetching rows later
instead of in EvalPlanQualFetchRowMarks would be that we'd fetch
(locally or via FDW) rows that are referenced, but not locked, at the
same time we lock rows, and that we would not fetch them if any of the
locked rows couldn't be locked.  Which seems like a mild improvement.

Greetings,

Andres Freund

Attachment

v1-0001-WIP-fix-crash-ROW_MARK_COPY.patch

Re: PostgreSQL12 crash bug report

From

Tom Lane

Date:

24 August 2019, 16:13:22

Andres Freund <andres@anarazel.de> writes:
> On 2019-07-31 22:37:40 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> Previously we simply didn't need to know the type during EPQ setup,
>>> because we only stored a HeapTuple anyway. And we'd know the appropriate
>>> tupledesc at the places that access the tuple.

>> Right.  So we gotta refactor that somehow.

> One way to avoid doing so would be to just not deal with slots in the
> case of ROW_MARK_COPY. The reason why EPQ was made to use to use slots
> instead of HeapTuples is that not doing so would require expensive
> conversions for AMs that don't use HeapTuple.  But for COPY rowmarks
> that's not an issue, because the source already is a HeapTuple (in the
> form of a Datum).
> That's imo not a particularly pretty approach, but might be the most
> viable approach for v12. Attached is a quick WIP implementation, which
> indeed fixes the crash reported here, and passes all our
> tests. Obviously it'd need some polish if we went for that.

I agree that that's mighty ugly.  But it does have the advantage of being
not very invasive, so at this late date maybe it's what we should do
for v12.

>>> One bigger change - but possibly one worth it - would be to basically
>>> move the work done in EvalPlanQualFetchRowMarks() to be done on-demand,
>>> at least for ROW_MARK_COPY.

>> Hm.  Too tired to think that through right now.

> I briefly tried to implement that. The problem is that, as things are
> currently set up, we don't have access to the corresponding epqstate
> from within executor nodes. That prevents us from accessing
> EPQState->{origslot, arowMarks}, which we need to actually be able to
> fetch the rows to return.
> We could solve that by referencing the EPQState from its EState (but not
> from the "main" estate). I've wondered before whether that'd be better
> than having various es_epq* fields in EState.  I don't think it'd make
> the modularity any worse, given that we're already referencing EPQstate
> fields from with EState.  If we moved es_epqTupleSlot into EPQState and
> allocated it from within EvalPlanQualInit(), instead of
> EvalPlanQualBegin(), we'd address your complaint that we now call that
> earlier too.
> Doesn't sound that hard. Seems like a big change for v12 though.

I feel that this seems like a promising solution for the longer term.
I agree that keeping a pointer to the whole EPQState in EState is no
worse than having pointers to some of its fields there.

Perhaps the next step is to draft a patch for this and see just how
big it is compared to the minimal fix.  I'd rather not have v12 be
different from later branches in this area if the change doesn't
seem entirely unreasonable for late beta.

            regards, tom lane

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

24 August 2019, 22:54:01

Hi,

On 2019-08-24 12:13:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-07-31 22:37:40 -0400, Tom Lane wrote:
> > I briefly tried to implement that. The problem is that, as things are
> > currently set up, we don't have access to the corresponding epqstate
> > from within executor nodes. That prevents us from accessing
> > EPQState->{origslot, arowMarks}, which we need to actually be able to
> > fetch the rows to return.
> > We could solve that by referencing the EPQState from its EState (but not
> > from the "main" estate). I've wondered before whether that'd be better
> > than having various es_epq* fields in EState.  I don't think it'd make
> > the modularity any worse, given that we're already referencing EPQstate
> > fields from with EState.  If we moved es_epqTupleSlot into EPQState and
> > allocated it from within EvalPlanQualInit(), instead of
> > EvalPlanQualBegin(), we'd address your complaint that we now call that
> > earlier too.
> > Doesn't sound that hard. Seems like a big change for v12 though.
> 
> I feel that this seems like a promising solution for the longer term.
> I agree that keeping a pointer to the whole EPQState in EState is no
> worse than having pointers to some of its fields there.
> 
> Perhaps the next step is to draft a patch for this and see just how
> big it is compared to the minimal fix.

Working on that.


I just had a slightly crazy idea: What if we just blessed wholerow
rowmark types? Then we'd pretty trivially have access to the relevant
types?  Or alternatively even just add type information to the
PlanRowMark for ROW_MARK_COPY instead of reconstructing it at runtime?

- Andres

Re: PostgreSQL12 crash bug report

From

Tom Lane

Date:

25 August 2019, 17:04:45

Andres Freund <andres@anarazel.de> writes:
> I just had a slightly crazy idea: What if we just blessed wholerow
> rowmark types? Then we'd pretty trivially have access to the relevant
> types?

What would be trivial about that?  You'd still need access to RTE-specific
information.  I'm a little worried about the extra load on the typcache,
too.

> Or alternatively even just add type information to the
> PlanRowMark for ROW_MARK_COPY instead of reconstructing it at runtime?

Maybe, but I'm not sure that leads to a less invasive solution.

            regards, tom lane

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

26 August 2019, 05:36:52

Hi,

On 2019-08-24 15:54:01 -0700, Andres Freund wrote:
> Working on that.

Think it's going to look not too invasive. Needs a good bit more
cleanup, but I'm getting there.


One thing I'm struggling with is understanding and testing nested EPQ
states:

    /*
     * Each EState must have its own es_epqScanDone state, but if we have
     * nested EPQ checks they should share es_epqTupleSlot arrays.  This
     * allows sub-rechecks to inherit the values being examined by an outer
     * recheck.
     */

At the moment I don't understand what the point here is. When/why do we
want to inherit values from the parent epqstate? I know how to construct
queries with nested EPQ, but I don't see why we'd ever want to inherit
the values from the outer check? Nor how we're doing so -
EvalPlanQualFetchRowMarks() unconditionally clears out the old tuple,
and nodeLockRows.c unconditionally fetches the source tuples too?

Greetings,

Andres Freund

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

27 August 2019, 01:08:47

Hi,

On 2019-08-25 22:36:52 -0700, Andres Freund wrote:
> On 2019-08-24 15:54:01 -0700, Andres Freund wrote:
> > Working on that.
> 
> Think it's going to look not too invasive. Needs a good bit more
> cleanup, but I'm getting there.
> 
> 
> One thing I'm struggling with is understanding and testing nested EPQ
> states:
> 
>     /*
>      * Each EState must have its own es_epqScanDone state, but if we have
>      * nested EPQ checks they should share es_epqTupleSlot arrays.  This
>      * allows sub-rechecks to inherit the values being examined by an outer
>      * recheck.
>      */
> 
> At the moment I don't understand what the point here is. When/why do we
> want to inherit values from the parent epqstate? I know how to construct
> queries with nested EPQ, but I don't see why we'd ever want to inherit
> the values from the outer check? Nor how we're doing so -
> EvalPlanQualFetchRowMarks() unconditionally clears out the old tuple,
> and nodeLockRows.c unconditionally fetches the source tuples too?

As far as I can tell it's not worth preserving this bit - it doesn't
seem to be working in a way one might infer from the comment, and the
current behaviour is pretty weird: In a nested EPQState, we'll overwrite
the tuples accessed by an outer layer of EPQ. I'm not sure, but I could
even see that causing crash type problems, e.g. if a columns from the
slot fetched by the outer EPQ is passed to the inner EPQ, which then
stores a new tuple in that slot, potentially leaving the argument be a
dangling pointer into that tuple.

Second, and somewhat related, question:

Why are we passing the surrounding EState to both EvalPlanQualInit() and
to EvalPlanQual()? Seems to not make much sense anymore? As I needed to
have accesses to the parent EState from the EPQState, to be able to
allocate slots for EvalPlanQualSlot() before EvalPlanQualBegin() has
been called, that seems even less necessary.

I've verified that the attached fixes the crash reported upthread. All
tests pass.  I think it needs a bit of cleanup, and a lot more test
coverage.

On the latter point, it seems that we are missing coverage around:
1) most of ROW_MARK_COPY
2) no nested EPQ
3) index only scan as an EPQ source
4) FDWs

I'll not be able to add complete coverage, but clearly we need more than
now.

Tom, everyone, what do you think of the attached approach?  It
implements what I described upthread.

The problem with fixing this bug is that we currently don't know the row
types for ExecAuxRowMarks at the site that is calling EvalPlanQual, and
that that is not easy to fix. But we don't really need the type that
early as there is no real need to fetch row marks that early.  The
difficulty moving it to ExecScanFetch is that at that stage we currently
don't access to the information about row marks from the EPQState.

To fix that, this patch moves most of the EPQ related information out of
the ESTate into the EPQState, and sets es_active_epq to the EPQState for
ESTates that are created by EvalPlanQualStart().  That then allows us to
have ExecScanFetch() fetch the rowmark on demand with the new
EvalPlanQualFetchRowMark (which basically is just the old
EvalPlanQualFetchRowMarks without a loop).

For EvalPlanQualFetchRowMark to work efficiently, I added an rti - 1
indexed EPQState->substitute_rowmark.

This also fixes Tom's complain elswhere that nodeLockRows.c does
EvalPlanQualBegin() earlier, by allowing EvalPlanQualSlot() to be called
before that. That seems considerably nicer to me.

Questions:
- Is there any potential downside to not fetching non-locking rowmarks
  eagerly? It's observable for FDWs, but it shouldn't matter? As far as
  I can tell the new behaviour should always be at least as efficient as
  before?
- Other names for EPQState->substitute_*?
- Should the RTI indexed version of arowmarks be computed as an array
  directly in nodeModifyTable.c etc? Right now we build it first as a
  list, and then convert it as an array? We could also just use an rti
  indexed List, and store NULLs where there are no RTIs? But there's no
  good API for that.
- We currently - and haven't in recent history - free epq test tuples
  until EPQ is needed the next time. Perhaps we should? But that'd
  probably not be a candidate for backpatching.

Todo:
- tests
- remove isolationtester rescheduling

Greetings,

Andres Freund

Attachment

epq.diff

Re: PostgreSQL12 crash bug report

From

Tom Lane

Date:

27 August 2019, 17:06:37

Andres Freund <andres@anarazel.de> writes:
>> One thing I'm struggling with is understanding and testing nested EPQ
>> states:
>> * Each EState must have its own es_epqScanDone state, but if we have
>> * nested EPQ checks they should share es_epqTupleSlot arrays.  This
>> * allows sub-rechecks to inherit the values being examined by an outer
>> * recheck.
>> At the moment I don't understand what the point here is. When/why do we
>> want to inherit values from the parent epqstate? I know how to construct
>> queries with nested EPQ, but I don't see why we'd ever want to inherit
>> the values from the outer check? Nor how we're doing so -
>> EvalPlanQualFetchRowMarks() unconditionally clears out the old tuple,
>> and nodeLockRows.c unconditionally fetches the source tuples too?

It's been a really long time, but I think that the idea might have been
to avoid useless repeat tuple fetches in nested-EPQ cases.  However,
those are such extreme corner cases that I agree it's not worth contorting
the logic to make them a shade faster (if indeed it works at all at
present, which it might not).

> This also fixes Tom's complain elswhere that nodeLockRows.c does
> EvalPlanQualBegin() earlier, by allowing EvalPlanQualSlot() to be called
> before that. That seems considerably nicer to me.

Yay.

> Tom, everyone, what do you think of the attached approach?  It
> implements what I described upthread.

Seems reasonable to store a pointer in the EPQState rather than pass
that as a separate argument, but I think you need a bit more documentation
work to keep the two EState pointers from being impossibly confusing.
It might help to write something like "estate is an executor state node
instantiated for the same rangetable as parentestate (though it might
be running only a sub-tree of the main plan).  While parentestate
represents the "real" plan execution, estate runs EPQ rechecks in
which the table scan nodes are changed to return only the current tuple
of interest for each table.  Those current tuples are found using the
rowmark mechanisms."  Maybe we should change the "estate" field name
to something else, too --- "childestate" is the first thing that comes
to mind but it's rather generic.  "recheckestate", perhaps?

Also I dislike the field name "es_active_epq", as what that suggests
to me is exactly backwards from the way you apparently are using it.
Maybe "es_parent_epq" instead?  The comment for it is pretty opaque
as well, not to mention that it misspells EState.

I don't agree with the FIXME comment in execnodes.h.  The rowmarks
state is used for things that are outside of EPQ, particularly the
actual row locking ;-), so I wouldn't want to move that even if it
didn't create notational issues.

I'm confused by the removal of the EvalPlanQualBegin call at
nodeModifyTable.c:1373 ... why is that OK?

The original intent of EvalPlanQualInit was to do an absolutely
minimal amount of work, since we don't know yet whether EPQ will
ever be needed (and, generally, the way to bet is that it won't be).
You've added some pallocs and list-to-array conversion, which
while not terribly expensive still seem like they don't belong
there.  Can't we move most of that to EvalPlanQualBegin?  I see that
we want the substitute_slot array to exist so that EvalPlanQualSlot
can work without EvalPlanQualBegin, but if we're postponing
non-locking-rowmark work then we don't need the rest.

You seem to have removed EvalPlanQualFetchRowMarks, but the
extern for it is still there?

The bool result of EvalPlanQualFetchRowMark needs documented.

In ExecIndexOnlyMarkPos and ExecIndexOnlyRestrPos, you seem to have
incorrectly changed the function names given in comments and elog messages
(copy&pasto?)

> Questions:
> - Is there any potential downside to not fetching non-locking rowmarks
>   eagerly? It's observable for FDWs, but it shouldn't matter? As far as
>   I can tell the new behaviour should always be at least as efficient as
>   before?

It's possible that in some cases we'd not re-fetch a tuple that the
existing code would re-fetch, but I have a hard time seeing how that's
bad.

> - Should the RTI indexed version of arowmarks be computed as an array
>   directly in nodeModifyTable.c etc? Right now we build it first as a
>   list, and then convert it as an array? We could also just use an rti
>   indexed List, and store NULLs where there are no RTIs? But there's no
>   good API for that.

Do that as a follow-on patch if at all.  I'm not sure it's worth changing.

> - We currently - and haven't in recent history - free epq test tuples
>   until EPQ is needed the next time. Perhaps we should?

Can't get excited about it.

            regards, tom lane

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

28 August 2019, 03:02:01

On 2019-08-27 13:06:37 -0400, Tom Lane wrote:
> > Tom, everyone, what do you think of the attached approach?  It
> > implements what I described upthread.
> 
> Seems reasonable to store a pointer in the EPQState rather than pass
> that as a separate argument, but I think you need a bit more documentation
> work to keep the two EState pointers from being impossibly confusing.

Agreed.


> Also I dislike the field name "es_active_epq", as what that suggests
> to me is exactly backwards from the way you apparently are using it.
> Maybe "es_parent_epq" instead?  The comment for it is pretty opaque
> as well, not to mention that it misspells EState.

I think what I was trying to signal was that EPQ is currently active
from the POV of executor nodes. Ought to have been es_epq_active, for
that, I guess.  For me "if (estate->es_epq_active)" explains the meaning
of the check more than "if (estate->es_parent_epq)".


> I don't agree with the FIXME comment in execnodes.h.  The rowmarks
> state is used for things that are outside of EPQ, particularly the
> actual row locking ;-), so I wouldn't want to move that even if it
> didn't create notational issues.

Yea, I wasn't sure about that.


> I'm confused by the removal of the EvalPlanQualBegin call at
> nodeModifyTable.c:1373 ... why is that OK?

EvalPlanQual() calls EvalPlanQualBegin(), it was just needed to initiate
enough state to be able to get a slot.


> The original intent of EvalPlanQualInit was to do an absolutely
> minimal amount of work, since we don't know yet whether EPQ will
> ever be needed (and, generally, the way to bet is that it won't be).
> You've added some pallocs and list-to-array conversion, which
> while not terribly expensive still seem like they don't belong
> there.  Can't we move most of that to EvalPlanQualBegin?  I see that
> we want the substitute_slot array to exist so that EvalPlanQualSlot
> can work without EvalPlanQualBegin, but if we're postponing
> non-locking-rowmark work then we don't need the rest.

Yea, we can move the other fields without much trouble.


> > - Should the RTI indexed version of arowmarks be computed as an array
> >   directly in nodeModifyTable.c etc? Right now we build it first as a
> >   list, and then convert it as an array? We could also just use an rti
> >   indexed List, and store NULLs where there are no RTIs? But there's no
> >   good API for that.
> 
> Do that as a follow-on patch if at all.  I'm not sure it's worth changing.

Well, it'd at least partially address your concern above that we ought
to do less work during EvalPlanQualInit - if we had it in array form
we'd just need to set a variable, rather than do the transformation.

Greetings,

Andres Freund

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

05 September 2019, 19:59:11

Hi,

On 2019-08-27 13:06:37 -0400, Tom Lane wrote:
> Seems reasonable to store a pointer in the EPQState rather than pass
> that as a separate argument, but I think you need a bit more documentation
> work to keep the two EState pointers from being impossibly confusing.
> It might help to write something like "estate is an executor state node
> instantiated for the same rangetable as parentestate (though it might
> be running only a sub-tree of the main plan).  While parentestate
> represents the "real" plan execution, estate runs EPQ rechecks in
> which the table scan nodes are changed to return only the current tuple
> of interest for each table.  Those current tuples are found using the
> rowmark mechanisms."

I tried that for a while, but was unsatisfied with putting it to the
EState(s) - felt like too general a comment. Moved it to the top of the
struct instead.  This basically ended up reordering all of struct
EState, so it'd be good to have a second look.

> Maybe we should change the "estate" field name to something else, too
> --- "childestate" is the first thing that comes to mind but it's
> rather generic.  "recheckestate", perhaps?

I went with recheckestate, and then also renamed the PlanState to follow
recheck* pattern, to make clearer it's not the main plan's PlanState
(sub-)tree.

> Also I dislike the field name "es_active_epq", as what that suggests
> to me is exactly backwards from the way you apparently are using it.
> Maybe "es_parent_epq" instead?

I went with es_epq_active, as I suggested in my earlier email, which
still seems accurate to me.

I really want to move consideration of es_ep_active to ExecInitNode()
time, rather than execution time. If we add an execScan helper, we can
have it set the corresponding executor node's ExecProcNode to
a) a function that performs qual checks and projection
b) a function that performs projection
c) the fetch method from the scan node
d) basically the current ExecScan, when es_epq_active

> The original intent of EvalPlanQualInit was to do an absolutely
> minimal amount of work, since we don't know yet whether EPQ will
> ever be needed (and, generally, the way to bet is that it won't be).
> You've added some pallocs and list-to-array conversion, which
> while not terribly expensive still seem like they don't belong
> there.  Can't we move most of that to EvalPlanQualBegin?  I see that
> we want the substitute_slot array to exist so that EvalPlanQualSlot
> can work without EvalPlanQualBegin, but if we're postponing
> non-locking-rowmark work then we don't need the rest.

Did that.

> In ExecIndexOnlyMarkPos and ExecIndexOnlyRestrPos, you seem to have
> incorrectly changed the function names given in comments and elog messages
> (copy&pasto?)

Fixed.

Did a good bit more comment polishing, renamed a few more variables.

I also added tests for things that I thought were clearly missing
(including a test that errors out before the code changes in the
patch).

For one of the new tests I added a noisy_oper(comment, a, oper, b)
function, to see which EPQ tests are performed with which
values. Without that I found it to be really hard to be confident that
we actually are using the right column values. I think it might make
sense to expand it's use a bit more - but that seems like something for
later.

I tried for a while to develop one for mark/restore of IndexOnlyScans,
but I concluded that that code is basically dead right now. Every scan
node of a normal that gets modified or needs a rowmark implies having
ctid as part of the targetlist. And we neither allow ctid to be part of
index definitions, nor understand that we actually kinda know the ctid
from within the index scan (HOT would make using the tid hard).  So the
relevant code in nodeIndexOnly.c seems dead?

Greetings,

Andres Freund

Attachment

0001-Fix-rowmark-related-EPQ-bug-improve-EPQ-efficiency-a.patch

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

05 September 2019, 20:04:29

Hi,

On 2019-09-05 12:59:11 -0700, Andres Freund wrote:
> @@ -562,6 +831,7 @@ LockRows
>          Merge Cond: (a.id = b.id)
>          ->  Index Scan using jointest_id_idx on jointest a
>          ->  Index Scan using jointest_id_idx on jointest b
> +s1: WARNING:  it's me, brother ExecIndexMarkPos()
>  id             data           id             data           
>  
>  1              0              1              0              

Err, I forgot to amend the commit with the stashed removal of debugging
cruft in the output file...

Greetings,

Andres Freund

Attachment

0001-Fix-rowmark-related-EPQ-bug-improve-EPQ-efficiency-a.patch

Re: PostgreSQL12 crash bug report

From

Andres Freund

Date:

09 September 2019, 14:05:49

Hi,

On 2019-09-05 12:59:11 -0700, Andres Freund wrote:
> Did a good bit more comment polishing, renamed a few more variables.

Pushed now, after some more polishing.


> I also added tests for things that I thought were clearly missing
> (including a test that errors out before the code changes in the
> patch).

For 12, I had to replace the NOTICE with WARNING (including SET
client_min_messages). I wonder if we ought to backpatch

commit ebd49928215e3854d91167e798949a75b34958d0
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2019-07-27 15:59:57 -0400

    Don't drop NOTICE messages in isolation tests.

to avoid backpatching pain?


> I tried for a while to develop one for mark/restore of IndexOnlyScans,
> but I concluded that that code is basically dead right now. Every scan
> node of a normal that gets modified or needs a rowmark implies having
> ctid as part of the targetlist. And we neither allow ctid to be part of
> index definitions, nor understand that we actually kinda know the ctid
> from within the index scan (HOT would make using the tid hard).  So the
> relevant code in nodeIndexOnly.c seems dead?

I wonder if, on master, we should make ExecIndexOnlyMarkPos(),
ExecIndexOnlyRestrPos() ERROR out in case they're hit for an EPQ
relation, given that they ought to be unreachable.

Greetings,

Andres Freund

Re: PostgreSQL12 crash bug report

From

Tom Lane

Date:

09 September 2019, 14:20:27

Andres Freund <andres@anarazel.de> writes:
> On 2019-09-05 12:59:11 -0700, Andres Freund wrote:
>> I tried for a while to develop one for mark/restore of IndexOnlyScans,
>> but I concluded that that code is basically dead right now. Every scan
>> node of a normal that gets modified or needs a rowmark implies having
>> ctid as part of the targetlist. And we neither allow ctid to be part of
>> index definitions, nor understand that we actually kinda know the ctid
>> from within the index scan (HOT would make using the tid hard).  So the
>> relevant code in nodeIndexOnly.c seems dead?

> I wonder if, on master, we should make ExecIndexOnlyMarkPos(),
> ExecIndexOnlyRestrPos() ERROR out in case they're hit for an EPQ
> relation, given that they ought to be unreachable.

I'd vote against.  The chain of reasoning that says they're unreachable
is long and involves mostly code that's nowhere near there, so when/if
somebody made a change that broke that reasoning, they'd not necessarily
notice that the ERROR has to be undone.

            regards, tom lane