Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943 - Mailing list pgsql-bugs
From | Tender Wang |
---|---|
Subject | Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943 |
Date | |
Msg-id | CAHewXNnpxy6rMNvBGZpTdgLosNTpEmZOzth6_m57kcU3kE4kTA@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943 (Tender Wang <tndrwang@gmail.com>) |
Responses |
Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943
|
List | pgsql-bugs |
Tender Wang <tndrwang@gmail.com> 于2024年4月18日周四 20:13写道:
Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年4月9日周二 01:57写道:On 2024-Mar-05, PG Bug reporting form wrote:
> #2 0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3 0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4 0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803
I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside. The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan. Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached. But this doesn't happen for some
reason that I haven't yet been able to understand.
Still trying to find a proper fix. In the meantime, not using prepared
plans should serve to work around the problem.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.netI had been analying this crash these days. And I added a lot debug infos in codes.Finally, I found a code execution sequence that would trigger this assert, and I coulduse gdb not pgbench to help to reproduce this crash.For example:./psql postgres # as session1 to do detach, start firstin another terminal, start gdb(call gdb1)gdb -p session1_pidb ATExecDetachPartitionin session1, input alter table p detach partition p1 concurrently;now session1 will be stalled by gdb.in gdb terminal, we input step next(e.g. n) until first transaction call CommitTransactionCommand().wo stop at CommitTransactionCommand().we start another session2 to do select.input : prepare p1 as select * from p where a = $1;we start a new terminal, start gdb(call gdb2)gdb -p session2_pidb exec_simple_queryin session2, input execute p1(1);Now session2 will be stalled by gdb.in gdb terminal, we step into PortalRunUtility(), after getting a snapshot, we stop here.For session2, the transaction updating pg_inherits is not commited.We switch to gdb1 terminal, and continue to step next until calling DetachPartitionFinalize().Because session2 has not get p relaiton lock, so in gdb1, we can cross WaitForLockersMultiple().Now we swithch to gdb2, and continue to do work. If we breakpoint find_inheritance_children_extended()We will get a tuple that inhdetachpending is true, but the xmin is in-progress for the session2 snapshot.So this tuple will be added to the outpue according to the logic. Finally we will get two parts.After return from add_base_rels_to_query() in query_planner(), we switch to gdb1.In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to remove the tuple.We input command "continue" to do left work for the detach.Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We continue gdb2, and we willstop at RelationCacheInvalidateEntry(). And we will see that p relation cache item will be cleared.The backtrace will be attached at the end of the this email.Entering ExecInitAppend(), because part_prune_info is not null, so we will enter CreatePartitionPruneState().We enter find_inheritance_children_extended() again to get partdesc, but in gdb1 we have done DetachPartitionFinalize()and the detach has commited. So we only get one tuple and parts is 1.Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).--Tender WangOpenPie: https://en.openpie.com/
Sorry, I forgot to put backtrace that call RelationCacheInvalidateEntry() in planner phase in last email.
I found one self-contradiction comments in CreatePartitionPruneState():
/* For data reading, executor always omits detached partitions */
if (estate->es_partition_directory == NULL)estate->es_partition_directory =
CreatePartitionDirectory(estate->es_query_cxt, false);
Should it be " not omits" if I didn't misunderstand. Because we pass false to the function.
I think if we could rewrite logic of CreatePartitionPruneState() as below:
if (partdesc->nparts == pinfo->nparts)
{
/* no new partition and no detached partition */
}
else if (partdesc->nparts >= pinfo->nparts)
{
/* new partition */
}
else
{
/* detached partition */
}
I haven't figured out a fix to the Scenario I found in last email.
--
Tender Wang
OpenPie: https://en.openpie.com/pgsql-bugs by date: