RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder() - Mailing list pgsql-bugs
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder() |
Date | |
Msg-id | OS0PR01MB57163F3F1BF6B818ADE34862941AA@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
|
List | pgsql-bugs |
On Thursday, August 17, 2023 6:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form > <noreply@postgresql.org> wrote: > > > > > > > > The following bug has been logged on the website: > > > > > > > > Bug reference: 18055 > > > > Logged by: ocean li > > > > Email address: ocean_li_996@163.com > > > > PostgreSQL version: 11.9 > > > > Operating system: centos7 5.10.84 x86_64 > > > > Description: > > > > > > > > For testing logical decoding module, *pg_logical_slot_get_changes* > > > > function is used. Sometimes i got an core whose stack was like that: > > > > ... > > > > And in level #3 of stack above, NInitialRunningXacts is 2 and > > > > InitialRunningXacts is not NULL observed in one of cores. > > > > > > > > Using of NInitialRunningXacts and InitialRunningXacts are clear. > > > > Currently, the core, as far as i know, maybe caused by this way: > > > > an ERROR raised when calling > > > > *pg_logical_slot_get_changes_guts* function. The code part of > > > > PG_CATCH() doses not reset NInitialRunningXacts and > InitialRunningXacts. > > > > Then, calling pg_logical_slot_get_changes_guts again, the core may > occur. > > > > Unfortunately, I couldn't find a minimal reproduction case. > > > > However, I observed an *ERROR: canceling statement due to > > > > statement timeout* logged before each core occurred. (For some > > > > reason, I can't provide the information of > > > > log) > > > > > > Thanks for the report and the fix! > > > > > > I can also reproduce by the steps[1] in PG15~11(Note that we need to > > > change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra > > > running xacts wal records). > > > > > > About the patch: > > > > > > --- > > > /* free context, call shutdown callback */ > > > FreeDecodingContext(ctx); > > > > > > ReplicationSlotRelease(); > > > InvalidateSystemCaches(); > > > } > > > PG_CATCH(); > > > { > > > + > > > + /* free context, call shutdown callback */ > > > + FreeDecodingContext(ctx); > > > + > > > + ReplicationSlotRelease(); > > > > > > I think we could not directly call the cleanup functions here. There > > > could be two risks: > > > 1) it's possible either 'ctx' hasn't been initialized before the > > > error happen, we don't need to clean it up in this case > > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up > > > normally before entering the catch() block which means we will double > cleanup(free) it. > > > > > > To improve this, I think we can use PG_FINALLY() here and move all > > > these cleanup functions in it and do the null check for 'ctx' before cleaning > up. > > > > > > Just share one small patch based on your fix for reference, feel free to > update it. > > > > > > > - LogicalDecodingContext *ctx; > > + LogicalDecodingContext *ctx = NULL; > > > > Don't we need to use volatile for ctx? See the following comment in > > elog.h: "Note: if a local variable of the function containing PG_TRY > > is modified in the PG_TRY section and used in the PG_CATCH section, > > that variable must be declared "volatile" for POSIX compliance."? > > Thanks for pointing it out, I think we need to use volatile for it. > > BTW, don't we need a similar change in pg_logical_replication_slot_advance()? You are right, I can reproduce the same following the steps in [1] and replace the last two pg_logical_slot_get_changes function calls with two "SELECT pg_replication_slot_advance('isolation_slot', (select pg_current_wal_lsn()));" Here is a new version patch based on PG15 STABLE which addressed above comments. [1] https://www.postgresql.org/message-id/OS0PR01MB57160A7DD10B517AFDC73378941AA%40OS0PR01MB5716.jpnprd01.prod.outlook.com Best Regards, Hou zj
Attachment
pgsql-bugs by date: