Home > mailing lists

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers

From	Zhijie Hou (Fujitsu)
Subject	RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
Date	January 30, 2024 01:54:42
Msg-id	OS0PR01MB5716B4C05F90F59C0ED227C2947D2@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw
In response to	Re: Race condition in FetchTableStates() breaks synchronization of subscription tables (vignesh C <vignesh21@gmail.com>)
Responses	Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
List	pgsql-hackers

Tree view

On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > Hello hackers,
> >
> > After determining a possible cause for intermittent failures of the
> > test subscription/031_column_list [1], I was wondering what makes
> > another subscription test (014_binary) fail on the buildfarm:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=20
> > 24-01-22%2001%3A19%3A03
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202
> > 4-01-14%2018%3A19%3A20
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=202
> > 3-12-21%2001%3A11%3A52
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202
> > 3-11-27%2001%3A42%3A39
> >
> > All those failures caused by a timeout when waiting for a message
> > expected in _subscriber.log. For example, in the snakefly's case:
> > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > incorrect binary data format) at
> /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl
> line 269.
> >
> > _subscriber.log contains:
> > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > 2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data
> > from WAL stream: server closed the connection unexpectedly
> >          This probably means the server terminated abnormally
> >          before or while processing the request.
> > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical
> > replication apply worker" (PID 409999) exited with exit code 1
> > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply
> > worker for subscription "tsub" has started ...
> >
> > While _subscriber.log from a successful test run contains:
> > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > session time: 0:00:00.014 user=postgres database=postgres host=[local]
> > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply
> > worker for subscription "tsub" has started
> > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table
> > synchronization worker for subscription "tsub", table
> > "test_mismatching_types" has started
> > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > format
> > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > test_mismatching_types, line 1, column a
> >
> > In this case, "logical replication apply worker for subscription
> > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes
> later.
> >
> > I've managed to reproduce this failure locally by running multiple
> > tests in parallel, and my analysis shows that it is caused by a race
> > condition when accessing variable table_states_valid inside tablesync.c.
> >
> > tablesync.c does the following with table_states_valid:
> > /*
> >   * Callback from syscache invalidation.
> >   */
> > void
> > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > hashvalue) {
> >      table_states_valid = false;
> > }
> > ...
> > static bool
> > FetchTableStates(bool *started_tx)
> > {
> > ...
> >      if (!table_states_valid)
> >      {
> > ...
> >          /* Fetch all non-ready tables. */
> >          rstates = GetSubscriptionRelations(MySubscription->oid,
> > true); ...
> >          table_states_valid = true;
> >      }
> >
> > So, when syscache invalidation occurs inside the code block "if
> > (!table_states_valid)", that invalidation is effectively ignored.
> >
> > In a failed case I observe the following events:
> > 1. logical replication apply worker performs
> >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> >
> > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
> >   for SUBSCRIPTIONRELMAP, and that leads to calling
> >   invalidate_syncing_table_states().
> >
> > 3. If the apply worker manages to fetch no non-ready tables in
> >   FetchTableStates() and ignore "table_states_valid = false" from
> >   invalidate_syncing_table_states(), then it just misses the invalidation
> >   event, so it keeps working without noticing non-ready tables appeared as
> >   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply()
> skips
> >   a loop "foreach(lc, table_states_not_ready) ..." until some other event
> >   occurs).
> >
> > pg_usleep(100000) added just below GetSubscriptionRelations(...)
> > proves my analysis -- without it, I need tens of iterations (with 50
> > tests running in
> > parallel) to catch the failure, but with it, I get the failure on the
> > first iteration.
> 
> Thanks for the analysis, I was able to reproduce the issue with the steps you had
> shared. I agree with your analysis. I added some logs to verify that the
> invalidation was getting missed.
> 
> I felt that this invalidation is getting ignored because we have used a boolean
> variable here, how about changing it slightly so that table_states_invalid gets
> incremented for every invalidation and then decrementing table_states_invalid
> after getting the non-ready tables like in the attached patch.  I was able to
> verify that the test passes with the attached patch.

Thanks for the patch.

I am not sure if counting the invalidation number is needed, as even if there
are hundreds of invalidations outside of FetchTableStates, one FetchTableStates
call should reset the count to 0 as it is checking the latest catalog. I think
a better approach is to make the table_states_valid tristate
{TABLE_STATE_NEEDS_REBUILD, TABLE_STATE_REBUILD_START, TABLE_STATE_VALID}. The
cache invalidate callback will set state to TABLE_STATE_NEEDS_REBUILD.
FetchTableStates will set the state to TABLE_STATE_VALID only if the state was
TABLE_STATE_REBUILD_START. See the existing EventTriggerCacheStateType for
reference.

Best Regards,
Hou zj

pgsql-hackers by date:

From: Melanie Plageman
Date: 30 January 2024, 01:18:45
Subject: Re: Confine vacuum skip logic to lazy_scan_skip

From: Peter Smith
Date: 30 January 2024, 01:59:04
Subject: Re: Synchronizing slots from primary to standby

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers

Previous

Next