Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAA4eK1JLBi3HzenB6do3_hd78kN0UDD1mz-vumWE52XHHEq5Bw@mail.gmail.com Whole thread Raw |
In response to | RE: Synchronizing slots from primary to standby ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is V87 patch that adds test for the suggested cases. > I have pushed this patch and it leads to a BF failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-14%2004%3A43%3A37 The test failures are: # Failed test 'logical decoding is not allowed on synced slot' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 272. # Failed test 'synced slot on standby cannot be altered' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 281. # Failed test 'synced slot on standby cannot be dropped' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 287. The reason is that in LOGs, we see a different ERROR message than what is expected: 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: replication slot "lsub1_slot" is active for PID 1760871 Now, we see the slot still active because a test before these tests (# Test that if the synchronized slot is invalidated while the remote slot is still valid, ....) is not able to successfully persist the slot and the synced temporary slot remains active. The reason is clear by referring to below standby LOGS: LOG: connection authorized: user=bf database=postgres application_name=040_standby_failover_slots_sync.pl LOG: statement: SELECT pg_sync_replication_slots(); LOG: dropped replication slot "lsub1_slot" of dbid 5 STATEMENT: SELECT pg_sync_replication_slots(); ... SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots WHERE slot_name = 'lsub1_slot'; In the above LOGs, we should ideally see: "newly created slot "lsub1_slot" is sync-ready now" after the "LOG: dropped replication slot "lsub1_slot" of dbid 5" but lack of that means the test didn't accomplish what it was supposed to. Ideally, the same test should have failed but the pass criteria for the test failed to check whether the slot is persisted or not. The probable reason for failure is that remote_slot's restart_lsn lags behind the oldest WAL segment on standby. Now, in the test, we do ensure that the publisher and subscriber are caught up by following steps: # Enable the subscription to let it catch up to the latest wal position $subscriber1->safe_psql('postgres', "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); $primary->wait_for_catchup('regress_mysub1'); However, this doesn't guarantee that restart_lsn is moved to a position new enough that standby has a WAL corresponding to it. One easy fix is to re-create the subscription with the same slot_name after we have ensured that the slot has been invalidated on standby so that a new restart_lsn is assigned to the slot but it is better to analyze some more why the slot's restart_lsn hasn't moved enough only sometimes. -- With Regards, Amit Kapila.
pgsql-hackers by date: