Re: issue with synchronized_standby_slots - Mailing list pgsql-hackers

From Alexander Kukushkin
Subject Re: issue with synchronized_standby_slots
Date
Msg-id CAFh8B=mCN1hf1Rb-78kEG5ahD_dXT6WNb7-0KqJS9sNQwieh_g@mail.gmail.com
Whole thread Raw
In response to Re: issue with synchronized_standby_slots  (Fabrice Chapuis <fabrice636861@gmail.com>)
Responses Re: issue with synchronized_standby_slots
Re: issue with synchronized_standby_slots
List pgsql-hackers
Hi,


On Sun, 7 Sept 2025 at 10:15, Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Thanks for your reply Zhijie,

I understand that the error  invalid value for parameter will be diplayed in case of bad value for the GUC synchronized_standby_slots or if a standby node configured is not up and running.
But the problem I noticed is that statements could not execute normally and error code is returned to the applcation.
This append after an upgrade from PG 14 to PG 17.
I could try to reproduce the issue
 
> STATEMENT:  select service_period,sp1_0.address_line_1 from tbl1  where http://sp1_0.vn=$1 order by sp1_0.start_of_period
> 2025-08-24 13:14:29.417 CEST [848477]: [1-1] user=,db=,client=,application= ERROR:  invalid value for parameter "synchronized_standby_slots": "node1,node2"
> 2025-08-24 13:14:29.417 CEST [848477]: [2-1] user=,db=,client=,application= DETAIL:  replication slot "s029054a" does not exist
> 2025-08-24 13:14:29.417 CEST [848477]: [3-1] user=,db=,client=,application= CONTEXT:  while setting parameter "synchronized_standby_slots" to "node1,node2"
> 2025-08-24 13:14:29.418 CEST [777453]: [48-1] user=,db=,client=,application= LOG:  background worker "parallel worker" (PID 848476) exited with exit code 1
> 2025-08-24 13:14:29.418 CEST [777453]: [49-1] user=,db=,client=,application= LOG:  background worker "parallel worker" (PID 848477) exited with exit code 1
>
> Is this issue already observed

Recently we also hit this problem.

I think in a current state check_synchronized_standby_slots() and validate_sync_standby_slots() functions are not very useful:
- When the hook is executed from postmaster it only checks that synchronized_standby_slots contains a valid list, but doesn't check that replication slots exists, because MyProc is NULL. It happens both, on start and on reload.
- When executed from other backends set_config_with_handle() is called with elevel = 0, and therefore elevel becomes DEBUG3, which results in no useful error/warning messages.

There are a couple of places where check_synchronized_standby_slots() failure is not ignored:
1. alter system set synchronized_standby_slots='invalid value';
2. Parallel workers, because RestoreGUCState() calls set_config_option_ext()->set_config_with_handle() with elevel=ERROR. As a result, parallel workers fail to start with the error.

With parallel workers it is actually even worse - we get the error even in case of standby:
1. start standby with synchronized_standby_slots referring to non-existing slots
2. SET parallel_setup_cost, parallel_tuple_cost, and min_parallel_table_scan_size to 0
3. Run  select * from pg_namespace; and observe following error:
ERROR:  invalid value for parameter "synchronized_standby_slots": "a1,b1"
DETAIL:  replication slot "a1" does not exist
CONTEXT:  while setting parameter "synchronized_standby_slots" to "a1,b1"
parallel worker

We may argue a lot that invalid configuration must not be used, but the thing is that the main problem being solved by synchronized_standby_slots is delaying logical decoding at certains LSN until it was sent to enough physical standby.
This feature must not affect normal queries, only logical replication.

Please find attached patch fixing a problem for parallel workers.

Regards,
--
Alexander Kukushkin
Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Fix missing EvalPlanQual recheck for TID scans
Next
From: Nitin Motiani
Date:
Subject: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles