Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CALj2ACV+VX9McnogGNyFCjZW+qnPvdmjnBjttotygs8+7D5JuA@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby Re: Synchronizing slots from primary to standby |
List | pgsql-hackers |
On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta.malik@gmail.com> wrote: > > Thanks Bharat for letting us know. It is okay to split the patch, it > may definitely help to understand the modules better but shall we take > a step back and try to reevaluate the design first before moving to > other tasks? Agree that design comes first. FWIW, I'm attaching the v9 patch set that I have with me. It can't be a perfect patch set unless the design is finalized. > I analyzed more on the issues stated in [1] for replacing LIST_SLOTS > with SELECT query. On rethinking, it might not be a good idea to > replace this cmd with SELECT in Launcher code-path I think there are open fundamental design aspects, before optimizing LIST_SLOTS, see below. I'm sure we can come back to this later. > Secondly, I was thinking if the design proposed in the patch is the > best one. No doubt, it is the most simplistic design and thus may > .......... Any feedback is appreciated. Here are my thoughts about this feature: Current design: 1. On primary, never allow walsenders associated with logical replication slots to go ahead of physical standbys that are candidates for future primary after failover. This enables subscribers to connect to new primary after failover. 2. On all candidate standbys, periodically sync logical slots from primary (creating the slots if necessary) with one slot sync worker per logical slot. Important considerations: 1. Does this design guarantee the row versions required by subscribers aren't removed on candidate standbys as raised here - https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de? It seems safe with logical decoding on standbys feature. Also, a test-case from upthread is already in patch sets (in v9 too) https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com. However, we need to verify the use cases extensively. 2. All candidate standbys will start one slot sync worker per logical slot which might not be scalable. Is having one (or a few more - not necessarily one for each logical slot) worker for all logical slots enough? It seems safe to have one worker for all logical slots - it's not a problem even if the worker takes a bit of time to get to sync a logical slot on a candidate standby, because the standby is ensured to retain all the WAL and row versions required to decode and send to the logical slots. 3. Indefinite waiting of logical walsenders for candidate standbys may not be a good idea. Is having a timeout for logical walsenders a good idea? A problem with timeout is that it can make logical slots unusable after failover. 4. All candidate standbys retain WAL required by logical slots. Amount of WAL retained may be huge if there's a replication lag with logical replication subscribers. This turns out to be a typical problem with replication, so there's nothing much this feature can do to prevent WAL file accumulation except for asking one to monitor replication lag and WAL file growth. 5. Logical subscribers replication lag will depend on all candidate standbys replication lag. If candidate standbys are too far from primary and logical subscribers are too close, still logical subscribers will have replication lag. There's nothing much this feature can do to prevent this except for calling it out in documentation. 6. This feature might need to prevent the GUCs from deviating on primary and the candidate standbys - there's no point in syncing a logical slot on candidate standbys if logical walsender related to it on primary isn't keeping itself behind all the candidate standbys. If preventing this from happening proves to be tough, calling it out in documentation to keep GUCs the same is a good start. 7. There are some important review comments provided upthread as far as this design and patches are concerned - https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de and https://www.postgresql.org/message-id/20220207203222.22aktwxrt3fcllru%40alap3.anarazel.de. I'm sure we can come to these once the design is clear. Please feel free to add the list if I'm missing anything. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: