Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Drouvot, Bertrand |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | 40e1d73f-a09d-4608-8efa-750c998c3287@gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
Hi, On 12/5/23 6:08 AM, shveta malik wrote: > On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> Maybe another option could be to have the walreceiver a way to let the slot sync >> worker knows that it (the walreceiver) was not able to start due to non existing >> replication slot on the primary? (that way we'd avoid the slot sync worker having >> to talk to the primary). > > Few points: > 1) I think if we do it, we should do it in generic way i.e. slotsync > worker should go to no-op if walreceiver is not able to start due to > any reason and not only due to invalid primary_slot_name. Agree. > 2) Secondly, slotsync worker needs to make sure it has synced the > slots so far i.e. worker should not go to no-op immediately on seeing > missing WalRcv process if there are pending slots to be synced. Agree. > So the generic way I see to have this optimization is: > 1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver > is running or not. Not sure that would work because the walreceiver keeps try re-starting and so get a pid before reaching the "could not start WAL streaming: ERROR: replication slot "XXXX" does not exist" error. We may want to add an extra check on walrcv->walRcvState (or should/could be enough by its own). But walrcv->walRcvState is set to WALRCV_STREAMING way before walrcv_startstreaming(). Wouldn't that make sense to move it once we are sure that walrcv_startstreaming() returns true and first_stream is true, here? " if (first_stream) + { ereport(LOG, (errmsg("started streaming WAL from primary at %X/%X on timeline %u", LSN_FORMAT_ARGS(startpoint), startpointTLI))); + SpinLockAcquire(&walrcv->mutex); + walrcv->walRcvState = WALRCV_STREAMING; + SpinLockRelease(&walrcv->mutex); + } " > 2) Slotsync worker should check null 'WalRcv->pid' only when > no-activity is observed for threshold time i.e. it can do it during > existing logic of increasing naptime. > 3) On finding null 'WalRcv->pid', worker can mark a flag to go to > no-op unless WalRcv->pid becomes valid again. Marking this flag during > increasing naptime will guarantee that the worker has taken all the > changes so far i.e. standby is not lagging in terms of slots. > 2) and 3) looks good to me but with a check on walrcv->walRcvState looking for WALRCV_STREAMING state instead of looking for a non null WalRcv->pid. And only if it makes sense to move the walrcv->walRcvState = WALRCV_STREAMING as mentioned above (I think it does). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: