Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAD21AoB2ipSzQb5-o5pEYKie4oTPJTsYR1ip9_wRVrF6HbBWDQ@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby |
List | pgsql-hackers |
On Tue, Feb 20, 2024 at 12:33 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've reviewed the v91 patch. Here are random comments: > > Thanks for the comments. > > > --- > > /* > > * Checks the remote server info. > > * > > - * We ensure that the 'primary_slot_name' exists on the remote server and the > > - * remote server is not a standby node. > > + * Check whether we are a cascading standby. For non-cascading standbys, it > > + * also ensures that the 'primary_slot_name' exists on the remote server. > > */ > > > > IIUC what the validate_remote_info() does doesn't not change by this > > patch, so the previous comment seems to be clearer to me. > > > > --- > > if (remote_in_recovery) > > + { > > + /* > > + * If we are a cascading standby, no need to check further for > > + * 'primary_slot_name'. > > + */ > > ereport(ERROR, > > errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > errmsg("cannot synchronize replication slots from a > > standby server")); > > + } > > + else > > + { > > + bool primary_slot_valid; > > > > - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > > - Assert(!isnull); > > + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > > + Assert(!isnull); > > > > - if (!primary_slot_valid) > > - ereport(ERROR, > > - errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > - errmsg("bad configuration for slot synchronization"), > > - /* translator: second %s is a GUC variable name */ > > - errdetail("The replication slot \"%s\" specified by > > \"%s\" does not exist on the primary server.", > > - PrimarySlotName, "primary_slot_name")); > > + if (!primary_slot_valid) > > + ereport(ERROR, > > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("bad configuration for slot synchronization"), > > + /* translator: second %s is a GUC variable name */ > > + errdetail("The replication slot \"%s\" specified > > by \"%s\" does not exist on the primary server.", > > + PrimarySlotName, "primary_slot_name")); > > + } > > > > I think it's a refactoring rather than changes required by the > > slotsync worker. We can do that in a separate patch but why do we need > > this change in the first place? > > In v90, this refactoring was made due to the fact that > validate_remote_info() was supposed to behave differently for SQL > function and slot-sync worker. SQL-function was supposed to ERROR out > while the worker was supposed to LOG and become no-op. And thus the > change was needed. In v91, we made this functionality same i.e. both > sql function and worker will error out but missed to remove this > refactoring. Thanks for catching this, I will revert it in the next > version. To match the refactoring, I made the comment change too, will > revert that as well. > > > --- > > + ValidateSlotSyncParams(ERROR); > > + > > /* Load the libpq-specific functions */ > > load_file("libpqwalreceiver", false); > > > > - ValidateSlotSyncParams(); > > + (void) CheckDbnameInConninfo(); > > > > Is there any reason why we move where to check the parameters? > > Earlier DBname verification was done inside ValidateSlotSyncParams() > and thus it was needed to load 'libpqwalreceiver' before we call this > function. Now we have moved dbname verification in a separate call and > thus the above order got changed. ValidateSlotSyncParams() is a common > function used by SQL function and worker. Earlier slot sync worker was > checking all the GUCs after starting up and was exiting each time any > GUC was invalid. It was suggested that it would be better to check the > GUCs before starting the slot sync worker in the postmaster itself, > making the ValidateSlotSyncParams() move to postmaster (see > SlotSyncWorkerAllowed). But it was not a good idea to load libpq in > postmaster and thus we moved libpq related verification out of > ValidateSlotSyncParams(). This resulted in the above change. I hope > it answers your query. Thank you for the explanation. It makes sense to me to move the check. As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I have two comments: 1. The error messages are not very descriptive and seem not to match other messages the postmaster says. When starting the standby server with misconfiguration about the slotsync, I got the following messages from the postmaster: 2024-02-20 17:01:16.356 JST [456741] LOG: database system is ready to accept read-only connections 2024-02-20 17:01:16.358 JST [456741] LOG: bad configuration for slot synchronization 2024-02-20 17:01:16.358 JST [456741] HINT: "hot_standby_feedback" must be enabled. It says "bad configuration" but is still working, and does not say further information such as whether it skipped to start the slotsync worker etc. I think these messages could work for the slotsync worker but we might want to have more descriptive messages for the postmaster. For example, "skipped starting slot sync worker because hot_standby_feedback is disabled". 2. If the wal_level is not logical, the server will need to restart anyway to change the wal_level and have the slotsync worker work. Does it make sense to have the postmaster exit if the wal_level is not logical and sync_replication_slots is enabled? For instance, we have similar checks in PostmsaterMain(): if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, (errmsg("WAL cannot be summarized when wal_level is \"minimal\""))); Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: