Home > mailing lists

Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From	Masahiko Sawada
Subject	Re: Synchronizing slots from primary to standby
Date	February 20, 2024 12:49:08
Msg-id	CAD21AoB2ipSzQb5-o5pEYKie4oTPJTsYR1ip9_wRVrF6HbBWDQ@mail.gmail.com Whole thread Raw
In response to	Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>)
Responses	Re: Synchronizing slots from primary to standby Re: Synchronizing slots from primary to standby
List	pgsql-hackers

Tree view

On Tue, Feb 20, 2024 at 12:33 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > I've reviewed the v91 patch. Here are random comments:
>
> Thanks for the comments.
>
> > ---
> >  /*
> >   * Checks the remote server info.
> >   *
> > - * We ensure that the 'primary_slot_name' exists on the remote server and the
> > - * remote server is not a standby node.
> > + * Check whether we are a cascading standby. For non-cascading standbys, it
> > + * also ensures that the 'primary_slot_name' exists on the remote server.
> >   */
> >
> > IIUC what the validate_remote_info() does doesn't not change by this
> > patch, so the previous comment seems to be clearer to me.
> >
> > ---
> >     if (remote_in_recovery)
> > +   {
> > +       /*
> > +        * If we are a cascading standby, no need to check further for
> > +        * 'primary_slot_name'.
> > +        */
> >         ereport(ERROR,
> >                 errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >                 errmsg("cannot synchronize replication slots from a
> > standby server"));
> > +   }
> > +   else
> > +   {
> > +       bool        primary_slot_valid;
> >
> > -   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
> > -   Assert(!isnull);
> > +       primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
> > +       Assert(!isnull);
> >
> > -   if (!primary_slot_valid)
> > -       ereport(ERROR,
> > -               errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > -               errmsg("bad configuration for slot synchronization"),
> > -       /* translator: second %s is a GUC variable name */
> > -               errdetail("The replication slot \"%s\" specified by
> > \"%s\" does not exist on the primary server.",
> > -                         PrimarySlotName, "primary_slot_name"));
> > +       if (!primary_slot_valid)
> > +           ereport(ERROR,
> > +                   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                   errmsg("bad configuration for slot synchronization"),
> > +           /* translator: second %s is a GUC variable name */
> > +                   errdetail("The replication slot \"%s\" specified
> > by \"%s\" does not exist on the primary server.",
> > +                             PrimarySlotName, "primary_slot_name"));
> > +   }
> >
> > I think it's a refactoring rather than changes required by the
> > slotsync worker. We can do that in a separate patch but why do we need
> > this change in the first place?
>
> In v90, this refactoring was made due to the fact that
> validate_remote_info() was supposed to behave differently for SQL
> function and slot-sync worker. SQL-function was supposed to ERROR out
> while the worker was supposed to LOG and become no-op. And thus the
> change was needed. In v91, we made this functionality same i.e. both
> sql function and worker will error out but missed to remove this
> refactoring. Thanks for catching this, I will revert it in the next
> version. To match the refactoring, I made the comment change too, will
> revert that as well.
>
> > ---
> > +        ValidateSlotSyncParams(ERROR);
> > +
> >          /* Load the libpq-specific functions */
> >          load_file("libpqwalreceiver", false);
> >
> > -        ValidateSlotSyncParams();
> > +        (void) CheckDbnameInConninfo();
> >
> > Is there any reason why we move where to check the parameters?
>
> Earlier DBname verification was done inside ValidateSlotSyncParams()
> and thus it was needed to load 'libpqwalreceiver' before we call this
> function. Now we have moved dbname verification in a separate call and
> thus the above order got changed. ValidateSlotSyncParams() is a common
> function used by SQL function and worker. Earlier slot sync worker was
> checking all the GUCs after starting up and was exiting each time any
> GUC was invalid. It was suggested that it would be better to check the
> GUCs before starting the slot sync worker in the postmaster itself,
> making the ValidateSlotSyncParams() move to postmaster (see
> SlotSyncWorkerAllowed).  But it was not a good idea to load libpq in
> postmaster and thus we moved libpq related verification out of
> ValidateSlotSyncParams(). This resulted in the above change.  I hope
> it answers your query.

Thank you for the explanation. It makes sense to me to move the check.

As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I
have two comments:

1. The error messages are not very descriptive and seem not to match
other messages the postmaster says. When starting the standby server
with misconfiguration about the slotsync, I got the following messages
from the postmaster:

2024-02-20 17:01:16.356 JST [456741] LOG:  database system is ready to
accept read-only connections
2024-02-20 17:01:16.358 JST [456741] LOG:  bad configuration for slot
synchronization
2024-02-20 17:01:16.358 JST [456741] HINT:  "hot_standby_feedback"
must be enabled.

It says "bad configuration" but is still working, and does not say
further information such as whether it skipped to start the slotsync
worker etc. I think these messages could work for the slotsync worker
but we might want to have more descriptive messages for the
postmaster. For example, "skipped starting slot sync worker because
hot_standby_feedback is disabled".

2. If the wal_level is not logical, the server will need to restart
anyway to change the wal_level and have the slotsync worker work. Does
it make sense to have the postmaster exit if the wal_level is not
logical and sync_replication_slots is enabled? For instance, we have
similar checks in PostmsaterMain():

    if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
        ereport(ERROR,
                (errmsg("WAL cannot be summarized when wal_level is
\"minimal\"")));

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Peter Eisentraut
Date: 20 February 2024, 12:40:27
Subject: Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

From: Masahiko Sawada
Date: 20 February 2024, 13:10:44
Subject: Re: Synchronizing slots from primary to standby

Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

Previous

Next