Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAD21AoDH5x-Wt=x=xrrrD99+BYp30fwH0FHaxx6HvEfcPrxNxA@mail.gmail.com Whole thread Raw |
In response to | RE: Conflict detection for update_deleted in logical replication ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
On Thu, Aug 14, 2025 at 10:46 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, August 14, 2025 11:46 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Wed, Aug 13, 2025 at 4:15 PM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > On Wed, Aug 13, 2025 at 10:41 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > Here is the V61 patch set which addressed above comments and the > > comment by Nisha[2]. > > > > > > > > > > Thank You for the patch. I tested the patch, please find a few comments: > > > > > > 1) > > > Now when it stops-retention and later resumes it due to the fact that > > > max_duration is meanwhile altered to 0, I get log: > > > > > > LOG: logical replication worker for subscription "sub1" resumes > > > retaining the information for detecting conflicts > > > DETAIL: The time spent applying changes up to LSN 0/17DD728 is now > > > within the maximum limit of 0 ms. > > > > > > I did not get which lsn it is pointing to? Is it some dangling lsn > > > from when it was retaining info? Also the msg looks odd, when it says > > > 'is now within the maximum limit of 0 ms.' > > > > > > 2) > > > While stopping the message is: > > > LOG: logical replication worker for subscription "sub1" will stop > > > retaining conflict information > > > DETAIL: The time spent advancing the non-removable transaction ID has > > > exceeded the maximum limit of 1000 ms. > > > > > > And while resuming: > > > logical replication worker for subscription "sub1" resumes retaining > > > the information for detecting conflicts > > > ---------- > > > > > > We can make both similar. Both can have 'retaining the information for > > > detecting conflicts' instead of 'conflict information' in first one. > > > > > > 3) > > > I believe the tenses should also be updated. When stopping, we can say: > > > > > > Logical replication worker for subscription "sub1" has stopped... > > > > > > This is appropriate because it has already stopped by pre-setting > > > oldest_nonremovable_xid to Invalid. > > > > > > When resuming, we can say: > > > Logical replication worker for subscription "sub1" will resume... > > > > > > This is because it will begin resuming from the next cycle onward, > > > specifically after the launcher sets its oldest_xid. > > > > > > 4) > > > For the DETAIL part of resume and stop messages, how about these: > > > > > > The retention duration for information used in conflict detection has > > > exceeded the limit of xx. > > > The retention duration for information used in conflict detection is > > > now within the acceptable limit of xx. > > > The retention duration for information used in conflict detection is > > > now indefinite. > > > > > Thanks for the comments, I have adjusted the log messages > according to the suggestions. > > > > > > 5) > > Say there 2-3 subs, all have stopped-retention and the slot is set to have invalid > > xmin; now if I create a new sub, it will start with stopped-flag set to true due to > > the fact that slot has invalid xmin to begin with. But then immediately, it will > > dump a resume message. It looks odd, as at first, it has not even stopped, as it > > is a new sub. > > Is there anything we can do to improve this situation? > > I changed the logic to recovery the slot immediately on starting a new worker > that has retain_dead_tuples enabled. > > Here is the V62 patch set which addressed above comments and [1]. > Here are review comments on v62 patch: Regarding the subscription-level option vs. GUC, I don't disagree with the current approach. For the record, while I agree that the subscription-level option is more consistent with the existing retain_dead_tuples option and can work for different requirements, my biggest concern is that if users set different values to different subscriptions, they might think it doesn't work as expected. This is because even if a subscription disables retaining dead tuples due to max_conflict_retention_duration, the database cluster doesn't stop retaining dead tuples unless all other subscriptions disable it, meaning that the performance on that database might not get recovered. I proposed the GUC parameter as I thought it's less confusing from a user perspective. I'm not sure it's sufficient to mention that in the documentation but I don't have a better idea. --- + else if (IsSet(supported_opts, SUBOPT_MAX_CONFLICT_RETENTION_DURATION) && + strcmp(defel->defname, "max_conflict_retention_duration") == 0) + { + if (IsSet(opts->specified_opts, SUBOPT_MAX_CONFLICT_RETENTION_DURATION)) + errorConflictingDefElem(defel, pstate); + + opts->specified_opts |= SUBOPT_MAX_CONFLICT_RETENTION_DURATION; + opts->maxconflretention = defGetInt32(defel); + } The new subscription parameter accepts only integers and takes it as milliseconds, but I think it would be relatively rare that users specify this parameter to less than 1 second. I guess it would be a good idea to accept string representation of a duration too such as '10 min' like we do for parsing GUC parameter values. --- +static void +notify_ineffective_max_conflict_retention(bool update_maxconflretention) +{ + ereport(NOTICE, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + update_maxconflretention + ? errmsg("max_conflict_retention_duration has no effect when retain_dead_tuples is disabled") + : errmsg("disabling retain_dead_tuples will render max_conflict_retention_duration ineffective")); +} Given that max_conflict_retention_duration works only when retain_dead_tuples is enabled, why not merge these two parameters? For example, setting max_conflict_retention_duration=-1 means to disable retaining dead tuples behavior and =0 means that dead tuples are retained until they are no longer needed for detection purposes. --- + /* + * Only the leader apply worker manages conflict retention (see + * maybe_advance_nonremovable_xid() for details). + */ + if (!isParallelApplyWorker(&worker) && !isTablesyncWorker(&worker)) + values[10] = TransactionIdIsValid(worker.oldest_nonremovable_xid); + else + nulls[10] = true; I think that adding a new column to the pg_stat_subscription view should be implemented in a separate patch since it needs to bump the catalog version while introducing max_conflict_retention_duration subscription option doesn't require that. --- Even if an apply worker disables retaining dead tuples due to max_conflict_retention_duration, it enables again after the server restarts. I guess restarting a server doesn't necessarily mean that the subscription caught up to the publisher and can resume retaining dead tuples but is this behavior intentional? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: