RE: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: How can end users know the cause of LR slot sync delays?
Date
Msg-id OSCPR01MB14966A618A8C61EC3DEE486A4F517A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: How can end users know the cause of LR slot sync delays?  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
Responses Re: How can end users know the cause of LR slot sync delays?
List pgsql-hackers
Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

Below are my comments for your patch.

01.
```
+        Number of times the slot sync is skipped.
```

"slot sync" should be "slot synchronization". Same thing can be saied for other
attributes.

02.
```
+        Reason of the last slot sync skip.
```

Possible values must be clarified.

03.
```
+            s.slot_sync_skip_count,
+            s.last_slot_sync_skip,
+                       s.slot_sync_skip_reason,
```

Last line has tab-blank but others have space-blank.

04.
```
+typedef enum SlotSyncSkipReason
+{
+       SLOT_SYNC_SKIP_NONE,            /* No skip */
+       SLOT_SYNC_SKIP_STANDBY_BEHIND,  /* Standby is behind the remote slot */
+       SLOT_SYNC_SKIP_REMOTE_BEHIND,   /* Remote slot is behind the local slot */
+       SLOT_SYNC_SKIP_NO_CONSISTENT_SNAPSHOT   /* Standby could not reach a
+                                                                                        * consistent snapshot */
+}                      SlotSyncSkipReason
```

a.
Can we add comment atop the enum?

b.
SLOT_SYNC_SKIP_STANDBY_BEHIND is misleading; it indicates the standby server has
not received WALs required by slots, but enum does not clarify it.
How about SLOT_SYNC_SKIP_MISSING_WAL_RECORDS or something? Better naming are very
welcome.

c
s/reach/build/.

05.
```
@@ -646,11 +652,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
                                           remote_slot->name,
                                           LSN_FORMAT_ARGS(latestFlushPtr)));
 
-               return false;
+               /* If the slot is not present on the local */
+               if (!(slot = SearchNamedReplicationSlot(remote_slot->name, true)))
+                       return false;
        }
```

Looks like if users try to sync slots via SQL interfaces, the statistics cannot
be updated.

06. update_slot_sync_skip_stats
```
+       /* Update the slot sync reason */
+       SpinLockAcquire(&slot->mutex);
+       slot->slot_sync_skip_reason = skip_reason;
+       SpinLockRelease(&slot->mutex);
```

I feel no need to update the reason if the slot->slot_sync_skip_reason
and skip_reason are the same.

07. synchronize_one_slot
```
+               /*
+                * If standby is behind remote slot and the synced slot is present on
+                * local.
+                */
+               if (remote_slot->confirmed_lsn > latestFlushPtr)
+               {
+                       if (synced)
+                               update_slot_sync_skip_stats(slot, skip_reason);
+                       return false;
+               }
```

This condition exist in the same function; can we combine?

08. GetSlotSyncSkipReason()

Do we have to do pstrdup() here? I found a similar function get_snapbuild_state_desc(),
and it does not use.

09.

Can you consider a test for added codes?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: someone else to do the list of acknowledgments
Next
From: Bruce Momjian
Date:
Subject: Re: PG 18 release notes draft committed