Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication - Mailing list pgsql-hackers

From Chao Li
Subject Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication
Date
Msg-id 2995625E-BF51-431B-BF17-BAF064005E74@gmail.com
Whole thread Raw
In response to Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers

> On Jan 7, 2026, at 11:36, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Mon, Dec 29, 2025 at 9:45 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
>>
>> Hi,
>>
>> Thanks for the patch updates.
>>
>> On 26/12/2025 10:28, Fujii Masao wrote:
>>
>> Maybe it's better to use slurp_file(). We already have wait_for_log() to
>> wait for a message in the cluster's log file, but there's no helper function
>> to wait for specific content to appear in an arbitrary file.
>>
>> To support waiting for output in pg_recvlogical's output file,
>> I added a new helper that uses slurp_file() (see the attached 0002 patch).
>> I also updated the 0003 patch (the pg_recvlogical reconnection test) to
>> use this helper instead of pg_read_file(). Thoughts?
>>
>> Agreed, nice addition.
>>
>> I applied the v3-000* patch set and it builds successfully and passes the tests on my laptop.
>>
>> However the CI seems not completely happy yet, with previous 2 runs not green for Windows. Could it be there's an
issuewith executing the test on Windows? 
>
> Thanks for the report!
>
> The TAP test failed on Windows because it attempted to terminate
> pg_recvlogical using a TERM signal, which isn't available there.
> As a result, the test waited indefinitely for pg_recvlogical to exit
> and finally timed out.
>
> To address this, I updated the 0003 patch so that the test passes
> --endpos to pg_recvlogical on Windows only. This allows pg_recvlogical
> to terminate without signals, by generating WAL until the current
> position reaches the specified end position. OTOH, on non-Windows
> platforms, the test continues to use signals to terminate pg_recvlogical.
>
> This approach may be somewhat unstable. If there's a more robust
> way to terminate pg_recvlogical on Windows, I'd be happy to switch
> to it, but I couldn't come up with a better option.
>
> Updated patches are attached.
>
> Regards,
>
> --
> Fujii Masao
>
<v4-0004-pg_recvlogical-remove-unnecessary-OutputFsync-ret.patch><v4-0003-Add-test-for-pg_recvlogical-reconnection-behavior.patch><v4-0001-pg_recvlogical-Prevent-flushed-data-from-being-re.patch><v4-0002-Add-a-new-helper-function-wait_for_file-to-Utils..patch>

Hi Fujii-san,

Thanks for the patch. Here are my comments on v4.

1 - 0001
```
+    /*
+     * Save the last flushed position as the replication start point. On
+     * reconnect, replication resumes from there to avoid re-sending flushed
+     * data.
+     */
+    startpos = output_fsync_lsn;
```

Looking at function OutputFsync(), fsync() may fail and there a few branches to return early without fsync(), so should
weonly update startpos after fsync()? 

2 - 0001
```
+        if (outfd != -1 && strcmp(outfile, "-") != 0)
+            OutputFsync(feGetCurrentTimestamp());
```

Do we need to check return code of OutputFsync()? I checked this file, the only caller that doesn’t check return code
ofOutputFsync() has a comment for why: 
```
        /* no need to jump to error on failure here, we're finishing anyway */
        OutputFsync(t);
```

I saw 0004 has changed OutputFsync to return nothing. So, it’s ok to not adding a comment here. But I just feel that,
ifwe want to make the commit self-contained, it would be better to add a comment here, but that’s not a strong opinion. 


3 - 0001

No a direct issue of this patch. I noticed that, everywhere that calls OutputFsync(), it checks outfd != -1 &&
strcmp(outfile,"-") != 0. However, two places miss the check: 

```
        if (outfd != -1 &&
            feTimestampDifferenceExceeds(output_last_fsync, now,
                                         fsync_interval))
        {
            if (!OutputFsync(now))
                goto error;
        }
```
This place doesn’t check strcmp(outfile, "-") != 0.

```
static bool
flushAndSendFeedback(PGconn *conn, TimestampTz *now)
{
    /* flush data to disk, so that we send a recent flush pointer */
    if (!OutputFsync(*now))
        return false;
    *now = feGetCurrentTimestamp();
    if (!sendFeedback(conn, *now, true, false))
        return false;

    return true;
}
```
flushAndSendFeedback() doesn’t check the both conditions. I don’t why the checks can be skipped.

I want to hear your opinion. If you consider this is a problem, then you may address it in this patch; or you want me
toaddress it in a separate patch? 

4 - 0002
```
+    croak "timed out waiting for match: $regexp”;
```

Is it more helpful to include filename in the error message?

5 - 0003
```
+my ($stdout, $stderr);
+my $recv = IPC::Run::start(
+    [@pg_recvlogical_cmd],
+    '>' => \$stdout,
+    '2>' => \$stderr);
```

$stdout and $stderr are never used.

6 - 0004 LGTM. As OutputFsync() only returned true, making it as void makes the code neater.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: global temporary table (GTT) - are there some ideas how to implement it?
Next
From: Peter Smith
Date:
Subject: Re: Proposal: Conflict log history table for Logical Replication