Thread: Syncrep and improving latency due to WAL throttling

Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

25 January 2023, 13:32:51

Hi,

attached is proposal idea by Tomas (in CC) for protecting and
prioritizing OLTP latency on syncrep over other heavy WAL hitting
sessions. This is the result of internal testing and research related
to the syncrep behavior with Tomas, Alvaro and me. The main objective
of this work-in-progress/crude patch idea (with GUC default disabled)
is to prevent build-up of lag against the synchronous standby. It
allows DBA to maintain stable latency for many backends when in
parallel there is some WAL-heavy activity happening (full table
rewrite, VACUUM, MV build, archiving, etc.). In other words it allows
slow down of any backend activity. Any feedback on such a feature is
welcome, including better GUC name proposals ;) and conditions in
which such feature should be disabled even if it would be enabled
globally (right now only anti- wraparound VACUUM comes to mind, it's
not in the patch).

Demo; Given:
- two DBs in syncrep configuration and artificial RTT 10ms latency
between them (introduced via tc qdisc netem)
- insertBIG.sql = "insert into bandwidthhog select repeat('c', 1000)
from generate_series(1, 500000);" (50MB of WAL data)
- pgbench (8c) and 1x INSERT session

There are clearly visible drops of pgbench (OLTP) latency when the WAL
socket is saturated:

with 16devel/master and synchronous_commit_flush_wal_after=0
(disabled, default/baseline):
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 59.0 tps, lat 18.840 ms stddev 11.251, 0 failed, lag 0.059 ms
progress: 2.0 s, 48.0 tps, lat 14.332 ms stddev 4.272, 0 failed, lag 0.063 ms
progress: 3.0 s, 56.0 tps, lat 15.383 ms stddev 6.270, 0 failed, lag 0.061 ms
progress: 4.0 s, 51.0 tps, lat 15.104 ms stddev 5.850, 0 failed, lag 0.061 ms
progress: 5.0 s, 47.0 tps, lat 15.184 ms stddev 5.472, 0 failed, lag 0.063 ms
progress: 6.0 s, 23.0 tps, lat 88.495 ms stddev 141.845, 0 failed, lag 0.064 ms
progress: 7.0 s, 1.0 tps, lat 999.053 ms stddev 0.000, 0 failed, lag 0.077 ms
progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed, lag 0.000 ms
progress: 9.0 s, 1.0 tps, lat 2748.142 ms stddev NaN, 0 failed, lag 0.072 ms
progress: 10.0 s, 68.1 tps, lat 3368.267 ms stddev 282.842, 0 failed,
lag 2911.857 ms
progress: 11.0 s, 97.0 tps, lat 2560.750 ms stddev 216.844, 0 failed,
lag 2478.261 ms
progress: 12.0 s, 96.0 tps, lat 1463.754 ms stddev 376.276, 0 failed,
lag 1383.873 ms
progress: 13.0 s, 94.0 tps, lat 616.243 ms stddev 230.673, 0 failed,
lag 527.241 ms
progress: 14.0 s, 59.0 tps, lat 48.265 ms stddev 72.533, 0 failed, lag 15.181 ms
progress: 15.0 s, 39.0 tps, lat 14.237 ms stddev 6.073, 0 failed, lag 0.063 ms
transaction type: <builtin: TPC-B (sort of)>
[..]
latency average = 931.383 ms
latency stddev = 1188.530 ms
rate limit schedule lag: avg 840.170 (max 3605.569) ms

session2 output:
    postgres=# \i insertBIG.sql
    Timing is on.
    INSERT 0 500000
    Time: 4119.485 ms (00:04.119)

This new GUC makes it possible for the OLTP traffic to be less
affected (latency-wise) when the heavy bulk traffic hits. With
synchronous_commit_flush_wal_after=1024 (kB) it's way better, but
latency rises up to 45ms:
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 52.0 tps, lat 17.300 ms stddev 10.178, 0 failed, lag 0.061 ms
progress: 2.0 s, 51.0 tps, lat 19.490 ms stddev 12.626, 0 failed, lag 0.061 ms
progress: 3.0 s, 48.0 tps, lat 14.839 ms stddev 5.429, 0 failed, lag 0.061 ms
progress: 4.0 s, 53.0 tps, lat 24.635 ms stddev 13.449, 0 failed, lag 0.062 ms
progress: 5.0 s, 48.0 tps, lat 17.999 ms stddev 9.291, 0 failed, lag 0.062 ms
progress: 6.0 s, 57.0 tps, lat 21.513 ms stddev 17.011, 0 failed, lag 0.058 ms
progress: 7.0 s, 50.0 tps, lat 28.071 ms stddev 21.622, 0 failed, lag 0.061 ms
progress: 8.0 s, 45.0 tps, lat 27.244 ms stddev 11.975, 0 failed, lag 0.064 ms
progress: 9.0 s, 57.0 tps, lat 35.988 ms stddev 25.752, 0 failed, lag 0.057 ms
progress: 10.0 s, 56.0 tps, lat 45.478 ms stddev 39.831, 0 failed, lag 0.534 ms
progress: 11.0 s, 62.0 tps, lat 45.146 ms stddev 32.881, 0 failed, lag 0.058 ms
progress: 12.0 s, 51.0 tps, lat 24.250 ms stddev 12.405, 0 failed, lag 0.063 ms
progress: 13.0 s, 57.0 tps, lat 18.554 ms stddev 8.677, 0 failed, lag 0.060 ms
progress: 14.0 s, 44.0 tps, lat 15.923 ms stddev 6.958, 0 failed, lag 0.065 ms
progress: 15.0 s, 54.0 tps, lat 19.773 ms stddev 10.024, 0 failed, lag 0.063 ms
transaction type: <builtin: TPC-B (sort of)>
[..]
latency average = 25.575 ms
latency stddev = 21.540 ms

session2 output:
    postgres=# set synchronous_commit_flush_wal_after = 1024;
    SET
    postgres=# \i insertBIG.sql
    INSERT 0 500000
    Time: 8889.318 ms (00:08.889)


With 16devel/master and synchronous_commit_flush_wal_after=256 (kB)
all is smooth:
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 49.0 tps, lat 14.345 ms stddev 4.700, 0 failed, lag 0.062 ms
progress: 2.0 s, 45.0 tps, lat 14.812 ms stddev 5.816, 0 failed, lag 0.064 ms
progress: 3.0 s, 49.0 tps, lat 13.145 ms stddev 4.320, 0 failed, lag 0.063 ms
progress: 4.0 s, 44.0 tps, lat 14.429 ms stddev 4.715, 0 failed, lag 0.063 ms
progress: 5.0 s, 49.0 tps, lat 18.111 ms stddev 8.536, 0 failed, lag 0.062 ms
progress: 6.0 s, 58.0 tps, lat 17.929 ms stddev 8.198, 0 failed, lag 0.060 ms
progress: 7.0 s, 65.0 tps, lat 20.186 ms stddev 12.973, 0 failed, lag 0.059 ms
progress: 8.0 s, 47.0 tps, lat 16.174 ms stddev 6.508, 0 failed, lag 0.061 ms
progress: 9.0 s, 45.0 tps, lat 14.485 ms stddev 4.736, 0 failed, lag 0.061 ms
progress: 10.0 s, 53.0 tps, lat 16.879 ms stddev 8.783, 0 failed, lag 0.061 ms
progress: 11.0 s, 42.0 tps, lat 13.711 ms stddev 4.464, 0 failed, lag 0.062 ms
progress: 12.0 s, 49.0 tps, lat 13.252 ms stddev 4.082, 0 failed, lag 0.062 ms
progress: 13.0 s, 48.0 tps, lat 14.179 ms stddev 6.238, 0 failed, lag 0.058 ms
progress: 14.0 s, 43.0 tps, lat 12.210 ms stddev 2.993, 0 failed, lag 0.060 ms
progress: 15.0 s, 34.0 tps, lat 14.811 ms stddev 6.544, 0 failed, lag 0.062 ms
transaction type: <builtin: TPC-B (sort of)>
[..]
latency average = 15.454 ms
latency stddev = 7.354 ms

session2 output (notice the INSERT took much longer but did NOT affect
the pgbench's stddev at all):
    postgres=# set synchronous_commit_flush_wal_after = 256;
    SET
    postgres=# \i insertBIG.sql
    Timing is on.
    [..]
    INSERT 0 500000
    Time: 22737.729 ms (00:22.738)

Without this feature (or with synchronous_commit_flush_wal_after=0)
the TCP's SendQ on socket walsender-->walreceiver is growing and as
such any next sendto() by OLTP backends/walwriter ends being queued
too much causing stalls of activity.

-Jakub Wartak.

Attachment

0001-GUC-synchronous_commit_flush_wal_after.patch

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

25 January 2023, 19:05:35

Hi,

On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
> In other words it allows slow down of any backend activity. Any feedback on
> such a feature is welcome, including better GUC name proposals ;) and
> conditions in which such feature should be disabled even if it would be
> enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
> not in the patch).

Such a feature could be useful - but I don't think the current place of
throttling has any hope of working reliably:

> @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
>          pgWalUsage.wal_bytes += rechdr->xl_tot_len;
>          pgWalUsage.wal_records++;
>          pgWalUsage.wal_fpi += num_fpi;
> +
> +        backendWalInserted += rechdr->xl_tot_len;
> +
> +        if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_commit ==
SYNCHRONOUS_COMMIT_REMOTE_WRITE)&&

> +            synchronous_commit_flush_wal_after > 0 &&
> +            backendWalInserted > synchronous_commit_flush_wal_after * 1024L)
> +        {
> +            elog(DEBUG3, "throttling WAL down on this session (backendWalInserted=%d)", backendWalInserted);
> +            XLogFlush(EndPos);
> +            /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP  */
> +            /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
> +            SyncRepWaitForLSN(EndPos, false);
> +            elog(DEBUG3, "throttling WAL down on this session - end");
> +            backendWalInserted = 0;
> +        }
>      }

You're blocking in the middle of an XLOG insertion. We will commonly hold
important buffer lwlocks, it'll often be in a critical section (no cancelling
/ terminating the session!). This'd entail loads of undetectable deadlocks
(i.e. hard hangs).  And even leaving that aside, doing an unnecessary
XLogFlush() with important locks held will seriously increase contention.

My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay.  That should fix the hard deadlock danger and
remove most of the increase in lock contention.

I don't think doing an XLogFlush() of a record that we just wrote is a good
idea - that'll sometimes significantly increase the number of fdatasyncs/sec
we're doing. To make matters worse, this will often cause partially filled WAL
pages to be flushed out - rewriting a WAL page multiple times can
significantly increase overhead on the storage level. At the very least this'd
have to flush only up to the last fully filled page.

Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address.  How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
last XLogFlush() will increase quickly, triggering a flush at the next
opportunity. But short OLTP transactions will do XLogFlush()es at least at
every commit.

I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Bharath Rupireddy

Date:

26 January 2023, 08:03:27

On Thu, Jan 26, 2023 at 12:35 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
> > In other words it allows slow down of any backend activity. Any feedback on
> > such a feature is welcome, including better GUC name proposals ;) and
> > conditions in which such feature should be disabled even if it would be
> > enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
> > not in the patch).
>
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:

+1 for the feature as it keeps replication lag in check and provides a
way for defining RPO (recovery point objective).

> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs).  And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
>
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay.  That should fix the hard deadlock danger and
> remove most of the increase in lock contention.

We've discussed this feature quite extensively in a recent thread -
https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
Folks were agreeing to Andres' suggestion there -
https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de.
However, that thread got lost from my radar. It's good that someone
else started working on it and I'm happy to help with this feature.

> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
>
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address.

Right.

> How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?

Seems reasonable.

> A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
> last XLogFlush() will increase quickly, triggering a flush at the next
> opportunity. But short OLTP transactions will do XLogFlush()es at least at
> every commit.

Right.

> I also suspect the overhead will be more manageable if you were to force a
> flush only up to a point further back than the last fully filled page. We
> don't want to end up flushing WAL for every page, but if you just have a
> backend-local accounting mechanism, I think that's inevitably what you'd end
> up with when you have a large number of sessions. But if you'd limit the
> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> proportional to synchronous_commit_flush_wal_after.

When the WAL records are of large in size, then the backend inserting
them would throttle frequently generating more flushes and more waits
for sync replication ack and more contention on WALWriteLock. Not sure
if this is good unless the impact is measured.

Few more thoughts:

1. This feature keeps replication lag in check at the cost of
throttling write traffic. I'd be curious to know improvement in
replication lag vs drop in transaction throughput, say pgbench with a
custom insert script and one or more async/sync standbys.

2. So, heavy WAL throttling can turn into a lot of writes and fsyncs.
Eventually, each backend gets a chance to throttle WAL if it ever
generates WAL irrespective of whether there's a replication lag or
not. How about we let backends throttle themselves not just based on
wal_distance_from_last_flush but also depending on the replication lag
at the moment, say, if replication lag crosses
wal_throttle_replication_lag_threshold bytes?

3. Should the backends wait indefinitely for sync rep ack when they
crossed wal_throttle_threshold? Well, I don't think so, it must be
given a chance to do its stuff instead of favouring other backends by
throttling itself.

4. I'd prefer adding a TAP test for this feature to check if the WAL
throttle is picked up by a backend.

5. Can we also extend this WAL throttling feature to logical
replication to keep replication lag in check there as well?

6. Backends can ignore throttling for WAL records marked as unimportant, no?

7. I think we need to not let backends throttle too frequently even
though they have crossed wal_throttle_threshold bytes. The best way is
to rely on replication lag, after all the goal of this feature is to
keep replication lag under check - say, throttle only when
wal_distance > wal_throttle_threshold && replication_lag >
wal_throttle_replication_lag_threshold.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

26 January 2023, 11:08:16


On 1/25/23 20:05, Andres Freund wrote:
> Hi,
> 
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
>> In other words it allows slow down of any backend activity. Any feedback on
>> such a feature is welcome, including better GUC name proposals ;) and
>> conditions in which such feature should be disabled even if it would be
>> enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
>> not in the patch).
> 
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:
> 
>> @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
>>          pgWalUsage.wal_bytes += rechdr->xl_tot_len;
>>          pgWalUsage.wal_records++;
>>          pgWalUsage.wal_fpi += num_fpi;
>> +
>> +        backendWalInserted += rechdr->xl_tot_len;
>> +
>> +        if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_commit ==
SYNCHRONOUS_COMMIT_REMOTE_WRITE)&&
 
>> +            synchronous_commit_flush_wal_after > 0 &&
>> +            backendWalInserted > synchronous_commit_flush_wal_after * 1024L)
>> +        {
>> +            elog(DEBUG3, "throttling WAL down on this session (backendWalInserted=%d)", backendWalInserted);
>> +            XLogFlush(EndPos);
>> +            /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP  */
>> +            /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
>> +            SyncRepWaitForLSN(EndPos, false);
>> +            elog(DEBUG3, "throttling WAL down on this session - end");
>> +            backendWalInserted = 0;
>> +        }
>>      }
> 
> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs).  And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
> 

Yeah, I agree the sleep would have to happen elsewhere.

It's not clear to me how could it cause deadlocks, as we're not waiting
for a lock/resource locked by someone else, but it's certainly an issue
for uninterruptible hangs.

> 
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay.  That should fix the hard deadlock danger and
> remove most of the increase in lock contention.
> 

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

> 
> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
> 

I don't see why would that significantly increase the number of flushes.
The goal is to do this only every ~1MB of a WAL or so, and chances are
there were many (perhaps hundreds or more) commits in between. At least
in a workload with a fair number of OLTP transactions (which is kinda
the point of all this).

And the "small" OLTP transactions don't really do any extra fsyncs,
because the accounting resets at commit (or places that flush WAL).

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse. But if needed, we can simply
backoff to the last page boundary, so that we only flush the complete
page. That would work too - the goal is not to flush everything, but to
reduce how much of the lag is due to the current process (i.e. to wait
for sync replica to catch up).

> 
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address.  How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?
> 

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

> A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
> last XLogFlush() will increase quickly, triggering a flush at the next
> opportunity. But short OLTP transactions will do XLogFlush()es at least at
> every commit.
> 

Right, that's kinda the approach the patch is trying to do.

> 
> I also suspect the overhead will be more manageable if you were to force a
> flush only up to a point further back than the last fully filled page. We
> don't want to end up flushing WAL for every page, but if you just have a
> backend-local accounting mechanism, I think that's inevitably what you'd end
> up with when you have a large number of sessions. But if you'd limit the
> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> proportional to synchronous_commit_flush_wal_after.
> 

True, that's kinda what I suggested above as a solution for partially
filled WAL pages.

I agree this is crude and we'd probably need some sort of "balancing"
logic that distributes WAL bandwidth between backends, and throttles
backends producing a lot of WAL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

26 January 2023, 13:40:56

> On 1/25/23 20:05, Andres Freund wrote:
> > Hi,
> >
> > Such a feature could be useful - but I don't think the current place of
> > throttling has any hope of working reliably:
[..]
> > You're blocking in the middle of an XLOG insertion.
[..]
> Yeah, I agree the sleep would have to happen elsewhere.

Fixed.

> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> >
>
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.
>

By the time you replied I've already tried what Andres has recommended.

[..]
>> At the very least this'd
> > have to flush only up to the last fully filled page.
> >
> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse. But if needed, we can simply
> backoff to the last page boundary, so that we only flush the complete
> page. That would work too - the goal is not to flush everything, but to
> reduce how much of the lag is due to the current process (i.e. to wait
> for sync replica to catch up).

I've introduced the rounding to the last written page (hopefully).

> > Just counting the number of bytes inserted by a backend will make the overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> >
>
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

Fixed.

> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> > proportional to synchronous_commit_flush_wal_after.
> >
>
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.
>
> I agree this is crude and we'd probably need some sort of "balancing"
> logic that distributes WAL bandwidth between backends, and throttles
> backends producing a lot of WAL.

OK - that's not included (yet?), as it would make this much more complex.

In summary:  Attached is a slightly reworked version of this patch.
1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
3. Removed GUC for now (always on->256kB); potentially to be restored
4. Resets backendWal counter on commits

It's still crude, but first tests indicate that it behaves similarly
(to the initial one with GUC = 256kB).

Also from the Bharath email, I've found another patch proposal by
Simon [1] and I would like to avoid opening the Pandora's box again,
but to stress this: this feature is specifically aimed at solving OLTP
latency on *sync*rep (somewhat some code could be added/generalized
later on and this feature could be extended to async case, but this
then opens the question of managing the possible WAL throughput
budget/back throttling - this was also raised in first thread here [2]
by Konstantin).

IMHO it could implement various strategies under user-settable GUC
"wal_throttle_larger_transactions=(sync,256kB)" or
"wal_throttle_larger_transactions=off" , and someday later someone
could teach the code the async case (let's say under
"wal_throttle_larger_transactions=(asyncMaxRPO, maxAllowedLag=8MB,
256kB)"). Thoughts?

[1] -
https://www.postgresql.org/message-id/flat/CA%2BU5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/71f3e6fb-2fca-a798-856a-f23c8ede2333%40garret.ru

Attachment

0001-WIP-Syncrep-and-improving-latency-due-to-WAL-throttl.patch

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

26 January 2023, 15:40:29

Hi,

On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
> It's not clear to me how could it cause deadlocks, as we're not waiting
> for a lock/resource locked by someone else, but it's certainly an issue
> for uninterruptible hangs.

Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
lock-ordering rules.


> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> > 
> 
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.

Where would such a point be, except for ProcessInterrupts()? Iteratively
adding a separate set of "wal_delay()" points all over the executor,
commands/, ... seems like a bad plan.


> > I don't think doing an XLogFlush() of a record that we just wrote is a good
> > idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> > we're doing. To make matters worse, this will often cause partially filled WAL
> > pages to be flushed out - rewriting a WAL page multiple times can
> > significantly increase overhead on the storage level. At the very least this'd
> > have to flush only up to the last fully filled page.
> > 
> 
> I don't see why would that significantly increase the number of flushes.
> The goal is to do this only every ~1MB of a WAL or so, and chances are
> there were many (perhaps hundreds or more) commits in between. At least
> in a workload with a fair number of OLTP transactions (which is kinda
> the point of all this).

Because the accounting is done separately in each process. Even if you just
add a few additional flushes for each connection, in aggregate that'll be a
lot.


> And the "small" OLTP transactions don't really do any extra fsyncs,
> because the accounting resets at commit (or places that flush WAL).
> [...]
> > Just counting the number of bytes inserted by a backend will make the overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> > 
> 
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

I might have missed something, but I don't think the first version of patch
resets the accounting at commit?


> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse.

Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
calls to partial boundaries by every autovacuum, by every process doing a bit
more bulky work. Because you're flushing the LSN after a record you just
inserted, you're commonly not going to be "joining" the work of an already
in-process XLogFlush().


> But if needed, we can simply backoff to the last page boundary, so that we
> only flush the complete page. That would work too - the goal is not to flush
> everything, but to reduce how much of the lag is due to the current process
> (i.e. to wait for sync replica to catch up).

Yes, as I proposed...


> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> > proportional to synchronous_commit_flush_wal_after.
> > 
> 
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.

I think flushing only up to a point further in the past than the last page
boundary is somewhat important to prevent unnecessary flushes.


Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

26 January 2023, 15:49:12

Hi,

On 2023-01-26 14:40:56 +0100, Jakub Wartak wrote:
> In summary:  Attached is a slightly reworked version of this patch.
> 1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
> 2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
> 3. Removed GUC for now (always on->256kB); potentially to be restored

Huh? Why did you remove the GUC? Clearly this isn't something we can default
to on.


> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
> index d85e313908..05d56d65f9 100644
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -2395,6 +2395,7 @@ CommitTransaction(void)
>  
>      XactTopFullTransactionId = InvalidFullTransactionId;
>      nParallelCurrentXids = 0;
> +    backendWalInserted = 0;
>  
>      /*
>       * done with commit processing, set current transaction state back to

I don't like the resets around like this. Why not just move it into
XLogFlush()?



> +/*
> + * Called from ProcessMessageInterrupts() to avoid waiting while being in critical section
> + */ 
> +void HandleXLogDelayPending()
> +{
> +    /* flush only up to the last fully filled page */
> +    XLogRecPtr     LastFullyWrittenXLogPage = XactLastRecEnd - (XactLastRecEnd % XLOG_BLCKSZ);
> +    XLogDelayPending = false;

Hm - wonder if it'd be better to determine the LSN to wait for in
XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
bounded number of) WAL records before processing interrupts. No need to flush
more aggressively than necessary...


> +    //HOLD_INTERRUPTS();
> +
> +    /* XXX Debug for now */
> +    elog(WARNING, "throttling WAL down on this session (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", 
> +        backendWalInserted, 
> +        LSN_FORMAT_ARGS(XactLastRecEnd),
> +        LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
> +
> +    /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP  */
> +    /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
> +    XLogFlush(LastFullyWrittenXLogPage);
> +    SyncRepWaitForLSN(LastFullyWrittenXLogPage, false);

SyncRepWaitForLSN() has this comment:
 * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
 * represents a commit record.  If it doesn't, then we wait only for the WAL
 * to be flushed if synchronous_commit is set to the higher level of
 * remote_apply, because only commit records provide apply feedback.


> +    elog(WARNING, "throttling WAL down on this session - end");
> +    backendWalInserted = 0;
> +
> +    //RESUME_INTERRUPTS();
> +}

I think we'd want a distinct wait event for this.



> diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> index 1b1d814254..8ed66b2eae 100644
> --- a/src/backend/utils/init/globals.c
> +++ b/src/backend/utils/init/globals.c
> @@ -37,6 +37,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
>  volatile sig_atomic_t ProcSignalBarrierPending = false;
>  volatile sig_atomic_t LogMemoryContextPending = false;
>  volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
> +volatile sig_atomic_t XLogDelayPending = false;
>  volatile uint32 InterruptHoldoffCount = 0;
>  volatile uint32 QueryCancelHoldoffCount = 0;
>  volatile uint32 CritSectionCount = 0;

I don't think the new variable needs to be volatile, or even
sig_atomic_t. We'll just manipulate it from outside signal handlers.


Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

26 January 2023, 15:51:24

Hi,

On 2023-01-26 13:33:27 +0530, Bharath Rupireddy wrote:
> 6. Backends can ignore throttling for WAL records marked as unimportant, no?

Why would that be a good idea? Not that it matters today, but those records
still need to be flushed in case of a commit by another transaction.

> 7. I think we need to not let backends throttle too frequently even
> though they have crossed wal_throttle_threshold bytes. The best way is
> to rely on replication lag, after all the goal of this feature is to
> keep replication lag under check - say, throttle only when
> wal_distance > wal_throttle_threshold && replication_lag >
> wal_throttle_replication_lag_threshold.

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

26 January 2023, 18:53:20


On 1/26/23 16:40, Andres Freund wrote:
> Hi,
> 
> On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
>> It's not clear to me how could it cause deadlocks, as we're not waiting
>> for a lock/resource locked by someone else, but it's certainly an issue
>> for uninterruptible hangs.
> 
> Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
> lock-ordering rules.
> 

Not sure what lock ordering issues you have in mind, but I agree it's
not the right place for the sleep, no argument here.

> 
>>> My best idea for how to implement this in a somewhat safe way would be for
>>> XLogInsertRecord() to set a flag indicating that we should throttle, and set
>>> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
>>> proceed (i.e. we'll not be in a critical / interrupts off section) can
>>> actually perform the delay.  That should fix the hard deadlock danger and
>>> remove most of the increase in lock contention.
>>>
>>
>> The solution I've imagined is something like autovacuum throttling - do
>> some accounting of how much "WAL bandwidth" each process consumed, and
>> then do the delay/sleep in a suitable place.
> 
> Where would such a point be, except for ProcessInterrupts()? Iteratively
> adding a separate set of "wal_delay()" points all over the executor,
> commands/, ... seems like a bad plan.
> 

I haven't thought about where to do the work, TBH. ProcessInterrupts()
may very well be a good place.

I should have been clearer, but the main benefit of autovacuum-like
throttling is IMHO that it involves all processes and a global limit,
while the current approach is per-backend.

> 
>>> I don't think doing an XLogFlush() of a record that we just wrote is a good
>>> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
>>> we're doing. To make matters worse, this will often cause partially filled WAL
>>> pages to be flushed out - rewriting a WAL page multiple times can
>>> significantly increase overhead on the storage level. At the very least this'd
>>> have to flush only up to the last fully filled page.
>>>
>>
>> I don't see why would that significantly increase the number of flushes.
>> The goal is to do this only every ~1MB of a WAL or so, and chances are
>> there were many (perhaps hundreds or more) commits in between. At least
>> in a workload with a fair number of OLTP transactions (which is kinda
>> the point of all this).
> 
> Because the accounting is done separately in each process. Even if you just
> add a few additional flushes for each connection, in aggregate that'll be a
> lot.
> 

How come? Imagine the backend does flush only after generating e.g. 1MB
of WAL. Small transactions won't do any additional flushes at all
(because commit resets the accounting). Large transactions will do an
extra flush every 1MB, so 16x per WAL segment. But in between there will
be many commits from the small transactions. If we backoff to the last
complete page, that should eliminate even most of these flushes.

So where would the additional flushes come from?

> 
>> And the "small" OLTP transactions don't really do any extra fsyncs,
>> because the accounting resets at commit (or places that flush WAL).
>> [...]
>>> Just counting the number of bytes inserted by a backend will make the overhead
>>> even worse, as the flush will be triggered even for OLTP sessions doing tiny
>>> transactions, even though they don't contribute to the problem you're trying
>>> to address.  How about counting how many bytes of WAL a backend has inserted
>>> since the last time that backend did an XLogFlush()?
>>>
>>
>> No, we should reset the counter at commit, so small OLTP transactions
>> should not reach really trigger this. That's kinda the point, to only
>> delay "large" transactions producing a lot of WAL.
> 
> I might have missed something, but I don't think the first version of patch
> resets the accounting at commit?
> 

Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test
the behavior / demonstrate the effect.

> 
>> Same for the flushes of partially flushed pages - if there's enough of
>> small OLTP transactions, we're already having this issue. I don't see
>> why would this make it measurably worse.
> 
> Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
> could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
> calls to partial boundaries by every autovacuum, by every process doing a bit
> more bulky work. Because you're flushing the LSN after a record you just
> inserted, you're commonly not going to be "joining" the work of an already
> in-process XLogFlush().
> 

Right. We do ~16 additional flushes per 16MB segment (or something like
that, depending on the GUC value). Considering we do thousand of commits
per segment, each of which does a flush, I don't see how would this make
it measurably worse.

> 
>> But if needed, we can simply backoff to the last page boundary, so that we
>> only flush the complete page. That would work too - the goal is not to flush
>> everything, but to reduce how much of the lag is due to the current process
>> (i.e. to wait for sync replica to catch up).
> 
> Yes, as I proposed...
> 

Right.

> 
>>> I also suspect the overhead will be more manageable if you were to force a
>>> flush only up to a point further back than the last fully filled page. We
>>> don't want to end up flushing WAL for every page, but if you just have a
>>> backend-local accounting mechanism, I think that's inevitably what you'd end
>>> up with when you have a large number of sessions. But if you'd limit the
>>> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
>>> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
>>> proportional to synchronous_commit_flush_wal_after.
>>>
>>
>> True, that's kinda what I suggested above as a solution for partially
>> filled WAL pages.
> 
> I think flushing only up to a point further in the past than the last page
> boundary is somewhat important to prevent unnecessary flushes.
> 

Not sure I agree with that. Yes, we should not be doing flushes unless
we need to, but OTOH we should not delay sending WAL to standby too much
- because that's what affects syncrep latency for small transactions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Bharath Rupireddy

Date:

27 January 2023, 07:18:43

On Thu, Jan 26, 2023 at 9:21 PM Andres Freund <andres@anarazel.de> wrote:
>
> > 7. I think we need to not let backends throttle too frequently even
> > though they have crossed wal_throttle_threshold bytes. The best way is
> > to rely on replication lag, after all the goal of this feature is to
> > keep replication lag under check - say, throttle only when
> > wal_distance > wal_throttle_threshold && replication_lag >
> > wal_throttle_replication_lag_threshold.
>
> I think my idea of only forcing to flush/wait an LSN some distance in the past
> would automatically achieve that?

I'm sorry, I couldn't get your point, can you please explain it a bit more?

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Syncrep and improving latency due to WAL throttling

From

Alvaro Herrera

Date:

27 January 2023, 08:20:46

On 2023-Jan-27, Bharath Rupireddy wrote:

> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 
> Keeping replication lag under check enables one to provide a better
> RPO guarantee

Hmm, but you can already do that by tuning walwriter, no?

-- 
Álvaro Herrera

Re: Syncrep and improving latency due to WAL throttling

From

Bharath Rupireddy

Date:

27 January 2023, 11:04:39

On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Jan-27, Bharath Rupireddy wrote:
>
> > Looking at the patch, the feature, in its current shape, focuses on
> > improving replication lag (by throttling WAL on the primary) only when
> > synchronous replication is enabled. Why is that? Why can't we design
> > it for replication in general (async, sync, and logical replication)?
> >
> > Keeping replication lag under check enables one to provide a better
> > RPO guarantee
>
> Hmm, but you can already do that by tuning walwriter, no?

IIUC, one can increase wal_writer_delay or wal_writer_flush_after to
control the amount of WAL walwriter writes and flushes so that the
walsenders will get slower a bit thus improving replication lag. If my
understanding is correct, what if other backends doing WAL writes and
flushes? How do we control that?

I think tuning walwriter alone may not help to keep replication lag
under check. Even if it does, it requires manual intervention.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

27 January 2023, 11:06:49

Hi,

v2 is attached.

On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

> Huh? Why did you remove the GUC?

After reading previous threads, my optimism level of getting it ever
in shape of being widely accepted degraded significantly (mainly due
to the discussion of wider category of 'WAL I/O throttling' especially
in async case, RPO targets in async case and potentially calculating
global bandwidth). I've assumed that it is a working sketch, and as
such not having GUC name right now (just for sync case) would still
allow covering various other async cases in future without breaking
compatibility with potential name GUC changes (see my previous
"wal_throttle_larger_transactions=<strategies>" proposal ). Although
I've restored the synchronous_commit_flush_wal_after GUC into v2
patch, sticking to such a name removes the way of using the code to
achieve async WAL throttling in future.

> Clearly this isn't something we can default to on..

Yes, I agree. It's never meant to be on by default.

> I don't like the resets around like this. Why not just move it into
> XLogFlush()?

Fixed.

> Hm - wonder if it'd be better to determine the LSN to wait for in
> XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
> bounded number of) WAL records before processing interrupts. No need to flush
> more aggressively than necessary...

Fixed.

> SyncRepWaitForLSN() has this comment:
>  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
>  * represents a commit record.  If it doesn't, then we wait only for the WAL
>  * to be flushed if synchronous_commit is set to the higher level of
>  * remote_apply, because only commit records provide apply feedback.

Hm, not sure if I understand: are you saying that we should (in the
throttled scenario) have some special feedback msgs or not --
irrespective of the setting? To be honest the throttling shouldn't
wait for the standby full setting, it's just about slowdown fact (so
IMHO it would be fine even in remote_write/remote_apply scenario if
the remote walreceiver just received the data, not necessarily write
it into file or wait for for applying it). Just this waiting for a
round-trip ack about LSN progress would be enough to slow down the
writer (?). I've added some timing log into the draft and it shows
more or less constantly solid RTT even as it stands:

psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CB70B8 flushingTo=2/A6CB6000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.500052 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CF7C08 flushingTo=2/A6CF6000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.655370 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D385E0 flushingTo=2/A6D38000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.627334 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D78FA0 flushingTo=2/A6D78000)
[..]

 > I think we'd want a distinct wait event for this.

Added and tested that it shows up.

> > +volatile sig_atomic_t XLogDelayPending = false;
> I don't think the new variable needs to be volatile, or even
> sig_atomic_t. We'll just manipulate it from outside signal handlers.

Changed to bool, previously I wanted it to "fit" the above ones.

-J.

Attachment

v2-0001-WIP-Syncrep-and-improving-latency-due-to-WAL-thro.patch

Re: Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

27 January 2023, 11:28:59

Hi Bharath,

On Fri, Jan 27, 2023 at 12:04 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2023-Jan-27, Bharath Rupireddy wrote:
> >
> > > Looking at the patch, the feature, in its current shape, focuses on
> > > improving replication lag (by throttling WAL on the primary) only when
> > > synchronous replication is enabled. Why is that? Why can't we design
> > > it for replication in general (async, sync, and logical replication)?
> > >
> > > Keeping replication lag under check enables one to provide a better
> > > RPO guarantee

Sorry for not answering earlier; although the title of the thread goes
for the SyncRep-only I think everyone would be open to also cover the
more advanced async scenarios that Satyanarayana proposed in those
earlier threads (as just did Simon much earlier). I was  proposing
wal_throttle_larger_transactions=<..> (for the lack of better name),
however v2 of the patch from today right now contains again reference
to syncrep (it could be changed of course). It's just the code that is
missing that could be also added on top of v2, so we could combine our
efforts. It's just the competency and time that I lack on how to
implement such async-scenario code-paths (maybe Tomas V. has something
in mind with his own words [1])  so also any feedback from senior
hackers is more than welcome ...

-J.

[1] - "The solution I've imagined is something like autovacuum
throttling - do some accounting of how much "WAL bandwidth" each
process consumed, and then do the delay/sleep in a suitable place. "

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

27 January 2023, 20:45:16

On 1/27/23 08:18, Bharath Rupireddy wrote:
> On Thu, Jan 26, 2023 at 9:21 PM Andres Freund <andres@anarazel.de> wrote:
>>
>>> 7. I think we need to not let backends throttle too frequently even
>>> though they have crossed wal_throttle_threshold bytes. The best way is
>>> to rely on replication lag, after all the goal of this feature is to
>>> keep replication lag under check - say, throttle only when
>>> wal_distance > wal_throttle_threshold && replication_lag >
>>> wal_throttle_replication_lag_threshold.
>>
>> I think my idea of only forcing to flush/wait an LSN some distance in the past
>> would automatically achieve that?
> 
> I'm sorry, I couldn't get your point, can you please explain it a bit more?
> 

The idea is that we would not flush the exact current LSN, because
that's likely somewhere in the page, and we always write the whole page
which leads to write amplification.

But if we backed off a bit, and wrote e.g. to the last page boundary,
that wouldn't have this issue (either the page was already flushed -
noop, or we'd have to flush it anyway).

We could even back off a bit more, to increase the probability it was
actually flushed / sent to standby. That would still work, because the
whole point is not to allow one process to generate too much unflushed
WAL, forcing the other (small) xacts to wait at commit.

Imagine we have the limit set to 8MB, i.e. the backend flushes WAL after
generating 8MB of WAL. If we flush to the exact current LSN, the other
backends will wait for ~4MB on average. If we back off to 1MB, the wait
average increases to ~4.5MB. (This is simplified, as it ignores WAL from
the small xacts. But those flush regularly, which limit the amount. It
also ignores there might be multiple large xacts.)

> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 

This focuses on sync rep, because that's where the commit latency comes
from. Async doesn't have that issue, because it doesn't wait for the
standby.

In particular, the trick is in penalizing the backends generating a lot
of WAL, while leaving the small xacts alone.

> Keeping replication lag under check enables one to provide a better
> RPO guarantee as discussed in the other thread
> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
> 

Isn't that a bit over-complicated? RPO generally only cares about xacts
that committed (because that's what you want to not lose), so why not to
simply introduce a "sync mode" that simply uses a bit older LSN when
waiting for the replica? Seems much simpler and similar to what we
already do.

Yeah, if someone generates a lot of WAL in uncommitted transaction, all
of that may be lost. But who cares (from the RPO point of view)?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

27 January 2023, 21:19:27

Hi,

On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > Huh? Why did you remove the GUC?
> 
> After reading previous threads, my optimism level of getting it ever
> in shape of being widely accepted degraded significantly (mainly due
> to the discussion of wider category of 'WAL I/O throttling' especially
> in async case, RPO targets in async case and potentially calculating
> global bandwidth).

I think it's quite reasonable to limit this to a smaller scope. Particularly
because those other goals are pretty vague but ambitious goals. IMO the
problem with a lot of the threads is precisely that that they aimed at a level
of generallity that isn't achievable in one step.


> I've assumed that it is a working sketch, and as such not having GUC name
> right now (just for sync case) would still allow covering various other
> async cases in future without breaking compatibility with potential name GUC
> changes (see my previous "wal_throttle_larger_transactions=<strategies>"
> proposal ).

It's harder to benchmark something like this without a GUC, so I think it's
worth having, even if it's not the final name.


> > SyncRepWaitForLSN() has this comment:
> >  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
> >  * represents a commit record.  If it doesn't, then we wait only for the WAL
> >  * to be flushed if synchronous_commit is set to the higher level of
> >  * remote_apply, because only commit records provide apply feedback.
> 
> Hm, not sure if I understand: are you saying that we should (in the
> throttled scenario) have some special feedback msgs or not --
> irrespective of the setting? To be honest the throttling shouldn't
> wait for the standby full setting, it's just about slowdown fact (so
> IMHO it would be fine even in remote_write/remote_apply scenario if
> the remote walreceiver just received the data, not necessarily write
> it into file or wait for for applying it). Just this waiting for a
> round-trip ack about LSN progress would be enough to slow down the
> writer (?). I've added some timing log into the draft and it shows
> more or less constantly solid RTT even as it stands:

My problem was that the function header for SyncRepWaitForLSN() seems to say
that we don't wait at all if commit=false and synchronous_commit <
remote_apply. But I think that might just be bad formulation.

   [...] 'commit' indicates whether this LSN
 * represents a commit record.  If it doesn't, then we wait only for the WAL
 * to be flushed if synchronous_commit is set to the higher level of
 * remote_apply, because only commit records provide apply feedback.

because the code does something entirely different afaics:

    /* Cap the level for anything other than commit to remote flush only. */
    if (commit)
        mode = SyncRepWaitMode;
    else
        mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);


Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

27 January 2023, 21:20:39

Hi,

On 2023-01-27 12:48:43 +0530, Bharath Rupireddy wrote:
> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 
> Keeping replication lag under check enables one to provide a better
> RPO guarantee as discussed in the other thread
> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.

I think something narrower and easier to achieve is a good thing. We've
already had loads of discussion for the more general problem, without a lot of
actual progress.

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

27 January 2023, 21:33:59

Hi,

On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:
> On 1/27/23 08:18, Bharath Rupireddy wrote:
> >> I think my idea of only forcing to flush/wait an LSN some distance in the past
> >> would automatically achieve that?
> > 
> > I'm sorry, I couldn't get your point, can you please explain it a bit more?
> > 
> 
> The idea is that we would not flush the exact current LSN, because
> that's likely somewhere in the page, and we always write the whole page
> which leads to write amplification.
> 
> But if we backed off a bit, and wrote e.g. to the last page boundary,
> that wouldn't have this issue (either the page was already flushed -
> noop, or we'd have to flush it anyway).

Yep.

> We could even back off a bit more, to increase the probability it was
> actually flushed / sent to standby.

That's not the sole goal, from my end: I'd like to avoid writing out +
flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
COPYs or such - if we're unlucky they'd each end up exceeding their "private"
limit close to each other, leading to a number of small writes of the
WAL. Which could end up increasing local commit latency / iops.

If we instead decide to only ever flush up to something like
  last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ

we'd make sure that the throttling mechanism won't cause a lot of small
writes.

> > Keeping replication lag under check enables one to provide a better
> > RPO guarantee as discussed in the other thread
> > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
> > 
> 
> Isn't that a bit over-complicated? RPO generally only cares about xacts
> that committed (because that's what you want to not lose), so why not to
> simply introduce a "sync mode" that simply uses a bit older LSN when
> waiting for the replica? Seems much simpler and similar to what we
> already do.

I don't think that really helps you that much. If there's e.g. a huge VACUUM /
COPY emitting loads of WAL you'll suddenly see commit latency of a
concurrently committing transactions spike into oblivion. Whereas a general
WAL throttling mechanism would throttle the VACUUM, without impacting the
commit latency of normal transactions.

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

28 January 2023, 00:36:17

On 1/27/23 22:33, Andres Freund wrote:
> Hi,
> 
> On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:
>> On 1/27/23 08:18, Bharath Rupireddy wrote:
>>>> I think my idea of only forcing to flush/wait an LSN some distance in the past
>>>> would automatically achieve that?
>>>
>>> I'm sorry, I couldn't get your point, can you please explain it a bit more?
>>>
>>
>> The idea is that we would not flush the exact current LSN, because
>> that's likely somewhere in the page, and we always write the whole page
>> which leads to write amplification.
>>
>> But if we backed off a bit, and wrote e.g. to the last page boundary,
>> that wouldn't have this issue (either the page was already flushed -
>> noop, or we'd have to flush it anyway).
> 
> Yep.
> 
> 
>> We could even back off a bit more, to increase the probability it was
>> actually flushed / sent to standby.
> 
> That's not the sole goal, from my end: I'd like to avoid writing out +
> flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
> COPYs or such - if we're unlucky they'd each end up exceeding their "private"
> limit close to each other, leading to a number of small writes of the
> WAL. Which could end up increasing local commit latency / iops.
> 
> If we instead decide to only ever flush up to something like
>   last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ
> 
> we'd make sure that the throttling mechanism won't cause a lot of small
> writes.
> 

I'm not saying we shouldn't do this, but I still don't see how this
could make a measurable difference. At least assuming a sensible value
of the throttling limit (say, more than 256kB per backend), and OLTP
workload running concurrently. That means ~64 extra flushes/writes per
16MB segment (at most). Yeah, a particular page might get unlucky and be
flushed by multiple backends, but the average still holds. Meanwhile,
the OLTP transactions will generate (at least) an order of magnitude
more flushes.

> 
>>> Keeping replication lag under check enables one to provide a better
>>> RPO guarantee as discussed in the other thread
>>> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
>>>
>>
>> Isn't that a bit over-complicated? RPO generally only cares about xacts
>> that committed (because that's what you want to not lose), so why not to
>> simply introduce a "sync mode" that simply uses a bit older LSN when
>> waiting for the replica? Seems much simpler and similar to what we
>> already do.
> 
> I don't think that really helps you that much. If there's e.g. a huge VACUUM /
> COPY emitting loads of WAL you'll suddenly see commit latency of a
> concurrently committing transactions spike into oblivion. Whereas a general
> WAL throttling mechanism would throttle the VACUUM, without impacting the
> commit latency of normal transactions.
> 

True, but it solves the RPO goal which is what the other thread was about.

IMHO it's useful to look at this as a resource scheduling problem:
limited WAL bandwidth consumed by backends, with the bandwidth
distributed using some sort of policy.

The patch discussed in this thread uses fundamentally unfair policy,
with throttling applied only on backends that produce a lot of WAL. And
trying to leave the OLTP as unaffected as possible.

The RPO thread seems to be aiming for a "fair" policy, providing the
same fraction of bandwidth to all processes. This will affect all xacts
the same way (sleeps proportional to amount of WAL generated by the xact).

Perhaps we want such alternative scheduling policies, but it'll probably
require something like the autovacuum throttling.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

28 January 2023, 00:37:49

On 1/27/23 22:19, Andres Freund wrote:
> Hi,
> 
> On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
>> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:
>>
>>> Huh? Why did you remove the GUC?
>>
>> After reading previous threads, my optimism level of getting it ever
>> in shape of being widely accepted degraded significantly (mainly due
>> to the discussion of wider category of 'WAL I/O throttling' especially
>> in async case, RPO targets in async case and potentially calculating
>> global bandwidth).
> 
> I think it's quite reasonable to limit this to a smaller scope. Particularly
> because those other goals are pretty vague but ambitious goals. IMO the
> problem with a lot of the threads is precisely that that they aimed at a level
> of generallity that isn't achievable in one step.
> 

+1 to that

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Bharath Rupireddy

Date:

30 January 2023, 08:16:46

On Sat, Jan 28, 2023 at 6:06 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> >
> > That's not the sole goal, from my end: I'd like to avoid writing out +
> > flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
> > COPYs or such - if we're unlucky they'd each end up exceeding their "private"
> > limit close to each other, leading to a number of small writes of the
> > WAL. Which could end up increasing local commit latency / iops.
> >
> > If we instead decide to only ever flush up to something like
> >   last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ
> >
> > we'd make sure that the throttling mechanism won't cause a lot of small
> > writes.
> >
>
> I'm not saying we shouldn't do this, but I still don't see how this
> could make a measurable difference. At least assuming a sensible value
> of the throttling limit (say, more than 256kB per backend), and OLTP
> workload running concurrently. That means ~64 extra flushes/writes per
> 16MB segment (at most). Yeah, a particular page might get unlucky and be
> flushed by multiple backends, but the average still holds. Meanwhile,
> the OLTP transactions will generate (at least) an order of magnitude
> more flushes.

I think measuring the number of WAL flushes with and without this
feature that the postgres generates is great to know this feature
effects on IOPS. Probably it's even better with variations in
synchronous_commit_flush_wal_after or the throttling limits.

On Sat, Jan 28, 2023 at 6:08 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/27/23 22:19, Andres Freund wrote:
> > Hi,
> >
> > On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
> >> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:
> >>
> >>> Huh? Why did you remove the GUC?
> >>
> >> After reading previous threads, my optimism level of getting it ever
> >> in shape of being widely accepted degraded significantly (mainly due
> >> to the discussion of wider category of 'WAL I/O throttling' especially
> >> in async case, RPO targets in async case and potentially calculating
> >> global bandwidth).
> >
> > I think it's quite reasonable to limit this to a smaller scope. Particularly
> > because those other goals are pretty vague but ambitious goals. IMO the
> > problem with a lot of the threads is precisely that that they aimed at a level
> > of generallity that isn't achievable in one step.
> >
>
> +1 to that

Okay, I agree to limit the scope first to synchronous replication.

Although v2 is a WIP patch, I have some comments:
1. Coding style doesn't confirm to the Postgres standard:
+/*
+ * Called from ProcessMessageInterrupts() to avoid waiting while
being in critical section
+ */
80-line char limit

+void HandleXLogDelayPending()
A new line missing between function return type and functin name

+    elog(WARNING, "throttling WAL down on this session
(backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
+        backendWalInserted,
+        LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
+        LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
Indentation issue - space needed in the next lines after elog(WARNING,..

+    elog(WARNING, "throttling WAL down on this session - end (%f
ms)", timediff);
80-line char limit, timediff); be on the new line.

+    //RESUME_INTERRUPTS();
+    //HOLD_INTERRUPTS();
Multi-line comments are used elsewhere in the code.

Better to run pgindent on the patch.

2. It'd be better to add a TAP test hitting the WAL throttling.

3. We generally don't need timings to be calculated explicitly when we
emit before and after log messages. If needed one can calculate the
wait time from timestamps of the log messages. If it still needs an
explicit time difference which I don't think is required, because we
have a special event and before/after log messages, I think it needs
to be under track_wal_io_timing to keep things simple.

4. XLogInsertRecord() can be a hot path for high-write workload, so
effects of adding anything in it needs to be measured. So, it's better
to run benchmarks with this feature enabled and disabled.

5. Missing documentation of this feature and the GUC. I think we can
describe extensively why we'd need a feature of this kind in the
documentation for better adoption or usability.

6. Shouldn't the max limit be MAX_KILOBYTES?
+        &synchronous_commit_flush_wal_after,
+        0, 0, 1024L*1024L,

7. Can this GUC name be something like
synchronous_commit_wal_throttle_threshold to better reflect what it
does for a backend?
+        {"synchronous_commit_flush_wal_after", PGC_USERSET,
REPLICATION_SENDING,

8. A typo - s/confiration/confirmation
+            gettext_noop("Sets the maximum logged WAL in kbytes,
after which wait for sync commit confiration even without commit "),

9. This
"Sets the maximum logged WAL in kbytes, after which wait for sync
commit confiration even without commit "
better be something like below?
"Sets the maximum amount of WAL in kilobytes a backend generates after
which it waits for synchronous commit confiration even without commit
"

10. I think there's nothing wrong in adding some assertions in
HandleXLogDelayPending():
Assert(synchronous_commit_flush_wal_after > 0);
Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);

11. Specify the reason in the comments as to why we had to do the
following things:
Here:
+    /* flush only up to the last fully filled page */
+    XLogRecPtr     LastFullyWrittenXLogPage = XactLastThrottledRecEnd
- (XactLastThrottledRecEnd % XLOG_BLCKSZ);
Talk about how this avoids multiple-flushes for the same page.

Here:
+ * Called from ProcessMessageInterrupts() to avoid waiting while
being in critical section
+ */
+void HandleXLogDelayPending()
Talk about how waiting in a critical section, that is in
XLogInsertRecord() causes locks to be held longer durations and other
effects.

Here:
+        /* WAL throttling */
+        backendWalInserted += rechdr->xl_tot_len;
Be a bit more verbose about why we try to throttle WAL and why only
for sync replication, the advantages, effects etc.

12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
clutter server logs.
+    /* XXX Debug for now */
+    elog(WARNING, "throttling WAL down on this session
(backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
+        backendWalInserted,
+        LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
+        LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));

+    elog(WARNING, "throttling WAL down on this session - end (%f
ms)", timediff);

13. BTW, we don't need to hold interrupts while waiting for sync
replication ack as it may block query cancels or proc die pendings.
You can remove these, unless there's a strong reason.
+    //HOLD_INTERRUPTS();
+    //RESUME_INTERRUPTS();

14. Add this wait event in the documentation.
+        case WAIT_EVENT_SYNC_REP_THROTTLED:
+            event_name = "SyncRepThrottled";
+            break;

15. Why can XLogDelayPending not be a volatile atomic variable? Is it
because it's not something being set within a signal handler?
 extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT bool XLogDelayPending;

16.
+    instr_time    wal_throttle_time_start, wal_throttle_time_end;
+    double        timediff;
+    XLogDelayPending = false;
An extra line needed after variable declaration and assignment.

17. I think adding how many times a backend throttled WAL to
pg_stat_activity is a good metric.

18. Can you avoid the need of new functions SyncRepWaitForLSNInternal
and SyncRepWaitForLSNThrottled by relying on the global throttling
state to determine the correct waitevent in SyncRepWaitForLSN?

19. I think measuring the number of WAL flushes with and without this
feature that the postgres generates is great to know this feature
effects on IOPS. Probably it's even better with variations in
synchronous_commit_flush_wal_after or the throttling limits.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

01 February 2023, 10:04:14

On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Hi Bharath, thanks for reviewing.

> I think measuring the number of WAL flushes with and without this
> feature that the postgres generates is great to know this feature
> effects on IOPS. Probably it's even better with variations in
> synchronous_commit_flush_wal_after or the throttling limits.

It's the same as point 19, so I covered it there.

> Although v2 is a WIP patch, I have some comments:
> 1. Coding style doesn't confirm to the Postgres standard:

Fixed all thoses that you mentioned and I've removed elog() and code
for timing. BTW: is there a way to pgindent only my changes (on git
diff?)

> 2. It'd be better to add a TAP test hitting the WAL throttling.

TODO, any hints on how that test should look like are welcome
(checking pg_stat_wal?) I've could spot only 1 place for it --
src/test/recovery/007_sync_rep.pl

> 3. We generally don't need timings to be calculated explicitly when we
> emit before and after log messages. If needed one can calculate the
> wait time from timestamps of the log messages. If it still needs an
> explicit time difference which I don't think is required, because we
> have a special event and before/after log messages, I think it needs
> to be under track_wal_io_timing to keep things simple.

Removed, it was just debugging aid to demonstrate the effect in the
session waiting.

> 4. XLogInsertRecord() can be a hot path for high-write workload, so
> effects of adding anything in it needs to be measured. So, it's better
> to run benchmarks with this feature enabled and disabled.

When the GUC is off(0 / default), in my tests the impact is none (it's
just set of simple IFs), however if the feature is enabled then the
INSERT is slowed down (I've repeated the initial test from 1st post
and single-statement INSERT for 50MB WAL result is the same 4-5s and
~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
that's intentional).

> 5. Missing documentation of this feature and the GUC. I think we can
> describe extensively why we'd need a feature of this kind in the
> documentation for better adoption or usability.

TODO, I'm planning on adding documentation when we'll be a little bit
closer to adding to CF.

> 6. Shouldn't the max limit be MAX_KILOBYTES?
> +        &synchronous_commit_flush_wal_after,
> +        0, 0, 1024L*1024L,

Fixed.

> 7. Can this GUC name be something like
> synchronous_commit_wal_throttle_threshold to better reflect what it
> does for a backend?
> +        {"synchronous_commit_flush_wal_after", PGC_USERSET,
> REPLICATION_SENDING,

Done.

> 8. A typo - s/confiration/confirmation
[..]
> 9. This
> "Sets the maximum logged WAL in kbytes, after which wait for sync
> commit confiration even without commit "
> better be something like below?
> "Sets the maximum amount of WAL in kilobytes a backend generates after
> which it waits for synchronous commit confiration even without commit
> "

Fixed as you have suggested.

> 10. I think there's nothing wrong in adding some assertions in
> HandleXLogDelayPending():
> Assert(synchronous_commit_flush_wal_after > 0);
> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);

Sure, added.

> 11. Specify the reason in the comments as to why we had to do the
> following things:
> Here:
> +    /* flush only up to the last fully filled page */
> +    XLogRecPtr     LastFullyWrittenXLogPage = XactLastThrottledRecEnd
> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
> Talk about how this avoids multiple-flushes for the same page.
>
> Here:
> + * Called from ProcessMessageInterrupts() to avoid waiting while
> being in critical section
> + */
> +void HandleXLogDelayPending()
> Talk about how waiting in a critical section, that is in
> XLogInsertRecord() causes locks to be held longer durations and other
> effects.

Added.

> Here:
> +        /* WAL throttling */
> +        backendWalInserted += rechdr->xl_tot_len;
> Be a bit more verbose about why we try to throttle WAL and why only
> for sync replication, the advantages, effects etc.

Added.

> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
> clutter server logs.
> +    /* XXX Debug for now */
> +    elog(WARNING, "throttling WAL down on this session
> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
> +        backendWalInserted,
> +        LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
> +        LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>
> +    elog(WARNING, "throttling WAL down on this session - end (%f
> ms)", timediff);

OK, that's entirely removed.

> 13. BTW, we don't need to hold interrupts while waiting for sync
> replication ack as it may block query cancels or proc die pendings.
> You can remove these, unless there's a strong reason.
> +    //HOLD_INTERRUPTS();
> +    //RESUME_INTERRUPTS();

Sure, removed. However, one problem I've discovered is that we were
hitting Assert(InterruptHoldoffCount > 0) in SyncRepWaitForLSN, so
I've fixed that too.

> 14. Add this wait event in the documentation.
> +        case WAIT_EVENT_SYNC_REP_THROTTLED:
> +            event_name = "SyncRepThrottled";
> +            break;
>
> 15. Why can XLogDelayPending not be a volatile atomic variable? Is it
> because it's not something being set within a signal handler?
>  extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
> +extern PGDLLIMPORT bool XLogDelayPending;

Added a comment explaining that.

> 16.
> +    instr_time    wal_throttle_time_start, wal_throttle_time_end;
> +    double        timediff;
> +    XLogDelayPending = false;
> An extra line needed after variable declaration and assignment.

Fixed.

> 17. I think adding how many times a backend throttled WAL to
> pg_stat_activity is a good metric.

Good idea, added; catversion and pgstat format id were bumped. I've
also added it to the per-query EXPLAIN (WAL) so it logs something like
"WAL: records=500000 bytes=529500000 throttled=2016" , however I would
appreciate a better name proposal on how to name that.

> 18. Can you avoid the need of new functions SyncRepWaitForLSNInternal
> and SyncRepWaitForLSNThrottled by relying on the global throttling
> state to determine the correct waitevent in SyncRepWaitForLSN?

Done.

> 19. I think measuring the number of WAL flushes with and without this
> feature that the postgres generates is great to know this feature
> effects on IOPS. Probably it's even better with variations in
> synchronous_commit_flush_wal_after or the throttling limits.

default =>
INSERT 0 500000
Time: 6996.340 ms (00:06.996)
-[ RECORD 1 ]----+------------------------------
wal_records      | 500001
wal_fpi          | 0
wal_bytes        | 529500034
wal_buffers_full | 44036
wal_write        | 44056
wal_sync         | 40
wal_write_time   | 317.991
wal_sync_time    | 2690.425
wal_throttled    | 0
stats_reset      | 2023-02-01 10:31:44.475651+01
and 1 call to probe_postgres:XLogFlush

set synchronous_commit_wal_throttle_threshold to '256kB'; =>
INSERT 0 500000
Time: 25476.155 ms (00:25.476)
-[ RECORD 1 ]----+------------------------------
wal_records      | 500001
wal_fpi          | 0
wal_bytes        | 529500034
wal_buffers_full | 0
wal_write        | 2062
wal_sync         | 2062
wal_write_time   | 180.177
wal_sync_time    | 1409.522
wal_throttled    | 2016
stats_reset      | 2023-02-01 10:32:01.955513+01
and 2017 calls to probe_postgres:XLogFlush

set synchronous_commit_wal_throttle_threshold to '1MB'; =>
INSERT 0 500000
Time: 10113.278 ms (00:10.113)
-[ RECORD 1 ]----+------------------------------
wal_records      | 642864
wal_fpi          | 0
wal_bytes        | 537929315
wal_buffers_full | 0
wal_write        | 560
wal_sync         | 559
wal_write_time   | 182.586
wal_sync_time    | 988.72
wal_throttled    | 504
stats_reset      | 2023-02-01 10:32:36.250678+01

Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
we are flushing just because of WAL thortting ?) Would that still be
safe?

-J.

Attachment

v3-0001-WIP-Syncrep-and-improving-latency-due-to-WAL-thro.patch

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

01 February 2023, 13:13:57


On 2/1/23 11:04, Jakub Wartak wrote:
> On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> 
> Hi Bharath, thanks for reviewing.
> 
>> I think measuring the number of WAL flushes with and without this
>> feature that the postgres generates is great to know this feature
>> effects on IOPS. Probably it's even better with variations in
>> synchronous_commit_flush_wal_after or the throttling limits.
> 
> It's the same as point 19, so I covered it there.
> 
>> Although v2 is a WIP patch, I have some comments:
>> 1. Coding style doesn't confirm to the Postgres standard:
> 
> Fixed all thoses that you mentioned and I've removed elog() and code
> for timing. BTW: is there a way to pgindent only my changes (on git
> diff?)
> 
>> 2. It'd be better to add a TAP test hitting the WAL throttling.
> 
> TODO, any hints on how that test should look like are welcome
> (checking pg_stat_wal?) I've could spot only 1 place for it --
> src/test/recovery/007_sync_rep.pl
> 
>> 3. We generally don't need timings to be calculated explicitly when we
>> emit before and after log messages. If needed one can calculate the
>> wait time from timestamps of the log messages. If it still needs an
>> explicit time difference which I don't think is required, because we
>> have a special event and before/after log messages, I think it needs
>> to be under track_wal_io_timing to keep things simple.
> 
> Removed, it was just debugging aid to demonstrate the effect in the
> session waiting.
> 
>> 4. XLogInsertRecord() can be a hot path for high-write workload, so
>> effects of adding anything in it needs to be measured. So, it's better
>> to run benchmarks with this feature enabled and disabled.
> 
> When the GUC is off(0 / default), in my tests the impact is none (it's
> just set of simple IFs), however if the feature is enabled then the
> INSERT is slowed down (I've repeated the initial test from 1st post
> and single-statement INSERT for 50MB WAL result is the same 4-5s and
> ~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
> that's intentional).
> 
>> 5. Missing documentation of this feature and the GUC. I think we can
>> describe extensively why we'd need a feature of this kind in the
>> documentation for better adoption or usability.
> 
> TODO, I'm planning on adding documentation when we'll be a little bit
> closer to adding to CF.
> 
>> 6. Shouldn't the max limit be MAX_KILOBYTES?
>> +        &synchronous_commit_flush_wal_after,
>> +        0, 0, 1024L*1024L,
> 
> Fixed.
> 
>> 7. Can this GUC name be something like
>> synchronous_commit_wal_throttle_threshold to better reflect what it
>> does for a backend?
>> +        {"synchronous_commit_flush_wal_after", PGC_USERSET,
>> REPLICATION_SENDING,
> 
> Done.
> 
>> 8. A typo - s/confiration/confirmation
> [..]
>> 9. This
>> "Sets the maximum logged WAL in kbytes, after which wait for sync
>> commit confiration even without commit "
>> better be something like below?
>> "Sets the maximum amount of WAL in kilobytes a backend generates after
>> which it waits for synchronous commit confiration even without commit
>> "
> 
> Fixed as you have suggested.
> 
>> 10. I think there's nothing wrong in adding some assertions in
>> HandleXLogDelayPending():
>> Assert(synchronous_commit_flush_wal_after > 0);
>> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
>> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);
> 
> Sure, added.
> 
>> 11. Specify the reason in the comments as to why we had to do the
>> following things:
>> Here:
>> +    /* flush only up to the last fully filled page */
>> +    XLogRecPtr     LastFullyWrittenXLogPage = XactLastThrottledRecEnd
>> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
>> Talk about how this avoids multiple-flushes for the same page.
>>
>> Here:
>> + * Called from ProcessMessageInterrupts() to avoid waiting while
>> being in critical section
>> + */
>> +void HandleXLogDelayPending()
>> Talk about how waiting in a critical section, that is in
>> XLogInsertRecord() causes locks to be held longer durations and other
>> effects.
> 
> Added.
> 
>> Here:
>> +        /* WAL throttling */
>> +        backendWalInserted += rechdr->xl_tot_len;
>> Be a bit more verbose about why we try to throttle WAL and why only
>> for sync replication, the advantages, effects etc.
> 
> Added.
> 
>> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
>> clutter server logs.
>> +    /* XXX Debug for now */
>> +    elog(WARNING, "throttling WAL down on this session
>> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
>> +        backendWalInserted,
>> +        LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
>> +        LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>>
>> +    elog(WARNING, "throttling WAL down on this session - end (%f
>> ms)", timediff);
> 
> OK, that's entirely removed.
> 
>> 13. BTW, we don't need to hold interrupts while waiting for sync
>> replication ack as it may block query cancels or proc die pendings.
>> You can remove these, unless there's a strong reason.
>> +    //HOLD_INTERRUPTS();
>> +    //RESUME_INTERRUPTS();
> 
> Sure, removed. However, one problem I've discovered is that we were
> hitting Assert(InterruptHoldoffCount > 0) in SyncRepWaitForLSN, so
> I've fixed that too.
> 
>> 14. Add this wait event in the documentation.
>> +        case WAIT_EVENT_SYNC_REP_THROTTLED:
>> +            event_name = "SyncRepThrottled";
>> +            break;
>>
>> 15. Why can XLogDelayPending not be a volatile atomic variable? Is it
>> because it's not something being set within a signal handler?
>>  extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
>> +extern PGDLLIMPORT bool XLogDelayPending;
> 
> Added a comment explaining that.
> 
>> 16.
>> +    instr_time    wal_throttle_time_start, wal_throttle_time_end;
>> +    double        timediff;
>> +    XLogDelayPending = false;
>> An extra line needed after variable declaration and assignment.
> 
> Fixed.
> 
>> 17. I think adding how many times a backend throttled WAL to
>> pg_stat_activity is a good metric.
> 
> Good idea, added; catversion and pgstat format id were bumped. I've
> also added it to the per-query EXPLAIN (WAL) so it logs something like
> "WAL: records=500000 bytes=529500000 throttled=2016" , however I would
> appreciate a better name proposal on how to name that.
> 
>> 18. Can you avoid the need of new functions SyncRepWaitForLSNInternal
>> and SyncRepWaitForLSNThrottled by relying on the global throttling
>> state to determine the correct waitevent in SyncRepWaitForLSN?
> 
> Done.
> 
>> 19. I think measuring the number of WAL flushes with and without this
>> feature that the postgres generates is great to know this feature
>> effects on IOPS. Probably it's even better with variations in
>> synchronous_commit_flush_wal_after or the throttling limits.
> 
> default =>
> INSERT 0 500000
> Time: 6996.340 ms (00:06.996)
> -[ RECORD 1 ]----+------------------------------
> wal_records      | 500001
> wal_fpi          | 0
> wal_bytes        | 529500034
> wal_buffers_full | 44036
> wal_write        | 44056
> wal_sync         | 40
> wal_write_time   | 317.991
> wal_sync_time    | 2690.425
> wal_throttled    | 0
> stats_reset      | 2023-02-01 10:31:44.475651+01
> and 1 call to probe_postgres:XLogFlush
> 
> set synchronous_commit_wal_throttle_threshold to '256kB'; =>
> INSERT 0 500000
> Time: 25476.155 ms (00:25.476)
> -[ RECORD 1 ]----+------------------------------
> wal_records      | 500001
> wal_fpi          | 0
> wal_bytes        | 529500034
> wal_buffers_full | 0
> wal_write        | 2062
> wal_sync         | 2062
> wal_write_time   | 180.177
> wal_sync_time    | 1409.522
> wal_throttled    | 2016
> stats_reset      | 2023-02-01 10:32:01.955513+01
> and 2017 calls to probe_postgres:XLogFlush
> 
> set synchronous_commit_wal_throttle_threshold to '1MB'; =>
> INSERT 0 500000
> Time: 10113.278 ms (00:10.113)
> -[ RECORD 1 ]----+------------------------------
> wal_records      | 642864
> wal_fpi          | 0
> wal_bytes        | 537929315
> wal_buffers_full | 0
> wal_write        | 560
> wal_sync         | 559
> wal_write_time   | 182.586
> wal_sync_time    | 988.72
> wal_throttled    | 504
> stats_reset      | 2023-02-01 10:32:36.250678+01
> 

I'm not quite sure how to interpret these numbers and what conclusions
to draw. ~520MB of WAL is about 2000 x 256kB chunks, so it's not
surprising we do ~2017 XLogFlush calls.

But does that mean we did more I/O operations (like fsyncs)? Not
necessarily, because something else might have done a flush too. For
example, if there are concurrent sessions doing commit, that'll trigger
fsync. And if we back off a bit (e.g. to the LSN of the last complete
page), we'll simply piggyback on that and won't do any extra fsync or
other IO.

Of course, if there is no other concurrent activity triggering flushes,
this will do extra (network) I/O and fsync. But that's expected and
flushing data more often naturally reduces throughput. The impact is
mostly proportional to network latency.

The way I see this is "latency guarantee" - not allowing more unflushed
(not confirmed by sync replica) WAL than can be sent/confirmed within
some time limit. Of course, we don't have a way to specify this in time
directly, so we have to specify amount of WAL instead. And stricter
guarantees generally lead to lower throughput.

I don't think there's a way around this, except for disabling the
throttling by default - if someone enables that, he/she intentionally
did that because latency is more important than throughput.

> Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
> HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
> we are flushing just because of WAL thortting ?) Would that still be
> safe?

It's not clear to me how could this work and still be safe. I mean, we
*must* flush the local WAL first, otherwise the replica could get ahead
(if we send unflushed WAL to replica and then crash). Which would be
really bad, obviously.

And we *have* to send the data to the sync replica, because that's the
whole point of this thread.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

01 February 2023, 13:40:18

On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> > Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
> > HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
> > we are flushing just because of WAL thortting ?) Would that still be
> > safe?
>
> It's not clear to me how could this work and still be safe. I mean, we
> *must* flush the local WAL first, otherwise the replica could get ahead
> (if we send unflushed WAL to replica and then crash). Which would be
> really bad, obviously.

Well it was just a thought: in this particular test - with no other
concurrent activity happening - we are fsyncing() uncommitted
Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
record comes into play. I agree that some other concurrent backend's
COMMIT could fsync it, but I was wondering if that's sensible
optimization to perform (so that issue_fsync() would be called for
only commit/rollback records). I can imagine a scenario with 10 such
concurrent backends running - all of them with this $thread-GUC set -
but that would cause 20k unnecessary fsyncs (?) -- (assuming single
HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
would be wasted close to 400s just due to local fsyncs?). I don't have
a strong opinion or in-depth on this, but that smells like IO waste.

-J.

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

02 February 2023, 10:03:06

On 2/1/23 14:40, Jakub Wartak wrote:
> On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>>> Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
>>> HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
>>> we are flushing just because of WAL thortting ?) Would that still be
>>> safe?
>>
>> It's not clear to me how could this work and still be safe. I mean, we
>> *must* flush the local WAL first, otherwise the replica could get ahead
>> (if we send unflushed WAL to replica and then crash). Which would be
>> really bad, obviously.
> 
> Well it was just a thought: in this particular test - with no other
> concurrent activity happening - we are fsyncing() uncommitted
> Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
> record comes into play.

Right. I see it as testing (more or less) a worst-case scenario,
measuring impact on commands generating a lot of WAL. I'm not sure the
slowdown comes from the extra fsyncs, thgouh - I'd bet it's more about
the extra waits for confirmations from the replica.

> I agree that some other concurrent backend's
> COMMIT could fsync it, but I was wondering if that's sensible
> optimization to perform (so that issue_fsync() would be called for
> only commit/rollback records). I can imagine a scenario with 10 such
> concurrent backends running - all of them with this $thread-GUC set -
> but that would cause 20k unnecessary fsyncs (?) -- (assuming single
> HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
> would be wasted close to 400s just due to local fsyncs?). I don't have
> a strong opinion or in-depth on this, but that smells like IO waste.
> 

Not sure what optimization you mean, but triggering the WAL flushes from
a separate process would be beneficial. But we already do that, more or
less - that's what WAL writer is about, right? Maybe it's not aggressive
enough or something, not sure.

But I think the backends still have to sleep at some point, so that they
don't queue too much unflushed WAL - that's kinda the whole point, no?
The issue is more about triggering the throttling too early, before we
hit the bandwidth limit. Which happens simply because we don't have a
very good way to decide whether the latency is growing, so the patch
just throttles everything.

Consider a replica on a network link with 10ms round trip. Then commit
latency can't really be better than 10ms, and throttling at that point
can't really improve anything, it just makes it slower. Instead, we
should measure the latency somehow, and only throttle when it increases.
And probably make it proportional to the delta (so the higher it's from
the "minimal" latency, the more we'd throttle).

I'd imagine we'd measure the latency (or the wait for sync replica) over
reasonably short time windows (1/10 of a second?), and using that to
drive the throttling. If the latency is below some acceptable value,
don't throttle at all. If it increases, start throttling.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

02 February 2023, 11:12:44

On Thu, Feb 2, 2023 at 11:03 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> > I agree that some other concurrent backend's
> > COMMIT could fsync it, but I was wondering if that's sensible
> > optimization to perform (so that issue_fsync() would be called for
> > only commit/rollback records). I can imagine a scenario with 10 such
> > concurrent backends running - all of them with this $thread-GUC set -
> > but that would cause 20k unnecessary fsyncs (?) -- (assuming single
> > HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
> > would be wasted close to 400s just due to local fsyncs?). I don't have
> > a strong opinion or in-depth on this, but that smells like IO waste.
> >
>
> Not sure what optimization you mean,

Let me clarify, let's say something like below (on top of the v3) just
to save IOPS:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2340,6 +2340,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli,
bool flexible)
                if (sync_method != SYNC_METHOD_OPEN &&
                        sync_method != SYNC_METHOD_OPEN_DSYNC)
                {
+                       bool openedLogFile = false;
                        if (openLogFile >= 0 &&
                                !XLByteInPrevSeg(LogwrtResult.Write,
openLogSegNo,

wal_segment_size))
@@ -2351,9 +2352,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)
                                openLogTLI = tli;
                                openLogFile = XLogFileOpen(openLogSegNo, tli);
                                ReserveExternalFD();
+                               openedLogFile = true;
                        }

-                       issue_xlog_fsync(openLogFile, openLogSegNo, tli);
+                       /* can we bypass fsyncing() XLOG from the backend if
+                        * we have been called without commit request?
+                        * usually the feature will be off here
(XLogDelayPending=false)
+                        */
+                       if(openedLogFile == true || XLogDelayPending == false)
+                               issue_xlog_fsync(openLogFile,
openLogSegNo, tli);
                }

+ maybe some additional logic to ensure that this micro-optimization
for saving IOPS would be not enabled if the backend is calling that
XLogFlush/Write() for actual COMMIT record

> But I think the backends still have to sleep at some point, so that they
> don't queue too much unflushed WAL - that's kinda the whole point, no?

Yes, but it can be flushed to standby, flushed locally but not fsynced
locally (?) - provided that it was not COMMIT - I'm just wondering
whether it makes sense (Question 1)

> The issue is more about triggering the throttling too early, before we
> hit the bandwidth limit. Which happens simply because we don't have a
> very good way to decide whether the latency is growing, so the patch
> just throttles everything.

Maximum TCP bandwidth limit seems to be fluctuating in the real world
I suppose, so it couldn't be a hard limit. On the other hand I can
imagine operators setting
"throttle-those-backends-if-global-WALlatencyORrate>XXX"
(administrative decision). That would be cool to have but yes it would
require WAL latency and rate measurement first (on its own that would
make a very nice addition to the pg_stat_replication). But one thing
to note would be that there could be many potential latencies (& WAL
throughput rates) to consider (e.g. quorum of 3 standby sync having
different latencies) - which one to choose?

(Question 2) I think we have reached simply a decision point on
whether the WIP/PoC is good enough as it is (like Andres wanted and
you +1 to this) or it should work as you propose or maybe keep it as
an idea for the future?

-J.

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

04 November 2023, 19:00:46

Hi,

I keep getting occasional complaints about the impact of large/bulk
transactions on latency of small OLTP transactions, so I'd like to
revive this thread a bit and move it forward.

Attached is a rebased v3, followed by 0002 patch with some review
comments, missing comments and minor tweaks. More about that later ...

It's been a couple months, and there's been a fair amount of discussion
and changes earlier, so I guess it makes sense to post a summary,
stating the purpose (and scope), and then go through the various open
questions etc.


goals
-----
The goal is to limit the impact of large transactions (producing a lot
of WAL) on small OLTP transactions, in a cluster with a sync replica.
Imagine a backend executing single-row inserts, or something like that.
The commit will wait for the replica to confirm the WAL, which may be
expensive, but it's well determined by the network roundtrip.

But then a large transaction comes, and inserts a lot of WAL (imagine a
COPY which inserts 100MB of data, VACUUM, CREATE INDEX and so on). A
small transaction may insert a COMMIT record right after this WAL chunk,
and locally that's (mostly) fine. But with the sync replica it's much
worse - we don't send WAL until it's flushed locally, and then we need
to wait for the WAL to be sent, applied and confirmed by the replica.
This takes time (depending on the bandwidth), and it may not happen
until the small transaction does COMMIT (because we may not flush WAL
from in-progress transaction very often).

Jakub Wartak presented some examples of the impact when he started this
thread, and it can be rather bad. Particularly for latency-sensitive
applications. I plan to do more experiments with the current patch, but
I don't have the results yet.


scope
-----
Now, let's talk about scope - what the patch does not aim to do. The
patch is explicitly intended for syncrep clusters, not async. There have
been proposals to also support throttling for async replicas, logical
replication etc. I suppose all of that could be implemented, and I do
see the benefit of defining some sort of maximum lag even for async
replicas. But the agreement was to focus on the syncrep case, where it's
particularly painful, and perhaps extend it in the future.

I believe adding throttling for physical async replication should not be
difficult - in principle we need to determine how far the replica got,
and compare it to the local LSN. But there's likely complexity with
defining which async replicas to look at, inventing a sensible way to
configure this, etc. It'd be helpful if people interested in that
feature took a look at this patch and tried extending etc.

It's not clear to me what to do about disconnected replicas, though. We
may not even know about them, if there's no slot (and how would we know
what the slot is for). So maybe this would need a new GUC listing the
interesting replicas, and all would need to be connected. But that's an
availability issue, because then all replicas need to be connected.

I'm not sure about logical replication, but I guess we could treat it
similarly to async.

But what I think would need to be different is handling of small
transactions. For syncrep we automatically wait for those at commit,
which means automatic throttling. But for async (and logical), it's
trivial to cause ever-increasing lag with only tiny transactions, thanks
to the single-process replay, so maybe we'd need to throttle those too.
(The recovery prefetching improved this for async quite a bit, ofc.)


implementation
--------------
The implementation is fairly straightforward, and happens in two places.
XLogInsertRecord() decides if a throttling might be needed for this
backend, and then HandleXLogDelayPending() does the wait.

XLogInsertRecord() checks if the backend produced certain amount of WAL
(might be 1MB, for example). We do this because we don't want to do the
expensive stuff in HandleXLogDelayPending() too often (e.g. after every
XLOG record).

HandleXLogDelayPending() picks a suitable LSN, flushes it and then also
waits for the sync replica, as if it was a commit. This limits the lag,
i.e. the amount of WAL that the small transaction will need to wait for
to be replicated and confirmed by the replica.

There was a fair amount of discussion about how to pick the LSN. I think
the agreement is we certainly can't pick the current LSN (because that
would lead to write amplification for the partially filled page), and we
probably even want to backoff a bit more, to make it more likely the LSN
is already flushed. So for example with the threshold set to 8MB we
might go back 1MB, or something like that. That'd still limit the lag.


problems
--------
Now let's talk about some problems - both conceptual and technical
(essentially review comments for the patch).

1) The goal of the patch is to limit the impact on latency, but the
relationship between WAL amounts and latency may not be linear. But we
don't have a good way to predict latency, and WAL lag is the only thing
we have, so there's that. Ultimately, it's a best effort.

2) The throttling is per backend. That makes it simple, but it means
that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
and with a single backend that works fine - the lag should not exceed
the 8MB value. But if there are N backends, the lag could be up to
N-times 8MB, I believe. That's a bit annoying, but I guess the only
solution would be to have some autovacuum-like cost balancing, with all
backends (or at least those running large stuff) doing the checks more
often. I'm not sure we want to do that.

3) The actual throttling (flush and wait for syncrep) happens in
ProcessInterrupts(), which mostly works but it has two drawbacks:

 * It may not happen "early enough" if the backends inserts a lot of
XLOG records without processing interrupts in between.

 * It may happen "too early" if the backend inserts enough WAL to need
throttling (i.e. sets XLogDelayPending), but then after processing
interrupts it would be busy with other stuff, not inserting more WAL.

I think ideally we'd do the throttling right before inserting the next
XLOG record, but there's no convenient place, I think. We'd need to
annotate a lot of places, etc. So maybe ProcessInterrupts() is a
reasonable approximation.

We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
that seems reasonable.

4) I'm not sure I understand why we need XactLastThrottledRecEnd. Why
can't we just use XLogRecEnd?

5) I think the way XLogFlush() resets backendWalInserted is a bit wrong.
Imagine a backend generates a fair amount of WAL, and then calls
XLogFlush(lsn). Why is it OK to set backendWalInserted=0 when we don't
know if the generated WAL was before the "lsn"? I suppose we don't use
very old lsn values for flushing, but I don't know if this drift could
accumulate over time, or cause some other issues.

6) Why XLogInsertRecord() did skip SYNCHRONOUS_COMMIT_REMOTE_FLUSH?

7) I find the "synchronous_commit_wal_throttle_threshold" name
annoyingly long, so I renamed it to just "wal_throttle_threshold". I've
also renamed the GUC to "wal_throttle_after" and I wonder if maybe it
should be configured in , maybe it should be in GUC_UNIT_BLOCKS just
like the other _after options? But those changes are more a matter of
taste, feel free to ignore this.


missing pieces
--------------
The thing that's missing is that some processes (like aggressive
anti-wraparound autovacuum) should not be throttled. If people set the
GUC in the postgresql.conf, I guess that'll affect those processes too,
so I guess we should explicitly reset the GUC for those processes. I
wonder if there are other cases that should not be throttled.


tangents
--------
While discussing this with Andres a while ago, he mentioned a somewhat
orthogonal idea - sending unflushed data to the replica.

We currently never send unflushed data to the replica, which makes sense
because this data is not durable and if the primary crashes/restarts,
this data will disappear. But it also means there may be a fairly large
chunk of WAL data that we may need to send at COMMIT and wait for the
confirmation.

He suggested we might actually send the data to the replica, but the
replica would know this data is not flushed yet and so would not do the
recovery etc. And at commit we could just send a request to flush,
without having to transfer the data at that moment.

I don't have a very good intuition about how large the effect would be,
i.e. how much unflushed WAL data could accumulate on the primary
(kilobytes/megabytes?), and how big is the difference between sending a
couple kilobytes or just a request to flush.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

08 November 2023, 06:40:08

Hi,

On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
> scope
> -----
> Now, let's talk about scope - what the patch does not aim to do. The
> patch is explicitly intended for syncrep clusters, not async. There have
> been proposals to also support throttling for async replicas, logical
> replication etc. I suppose all of that could be implemented, and I do
> see the benefit of defining some sort of maximum lag even for async
> replicas. But the agreement was to focus on the syncrep case, where it's
> particularly painful, and perhaps extend it in the future.

Perhaps we should take care to make the configuration extensible in that
direction in the future?

Hm - is this feature really tied to replication, at all? Pretty much the same
situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
and -P1. Guess where I started a VACUUM (on a fully cached table, so no
continuous WAL flushes):

progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed

At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.

Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
"good times" show a delay of ~8kB (note that this includes WAL records that
are still being inserted). Once the VACUUM runs, it's ~2-3MB.

The picture with more clients is similar.

If I instead severely limit the amount of outstanding (but not the amount of
unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
(down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32).  Of course
that's ridiculous and will completely trash performance in many other cases,
but it shows that limiting the amount of outstanding WAL could help without
replication as well.  With remote storage, that'd likely be a bigger
difference.

> problems
> --------
> Now let's talk about some problems - both conceptual and technical
> (essentially review comments for the patch).
>
> 1) The goal of the patch is to limit the impact on latency, but the
> relationship between WAL amounts and latency may not be linear. But we
> don't have a good way to predict latency, and WAL lag is the only thing
> we have, so there's that. Ultimately, it's a best effort.

It's indeed probably not linear. Realistically, to do better, we probably need
statistics for the specific system in question - the latency impact will
differ hugely between different storage/network.

> 2) The throttling is per backend. That makes it simple, but it means
> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
> and with a single backend that works fine - the lag should not exceed
> the 8MB value. But if there are N backends, the lag could be up to
> N-times 8MB, I believe. That's a bit annoying, but I guess the only
> solution would be to have some autovacuum-like cost balancing, with all
> backends (or at least those running large stuff) doing the checks more
> often. I'm not sure we want to do that.

Hm. The average case is likely fine - the throttling of the different backends
will intersperse and flush more frequently - but the worst case is presumably
part of the issue here. I wonder if we could deal with this by somehow
offsetting the points at which backends flush at somehow.

I doubt we want to go for something autovacuum balancing like - that doesn't
seem to work well - but I think we could take the amount of actually unflushed
WAL into account when deciding whether to throttle. We have the necessary
state in local memory IIRC. We'd have to be careful to not throttle every
backend at the same time, or we'll introduce latency penalties that way. But
what if we scaled synchronous_commit_wal_throttle_threshold depending on the
amount of unflushed WAL? By still taking backendWalInserted into account, we'd
avoid throttling everyone at the same time, but still would make throttling
more aggressive depending on the amount of unflushed/unreplicated WAL.

> 3) The actual throttling (flush and wait for syncrep) happens in
> ProcessInterrupts(), which mostly works but it has two drawbacks:
>
>  * It may not happen "early enough" if the backends inserts a lot of
> XLOG records without processing interrupts in between.

Does such code exist? And if so, is there a reason not to fix said code?

>  * It may happen "too early" if the backend inserts enough WAL to need
> throttling (i.e. sets XLogDelayPending), but then after processing
> interrupts it would be busy with other stuff, not inserting more WAL.

> I think ideally we'd do the throttling right before inserting the next
> XLOG record, but there's no convenient place, I think. We'd need to
> annotate a lot of places, etc. So maybe ProcessInterrupts() is a
> reasonable approximation.

Yea, I think there's no way to do that with reasonable effort. Starting to
wait with a bunch of lwlocks held would obviously be bad.

> We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
> that seems reasonable.

And independently beneficial.

> missing pieces
> --------------
> The thing that's missing is that some processes (like aggressive
> anti-wraparound autovacuum) should not be throttled. If people set the
> GUC in the postgresql.conf, I guess that'll affect those processes too,
> so I guess we should explicitly reset the GUC for those processes. I
> wonder if there are other cases that should not be throttled.

Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone
else even further, even though the goal of the feature might be the opposite.
I don't think that's warranted for anti-wraparound vacuums - they're normal. I
think failsafe vacuums are a different story - there we really just don't care
about impacting other backends, the goal is to prevent moving the cluster to
read only pretty soon.

> tangents
> --------
> While discussing this with Andres a while ago, he mentioned a somewhat
> orthogonal idea - sending unflushed data to the replica.
>
> We currently never send unflushed data to the replica, which makes sense
> because this data is not durable and if the primary crashes/restarts,
> this data will disappear. But it also means there may be a fairly large
> chunk of WAL data that we may need to send at COMMIT and wait for the
> confirmation.
>
> He suggested we might actually send the data to the replica, but the
> replica would know this data is not flushed yet and so would not do the
> recovery etc. And at commit we could just send a request to flush,
> without having to transfer the data at that moment.
>
> I don't have a very good intuition about how large the effect would be,
> i.e. how much unflushed WAL data could accumulate on the primary
> (kilobytes/megabytes?),

Obviously heavily depends on the workloads. If you have anything with bulk
writes it can be many megabytes.

> and how big is the difference between sending a couple kilobytes or just a
> request to flush.

Obviously heavily depends on the network...

I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
network (albeit with a crappy external card for my laptop), to put some
numbers to this. I used -r $s,100 to test sending a variable sized data to the
other size, with the other side always responding with 100 bytes (assuming
that'd more than fit a feedback response).

Command:
fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields;
fors in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done 

10gbe:

request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10              100             43              64.30           390             96              15526.084
100             100             57              75.12           428             122             13286.602
1000            100             47              74.41           270             108             13412.125
10000           100             89              114.63          712             152             8700.643
100000          100             167             255.90          584             312             3903.516
1000000         100             891             1015.99         2470            1143            983.708

Same hosts, but with my workstation forced to use a 1gbit connection:

request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10              100             78              131.18          2425            257             7613.416
100             100             81              129.25          425             255             7727.473
1000            100             100             162.12          1444            266             6161.388
10000           100             310             686.19          1797            927             1456.204
100000          100             1006            1114.20         1472            1199            896.770
1000000         100             8338            8420.96         8827            8498            118.410

I haven't checked, but I'd assume that 100bytes back and forth should easily
fit a new message to update LSNs and the existing feedback response. Even just
the difference between sending 100 bytes and sending 10k (a bit more than a
single WAL page) is pretty significant on a 1gbit network.

Of course, the relatively low latency between these systems makes this more
pronounced than if this were a cross regional or even cross continental link,
were the roundtrip latency is more likely to be dominated by distance rather
than throughput.

Testing between europe and western US:
request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10              100             157934          167627.12       317705          160000          5.652
100             100             161294          171323.59       324017          170000          5.530
1000            100             161392          171521.82       324629          170000          5.524
10000           100             163651          173651.06       328488          170000          5.456
100000          100             166344          198070.20       638205          170000          4.781
1000000         100             225555          361166.12       1302368         240000          2.568

No meaningful difference before getting to 100k. But it's pretty easy to lag
by 100k on a longer distance link...

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

08 November 2023, 12:59:55

On 11/8/23 07:40, Andres Freund wrote:
> Hi,
> 
> On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
>> scope
>> -----
>> Now, let's talk about scope - what the patch does not aim to do. The
>> patch is explicitly intended for syncrep clusters, not async. There have
>> been proposals to also support throttling for async replicas, logical
>> replication etc. I suppose all of that could be implemented, and I do
>> see the benefit of defining some sort of maximum lag even for async
>> replicas. But the agreement was to focus on the syncrep case, where it's
>> particularly painful, and perhaps extend it in the future.
> 
> Perhaps we should take care to make the configuration extensible in that
> direction in the future?
> 

Yes, if we can come up with a suitable configuration, that would work
for the other use cases. I don't have a very good idea what to do about
replicas that may not be connected, of have connected but need to catch
up. IMHO it would be silly to turn this into "almost a sync rep".

> 
> Hm - is this feature really tied to replication, at all? Pretty much the same
> situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
> and -P1. Guess where I started a VACUUM (on a fully cached table, so no
> continuous WAL flushes):
> 
> progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
> progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
> progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
> progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
> progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
> progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
> progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
> progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
> progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
> progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed
> 
> At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.
> 
> Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
> "good times" show a delay of ~8kB (note that this includes WAL records that
> are still being inserted). Once the VACUUM runs, it's ~2-3MB.
> 
> The picture with more clients is similar.
> 
> If I instead severely limit the amount of outstanding (but not the amount of
> unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
> (down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32).  Of course
> that's ridiculous and will completely trash performance in many other cases,
> but it shows that limiting the amount of outstanding WAL could help without
> replication as well.  With remote storage, that'd likely be a bigger
> difference.
> 

Yeah, that's an interesting idea. I think the idea of enforcing "maximum
lag" is somewhat general, the difference is against what LSN the lag is
measured. For the syncrep case it was about LSN confirmed by the
replica, what you described would measure it for either flush LSN or
write LSN (which would be the "outstanding" case I think).

I guess the remote storage is somewhat similar to the syncrep case, in
that the lag includes some network communication.

> 
>> problems
>> --------
>> Now let's talk about some problems - both conceptual and technical
>> (essentially review comments for the patch).
>>
>> 1) The goal of the patch is to limit the impact on latency, but the
>> relationship between WAL amounts and latency may not be linear. But we
>> don't have a good way to predict latency, and WAL lag is the only thing
>> we have, so there's that. Ultimately, it's a best effort.
> 
> It's indeed probably not linear. Realistically, to do better, we probably need
> statistics for the specific system in question - the latency impact will
> differ hugely between different storage/network.
> 

True. I can imagine two ways to measure that.

We could have a standalone tool similar to pg_test_fsync that would
mimic how we write/flush WAL, and measure the latency for different
amounts flushed data. The DBA would then be responsible for somehow
using this to configure the database (perhaps the tool could calculate
some "optimal" value to flush).

Alternatively, we could collect timing in XLogFlush, so that we'd track
amount of data to flush + timing, and then use that to calculate
expected latency (e.g. by binning by data size and using average latency
for each bin). And then use that, somehow.

So you could say - maximum commit latency is 10ms, and the system would
be able to estimate that the maximum amount of unflushed WAL is 256kB,
and it'd enforce this distance.

Still only best offort, no guarantees, of course.

> 
>> 2) The throttling is per backend. That makes it simple, but it means
>> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
>> and with a single backend that works fine - the lag should not exceed
>> the 8MB value. But if there are N backends, the lag could be up to
>> N-times 8MB, I believe. That's a bit annoying, but I guess the only
>> solution would be to have some autovacuum-like cost balancing, with all
>> backends (or at least those running large stuff) doing the checks more
>> often. I'm not sure we want to do that.
> 
> Hm. The average case is likely fine - the throttling of the different backends
> will intersperse and flush more frequently - but the worst case is presumably
> part of the issue here. I wonder if we could deal with this by somehow
> offsetting the points at which backends flush at somehow.
> 

If I understand correctly, you want to ensure the backends start
measuring the WAL from different LSNs, in order to "distribute" them
uniformly within the WAL (and not "coordinate" them, which leads to
higher lag).

I guess we could do that, say by flushing only up to

    hash(pid) % maximum_allowed_lag

I'm not sure that'll really work, especially if the backends are
somewhat naturally "correlated".

Maybe it could work if the backends explicitly coordinated the flushes
to distribute them, but then how is that different from just doing what
autovacuum-like costing does in principle.

However, perhaps there's an "adaptive" way to do this - each backend
know how much WAL it produced since the last flush LSN, and it can
easily measure the actual lag (unflushed WAL). It could compare those,
and estimate what fraction of the lag it's likely responsible for. And
then adjust the "flush distance" based on that.

Imagine we aim for 1MB unflushed WAL, and you have two backends that
happen to execute at the same time. They both generate 1MB of WAL and
hit the throttling code at about the same size. They discover the actual
lag is not the desired 1MB but 2MB, so (requested_lag/actual_lag) = 0.5,
and they'd adjust the flush distance to 1/2MB. And from that point we
know the lag is 1MB even with two backends.

Then one of the backends terminates, and the other backend eventually
hits the 1/2MB limit again, but the desired lag is 1MB, and it doubles
the distance again.

Of course, in practice the behavior would be more complicated, thanks to
backends that generate WAL but don't really hit the threshold.

There'd probably also be some sort of ramp-up, i.e. the backed would not
start with the "full" 1MB limit, but perhaps something lower. Would need
to be careful to be high enough to ignore the OLTP transactions, though.


> I doubt we want to go for something autovacuum balancing like - that doesn't
> seem to work well - but I think we could take the amount of actually unflushed
> WAL into account when deciding whether to throttle. We have the necessary
> state in local memory IIRC. We'd have to be careful to not throttle every
> backend at the same time, or we'll introduce latency penalties that way. But
> what if we scaled synchronous_commit_wal_throttle_threshold depending on the
> amount of unflushed WAL? By still taking backendWalInserted into account, we'd
> avoid throttling everyone at the same time, but still would make throttling
> more aggressive depending on the amount of unflushed/unreplicated WAL.
> 

Oh! Perhaps similar to the adaptive behavior I explained above?

> 
>> 3) The actual throttling (flush and wait for syncrep) happens in
>> ProcessInterrupts(), which mostly works but it has two drawbacks:
>>
>>  * It may not happen "early enough" if the backends inserts a lot of
>> XLOG records without processing interrupts in between.
> 
> Does such code exist? And if so, is there a reason not to fix said code?
> 

Not sure. I thought maybe index builds might do something like that, but
it doesn't seem to be the case (at least for the built-in indexes). But
if adding CHECK_FOR_INTERRUPTS to more places is an acceptable fix, I'm
OK with that.

> 
>>  * It may happen "too early" if the backend inserts enough WAL to need
>> throttling (i.e. sets XLogDelayPending), but then after processing
>> interrupts it would be busy with other stuff, not inserting more WAL.
> 
>> I think ideally we'd do the throttling right before inserting the next
>> XLOG record, but there's no convenient place, I think. We'd need to
>> annotate a lot of places, etc. So maybe ProcessInterrupts() is a
>> reasonable approximation.
> 
> Yea, I think there's no way to do that with reasonable effort. Starting to
> wait with a bunch of lwlocks held would obviously be bad.
> 

OK.

> 
>> We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
>> that seems reasonable.
> 
> And independently beneficial.
> 

OK.

> 
>> missing pieces
>> --------------
>> The thing that's missing is that some processes (like aggressive
>> anti-wraparound autovacuum) should not be throttled. If people set the
>> GUC in the postgresql.conf, I guess that'll affect those processes too,
>> so I guess we should explicitly reset the GUC for those processes. I
>> wonder if there are other cases that should not be throttled.
> 
> Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone
> else even further, even though the goal of the feature might be the opposite.
> I don't think that's warranted for anti-wraparound vacuums - they're normal. I
> think failsafe vacuums are a different story - there we really just don't care
> about impacting other backends, the goal is to prevent moving the cluster to
> read only pretty soon.
> 

Right, I confused those two autovacuum modes.

> 
>> tangents
>> --------
>> While discussing this with Andres a while ago, he mentioned a somewhat
>> orthogonal idea - sending unflushed data to the replica.
>>
>> We currently never send unflushed data to the replica, which makes sense
>> because this data is not durable and if the primary crashes/restarts,
>> this data will disappear. But it also means there may be a fairly large
>> chunk of WAL data that we may need to send at COMMIT and wait for the
>> confirmation.
>>
>> He suggested we might actually send the data to the replica, but the
>> replica would know this data is not flushed yet and so would not do the
>> recovery etc. And at commit we could just send a request to flush,
>> without having to transfer the data at that moment.
>>
>> I don't have a very good intuition about how large the effect would be,
>> i.e. how much unflushed WAL data could accumulate on the primary
>> (kilobytes/megabytes?),
> 
> Obviously heavily depends on the workloads. If you have anything with bulk
> writes it can be many megabytes.
> 
> 
>> and how big is the difference between sending a couple kilobytes or just a
>> request to flush.
> 
> Obviously heavily depends on the network...
> 

I know it depends on workload/network. I'm merely saying I don't have a
very good intuition what value would be suitable for a particular
workload / network.

> 
> I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
> network (albeit with a crappy external card for my laptop), to put some
> numbers to this. I used -r $s,100 to test sending a variable sized data to the
> other size, with the other side always responding with 100 bytes (assuming
> that'd more than fit a feedback response).
> 
> Command:
> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields;
fors in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
 
> 
> 10gbe:
> 
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10              100             43              64.30           390             96              15526.084
> 100             100             57              75.12           428             122             13286.602
> 1000            100             47              74.41           270             108             13412.125
> 10000           100             89              114.63          712             152             8700.643
> 100000          100             167             255.90          584             312             3903.516
> 1000000         100             891             1015.99         2470            1143            983.708
> 
> 
> Same hosts, but with my workstation forced to use a 1gbit connection:
> 
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10              100             78              131.18          2425            257             7613.416
> 100             100             81              129.25          425             255             7727.473
> 1000            100             100             162.12          1444            266             6161.388
> 10000           100             310             686.19          1797            927             1456.204
> 100000          100             1006            1114.20         1472            1199            896.770
> 1000000         100             8338            8420.96         8827            8498            118.410
> 
> I haven't checked, but I'd assume that 100bytes back and forth should easily
> fit a new message to update LSNs and the existing feedback response. Even just
> the difference between sending 100 bytes and sending 10k (a bit more than a
> single WAL page) is pretty significant on a 1gbit network.
> 

I'm on decaf so I may be a bit slow, but it's not very clear to me what
conclusion to draw from these numbers. What is the takeaway?

My understanding is that in both cases the latency is initially fairly
stable, independent of the request size. This applies to request up to
~1000B. And then the latency starts increasing fairly quickly, even
though it shouldn't hit the bandwidth (except maybe the 1MB requests).

I don't think it says we should be replicating WAL in tiny chunks,
because if you need to send a chunk of data it's always more efficient
to send it at once (compared to sending multiple smaller pieces). But if
we manage to send most of this "in the background", only leaving the
last small bit to be sent at the very end, that'd help.

> Of course, the relatively low latency between these systems makes this more
> pronounced than if this were a cross regional or even cross continental link,
> were the roundtrip latency is more likely to be dominated by distance rather
> than throughput.
> 
> Testing between europe and western US:
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10              100             157934          167627.12       317705          160000          5.652
> 100             100             161294          171323.59       324017          170000          5.530
> 1000            100             161392          171521.82       324629          170000          5.524
> 10000           100             163651          173651.06       328488          170000          5.456
> 100000          100             166344          198070.20       638205          170000          4.781
> 1000000         100             225555          361166.12       1302368         240000          2.568
> 
> 
> No meaningful difference before getting to 100k. But it's pretty easy to lag
> by 100k on a longer distance link...
> 

Right.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

08 November 2023, 17:11:31

Hi,

On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
> > I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
> > network (albeit with a crappy external card for my laptop), to put some
> > numbers to this. I used -r $s,100 to test sending a variable sized data to the
> > other size, with the other side always responding with 100 bytes (assuming
> > that'd more than fit a feedback response).
> >
> > Command:
> > fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo
$fields;for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o
"$fields";done
> >
> > 10gbe:
> >
> > request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> > 10              100             43              64.30           390             96              15526.084
> > 100             100             57              75.12           428             122             13286.602
> > 1000            100             47              74.41           270             108             13412.125
> > 10000           100             89              114.63          712             152             8700.643
> > 100000          100             167             255.90          584             312             3903.516
> > 1000000         100             891             1015.99         2470            1143            983.708
> >
> >
> > Same hosts, but with my workstation forced to use a 1gbit connection:
> >
> > request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> > 10              100             78              131.18          2425            257             7613.416
> > 100             100             81              129.25          425             255             7727.473
> > 1000            100             100             162.12          1444            266             6161.388
> > 10000           100             310             686.19          1797            927             1456.204
> > 100000          100             1006            1114.20         1472            1199            896.770
> > 1000000         100             8338            8420.96         8827            8498            118.410

Looks like the 1gbit numbers were somewhat bogus-ified due having configured
jumbo frames and some network component doing something odd with that
(handling them in software maybe?).

10gbe:
request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10        100        56        68.56        483        87        14562.476
100        100        57        75.68        353        123        13185.485
1000        100        60        71.97        391        94        13870.659
10000        100        58        92.42        489        140        10798.444
100000        100        184        260.48        1141        338        3834.504
1000000        100        926        1071.46        2012        1466        933.009

1gbe
request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10        100        77        132.19        1097        257        7555.420
100        100        79        127.85        534        249        7810.862
1000        100        98        155.91        966        265        6406.818
10000        100        176        235.37        1451        314        4245.304
100000        100        944        1022.00        1380        1148        977.930
1000000        100        8649        8768.42        9018        8895        113.703


> > I haven't checked, but I'd assume that 100bytes back and forth should easily
> > fit a new message to update LSNs and the existing feedback response. Even just
> > the difference between sending 100 bytes and sending 10k (a bit more than a
> > single WAL page) is pretty significant on a 1gbit network.
> >
>
> I'm on decaf so I may be a bit slow, but it's not very clear to me what
> conclusion to draw from these numbers. What is the takeaway?
>
> My understanding is that in both cases the latency is initially fairly
> stable, independent of the request size. This applies to request up to
> ~1000B. And then the latency starts increasing fairly quickly, even
> though it shouldn't hit the bandwidth (except maybe the 1MB requests).

Except for the smallest end, these are bandwidth related, I think. Converting
1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving
the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on
1gbit. If you subtract the minmum latency of about 130us, that's nearly all of
the latency.

The reason this matters is that the numbers show that the latency of having to
send a small message with updated positions is far smaller than having to send
all the outstanding data. Even having to send a single WAL page over the
network ~doubles the latency of the response on 1gbit!  Of course the impact
is smaller on 10gbit, but even there latency substantially increases around
100kB of outstanding data.

In a local pgbench with 32 clients I see WAL write sizes between 8kB and
~220kB. Being able to stream those out before the local flush completed
therefore seems likely to reduce synchronous_commit overhead substantially.


> I don't think it says we should be replicating WAL in tiny chunks,
> because if you need to send a chunk of data it's always more efficient
> to send it at once (compared to sending multiple smaller pieces).

I don't think that's a very large factor for network data, once your minimal
data sizes is ~8kB (or ~4kB if we lower wal_block_size). TCP messsages will
get chunked into something smaller anyway and small messages don't need to get
acknowledged individually. Sending more data at once is good for CPU
efficiency (reducing syscall and network device overhead), but doesn't do much
for throughput.

Sending 4kB of data in each send() in a bandwidth oriented test already gets
to ~9.3gbit/s in my network. That's close to the maximum atainable with normal
framing. If I change the mtu back to 9000 I get 9.89 gbit/s, again very close
to the theoretical max.

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

08 November 2023, 18:29:38


On 11/8/23 18:11, Andres Freund wrote:
> Hi,
> 
> On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
>>> I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
>>> network (albeit with a crappy external card for my laptop), to put some
>>> numbers to this. I used -r $s,100 to test sending a variable sized data to the
>>> other size, with the other side always responding with 100 bytes (assuming
>>> that'd more than fit a feedback response).
>>>
>>> Command:
>>> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo
$fields;for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o
"$fields";done
>>>
>>> 10gbe:
>>>
>>> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
>>> 10              100             43              64.30           390             96              15526.084
>>> 100             100             57              75.12           428             122             13286.602
>>> 1000            100             47              74.41           270             108             13412.125
>>> 10000           100             89              114.63          712             152             8700.643
>>> 100000          100             167             255.90          584             312             3903.516
>>> 1000000         100             891             1015.99         2470            1143            983.708
>>>
>>>
>>> Same hosts, but with my workstation forced to use a 1gbit connection:
>>>
>>> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
>>> 10              100             78              131.18          2425            257             7613.416
>>> 100             100             81              129.25          425             255             7727.473
>>> 1000            100             100             162.12          1444            266             6161.388
>>> 10000           100             310             686.19          1797            927             1456.204
>>> 100000          100             1006            1114.20         1472            1199            896.770
>>> 1000000         100             8338            8420.96         8827            8498            118.410
> 
> Looks like the 1gbit numbers were somewhat bogus-ified due having configured
> jumbo frames and some network component doing something odd with that
> (handling them in software maybe?).
> 
> 10gbe:
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10        100        56        68.56        483        87        14562.476
> 100        100        57        75.68        353        123        13185.485
> 1000        100        60        71.97        391        94        13870.659
> 10000        100        58        92.42        489        140        10798.444
> 100000        100        184        260.48        1141        338        3834.504
> 1000000        100        926        1071.46        2012        1466        933.009
> 
> 1gbe
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10        100        77        132.19        1097        257        7555.420
> 100        100        79        127.85        534        249        7810.862
> 1000        100        98        155.91        966        265        6406.818
> 10000        100        176        235.37        1451        314        4245.304
> 100000        100        944        1022.00        1380        1148        977.930
> 1000000        100        8649        8768.42        9018        8895        113.703
> 
> 
>>> I haven't checked, but I'd assume that 100bytes back and forth should easily
>>> fit a new message to update LSNs and the existing feedback response. Even just
>>> the difference between sending 100 bytes and sending 10k (a bit more than a
>>> single WAL page) is pretty significant on a 1gbit network.
>>>
>>
>> I'm on decaf so I may be a bit slow, but it's not very clear to me what
>> conclusion to draw from these numbers. What is the takeaway?
>>
>> My understanding is that in both cases the latency is initially fairly
>> stable, independent of the request size. This applies to request up to
>> ~1000B. And then the latency starts increasing fairly quickly, even
>> though it shouldn't hit the bandwidth (except maybe the 1MB requests).
> 
> Except for the smallest end, these are bandwidth related, I think. Converting
> 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving
> the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on
> 1gbit. If you subtract the minmum latency of about 130us, that's nearly all of
> the latency.
> 

Maybe I don't understand what you mean "bandwidth related" but surely
the smaller requests are not limited by bandwidth. I mean, 100B and 1kB
(and even 10kB) requests have almost the same transaction rate, yet
there's an order of magnitude difference in bandwidth (sure, there's
overhead, but this much magnitude?).

On the higher end, sure, that seems bandwidth related. But for 100kB,
it's still just ~50% of the 1Gbps.

> The reason this matters is that the numbers show that the latency of having to
> send a small message with updated positions is far smaller than having to send
> all the outstanding data. Even having to send a single WAL page over the
> network ~doubles the latency of the response on 1gbit!  Of course the impact
> is smaller on 10gbit, but even there latency substantially increases around
> 100kB of outstanding data.

Understood. I wonder if this is one of the things we'd need to measure
to adjust the write size (i.e. how eagerly to write WAL to disk / over
network). Essentially, we'd get an the size where the latency starts
increasing much faster, and try to write WAL faster than that.

I wonder if storage (not network) has a similar pattern.

> In a local pgbench with 32 clients I see WAL write sizes between 8kB and
> ~220kB. Being able to stream those out before the local flush completed
> therefore seems likely to reduce synchronous_commit overhead substantially.
> 

Yeah, those writes are certainly too large. If we can write them
earlier, and then only do smaller messages to write the remaining bit
and the positions, that'd help a lot.

> 
>> I don't think it says we should be replicating WAL in tiny chunks,
>> because if you need to send a chunk of data it's always more efficient
>> to send it at once (compared to sending multiple smaller pieces).
> 
> I don't think that's a very large factor for network data, once your minimal
> data sizes is ~8kB (or ~4kB if we lower wal_block_size). TCP messsages will
> get chunked into something smaller anyway and small messages don't need to get
> acknowledged individually. Sending more data at once is good for CPU
> efficiency (reducing syscall and network device overhead), but doesn't do much
> for throughput.
> 
> Sending 4kB of data in each send() in a bandwidth oriented test already gets
> to ~9.3gbit/s in my network. That's close to the maximum atainable with normal
> framing. If I change the mtu back to 9000 I get 9.89 gbit/s, again very close
> to the theoretical max.
> 

Got it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

From

Andres Freund

Date:

08 November 2023, 21:21:25

Hi,

On 2023-11-08 19:29:38 +0100, Tomas Vondra wrote:
> >>> I haven't checked, but I'd assume that 100bytes back and forth should easily
> >>> fit a new message to update LSNs and the existing feedback response. Even just
> >>> the difference between sending 100 bytes and sending 10k (a bit more than a
> >>> single WAL page) is pretty significant on a 1gbit network.
> >>>
> >>
> >> I'm on decaf so I may be a bit slow, but it's not very clear to me what
> >> conclusion to draw from these numbers. What is the takeaway?
> >>
> >> My understanding is that in both cases the latency is initially fairly
> >> stable, independent of the request size. This applies to request up to
> >> ~1000B. And then the latency starts increasing fairly quickly, even
> >> though it shouldn't hit the bandwidth (except maybe the 1MB requests).
> >
> > Except for the smallest end, these are bandwidth related, I think. Converting
> > 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving
> > the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on
> > 1gbit. If you subtract the minmum latency of about 130us, that's nearly all of
> > the latency.
> >
>
> Maybe I don't understand what you mean "bandwidth related" but surely
> the smaller requests are not limited by bandwidth. I mean, 100B and 1kB
> (and even 10kB) requests have almost the same transaction rate, yet
> there's an order of magnitude difference in bandwidth (sure, there's
> overhead, but this much magnitude?).

What I mean is that bandwidth is the biggest factor determining latency in the
numbers I showed (due to decent sized packet and it being a local network). At
line rate it takes ~80us to send 10kB over 1gbit ethernet. So a roundtrip
cannot be faster than 80us, even if everything else added zero latency.

That's why my numbers show such a lower latency for the 10gbit network - it's
simply faster to put even small-ish amounts of data onto the wire.

That does not mean that the link is fully utilized over time - because we wait
for the other side to receive the data, wake up a user space process, send
back 100 bytes, wait for the data be transmitted, and then wake up a process,
there are periods where the link in one direction is largely idle.  But in
case of a 10kB packet on the 1gbit network, yes, we are bandwidth limited for
~80us (or perhaps more interestingly, we are bandwidth limited for 0.8ms when
sending 100kB).

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

From

Tomas Vondra

Date:

04 December 2023, 01:45:46

Hi,

Since the last patch version I've done a number of experiments with this
throttling idea, so let me share some of the ideas and results, and see
where that gets us.

The patch versions so far tied everything to syncrep - commit latency
with sync replica was the original motivation, so this makes sense. But
while thinking about this and discussing this with a couple people, I've
been wondering why to limit this to just that particular option. There's
a couple other places in the WAL write path where we might do a similar
thing (i.e. wait) or be a bit more aggressive (and do a write/flush),
depending on circumstances.

If I simplify this a bit, there are about 3 WAL positions that I could
think of:

- write LSN (how far we wrote WAL to disk)
- flush LSN (how far we flushed WAL to local disk)
- syncrep LSN (how far the sync replica confirmed WAL)

So, why couldn't there be a similar "throttling threshold" for these
events too? Imagine we have three GUCs, with values satisfying this:

wal_write_after < wal_flush_after_local < wal_flush_after_remote

and this meaning:

wal_write_after - if a backend generates this amount of WAL, it will
write the completed WAL (but only whole pages)

wal_flush_after_local - if a backend generates this amount of WAL, it
will not only write the WAL, but also issue a
flush (if still needed)

wal_flush_after_remote - if this amount of WAL is generated, it will
wait for syncrep to confirm the flushed LSN

The attached PoC patch does this, mostly the same way as earlier
patches. XLogInsertRecord is where the decision whether throttling may
be needed is done, HandleXLogDelayPending then does the actual work
(writing WAL, flushing it, waiting for syncrep).

The one new thing HandleXLogDelayPending also does is auto-tuning the
values a bit. The idea is that with per-backend threshold, it's hard to
enforce some sort of global limit, because if depends on the number of
active backends. If you set 1MB of WAL per backend, the total might be
1MB or 1000MB, if there are 1000 backends. Who knows. So this tries to
reduce the threshold (if the backend generated only a tiny fraction of
the WAL), or increase the threshold (if it generated most of it). I'm
not entirely sure this behaves sanely under all circumstances, but for a
PoC patch it seems OK.

The first two GUCs remind me what walwriter is doing, and I've been
asking myself if maybe making it more aggressive would have the same
effect. But I don't think so, because a big part of this throttling
patch is ... well, throttling. Making the backends sleep for a bit (or
wait for something), to slow it down. And walwriter doesn't really do
that I think.

In a recent off-list discussion, someone asked if maybe this might be
useful to prevent emergencies due to archiver not keeping up and WAL
filling disk. A bit like enforcing a more "strict" limit on WAL than the
current max_wal_size GUC. I'm not sure about that, it's certainly a very
different use case than minimizing impact on OLTP latency. But it seems
like "archived LSN" might be another "event" the backends would wait
for, just like they wait for syncrep to confirm a LSN. Ideally it'd
never happen, ofc, and it seems a bit like a great footgun (outage on
archiver may kill PROD), but if you're at risk of ENOSPACE on pg_wal,
not doing anything may be risky too ...

FWIW I wonder if maybe we should frame this a as a QoS feature, where
instead of "minimize impact of bulk loads" we'd try to "guarantee" or
"reserve" some part of the capacity to certain backends/...

Now, let's look at results from some of the experiments. I wanted to see
how effective this approach could be in minimizing impact of large bulk
loads at small OLTP transactions in different setups. Thanks to the two
new GUCs this is not strictly about syncrep, so I decided to try three
cases:

1) local, i.e. single-node instance

2) syncrep on the same switch, with 0.1ms latency (1Gbit)

2) syncrep with 10ms latency (also 1Gbit)

And for each configuration I did ran a pgbench (30 minutes), either on
it's own, or concurrently with bulk COPY of 1GB data. The load was done
either by a single backend (so one backend loading 1GB of data), or the
file was split into 10 files 100MB each, and this was loaded by 10
concurrent backends.

And I did this test with three configurations:

(a) master - unpatched, current behavior

(b) throttle-1: patched with limits set like this:

# Add settings for extensions here
wal_write_after = '8kB'
wal_flush_after_local = '16kB'
wal_flush_after_remote = '32kB'

# Add settings for extensions here
wal_write_after = '32kB'
wal_flush_after_local = '64kB'
wal_flush_after_remote = '128kB'

And I did this for the traditional three scales (small, medium, large),
to hit different bottlenecks. And of course, I measured both throughput
and latencies.

The full results are available here:

[1] https://github.com/tvondra/wal-throttle-results/tree/master

I'm not going to attach the files visualizing the results here, because
it's like 1MB per file, which is not great for e-mail.

https://github.com/tvondra/wal-throttle-results/blob/master/wal-throttling.pdf
----------------------------------------------------------------------

The first file summarizes the throughput results for the three
configurations, different scales etc. On the left is throughput, on the
right is the number of load cycles completed.

I think this behaves mostly as expected - with the bulk loads, the
throughput drops. How much depends on the configuration (for syncrep
it's far more pronounced). The throttling recovers a lot of it, at the
expense of doing fewer loads - and it's quite significant drop. But
that's expected, and it was kinda what this patch was about - prioritise
the small OLTP transactions by doing fewer loads. This is not a patch
that would magically inflate capacity of the system to do more things.

I however agree this does not really represent a typical production OLTP
system. Those systems don't run at 100% saturation, except for short
periods, certainly not if they're doing something latency sensitive. So
a somewhat realistic test would be pgbench throttled at 75% capacity,
leaving some spare capacity for the bulk loads.

I actually tried that, and there are results in [1], but the behavior is
pretty similar to what I'm describing here (except that the system does
actually manages to do more bulk loads, ofc).

https://raw.githubusercontent.com/tvondra/wal-throttle-results/master/syncrep/latencies-1000-full.eps
-----------------------------------------------------------------------
Now let's look at the second file, which shows latency percentiles for
the medium dataset on syncrep. The difference between master (on the
left) and the two throttling builds is pretty obvious. It's not exactly
the same as "no concurrent bulk loads" in the top row, but not far from it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Syncrep and improving latency due to WAL throttling

From

Soumyadeep Chakraborty

Date:

25 September 2024, 09:00:18

Hey Tomas,

Shirisha had posted a recent re-attempt here [1] and then we were
graciously redirected here by Jakub.

We took a look at v5-0001-v4.patch and also a brief look at
v5-0002-rework.patch. We feel that it might be worth considering
throttling based on the remote standby to begin with for simplicity (i.e.
wal_flush_after_remote), and then we can add the other knobs incrementally?

Our comments on v5-0001-v4.patch are below:

> +
> + /*
> + * Decide if we need to throttle this backend, so that it does not write
> + * WAL too fast, causing lag against the sync standby (which in turn
> + * increases latency for standby confirmations). We may be holding locks
> + * and blocking interrupts here, so we only make the decision, but the
> + * wait (for sync standby confirmation) happens elsewhere.

Slightly reworded:

* wait (for sync standby confirmation) happens as part of the next
* CHECK_FOR_INTERRUPTS(). See HandleXLogDelayPending() for details on
* why the delay is deferred.

> + *
> + * The throttling is applied only to large transactions (producing more
> + * than wal_throttle_threshold kilobytes of WAL). Throttled backends
> + * can be identified by a new wait event SYNC_REP_THROTTLED.
> + *
> + * Small transactions (by amount of produced WAL) are still subject to
> + * the sync replication, so the same wait happens at commit time.
> + *
> + * XXX Not sure this is the right place for a comment explaining how the
> + * throttling works. This place is way too low level, and rather far from
> + * the place where the wait actually happens.

Perhaps the best course of action is to move the code and the comments
above into an inline function: SetAndCheckWALThrottle().

> +
> +/*
> + * HandleXLogDelayPending
> + * Throttle backends generating large amounts of WAL.
> + *
> + * The throttling is implemented by waiting for a sync replica confirmation for
> + * a convenient LSN position. In particular, we do not wait for the current LSN,
> + * which may be in a partially filled WAL page (and we don't want to write this
> + * one out - we'd have to write it out again, causing write amplification).
> + * Instead, we move back to the last fully WAL page.
> + *
> + * Called from ProcessMessageInterrupts() to avoid syncrep waits in XLogInsert(),
> + * which happens in critical section and with blocked interrupts (so it would be
> + * impossible to cancel the wait if it gets stuck). Also, there may be locks held
> + * and we don't want to hold them longer just because of the wait.
> + *
> + * XXX Andres suggested we actually go back a couple pages, to increase the
> + * probability the LSN was already flushed (obviously, this depends on how much
> + * lag we allow).
> + *
> + * XXX Not sure why we use XactLastThrottledRecEnd and not simply XLogRecEnd?
> + */
> +void
> +HandleXLogDelayPending()
> +{
> + XLogRecPtr lsn;
> +
> + /* calculate last fully filled page */
> + lsn = XactLastThrottledRecEnd - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
> +
> + Assert(wal_throttle_threshold > 0);
> + Assert(backendWalInserted >= wal_throttle_threshold * 1024L);
> + Assert(XactLastThrottledRecEnd != InvalidXLogRecPtr);
> +
> + XLogFlush(lsn);
> + SyncRepWaitForLSN(lsn, false);
> +
> + XLogDelayPending = false;
> +}


(1) Can't we simply call SyncRepWaitForLSN(LogwrtResult.Flush, false); here?
LogwrtResult.Flush will guarantee that we are waiting on something that
has already been flushed or will be flushed soon. Then we wouldn't need
to maintain XactLastThrottledRecEnd, nor call XLogFlush() before calling
SyncRepWaitForLSN(). LogwrtResult can be slightly stale, but does that
really matter here?

(2) Also, to make things more understandable, instead of maintaining a
counter to track the number of WAL bytes written, maybe we should
maintain a LSN pointer called XLogDelayWindowStart. And then in here,
we can assert:

Assert(LogwrtResult.Flush - XLogDelayWindowStart >
wal_throttle_threshold * 1024);

Similarly, we can check the same conditional in SetAndCheckWALThrottle().

After the wait is done and we reset XLogDelayPending = false, we can
perhaps reset XLogDelayWindowStart = LogwrtResult.Flush.

The only downside probably is that if our standby is caught up enough, we may be
repeatedly and unnecessarily invoking a HandleXLogDelayPending(), where our
SyncRepWaitForLSN() would be called with an older LSN and it would be a no-op
early exit.


> + * XXX Should this be done even if XLogDelayPending is already set? Maybe
> + * that should only update XactLastThrottledRecEnd, withoug incrementing
> + * the pgWalUsage.wal_throttled counter?
> + */
> + backendWalInserted += rechdr->xl_tot_len;
> +
> + if ((synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
> + (wal_throttle_threshold > 0) &&
> + (backendWalInserted >= wal_throttle_threshold * 1024L))
> + {
> + XactLastThrottledRecEnd = XactLastRecEnd;
> + InterruptPending = true;
> + XLogDelayPending = true;
> + pgWalUsage.wal_throttled++;
XLogDelayWindowStart = LogwrtResult.Flush;
> + }

Yeah we shouldn't do all of this if XLogDelayPending is already set.
We can add a quick
!XLogDelayPending leading conditional.

Also, [1] has a TAP test that may be useful.

Regards,
Soumyadeep & Shirisha
Broadcom

[1] https://www.postgresql.org/message-id/CAP3-t08umaBEUEppzBVY6%3D%3D3tbdLwG7b4wfrba73zfOAUrRsoQ%40mail.gmail.com

Re: Syncrep and improving latency due to WAL throttling

From

Soumyadeep Chakraborty

Date:

16 October 2024, 09:04:16

Hey Tomas,

Would you be interested in registering the patch for CF 2024-11?

Please let us know if there is anything we can do to help with additional review
or testing.

Regards,
Soumyadeep & Shirisha
Broadcom

Re: Syncrep and improving latency due to WAL throttling

From

Jakub Wartak

Date:

18 October 2024, 10:49:47

Hi Soumyadeep & Shirisha,

On Wed, Oct 16, 2024 at 8:04 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:

Hey Tomas,

Would you be interested in registering the patch for CF 2024-11?

Please let us know if there is anything we can do to help with additional review
or testing.

Tomas has apparently never received Your earlier email with review [1] - I've included his current email in this message and I've registered it in CF under https://commitfest.postgresql.org/50/5309/ . Even if this $topic doesn't get traction and will be rejected/returned with feedback due to lack of feedback or work, having the record in CF should help others in future using those patches and design discussions as a good starting point. Right now I do not have plans to work on that,but maybe I'll be able to help in future.

Regards,

-Jakub Wartak.

[1] - https://www.postgresql.org/message-id/CAE-ML%2B_Sf4oTYfm-XLo_hF114uEGA-qvXnoBbAF967Nu_JygYw%40mail.gmail.com