Thread: Syncrep and improving latency due to WAL throttling
Hi, attached is proposal idea by Tomas (in CC) for protecting and prioritizing OLTP latency on syncrep over other heavy WAL hitting sessions. This is the result of internal testing and research related to the syncrep behavior with Tomas, Alvaro and me. The main objective of this work-in-progress/crude patch idea (with GUC default disabled) is to prevent build-up of lag against the synchronous standby. It allows DBA to maintain stable latency for many backends when in parallel there is some WAL-heavy activity happening (full table rewrite, VACUUM, MV build, archiving, etc.). In other words it allows slow down of any backend activity. Any feedback on such a feature is welcome, including better GUC name proposals ;) and conditions in which such feature should be disabled even if it would be enabled globally (right now only anti- wraparound VACUUM comes to mind, it's not in the patch). Demo; Given: - two DBs in syncrep configuration and artificial RTT 10ms latency between them (introduced via tc qdisc netem) - insertBIG.sql = "insert into bandwidthhog select repeat('c', 1000) from generate_series(1, 500000);" (50MB of WAL data) - pgbench (8c) and 1x INSERT session There are clearly visible drops of pgbench (OLTP) latency when the WAL socket is saturated: with 16devel/master and synchronous_commit_flush_wal_after=0 (disabled, default/baseline): postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1 pgbench (16devel) progress: 1.0 s, 59.0 tps, lat 18.840 ms stddev 11.251, 0 failed, lag 0.059 ms progress: 2.0 s, 48.0 tps, lat 14.332 ms stddev 4.272, 0 failed, lag 0.063 ms progress: 3.0 s, 56.0 tps, lat 15.383 ms stddev 6.270, 0 failed, lag 0.061 ms progress: 4.0 s, 51.0 tps, lat 15.104 ms stddev 5.850, 0 failed, lag 0.061 ms progress: 5.0 s, 47.0 tps, lat 15.184 ms stddev 5.472, 0 failed, lag 0.063 ms progress: 6.0 s, 23.0 tps, lat 88.495 ms stddev 141.845, 0 failed, lag 0.064 ms progress: 7.0 s, 1.0 tps, lat 999.053 ms stddev 0.000, 0 failed, lag 0.077 ms progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed, lag 0.000 ms progress: 9.0 s, 1.0 tps, lat 2748.142 ms stddev NaN, 0 failed, lag 0.072 ms progress: 10.0 s, 68.1 tps, lat 3368.267 ms stddev 282.842, 0 failed, lag 2911.857 ms progress: 11.0 s, 97.0 tps, lat 2560.750 ms stddev 216.844, 0 failed, lag 2478.261 ms progress: 12.0 s, 96.0 tps, lat 1463.754 ms stddev 376.276, 0 failed, lag 1383.873 ms progress: 13.0 s, 94.0 tps, lat 616.243 ms stddev 230.673, 0 failed, lag 527.241 ms progress: 14.0 s, 59.0 tps, lat 48.265 ms stddev 72.533, 0 failed, lag 15.181 ms progress: 15.0 s, 39.0 tps, lat 14.237 ms stddev 6.073, 0 failed, lag 0.063 ms transaction type: <builtin: TPC-B (sort of)> [..] latency average = 931.383 ms latency stddev = 1188.530 ms rate limit schedule lag: avg 840.170 (max 3605.569) ms session2 output: postgres=# \i insertBIG.sql Timing is on. INSERT 0 500000 Time: 4119.485 ms (00:04.119) This new GUC makes it possible for the OLTP traffic to be less affected (latency-wise) when the heavy bulk traffic hits. With synchronous_commit_flush_wal_after=1024 (kB) it's way better, but latency rises up to 45ms: postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1 pgbench (16devel) progress: 1.0 s, 52.0 tps, lat 17.300 ms stddev 10.178, 0 failed, lag 0.061 ms progress: 2.0 s, 51.0 tps, lat 19.490 ms stddev 12.626, 0 failed, lag 0.061 ms progress: 3.0 s, 48.0 tps, lat 14.839 ms stddev 5.429, 0 failed, lag 0.061 ms progress: 4.0 s, 53.0 tps, lat 24.635 ms stddev 13.449, 0 failed, lag 0.062 ms progress: 5.0 s, 48.0 tps, lat 17.999 ms stddev 9.291, 0 failed, lag 0.062 ms progress: 6.0 s, 57.0 tps, lat 21.513 ms stddev 17.011, 0 failed, lag 0.058 ms progress: 7.0 s, 50.0 tps, lat 28.071 ms stddev 21.622, 0 failed, lag 0.061 ms progress: 8.0 s, 45.0 tps, lat 27.244 ms stddev 11.975, 0 failed, lag 0.064 ms progress: 9.0 s, 57.0 tps, lat 35.988 ms stddev 25.752, 0 failed, lag 0.057 ms progress: 10.0 s, 56.0 tps, lat 45.478 ms stddev 39.831, 0 failed, lag 0.534 ms progress: 11.0 s, 62.0 tps, lat 45.146 ms stddev 32.881, 0 failed, lag 0.058 ms progress: 12.0 s, 51.0 tps, lat 24.250 ms stddev 12.405, 0 failed, lag 0.063 ms progress: 13.0 s, 57.0 tps, lat 18.554 ms stddev 8.677, 0 failed, lag 0.060 ms progress: 14.0 s, 44.0 tps, lat 15.923 ms stddev 6.958, 0 failed, lag 0.065 ms progress: 15.0 s, 54.0 tps, lat 19.773 ms stddev 10.024, 0 failed, lag 0.063 ms transaction type: <builtin: TPC-B (sort of)> [..] latency average = 25.575 ms latency stddev = 21.540 ms session2 output: postgres=# set synchronous_commit_flush_wal_after = 1024; SET postgres=# \i insertBIG.sql INSERT 0 500000 Time: 8889.318 ms (00:08.889) With 16devel/master and synchronous_commit_flush_wal_after=256 (kB) all is smooth: postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1 pgbench (16devel) progress: 1.0 s, 49.0 tps, lat 14.345 ms stddev 4.700, 0 failed, lag 0.062 ms progress: 2.0 s, 45.0 tps, lat 14.812 ms stddev 5.816, 0 failed, lag 0.064 ms progress: 3.0 s, 49.0 tps, lat 13.145 ms stddev 4.320, 0 failed, lag 0.063 ms progress: 4.0 s, 44.0 tps, lat 14.429 ms stddev 4.715, 0 failed, lag 0.063 ms progress: 5.0 s, 49.0 tps, lat 18.111 ms stddev 8.536, 0 failed, lag 0.062 ms progress: 6.0 s, 58.0 tps, lat 17.929 ms stddev 8.198, 0 failed, lag 0.060 ms progress: 7.0 s, 65.0 tps, lat 20.186 ms stddev 12.973, 0 failed, lag 0.059 ms progress: 8.0 s, 47.0 tps, lat 16.174 ms stddev 6.508, 0 failed, lag 0.061 ms progress: 9.0 s, 45.0 tps, lat 14.485 ms stddev 4.736, 0 failed, lag 0.061 ms progress: 10.0 s, 53.0 tps, lat 16.879 ms stddev 8.783, 0 failed, lag 0.061 ms progress: 11.0 s, 42.0 tps, lat 13.711 ms stddev 4.464, 0 failed, lag 0.062 ms progress: 12.0 s, 49.0 tps, lat 13.252 ms stddev 4.082, 0 failed, lag 0.062 ms progress: 13.0 s, 48.0 tps, lat 14.179 ms stddev 6.238, 0 failed, lag 0.058 ms progress: 14.0 s, 43.0 tps, lat 12.210 ms stddev 2.993, 0 failed, lag 0.060 ms progress: 15.0 s, 34.0 tps, lat 14.811 ms stddev 6.544, 0 failed, lag 0.062 ms transaction type: <builtin: TPC-B (sort of)> [..] latency average = 15.454 ms latency stddev = 7.354 ms session2 output (notice the INSERT took much longer but did NOT affect the pgbench's stddev at all): postgres=# set synchronous_commit_flush_wal_after = 256; SET postgres=# \i insertBIG.sql Timing is on. [..] INSERT 0 500000 Time: 22737.729 ms (00:22.738) Without this feature (or with synchronous_commit_flush_wal_after=0) the TCP's SendQ on socket walsender-->walreceiver is growing and as such any next sendto() by OLTP backends/walwriter ends being queued too much causing stalls of activity. -Jakub Wartak.
Attachment
Hi, On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote: > In other words it allows slow down of any backend activity. Any feedback on > such a feature is welcome, including better GUC name proposals ;) and > conditions in which such feature should be disabled even if it would be > enabled globally (right now only anti- wraparound VACUUM comes to mind, it's > not in the patch). Such a feature could be useful - but I don't think the current place of throttling has any hope of working reliably: > @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata, > pgWalUsage.wal_bytes += rechdr->xl_tot_len; > pgWalUsage.wal_records++; > pgWalUsage.wal_fpi += num_fpi; > + > + backendWalInserted += rechdr->xl_tot_len; > + > + if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE)&& > + synchronous_commit_flush_wal_after > 0 && > + backendWalInserted > synchronous_commit_flush_wal_after * 1024L) > + { > + elog(DEBUG3, "throttling WAL down on this session (backendWalInserted=%d)", backendWalInserted); > + XLogFlush(EndPos); > + /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP */ > + /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */ > + SyncRepWaitForLSN(EndPos, false); > + elog(DEBUG3, "throttling WAL down on this session - end"); > + backendWalInserted = 0; > + } > } You're blocking in the middle of an XLOG insertion. We will commonly hold important buffer lwlocks, it'll often be in a critical section (no cancelling / terminating the session!). This'd entail loads of undetectable deadlocks (i.e. hard hangs). And even leaving that aside, doing an unnecessary XLogFlush() with important locks held will seriously increase contention. My best idea for how to implement this in a somewhat safe way would be for XLogInsertRecord() to set a flag indicating that we should throttle, and set InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to proceed (i.e. we'll not be in a critical / interrupts off section) can actually perform the delay. That should fix the hard deadlock danger and remove most of the increase in lock contention. I don't think doing an XLogFlush() of a record that we just wrote is a good idea - that'll sometimes significantly increase the number of fdatasyncs/sec we're doing. To make matters worse, this will often cause partially filled WAL pages to be flushed out - rewriting a WAL page multiple times can significantly increase overhead on the storage level. At the very least this'd have to flush only up to the last fully filled page. Just counting the number of bytes inserted by a backend will make the overhead even worse, as the flush will be triggered even for OLTP sessions doing tiny transactions, even though they don't contribute to the problem you're trying to address. How about counting how many bytes of WAL a backend has inserted since the last time that backend did an XLogFlush()? A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the last XLogFlush() will increase quickly, triggering a flush at the next opportunity. But short OLTP transactions will do XLogFlush()es at least at every commit. I also suspect the overhead will be more manageable if you were to force a flush only up to a point further back than the last fully filled page. We don't want to end up flushing WAL for every page, but if you just have a backend-local accounting mechanism, I think that's inevitably what you'd end up with when you have a large number of sessions. But if you'd limit the flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and only ever to a prior boundary, the amount of unnecessary WAL flushes would be proportional to synchronous_commit_flush_wal_after. Greetings, Andres Freund
On Thu, Jan 26, 2023 at 12:35 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote: > > In other words it allows slow down of any backend activity. Any feedback on > > such a feature is welcome, including better GUC name proposals ;) and > > conditions in which such feature should be disabled even if it would be > > enabled globally (right now only anti- wraparound VACUUM comes to mind, it's > > not in the patch). > > Such a feature could be useful - but I don't think the current place of > throttling has any hope of working reliably: +1 for the feature as it keeps replication lag in check and provides a way for defining RPO (recovery point objective). > You're blocking in the middle of an XLOG insertion. We will commonly hold > important buffer lwlocks, it'll often be in a critical section (no cancelling > / terminating the session!). This'd entail loads of undetectable deadlocks > (i.e. hard hangs). And even leaving that aside, doing an unnecessary > XLogFlush() with important locks held will seriously increase contention. > > My best idea for how to implement this in a somewhat safe way would be for > XLogInsertRecord() to set a flag indicating that we should throttle, and set > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to > proceed (i.e. we'll not be in a critical / interrupts off section) can > actually perform the delay. That should fix the hard deadlock danger and > remove most of the increase in lock contention. We've discussed this feature quite extensively in a recent thread - https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. Folks were agreeing to Andres' suggestion there - https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de. However, that thread got lost from my radar. It's good that someone else started working on it and I'm happy to help with this feature. > I don't think doing an XLogFlush() of a record that we just wrote is a good > idea - that'll sometimes significantly increase the number of fdatasyncs/sec > we're doing. To make matters worse, this will often cause partially filled WAL > pages to be flushed out - rewriting a WAL page multiple times can > significantly increase overhead on the storage level. At the very least this'd > have to flush only up to the last fully filled page. > > Just counting the number of bytes inserted by a backend will make the overhead > even worse, as the flush will be triggered even for OLTP sessions doing tiny > transactions, even though they don't contribute to the problem you're trying > to address. Right. > How about counting how many bytes of WAL a backend has inserted > since the last time that backend did an XLogFlush()? Seems reasonable. > A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the > last XLogFlush() will increase quickly, triggering a flush at the next > opportunity. But short OLTP transactions will do XLogFlush()es at least at > every commit. Right. > I also suspect the overhead will be more manageable if you were to force a > flush only up to a point further back than the last fully filled page. We > don't want to end up flushing WAL for every page, but if you just have a > backend-local accounting mechanism, I think that's inevitably what you'd end > up with when you have a large number of sessions. But if you'd limit the > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and > only ever to a prior boundary, the amount of unnecessary WAL flushes would be > proportional to synchronous_commit_flush_wal_after. When the WAL records are of large in size, then the backend inserting them would throttle frequently generating more flushes and more waits for sync replication ack and more contention on WALWriteLock. Not sure if this is good unless the impact is measured. Few more thoughts: 1. This feature keeps replication lag in check at the cost of throttling write traffic. I'd be curious to know improvement in replication lag vs drop in transaction throughput, say pgbench with a custom insert script and one or more async/sync standbys. 2. So, heavy WAL throttling can turn into a lot of writes and fsyncs. Eventually, each backend gets a chance to throttle WAL if it ever generates WAL irrespective of whether there's a replication lag or not. How about we let backends throttle themselves not just based on wal_distance_from_last_flush but also depending on the replication lag at the moment, say, if replication lag crosses wal_throttle_replication_lag_threshold bytes? 3. Should the backends wait indefinitely for sync rep ack when they crossed wal_throttle_threshold? Well, I don't think so, it must be given a chance to do its stuff instead of favouring other backends by throttling itself. 4. I'd prefer adding a TAP test for this feature to check if the WAL throttle is picked up by a backend. 5. Can we also extend this WAL throttling feature to logical replication to keep replication lag in check there as well? 6. Backends can ignore throttling for WAL records marked as unimportant, no? 7. I think we need to not let backends throttle too frequently even though they have crossed wal_throttle_threshold bytes. The best way is to rely on replication lag, after all the goal of this feature is to keep replication lag under check - say, throttle only when wal_distance > wal_throttle_threshold && replication_lag > wal_throttle_replication_lag_threshold. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 1/25/23 20:05, Andres Freund wrote: > Hi, > > On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote: >> In other words it allows slow down of any backend activity. Any feedback on >> such a feature is welcome, including better GUC name proposals ;) and >> conditions in which such feature should be disabled even if it would be >> enabled globally (right now only anti- wraparound VACUUM comes to mind, it's >> not in the patch). > > Such a feature could be useful - but I don't think the current place of > throttling has any hope of working reliably: > >> @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata, >> pgWalUsage.wal_bytes += rechdr->xl_tot_len; >> pgWalUsage.wal_records++; >> pgWalUsage.wal_fpi += num_fpi; >> + >> + backendWalInserted += rechdr->xl_tot_len; >> + >> + if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE)&& >> + synchronous_commit_flush_wal_after > 0 && >> + backendWalInserted > synchronous_commit_flush_wal_after * 1024L) >> + { >> + elog(DEBUG3, "throttling WAL down on this session (backendWalInserted=%d)", backendWalInserted); >> + XLogFlush(EndPos); >> + /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP */ >> + /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */ >> + SyncRepWaitForLSN(EndPos, false); >> + elog(DEBUG3, "throttling WAL down on this session - end"); >> + backendWalInserted = 0; >> + } >> } > > You're blocking in the middle of an XLOG insertion. We will commonly hold > important buffer lwlocks, it'll often be in a critical section (no cancelling > / terminating the session!). This'd entail loads of undetectable deadlocks > (i.e. hard hangs). And even leaving that aside, doing an unnecessary > XLogFlush() with important locks held will seriously increase contention. > Yeah, I agree the sleep would have to happen elsewhere. It's not clear to me how could it cause deadlocks, as we're not waiting for a lock/resource locked by someone else, but it's certainly an issue for uninterruptible hangs. > > My best idea for how to implement this in a somewhat safe way would be for > XLogInsertRecord() to set a flag indicating that we should throttle, and set > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to > proceed (i.e. we'll not be in a critical / interrupts off section) can > actually perform the delay. That should fix the hard deadlock danger and > remove most of the increase in lock contention. > The solution I've imagined is something like autovacuum throttling - do some accounting of how much "WAL bandwidth" each process consumed, and then do the delay/sleep in a suitable place. > > I don't think doing an XLogFlush() of a record that we just wrote is a good > idea - that'll sometimes significantly increase the number of fdatasyncs/sec > we're doing. To make matters worse, this will often cause partially filled WAL > pages to be flushed out - rewriting a WAL page multiple times can > significantly increase overhead on the storage level. At the very least this'd > have to flush only up to the last fully filled page. > I don't see why would that significantly increase the number of flushes. The goal is to do this only every ~1MB of a WAL or so, and chances are there were many (perhaps hundreds or more) commits in between. At least in a workload with a fair number of OLTP transactions (which is kinda the point of all this). And the "small" OLTP transactions don't really do any extra fsyncs, because the accounting resets at commit (or places that flush WAL). Same for the flushes of partially flushed pages - if there's enough of small OLTP transactions, we're already having this issue. I don't see why would this make it measurably worse. But if needed, we can simply backoff to the last page boundary, so that we only flush the complete page. That would work too - the goal is not to flush everything, but to reduce how much of the lag is due to the current process (i.e. to wait for sync replica to catch up). > > Just counting the number of bytes inserted by a backend will make the overhead > even worse, as the flush will be triggered even for OLTP sessions doing tiny > transactions, even though they don't contribute to the problem you're trying > to address. How about counting how many bytes of WAL a backend has inserted > since the last time that backend did an XLogFlush()? > No, we should reset the counter at commit, so small OLTP transactions should not reach really trigger this. That's kinda the point, to only delay "large" transactions producing a lot of WAL. > A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the > last XLogFlush() will increase quickly, triggering a flush at the next > opportunity. But short OLTP transactions will do XLogFlush()es at least at > every commit. > Right, that's kinda the approach the patch is trying to do. > > I also suspect the overhead will be more manageable if you were to force a > flush only up to a point further back than the last fully filled page. We > don't want to end up flushing WAL for every page, but if you just have a > backend-local accounting mechanism, I think that's inevitably what you'd end > up with when you have a large number of sessions. But if you'd limit the > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and > only ever to a prior boundary, the amount of unnecessary WAL flushes would be > proportional to synchronous_commit_flush_wal_after. > True, that's kinda what I suggested above as a solution for partially filled WAL pages. I agree this is crude and we'd probably need some sort of "balancing" logic that distributes WAL bandwidth between backends, and throttles backends producing a lot of WAL. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 1/25/23 20:05, Andres Freund wrote: > > Hi, > > > > Such a feature could be useful - but I don't think the current place of > > throttling has any hope of working reliably: [..] > > You're blocking in the middle of an XLOG insertion. [..] > Yeah, I agree the sleep would have to happen elsewhere. Fixed. > > My best idea for how to implement this in a somewhat safe way would be for > > XLogInsertRecord() to set a flag indicating that we should throttle, and set > > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to > > proceed (i.e. we'll not be in a critical / interrupts off section) can > > actually perform the delay. That should fix the hard deadlock danger and > > remove most of the increase in lock contention. > > > > The solution I've imagined is something like autovacuum throttling - do > some accounting of how much "WAL bandwidth" each process consumed, and > then do the delay/sleep in a suitable place. > By the time you replied I've already tried what Andres has recommended. [..] >> At the very least this'd > > have to flush only up to the last fully filled page. > > > Same for the flushes of partially flushed pages - if there's enough of > small OLTP transactions, we're already having this issue. I don't see > why would this make it measurably worse. But if needed, we can simply > backoff to the last page boundary, so that we only flush the complete > page. That would work too - the goal is not to flush everything, but to > reduce how much of the lag is due to the current process (i.e. to wait > for sync replica to catch up). I've introduced the rounding to the last written page (hopefully). > > Just counting the number of bytes inserted by a backend will make the overhead > > even worse, as the flush will be triggered even for OLTP sessions doing tiny > > transactions, even though they don't contribute to the problem you're trying > > to address. How about counting how many bytes of WAL a backend has inserted > > since the last time that backend did an XLogFlush()? > > > > No, we should reset the counter at commit, so small OLTP transactions > should not reach really trigger this. That's kinda the point, to only > delay "large" transactions producing a lot of WAL. Fixed. > > I also suspect the overhead will be more manageable if you were to force a > > flush only up to a point further back than the last fully filled page. We > > don't want to end up flushing WAL for every page, but if you just have a > > backend-local accounting mechanism, I think that's inevitably what you'd end > > up with when you have a large number of sessions. But if you'd limit the > > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and > > only ever to a prior boundary, the amount of unnecessary WAL flushes would be > > proportional to synchronous_commit_flush_wal_after. > > > > True, that's kinda what I suggested above as a solution for partially > filled WAL pages. > > I agree this is crude and we'd probably need some sort of "balancing" > logic that distributes WAL bandwidth between backends, and throttles > backends producing a lot of WAL. OK - that's not included (yet?), as it would make this much more complex. In summary: Attached is a slightly reworked version of this patch. 1. Moved logic outside XLogInsertRecord() under ProcessInterrupts() 2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN() 3. Removed GUC for now (always on->256kB); potentially to be restored 4. Resets backendWal counter on commits It's still crude, but first tests indicate that it behaves similarly (to the initial one with GUC = 256kB). Also from the Bharath email, I've found another patch proposal by Simon [1] and I would like to avoid opening the Pandora's box again, but to stress this: this feature is specifically aimed at solving OLTP latency on *sync*rep (somewhat some code could be added/generalized later on and this feature could be extended to async case, but this then opens the question of managing the possible WAL throughput budget/back throttling - this was also raised in first thread here [2] by Konstantin). IMHO it could implement various strategies under user-settable GUC "wal_throttle_larger_transactions=(sync,256kB)" or "wal_throttle_larger_transactions=off" , and someday later someone could teach the code the async case (let's say under "wal_throttle_larger_transactions=(asyncMaxRPO, maxAllowedLag=8MB, 256kB)"). Thoughts? [1] - https://www.postgresql.org/message-id/flat/CA%2BU5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/71f3e6fb-2fca-a798-856a-f23c8ede2333%40garret.ru
Attachment
Hi, On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote: > It's not clear to me how could it cause deadlocks, as we're not waiting > for a lock/resource locked by someone else, but it's certainly an issue > for uninterruptible hangs. Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of lock-ordering rules. > > My best idea for how to implement this in a somewhat safe way would be for > > XLogInsertRecord() to set a flag indicating that we should throttle, and set > > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to > > proceed (i.e. we'll not be in a critical / interrupts off section) can > > actually perform the delay. That should fix the hard deadlock danger and > > remove most of the increase in lock contention. > > > > The solution I've imagined is something like autovacuum throttling - do > some accounting of how much "WAL bandwidth" each process consumed, and > then do the delay/sleep in a suitable place. Where would such a point be, except for ProcessInterrupts()? Iteratively adding a separate set of "wal_delay()" points all over the executor, commands/, ... seems like a bad plan. > > I don't think doing an XLogFlush() of a record that we just wrote is a good > > idea - that'll sometimes significantly increase the number of fdatasyncs/sec > > we're doing. To make matters worse, this will often cause partially filled WAL > > pages to be flushed out - rewriting a WAL page multiple times can > > significantly increase overhead on the storage level. At the very least this'd > > have to flush only up to the last fully filled page. > > > > I don't see why would that significantly increase the number of flushes. > The goal is to do this only every ~1MB of a WAL or so, and chances are > there were many (perhaps hundreds or more) commits in between. At least > in a workload with a fair number of OLTP transactions (which is kinda > the point of all this). Because the accounting is done separately in each process. Even if you just add a few additional flushes for each connection, in aggregate that'll be a lot. > And the "small" OLTP transactions don't really do any extra fsyncs, > because the accounting resets at commit (or places that flush WAL). > [...] > > Just counting the number of bytes inserted by a backend will make the overhead > > even worse, as the flush will be triggered even for OLTP sessions doing tiny > > transactions, even though they don't contribute to the problem you're trying > > to address. How about counting how many bytes of WAL a backend has inserted > > since the last time that backend did an XLogFlush()? > > > > No, we should reset the counter at commit, so small OLTP transactions > should not reach really trigger this. That's kinda the point, to only > delay "large" transactions producing a lot of WAL. I might have missed something, but I don't think the first version of patch resets the accounting at commit? > Same for the flushes of partially flushed pages - if there's enough of > small OLTP transactions, we're already having this issue. I don't see > why would this make it measurably worse. Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1 could *not* make it worse. Suddenly you get a bunch of additional XLogFlush() calls to partial boundaries by every autovacuum, by every process doing a bit more bulky work. Because you're flushing the LSN after a record you just inserted, you're commonly not going to be "joining" the work of an already in-process XLogFlush(). > But if needed, we can simply backoff to the last page boundary, so that we > only flush the complete page. That would work too - the goal is not to flush > everything, but to reduce how much of the lag is due to the current process > (i.e. to wait for sync replica to catch up). Yes, as I proposed... > > I also suspect the overhead will be more manageable if you were to force a > > flush only up to a point further back than the last fully filled page. We > > don't want to end up flushing WAL for every page, but if you just have a > > backend-local accounting mechanism, I think that's inevitably what you'd end > > up with when you have a large number of sessions. But if you'd limit the > > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and > > only ever to a prior boundary, the amount of unnecessary WAL flushes would be > > proportional to synchronous_commit_flush_wal_after. > > > > True, that's kinda what I suggested above as a solution for partially > filled WAL pages. I think flushing only up to a point further in the past than the last page boundary is somewhat important to prevent unnecessary flushes. Greetings, Andres Freund
Hi, On 2023-01-26 14:40:56 +0100, Jakub Wartak wrote: > In summary: Attached is a slightly reworked version of this patch. > 1. Moved logic outside XLogInsertRecord() under ProcessInterrupts() > 2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN() > 3. Removed GUC for now (always on->256kB); potentially to be restored Huh? Why did you remove the GUC? Clearly this isn't something we can default to on. > diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c > index d85e313908..05d56d65f9 100644 > --- a/src/backend/access/transam/xact.c > +++ b/src/backend/access/transam/xact.c > @@ -2395,6 +2395,7 @@ CommitTransaction(void) > > XactTopFullTransactionId = InvalidFullTransactionId; > nParallelCurrentXids = 0; > + backendWalInserted = 0; > > /* > * done with commit processing, set current transaction state back to I don't like the resets around like this. Why not just move it into XLogFlush()? > +/* > + * Called from ProcessMessageInterrupts() to avoid waiting while being in critical section > + */ > +void HandleXLogDelayPending() > +{ > + /* flush only up to the last fully filled page */ > + XLogRecPtr LastFullyWrittenXLogPage = XactLastRecEnd - (XactLastRecEnd % XLOG_BLCKSZ); > + XLogDelayPending = false; Hm - wonder if it'd be better to determine the LSN to wait for in XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but bounded number of) WAL records before processing interrupts. No need to flush more aggressively than necessary... > + //HOLD_INTERRUPTS(); > + > + /* XXX Debug for now */ > + elog(WARNING, "throttling WAL down on this session (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", > + backendWalInserted, > + LSN_FORMAT_ARGS(XactLastRecEnd), > + LSN_FORMAT_ARGS(LastFullyWrittenXLogPage)); > + > + /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP */ > + /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */ > + XLogFlush(LastFullyWrittenXLogPage); > + SyncRepWaitForLSN(LastFullyWrittenXLogPage, false); SyncRepWaitForLSN() has this comment: * 'lsn' represents the LSN to wait for. 'commit' indicates whether this LSN * represents a commit record. If it doesn't, then we wait only for the WAL * to be flushed if synchronous_commit is set to the higher level of * remote_apply, because only commit records provide apply feedback. > + elog(WARNING, "throttling WAL down on this session - end"); > + backendWalInserted = 0; > + > + //RESUME_INTERRUPTS(); > +} I think we'd want a distinct wait event for this. > diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c > index 1b1d814254..8ed66b2eae 100644 > --- a/src/backend/utils/init/globals.c > +++ b/src/backend/utils/init/globals.c > @@ -37,6 +37,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false; > volatile sig_atomic_t ProcSignalBarrierPending = false; > volatile sig_atomic_t LogMemoryContextPending = false; > volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false; > +volatile sig_atomic_t XLogDelayPending = false; > volatile uint32 InterruptHoldoffCount = 0; > volatile uint32 QueryCancelHoldoffCount = 0; > volatile uint32 CritSectionCount = 0; I don't think the new variable needs to be volatile, or even sig_atomic_t. We'll just manipulate it from outside signal handlers. Greetings, Andres Freund
Hi, On 2023-01-26 13:33:27 +0530, Bharath Rupireddy wrote: > 6. Backends can ignore throttling for WAL records marked as unimportant, no? Why would that be a good idea? Not that it matters today, but those records still need to be flushed in case of a commit by another transaction. > 7. I think we need to not let backends throttle too frequently even > though they have crossed wal_throttle_threshold bytes. The best way is > to rely on replication lag, after all the goal of this feature is to > keep replication lag under check - say, throttle only when > wal_distance > wal_throttle_threshold && replication_lag > > wal_throttle_replication_lag_threshold. I think my idea of only forcing to flush/wait an LSN some distance in the past would automatically achieve that? Greetings, Andres Freund
On 1/26/23 16:40, Andres Freund wrote: > Hi, > > On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote: >> It's not clear to me how could it cause deadlocks, as we're not waiting >> for a lock/resource locked by someone else, but it's certainly an issue >> for uninterruptible hangs. > > Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of > lock-ordering rules. > Not sure what lock ordering issues you have in mind, but I agree it's not the right place for the sleep, no argument here. > >>> My best idea for how to implement this in a somewhat safe way would be for >>> XLogInsertRecord() to set a flag indicating that we should throttle, and set >>> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to >>> proceed (i.e. we'll not be in a critical / interrupts off section) can >>> actually perform the delay. That should fix the hard deadlock danger and >>> remove most of the increase in lock contention. >>> >> >> The solution I've imagined is something like autovacuum throttling - do >> some accounting of how much "WAL bandwidth" each process consumed, and >> then do the delay/sleep in a suitable place. > > Where would such a point be, except for ProcessInterrupts()? Iteratively > adding a separate set of "wal_delay()" points all over the executor, > commands/, ... seems like a bad plan. > I haven't thought about where to do the work, TBH. ProcessInterrupts() may very well be a good place. I should have been clearer, but the main benefit of autovacuum-like throttling is IMHO that it involves all processes and a global limit, while the current approach is per-backend. > >>> I don't think doing an XLogFlush() of a record that we just wrote is a good >>> idea - that'll sometimes significantly increase the number of fdatasyncs/sec >>> we're doing. To make matters worse, this will often cause partially filled WAL >>> pages to be flushed out - rewriting a WAL page multiple times can >>> significantly increase overhead on the storage level. At the very least this'd >>> have to flush only up to the last fully filled page. >>> >> >> I don't see why would that significantly increase the number of flushes. >> The goal is to do this only every ~1MB of a WAL or so, and chances are >> there were many (perhaps hundreds or more) commits in between. At least >> in a workload with a fair number of OLTP transactions (which is kinda >> the point of all this). > > Because the accounting is done separately in each process. Even if you just > add a few additional flushes for each connection, in aggregate that'll be a > lot. > How come? Imagine the backend does flush only after generating e.g. 1MB of WAL. Small transactions won't do any additional flushes at all (because commit resets the accounting). Large transactions will do an extra flush every 1MB, so 16x per WAL segment. But in between there will be many commits from the small transactions. If we backoff to the last complete page, that should eliminate even most of these flushes. So where would the additional flushes come from? > >> And the "small" OLTP transactions don't really do any extra fsyncs, >> because the accounting resets at commit (or places that flush WAL). >> [...] >>> Just counting the number of bytes inserted by a backend will make the overhead >>> even worse, as the flush will be triggered even for OLTP sessions doing tiny >>> transactions, even though they don't contribute to the problem you're trying >>> to address. How about counting how many bytes of WAL a backend has inserted >>> since the last time that backend did an XLogFlush()? >>> >> >> No, we should reset the counter at commit, so small OLTP transactions >> should not reach really trigger this. That's kinda the point, to only >> delay "large" transactions producing a lot of WAL. > > I might have missed something, but I don't think the first version of patch > resets the accounting at commit? > Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test the behavior / demonstrate the effect. > >> Same for the flushes of partially flushed pages - if there's enough of >> small OLTP transactions, we're already having this issue. I don't see >> why would this make it measurably worse. > > Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1 > could *not* make it worse. Suddenly you get a bunch of additional XLogFlush() > calls to partial boundaries by every autovacuum, by every process doing a bit > more bulky work. Because you're flushing the LSN after a record you just > inserted, you're commonly not going to be "joining" the work of an already > in-process XLogFlush(). > Right. We do ~16 additional flushes per 16MB segment (or something like that, depending on the GUC value). Considering we do thousand of commits per segment, each of which does a flush, I don't see how would this make it measurably worse. > >> But if needed, we can simply backoff to the last page boundary, so that we >> only flush the complete page. That would work too - the goal is not to flush >> everything, but to reduce how much of the lag is due to the current process >> (i.e. to wait for sync replica to catch up). > > Yes, as I proposed... > Right. > >>> I also suspect the overhead will be more manageable if you were to force a >>> flush only up to a point further back than the last fully filled page. We >>> don't want to end up flushing WAL for every page, but if you just have a >>> backend-local accounting mechanism, I think that's inevitably what you'd end >>> up with when you have a large number of sessions. But if you'd limit the >>> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and >>> only ever to a prior boundary, the amount of unnecessary WAL flushes would be >>> proportional to synchronous_commit_flush_wal_after. >>> >> >> True, that's kinda what I suggested above as a solution for partially >> filled WAL pages. > > I think flushing only up to a point further in the past than the last page > boundary is somewhat important to prevent unnecessary flushes. > Not sure I agree with that. Yes, we should not be doing flushes unless we need to, but OTOH we should not delay sending WAL to standby too much - because that's what affects syncrep latency for small transactions. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 26, 2023 at 9:21 PM Andres Freund <andres@anarazel.de> wrote: > > > 7. I think we need to not let backends throttle too frequently even > > though they have crossed wal_throttle_threshold bytes. The best way is > > to rely on replication lag, after all the goal of this feature is to > > keep replication lag under check - say, throttle only when > > wal_distance > wal_throttle_threshold && replication_lag > > > wal_throttle_replication_lag_threshold. > > I think my idea of only forcing to flush/wait an LSN some distance in the past > would automatically achieve that? I'm sorry, I couldn't get your point, can you please explain it a bit more? Looking at the patch, the feature, in its current shape, focuses on improving replication lag (by throttling WAL on the primary) only when synchronous replication is enabled. Why is that? Why can't we design it for replication in general (async, sync, and logical replication)? Keeping replication lag under check enables one to provide a better RPO guarantee as discussed in the other thread https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-Jan-27, Bharath Rupireddy wrote: > Looking at the patch, the feature, in its current shape, focuses on > improving replication lag (by throttling WAL on the primary) only when > synchronous replication is enabled. Why is that? Why can't we design > it for replication in general (async, sync, and logical replication)? > > Keeping replication lag under check enables one to provide a better > RPO guarantee Hmm, but you can already do that by tuning walwriter, no? -- Álvaro Herrera
On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Jan-27, Bharath Rupireddy wrote: > > > Looking at the patch, the feature, in its current shape, focuses on > > improving replication lag (by throttling WAL on the primary) only when > > synchronous replication is enabled. Why is that? Why can't we design > > it for replication in general (async, sync, and logical replication)? > > > > Keeping replication lag under check enables one to provide a better > > RPO guarantee > > Hmm, but you can already do that by tuning walwriter, no? IIUC, one can increase wal_writer_delay or wal_writer_flush_after to control the amount of WAL walwriter writes and flushes so that the walsenders will get slower a bit thus improving replication lag. If my understanding is correct, what if other backends doing WAL writes and flushes? How do we control that? I think tuning walwriter alone may not help to keep replication lag under check. Even if it does, it requires manual intervention. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, v2 is attached. On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: > Huh? Why did you remove the GUC? After reading previous threads, my optimism level of getting it ever in shape of being widely accepted degraded significantly (mainly due to the discussion of wider category of 'WAL I/O throttling' especially in async case, RPO targets in async case and potentially calculating global bandwidth). I've assumed that it is a working sketch, and as such not having GUC name right now (just for sync case) would still allow covering various other async cases in future without breaking compatibility with potential name GUC changes (see my previous "wal_throttle_larger_transactions=<strategies>" proposal ). Although I've restored the synchronous_commit_flush_wal_after GUC into v2 patch, sticking to such a name removes the way of using the code to achieve async WAL throttling in future. > Clearly this isn't something we can default to on.. Yes, I agree. It's never meant to be on by default. > I don't like the resets around like this. Why not just move it into > XLogFlush()? Fixed. > Hm - wonder if it'd be better to determine the LSN to wait for in > XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but > bounded number of) WAL records before processing interrupts. No need to flush > more aggressively than necessary... Fixed. > SyncRepWaitForLSN() has this comment: > * 'lsn' represents the LSN to wait for. 'commit' indicates whether this LSN > * represents a commit record. If it doesn't, then we wait only for the WAL > * to be flushed if synchronous_commit is set to the higher level of > * remote_apply, because only commit records provide apply feedback. Hm, not sure if I understand: are you saying that we should (in the throttled scenario) have some special feedback msgs or not -- irrespective of the setting? To be honest the throttling shouldn't wait for the standby full setting, it's just about slowdown fact (so IMHO it would be fine even in remote_write/remote_apply scenario if the remote walreceiver just received the data, not necessarily write it into file or wait for for applying it). Just this waiting for a round-trip ack about LSN progress would be enough to slow down the writer (?). I've added some timing log into the draft and it shows more or less constantly solid RTT even as it stands: psql:insertBIG.sql:2: WARNING: throttling WAL down on this session (backendWalInserted=262632, LSN=2/A6CB70B8 flushingTo=2/A6CB6000) psql:insertBIG.sql:2: WARNING: throttling WAL down on this session - end (10.500052 ms) psql:insertBIG.sql:2: WARNING: throttling WAL down on this session (backendWalInserted=262632, LSN=2/A6CF7C08 flushingTo=2/A6CF6000) psql:insertBIG.sql:2: WARNING: throttling WAL down on this session - end (10.655370 ms) psql:insertBIG.sql:2: WARNING: throttling WAL down on this session (backendWalInserted=262632, LSN=2/A6D385E0 flushingTo=2/A6D38000) psql:insertBIG.sql:2: WARNING: throttling WAL down on this session - end (10.627334 ms) psql:insertBIG.sql:2: WARNING: throttling WAL down on this session (backendWalInserted=262632, LSN=2/A6D78FA0 flushingTo=2/A6D78000) [..] > I think we'd want a distinct wait event for this. Added and tested that it shows up. > > +volatile sig_atomic_t XLogDelayPending = false; > I don't think the new variable needs to be volatile, or even > sig_atomic_t. We'll just manipulate it from outside signal handlers. Changed to bool, previously I wanted it to "fit" the above ones. -J.
Attachment
Hi Bharath, On Fri, Jan 27, 2023 at 12:04 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2023-Jan-27, Bharath Rupireddy wrote: > > > > > Looking at the patch, the feature, in its current shape, focuses on > > > improving replication lag (by throttling WAL on the primary) only when > > > synchronous replication is enabled. Why is that? Why can't we design > > > it for replication in general (async, sync, and logical replication)? > > > > > > Keeping replication lag under check enables one to provide a better > > > RPO guarantee Sorry for not answering earlier; although the title of the thread goes for the SyncRep-only I think everyone would be open to also cover the more advanced async scenarios that Satyanarayana proposed in those earlier threads (as just did Simon much earlier). I was proposing wal_throttle_larger_transactions=<..> (for the lack of better name), however v2 of the patch from today right now contains again reference to syncrep (it could be changed of course). It's just the code that is missing that could be also added on top of v2, so we could combine our efforts. It's just the competency and time that I lack on how to implement such async-scenario code-paths (maybe Tomas V. has something in mind with his own words [1]) so also any feedback from senior hackers is more than welcome ... -J. [1] - "The solution I've imagined is something like autovacuum throttling - do some accounting of how much "WAL bandwidth" each process consumed, and then do the delay/sleep in a suitable place. "
On 1/27/23 08:18, Bharath Rupireddy wrote: > On Thu, Jan 26, 2023 at 9:21 PM Andres Freund <andres@anarazel.de> wrote: >> >>> 7. I think we need to not let backends throttle too frequently even >>> though they have crossed wal_throttle_threshold bytes. The best way is >>> to rely on replication lag, after all the goal of this feature is to >>> keep replication lag under check - say, throttle only when >>> wal_distance > wal_throttle_threshold && replication_lag > >>> wal_throttle_replication_lag_threshold. >> >> I think my idea of only forcing to flush/wait an LSN some distance in the past >> would automatically achieve that? > > I'm sorry, I couldn't get your point, can you please explain it a bit more? > The idea is that we would not flush the exact current LSN, because that's likely somewhere in the page, and we always write the whole page which leads to write amplification. But if we backed off a bit, and wrote e.g. to the last page boundary, that wouldn't have this issue (either the page was already flushed - noop, or we'd have to flush it anyway). We could even back off a bit more, to increase the probability it was actually flushed / sent to standby. That would still work, because the whole point is not to allow one process to generate too much unflushed WAL, forcing the other (small) xacts to wait at commit. Imagine we have the limit set to 8MB, i.e. the backend flushes WAL after generating 8MB of WAL. If we flush to the exact current LSN, the other backends will wait for ~4MB on average. If we back off to 1MB, the wait average increases to ~4.5MB. (This is simplified, as it ignores WAL from the small xacts. But those flush regularly, which limit the amount. It also ignores there might be multiple large xacts.) > Looking at the patch, the feature, in its current shape, focuses on > improving replication lag (by throttling WAL on the primary) only when > synchronous replication is enabled. Why is that? Why can't we design > it for replication in general (async, sync, and logical replication)? > This focuses on sync rep, because that's where the commit latency comes from. Async doesn't have that issue, because it doesn't wait for the standby. In particular, the trick is in penalizing the backends generating a lot of WAL, while leaving the small xacts alone. > Keeping replication lag under check enables one to provide a better > RPO guarantee as discussed in the other thread > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. > Isn't that a bit over-complicated? RPO generally only cares about xacts that committed (because that's what you want to not lose), so why not to simply introduce a "sync mode" that simply uses a bit older LSN when waiting for the replica? Seems much simpler and similar to what we already do. Yeah, if someone generates a lot of WAL in uncommitted transaction, all of that may be lost. But who cares (from the RPO point of view)? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote: > On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: > > > Huh? Why did you remove the GUC? > > After reading previous threads, my optimism level of getting it ever > in shape of being widely accepted degraded significantly (mainly due > to the discussion of wider category of 'WAL I/O throttling' especially > in async case, RPO targets in async case and potentially calculating > global bandwidth). I think it's quite reasonable to limit this to a smaller scope. Particularly because those other goals are pretty vague but ambitious goals. IMO the problem with a lot of the threads is precisely that that they aimed at a level of generallity that isn't achievable in one step. > I've assumed that it is a working sketch, and as such not having GUC name > right now (just for sync case) would still allow covering various other > async cases in future without breaking compatibility with potential name GUC > changes (see my previous "wal_throttle_larger_transactions=<strategies>" > proposal ). It's harder to benchmark something like this without a GUC, so I think it's worth having, even if it's not the final name. > > SyncRepWaitForLSN() has this comment: > > * 'lsn' represents the LSN to wait for. 'commit' indicates whether this LSN > > * represents a commit record. If it doesn't, then we wait only for the WAL > > * to be flushed if synchronous_commit is set to the higher level of > > * remote_apply, because only commit records provide apply feedback. > > Hm, not sure if I understand: are you saying that we should (in the > throttled scenario) have some special feedback msgs or not -- > irrespective of the setting? To be honest the throttling shouldn't > wait for the standby full setting, it's just about slowdown fact (so > IMHO it would be fine even in remote_write/remote_apply scenario if > the remote walreceiver just received the data, not necessarily write > it into file or wait for for applying it). Just this waiting for a > round-trip ack about LSN progress would be enough to slow down the > writer (?). I've added some timing log into the draft and it shows > more or less constantly solid RTT even as it stands: My problem was that the function header for SyncRepWaitForLSN() seems to say that we don't wait at all if commit=false and synchronous_commit < remote_apply. But I think that might just be bad formulation. [...] 'commit' indicates whether this LSN * represents a commit record. If it doesn't, then we wait only for the WAL * to be flushed if synchronous_commit is set to the higher level of * remote_apply, because only commit records provide apply feedback. because the code does something entirely different afaics: /* Cap the level for anything other than commit to remote flush only. */ if (commit) mode = SyncRepWaitMode; else mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH); Greetings, Andres Freund
Hi, On 2023-01-27 12:48:43 +0530, Bharath Rupireddy wrote: > Looking at the patch, the feature, in its current shape, focuses on > improving replication lag (by throttling WAL on the primary) only when > synchronous replication is enabled. Why is that? Why can't we design > it for replication in general (async, sync, and logical replication)? > > Keeping replication lag under check enables one to provide a better > RPO guarantee as discussed in the other thread > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. I think something narrower and easier to achieve is a good thing. We've already had loads of discussion for the more general problem, without a lot of actual progress. Greetings, Andres Freund
Hi, On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote: > On 1/27/23 08:18, Bharath Rupireddy wrote: > >> I think my idea of only forcing to flush/wait an LSN some distance in the past > >> would automatically achieve that? > > > > I'm sorry, I couldn't get your point, can you please explain it a bit more? > > > > The idea is that we would not flush the exact current LSN, because > that's likely somewhere in the page, and we always write the whole page > which leads to write amplification. > > But if we backed off a bit, and wrote e.g. to the last page boundary, > that wouldn't have this issue (either the page was already flushed - > noop, or we'd have to flush it anyway). Yep. > We could even back off a bit more, to increase the probability it was > actually flushed / sent to standby. That's not the sole goal, from my end: I'd like to avoid writing out + flushing the WAL in too small chunks. Imagine a few concurrent vacuums or COPYs or such - if we're unlucky they'd each end up exceeding their "private" limit close to each other, leading to a number of small writes of the WAL. Which could end up increasing local commit latency / iops. If we instead decide to only ever flush up to something like last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ we'd make sure that the throttling mechanism won't cause a lot of small writes. > > Keeping replication lag under check enables one to provide a better > > RPO guarantee as discussed in the other thread > > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. > > > > Isn't that a bit over-complicated? RPO generally only cares about xacts > that committed (because that's what you want to not lose), so why not to > simply introduce a "sync mode" that simply uses a bit older LSN when > waiting for the replica? Seems much simpler and similar to what we > already do. I don't think that really helps you that much. If there's e.g. a huge VACUUM / COPY emitting loads of WAL you'll suddenly see commit latency of a concurrently committing transactions spike into oblivion. Whereas a general WAL throttling mechanism would throttle the VACUUM, without impacting the commit latency of normal transactions. Greetings, Andres Freund
On 1/27/23 22:33, Andres Freund wrote: > Hi, > > On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote: >> On 1/27/23 08:18, Bharath Rupireddy wrote: >>>> I think my idea of only forcing to flush/wait an LSN some distance in the past >>>> would automatically achieve that? >>> >>> I'm sorry, I couldn't get your point, can you please explain it a bit more? >>> >> >> The idea is that we would not flush the exact current LSN, because >> that's likely somewhere in the page, and we always write the whole page >> which leads to write amplification. >> >> But if we backed off a bit, and wrote e.g. to the last page boundary, >> that wouldn't have this issue (either the page was already flushed - >> noop, or we'd have to flush it anyway). > > Yep. > > >> We could even back off a bit more, to increase the probability it was >> actually flushed / sent to standby. > > That's not the sole goal, from my end: I'd like to avoid writing out + > flushing the WAL in too small chunks. Imagine a few concurrent vacuums or > COPYs or such - if we're unlucky they'd each end up exceeding their "private" > limit close to each other, leading to a number of small writes of the > WAL. Which could end up increasing local commit latency / iops. > > If we instead decide to only ever flush up to something like > last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ > > we'd make sure that the throttling mechanism won't cause a lot of small > writes. > I'm not saying we shouldn't do this, but I still don't see how this could make a measurable difference. At least assuming a sensible value of the throttling limit (say, more than 256kB per backend), and OLTP workload running concurrently. That means ~64 extra flushes/writes per 16MB segment (at most). Yeah, a particular page might get unlucky and be flushed by multiple backends, but the average still holds. Meanwhile, the OLTP transactions will generate (at least) an order of magnitude more flushes. > >>> Keeping replication lag under check enables one to provide a better >>> RPO guarantee as discussed in the other thread >>> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. >>> >> >> Isn't that a bit over-complicated? RPO generally only cares about xacts >> that committed (because that's what you want to not lose), so why not to >> simply introduce a "sync mode" that simply uses a bit older LSN when >> waiting for the replica? Seems much simpler and similar to what we >> already do. > > I don't think that really helps you that much. If there's e.g. a huge VACUUM / > COPY emitting loads of WAL you'll suddenly see commit latency of a > concurrently committing transactions spike into oblivion. Whereas a general > WAL throttling mechanism would throttle the VACUUM, without impacting the > commit latency of normal transactions. > True, but it solves the RPO goal which is what the other thread was about. IMHO it's useful to look at this as a resource scheduling problem: limited WAL bandwidth consumed by backends, with the bandwidth distributed using some sort of policy. The patch discussed in this thread uses fundamentally unfair policy, with throttling applied only on backends that produce a lot of WAL. And trying to leave the OLTP as unaffected as possible. The RPO thread seems to be aiming for a "fair" policy, providing the same fraction of bandwidth to all processes. This will affect all xacts the same way (sleeps proportional to amount of WAL generated by the xact). Perhaps we want such alternative scheduling policies, but it'll probably require something like the autovacuum throttling. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/27/23 22:19, Andres Freund wrote: > Hi, > > On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote: >> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: >> >>> Huh? Why did you remove the GUC? >> >> After reading previous threads, my optimism level of getting it ever >> in shape of being widely accepted degraded significantly (mainly due >> to the discussion of wider category of 'WAL I/O throttling' especially >> in async case, RPO targets in async case and potentially calculating >> global bandwidth). > > I think it's quite reasonable to limit this to a smaller scope. Particularly > because those other goals are pretty vague but ambitious goals. IMO the > problem with a lot of the threads is precisely that that they aimed at a level > of generallity that isn't achievable in one step. > +1 to that -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 28, 2023 at 6:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > > That's not the sole goal, from my end: I'd like to avoid writing out + > > flushing the WAL in too small chunks. Imagine a few concurrent vacuums or > > COPYs or such - if we're unlucky they'd each end up exceeding their "private" > > limit close to each other, leading to a number of small writes of the > > WAL. Which could end up increasing local commit latency / iops. > > > > If we instead decide to only ever flush up to something like > > last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ > > > > we'd make sure that the throttling mechanism won't cause a lot of small > > writes. > > > > I'm not saying we shouldn't do this, but I still don't see how this > could make a measurable difference. At least assuming a sensible value > of the throttling limit (say, more than 256kB per backend), and OLTP > workload running concurrently. That means ~64 extra flushes/writes per > 16MB segment (at most). Yeah, a particular page might get unlucky and be > flushed by multiple backends, but the average still holds. Meanwhile, > the OLTP transactions will generate (at least) an order of magnitude > more flushes. I think measuring the number of WAL flushes with and without this feature that the postgres generates is great to know this feature effects on IOPS. Probably it's even better with variations in synchronous_commit_flush_wal_after or the throttling limits. On Sat, Jan 28, 2023 at 6:08 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/27/23 22:19, Andres Freund wrote: > > Hi, > > > > On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote: > >> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: > >> > >>> Huh? Why did you remove the GUC? > >> > >> After reading previous threads, my optimism level of getting it ever > >> in shape of being widely accepted degraded significantly (mainly due > >> to the discussion of wider category of 'WAL I/O throttling' especially > >> in async case, RPO targets in async case and potentially calculating > >> global bandwidth). > > > > I think it's quite reasonable to limit this to a smaller scope. Particularly > > because those other goals are pretty vague but ambitious goals. IMO the > > problem with a lot of the threads is precisely that that they aimed at a level > > of generallity that isn't achievable in one step. > > > > +1 to that Okay, I agree to limit the scope first to synchronous replication. Although v2 is a WIP patch, I have some comments: 1. Coding style doesn't confirm to the Postgres standard: +/* + * Called from ProcessMessageInterrupts() to avoid waiting while being in critical section + */ 80-line char limit +void HandleXLogDelayPending() A new line missing between function return type and functin name + elog(WARNING, "throttling WAL down on this session (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", + backendWalInserted, + LSN_FORMAT_ARGS(XactLastThrottledRecEnd), + LSN_FORMAT_ARGS(LastFullyWrittenXLogPage)); Indentation issue - space needed in the next lines after elog(WARNING,.. + elog(WARNING, "throttling WAL down on this session - end (%f ms)", timediff); 80-line char limit, timediff); be on the new line. + //RESUME_INTERRUPTS(); + //HOLD_INTERRUPTS(); Multi-line comments are used elsewhere in the code. Better to run pgindent on the patch. 2. It'd be better to add a TAP test hitting the WAL throttling. 3. We generally don't need timings to be calculated explicitly when we emit before and after log messages. If needed one can calculate the wait time from timestamps of the log messages. If it still needs an explicit time difference which I don't think is required, because we have a special event and before/after log messages, I think it needs to be under track_wal_io_timing to keep things simple. 4. XLogInsertRecord() can be a hot path for high-write workload, so effects of adding anything in it needs to be measured. So, it's better to run benchmarks with this feature enabled and disabled. 5. Missing documentation of this feature and the GUC. I think we can describe extensively why we'd need a feature of this kind in the documentation for better adoption or usability. 6. Shouldn't the max limit be MAX_KILOBYTES? + &synchronous_commit_flush_wal_after, + 0, 0, 1024L*1024L, 7. Can this GUC name be something like synchronous_commit_wal_throttle_threshold to better reflect what it does for a backend? + {"synchronous_commit_flush_wal_after", PGC_USERSET, REPLICATION_SENDING, 8. A typo - s/confiration/confirmation + gettext_noop("Sets the maximum logged WAL in kbytes, after which wait for sync commit confiration even without commit "), 9. This "Sets the maximum logged WAL in kbytes, after which wait for sync commit confiration even without commit " better be something like below? "Sets the maximum amount of WAL in kilobytes a backend generates after which it waits for synchronous commit confiration even without commit " 10. I think there's nothing wrong in adding some assertions in HandleXLogDelayPending(): Assert(synchronous_commit_flush_wal_after > 0); Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L); Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr); 11. Specify the reason in the comments as to why we had to do the following things: Here: + /* flush only up to the last fully filled page */ + XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd - (XactLastThrottledRecEnd % XLOG_BLCKSZ); Talk about how this avoids multiple-flushes for the same page. Here: + * Called from ProcessMessageInterrupts() to avoid waiting while being in critical section + */ +void HandleXLogDelayPending() Talk about how waiting in a critical section, that is in XLogInsertRecord() causes locks to be held longer durations and other effects. Here: + /* WAL throttling */ + backendWalInserted += rechdr->xl_tot_len; Be a bit more verbose about why we try to throttle WAL and why only for sync replication, the advantages, effects etc. 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not clutter server logs. + /* XXX Debug for now */ + elog(WARNING, "throttling WAL down on this session (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", + backendWalInserted, + LSN_FORMAT_ARGS(XactLastThrottledRecEnd), + LSN_FORMAT_ARGS(LastFullyWrittenXLogPage)); + elog(WARNING, "throttling WAL down on this session - end (%f ms)", timediff); 13. BTW, we don't need to hold interrupts while waiting for sync replication ack as it may block query cancels or proc die pendings. You can remove these, unless there's a strong reason. + //HOLD_INTERRUPTS(); + //RESUME_INTERRUPTS(); 14. Add this wait event in the documentation. + case WAIT_EVENT_SYNC_REP_THROTTLED: + event_name = "SyncRepThrottled"; + break; 15. Why can XLogDelayPending not be a volatile atomic variable? Is it because it's not something being set within a signal handler? extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending; +extern PGDLLIMPORT bool XLogDelayPending; 16. + instr_time wal_throttle_time_start, wal_throttle_time_end; + double timediff; + XLogDelayPending = false; An extra line needed after variable declaration and assignment. 17. I think adding how many times a backend throttled WAL to pg_stat_activity is a good metric. 18. Can you avoid the need of new functions SyncRepWaitForLSNInternal and SyncRepWaitForLSNThrottled by relying on the global throttling state to determine the correct waitevent in SyncRepWaitForLSN? 19. I think measuring the number of WAL flushes with and without this feature that the postgres generates is great to know this feature effects on IOPS. Probably it's even better with variations in synchronous_commit_flush_wal_after or the throttling limits. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: Hi Bharath, thanks for reviewing. > I think measuring the number of WAL flushes with and without this > feature that the postgres generates is great to know this feature > effects on IOPS. Probably it's even better with variations in > synchronous_commit_flush_wal_after or the throttling limits. It's the same as point 19, so I covered it there. > Although v2 is a WIP patch, I have some comments: > 1. Coding style doesn't confirm to the Postgres standard: Fixed all thoses that you mentioned and I've removed elog() and code for timing. BTW: is there a way to pgindent only my changes (on git diff?) > 2. It'd be better to add a TAP test hitting the WAL throttling. TODO, any hints on how that test should look like are welcome (checking pg_stat_wal?) I've could spot only 1 place for it -- src/test/recovery/007_sync_rep.pl > 3. We generally don't need timings to be calculated explicitly when we > emit before and after log messages. If needed one can calculate the > wait time from timestamps of the log messages. If it still needs an > explicit time difference which I don't think is required, because we > have a special event and before/after log messages, I think it needs > to be under track_wal_io_timing to keep things simple. Removed, it was just debugging aid to demonstrate the effect in the session waiting. > 4. XLogInsertRecord() can be a hot path for high-write workload, so > effects of adding anything in it needs to be measured. So, it's better > to run benchmarks with this feature enabled and disabled. When the GUC is off(0 / default), in my tests the impact is none (it's just set of simple IFs), however if the feature is enabled then the INSERT is slowed down (I've repeated the initial test from 1st post and single-statement INSERT for 50MB WAL result is the same 4-5s and ~23s +/- 1s when feature is activated when the RTT is 10.1ms, but that's intentional). > 5. Missing documentation of this feature and the GUC. I think we can > describe extensively why we'd need a feature of this kind in the > documentation for better adoption or usability. TODO, I'm planning on adding documentation when we'll be a little bit closer to adding to CF. > 6. Shouldn't the max limit be MAX_KILOBYTES? > + &synchronous_commit_flush_wal_after, > + 0, 0, 1024L*1024L, Fixed. > 7. Can this GUC name be something like > synchronous_commit_wal_throttle_threshold to better reflect what it > does for a backend? > + {"synchronous_commit_flush_wal_after", PGC_USERSET, > REPLICATION_SENDING, Done. > 8. A typo - s/confiration/confirmation [..] > 9. This > "Sets the maximum logged WAL in kbytes, after which wait for sync > commit confiration even without commit " > better be something like below? > "Sets the maximum amount of WAL in kilobytes a backend generates after > which it waits for synchronous commit confiration even without commit > " Fixed as you have suggested. > 10. I think there's nothing wrong in adding some assertions in > HandleXLogDelayPending(): > Assert(synchronous_commit_flush_wal_after > 0); > Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L); > Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr); Sure, added. > 11. Specify the reason in the comments as to why we had to do the > following things: > Here: > + /* flush only up to the last fully filled page */ > + XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd > - (XactLastThrottledRecEnd % XLOG_BLCKSZ); > Talk about how this avoids multiple-flushes for the same page. > > Here: > + * Called from ProcessMessageInterrupts() to avoid waiting while > being in critical section > + */ > +void HandleXLogDelayPending() > Talk about how waiting in a critical section, that is in > XLogInsertRecord() causes locks to be held longer durations and other > effects. Added. > Here: > + /* WAL throttling */ > + backendWalInserted += rechdr->xl_tot_len; > Be a bit more verbose about why we try to throttle WAL and why only > for sync replication, the advantages, effects etc. Added. > 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not > clutter server logs. > + /* XXX Debug for now */ > + elog(WARNING, "throttling WAL down on this session > (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", > + backendWalInserted, > + LSN_FORMAT_ARGS(XactLastThrottledRecEnd), > + LSN_FORMAT_ARGS(LastFullyWrittenXLogPage)); > > + elog(WARNING, "throttling WAL down on this session - end (%f > ms)", timediff); OK, that's entirely removed. > 13. BTW, we don't need to hold interrupts while waiting for sync > replication ack as it may block query cancels or proc die pendings. > You can remove these, unless there's a strong reason. > + //HOLD_INTERRUPTS(); > + //RESUME_INTERRUPTS(); Sure, removed. However, one problem I've discovered is that we were hitting Assert(InterruptHoldoffCount > 0) in SyncRepWaitForLSN, so I've fixed that too. > 14. Add this wait event in the documentation. > + case WAIT_EVENT_SYNC_REP_THROTTLED: > + event_name = "SyncRepThrottled"; > + break; > > 15. Why can XLogDelayPending not be a volatile atomic variable? Is it > because it's not something being set within a signal handler? > extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending; > +extern PGDLLIMPORT bool XLogDelayPending; Added a comment explaining that. > 16. > + instr_time wal_throttle_time_start, wal_throttle_time_end; > + double timediff; > + XLogDelayPending = false; > An extra line needed after variable declaration and assignment. Fixed. > 17. I think adding how many times a backend throttled WAL to > pg_stat_activity is a good metric. Good idea, added; catversion and pgstat format id were bumped. I've also added it to the per-query EXPLAIN (WAL) so it logs something like "WAL: records=500000 bytes=529500000 throttled=2016" , however I would appreciate a better name proposal on how to name that. > 18. Can you avoid the need of new functions SyncRepWaitForLSNInternal > and SyncRepWaitForLSNThrottled by relying on the global throttling > state to determine the correct waitevent in SyncRepWaitForLSN? Done. > 19. I think measuring the number of WAL flushes with and without this > feature that the postgres generates is great to know this feature > effects on IOPS. Probably it's even better with variations in > synchronous_commit_flush_wal_after or the throttling limits. default => INSERT 0 500000 Time: 6996.340 ms (00:06.996) -[ RECORD 1 ]----+------------------------------ wal_records | 500001 wal_fpi | 0 wal_bytes | 529500034 wal_buffers_full | 44036 wal_write | 44056 wal_sync | 40 wal_write_time | 317.991 wal_sync_time | 2690.425 wal_throttled | 0 stats_reset | 2023-02-01 10:31:44.475651+01 and 1 call to probe_postgres:XLogFlush set synchronous_commit_wal_throttle_threshold to '256kB'; => INSERT 0 500000 Time: 25476.155 ms (00:25.476) -[ RECORD 1 ]----+------------------------------ wal_records | 500001 wal_fpi | 0 wal_bytes | 529500034 wal_buffers_full | 0 wal_write | 2062 wal_sync | 2062 wal_write_time | 180.177 wal_sync_time | 1409.522 wal_throttled | 2016 stats_reset | 2023-02-01 10:32:01.955513+01 and 2017 calls to probe_postgres:XLogFlush set synchronous_commit_wal_throttle_threshold to '1MB'; => INSERT 0 500000 Time: 10113.278 ms (00:10.113) -[ RECORD 1 ]----+------------------------------ wal_records | 642864 wal_fpi | 0 wal_bytes | 537929315 wal_buffers_full | 0 wal_write | 560 wal_sync | 559 wal_write_time | 182.586 wal_sync_time | 988.72 wal_throttled | 504 stats_reset | 2023-02-01 10:32:36.250678+01 Maybe we should avoid calling fsyncs for WAL throttling? (by teaching HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when we are flushing just because of WAL thortting ?) Would that still be safe? -J.
Attachment
On 2/1/23 11:04, Jakub Wartak wrote: > On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi Bharath, thanks for reviewing. > >> I think measuring the number of WAL flushes with and without this >> feature that the postgres generates is great to know this feature >> effects on IOPS. Probably it's even better with variations in >> synchronous_commit_flush_wal_after or the throttling limits. > > It's the same as point 19, so I covered it there. > >> Although v2 is a WIP patch, I have some comments: >> 1. Coding style doesn't confirm to the Postgres standard: > > Fixed all thoses that you mentioned and I've removed elog() and code > for timing. BTW: is there a way to pgindent only my changes (on git > diff?) > >> 2. It'd be better to add a TAP test hitting the WAL throttling. > > TODO, any hints on how that test should look like are welcome > (checking pg_stat_wal?) I've could spot only 1 place for it -- > src/test/recovery/007_sync_rep.pl > >> 3. We generally don't need timings to be calculated explicitly when we >> emit before and after log messages. If needed one can calculate the >> wait time from timestamps of the log messages. If it still needs an >> explicit time difference which I don't think is required, because we >> have a special event and before/after log messages, I think it needs >> to be under track_wal_io_timing to keep things simple. > > Removed, it was just debugging aid to demonstrate the effect in the > session waiting. > >> 4. XLogInsertRecord() can be a hot path for high-write workload, so >> effects of adding anything in it needs to be measured. So, it's better >> to run benchmarks with this feature enabled and disabled. > > When the GUC is off(0 / default), in my tests the impact is none (it's > just set of simple IFs), however if the feature is enabled then the > INSERT is slowed down (I've repeated the initial test from 1st post > and single-statement INSERT for 50MB WAL result is the same 4-5s and > ~23s +/- 1s when feature is activated when the RTT is 10.1ms, but > that's intentional). > >> 5. Missing documentation of this feature and the GUC. I think we can >> describe extensively why we'd need a feature of this kind in the >> documentation for better adoption or usability. > > TODO, I'm planning on adding documentation when we'll be a little bit > closer to adding to CF. > >> 6. Shouldn't the max limit be MAX_KILOBYTES? >> + &synchronous_commit_flush_wal_after, >> + 0, 0, 1024L*1024L, > > Fixed. > >> 7. Can this GUC name be something like >> synchronous_commit_wal_throttle_threshold to better reflect what it >> does for a backend? >> + {"synchronous_commit_flush_wal_after", PGC_USERSET, >> REPLICATION_SENDING, > > Done. > >> 8. A typo - s/confiration/confirmation > [..] >> 9. This >> "Sets the maximum logged WAL in kbytes, after which wait for sync >> commit confiration even without commit " >> better be something like below? >> "Sets the maximum amount of WAL in kilobytes a backend generates after >> which it waits for synchronous commit confiration even without commit >> " > > Fixed as you have suggested. > >> 10. I think there's nothing wrong in adding some assertions in >> HandleXLogDelayPending(): >> Assert(synchronous_commit_flush_wal_after > 0); >> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L); >> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr); > > Sure, added. > >> 11. Specify the reason in the comments as to why we had to do the >> following things: >> Here: >> + /* flush only up to the last fully filled page */ >> + XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd >> - (XactLastThrottledRecEnd % XLOG_BLCKSZ); >> Talk about how this avoids multiple-flushes for the same page. >> >> Here: >> + * Called from ProcessMessageInterrupts() to avoid waiting while >> being in critical section >> + */ >> +void HandleXLogDelayPending() >> Talk about how waiting in a critical section, that is in >> XLogInsertRecord() causes locks to be held longer durations and other >> effects. > > Added. > >> Here: >> + /* WAL throttling */ >> + backendWalInserted += rechdr->xl_tot_len; >> Be a bit more verbose about why we try to throttle WAL and why only >> for sync replication, the advantages, effects etc. > > Added. > >> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not >> clutter server logs. >> + /* XXX Debug for now */ >> + elog(WARNING, "throttling WAL down on this session >> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", >> + backendWalInserted, >> + LSN_FORMAT_ARGS(XactLastThrottledRecEnd), >> + LSN_FORMAT_ARGS(LastFullyWrittenXLogPage)); >> >> + elog(WARNING, "throttling WAL down on this session - end (%f >> ms)", timediff); > > OK, that's entirely removed. > >> 13. BTW, we don't need to hold interrupts while waiting for sync >> replication ack as it may block query cancels or proc die pendings. >> You can remove these, unless there's a strong reason. >> + //HOLD_INTERRUPTS(); >> + //RESUME_INTERRUPTS(); > > Sure, removed. However, one problem I've discovered is that we were > hitting Assert(InterruptHoldoffCount > 0) in SyncRepWaitForLSN, so > I've fixed that too. > >> 14. Add this wait event in the documentation. >> + case WAIT_EVENT_SYNC_REP_THROTTLED: >> + event_name = "SyncRepThrottled"; >> + break; >> >> 15. Why can XLogDelayPending not be a volatile atomic variable? Is it >> because it's not something being set within a signal handler? >> extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending; >> +extern PGDLLIMPORT bool XLogDelayPending; > > Added a comment explaining that. > >> 16. >> + instr_time wal_throttle_time_start, wal_throttle_time_end; >> + double timediff; >> + XLogDelayPending = false; >> An extra line needed after variable declaration and assignment. > > Fixed. > >> 17. I think adding how many times a backend throttled WAL to >> pg_stat_activity is a good metric. > > Good idea, added; catversion and pgstat format id were bumped. I've > also added it to the per-query EXPLAIN (WAL) so it logs something like > "WAL: records=500000 bytes=529500000 throttled=2016" , however I would > appreciate a better name proposal on how to name that. > >> 18. Can you avoid the need of new functions SyncRepWaitForLSNInternal >> and SyncRepWaitForLSNThrottled by relying on the global throttling >> state to determine the correct waitevent in SyncRepWaitForLSN? > > Done. > >> 19. I think measuring the number of WAL flushes with and without this >> feature that the postgres generates is great to know this feature >> effects on IOPS. Probably it's even better with variations in >> synchronous_commit_flush_wal_after or the throttling limits. > > default => > INSERT 0 500000 > Time: 6996.340 ms (00:06.996) > -[ RECORD 1 ]----+------------------------------ > wal_records | 500001 > wal_fpi | 0 > wal_bytes | 529500034 > wal_buffers_full | 44036 > wal_write | 44056 > wal_sync | 40 > wal_write_time | 317.991 > wal_sync_time | 2690.425 > wal_throttled | 0 > stats_reset | 2023-02-01 10:31:44.475651+01 > and 1 call to probe_postgres:XLogFlush > > set synchronous_commit_wal_throttle_threshold to '256kB'; => > INSERT 0 500000 > Time: 25476.155 ms (00:25.476) > -[ RECORD 1 ]----+------------------------------ > wal_records | 500001 > wal_fpi | 0 > wal_bytes | 529500034 > wal_buffers_full | 0 > wal_write | 2062 > wal_sync | 2062 > wal_write_time | 180.177 > wal_sync_time | 1409.522 > wal_throttled | 2016 > stats_reset | 2023-02-01 10:32:01.955513+01 > and 2017 calls to probe_postgres:XLogFlush > > set synchronous_commit_wal_throttle_threshold to '1MB'; => > INSERT 0 500000 > Time: 10113.278 ms (00:10.113) > -[ RECORD 1 ]----+------------------------------ > wal_records | 642864 > wal_fpi | 0 > wal_bytes | 537929315 > wal_buffers_full | 0 > wal_write | 560 > wal_sync | 559 > wal_write_time | 182.586 > wal_sync_time | 988.72 > wal_throttled | 504 > stats_reset | 2023-02-01 10:32:36.250678+01 > I'm not quite sure how to interpret these numbers and what conclusions to draw. ~520MB of WAL is about 2000 x 256kB chunks, so it's not surprising we do ~2017 XLogFlush calls. But does that mean we did more I/O operations (like fsyncs)? Not necessarily, because something else might have done a flush too. For example, if there are concurrent sessions doing commit, that'll trigger fsync. And if we back off a bit (e.g. to the LSN of the last complete page), we'll simply piggyback on that and won't do any extra fsync or other IO. Of course, if there is no other concurrent activity triggering flushes, this will do extra (network) I/O and fsync. But that's expected and flushing data more often naturally reduces throughput. The impact is mostly proportional to network latency. The way I see this is "latency guarantee" - not allowing more unflushed (not confirmed by sync replica) WAL than can be sent/confirmed within some time limit. Of course, we don't have a way to specify this in time directly, so we have to specify amount of WAL instead. And stricter guarantees generally lead to lower throughput. I don't think there's a way around this, except for disabling the throttling by default - if someone enables that, he/she intentionally did that because latency is more important than throughput. > Maybe we should avoid calling fsyncs for WAL throttling? (by teaching > HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when > we are flushing just because of WAL thortting ?) Would that still be > safe? It's not clear to me how could this work and still be safe. I mean, we *must* flush the local WAL first, otherwise the replica could get ahead (if we send unflushed WAL to replica and then crash). Which would be really bad, obviously. And we *have* to send the data to the sync replica, because that's the whole point of this thread. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Maybe we should avoid calling fsyncs for WAL throttling? (by teaching > > HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when > > we are flushing just because of WAL thortting ?) Would that still be > > safe? > > It's not clear to me how could this work and still be safe. I mean, we > *must* flush the local WAL first, otherwise the replica could get ahead > (if we send unflushed WAL to replica and then crash). Which would be > really bad, obviously. Well it was just a thought: in this particular test - with no other concurrent activity happening - we are fsyncing() uncommitted Heap/INSERT data much earlier than the final Transaction/COMMIT WAL record comes into play. I agree that some other concurrent backend's COMMIT could fsync it, but I was wondering if that's sensible optimization to perform (so that issue_fsync() would be called for only commit/rollback records). I can imagine a scenario with 10 such concurrent backends running - all of them with this $thread-GUC set - but that would cause 20k unnecessary fsyncs (?) -- (assuming single HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that would be wasted close to 400s just due to local fsyncs?). I don't have a strong opinion or in-depth on this, but that smells like IO waste. -J.
On 2/1/23 14:40, Jakub Wartak wrote: > On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >>> Maybe we should avoid calling fsyncs for WAL throttling? (by teaching >>> HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when >>> we are flushing just because of WAL thortting ?) Would that still be >>> safe? >> >> It's not clear to me how could this work and still be safe. I mean, we >> *must* flush the local WAL first, otherwise the replica could get ahead >> (if we send unflushed WAL to replica and then crash). Which would be >> really bad, obviously. > > Well it was just a thought: in this particular test - with no other > concurrent activity happening - we are fsyncing() uncommitted > Heap/INSERT data much earlier than the final Transaction/COMMIT WAL > record comes into play. Right. I see it as testing (more or less) a worst-case scenario, measuring impact on commands generating a lot of WAL. I'm not sure the slowdown comes from the extra fsyncs, thgouh - I'd bet it's more about the extra waits for confirmations from the replica. > I agree that some other concurrent backend's > COMMIT could fsync it, but I was wondering if that's sensible > optimization to perform (so that issue_fsync() would be called for > only commit/rollback records). I can imagine a scenario with 10 such > concurrent backends running - all of them with this $thread-GUC set - > but that would cause 20k unnecessary fsyncs (?) -- (assuming single > HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that > would be wasted close to 400s just due to local fsyncs?). I don't have > a strong opinion or in-depth on this, but that smells like IO waste. > Not sure what optimization you mean, but triggering the WAL flushes from a separate process would be beneficial. But we already do that, more or less - that's what WAL writer is about, right? Maybe it's not aggressive enough or something, not sure. But I think the backends still have to sleep at some point, so that they don't queue too much unflushed WAL - that's kinda the whole point, no? The issue is more about triggering the throttling too early, before we hit the bandwidth limit. Which happens simply because we don't have a very good way to decide whether the latency is growing, so the patch just throttles everything. Consider a replica on a network link with 10ms round trip. Then commit latency can't really be better than 10ms, and throttling at that point can't really improve anything, it just makes it slower. Instead, we should measure the latency somehow, and only throttle when it increases. And probably make it proportional to the delta (so the higher it's from the "minimal" latency, the more we'd throttle). I'd imagine we'd measure the latency (or the wait for sync replica) over reasonably short time windows (1/10 of a second?), and using that to drive the throttling. If the latency is below some acceptable value, don't throttle at all. If it increases, start throttling. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 2, 2023 at 11:03 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I agree that some other concurrent backend's > > COMMIT could fsync it, but I was wondering if that's sensible > > optimization to perform (so that issue_fsync() would be called for > > only commit/rollback records). I can imagine a scenario with 10 such > > concurrent backends running - all of them with this $thread-GUC set - > > but that would cause 20k unnecessary fsyncs (?) -- (assuming single > > HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that > > would be wasted close to 400s just due to local fsyncs?). I don't have > > a strong opinion or in-depth on this, but that smells like IO waste. > > > > Not sure what optimization you mean, Let me clarify, let's say something like below (on top of the v3) just to save IOPS: --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2340,6 +2340,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) if (sync_method != SYNC_METHOD_OPEN && sync_method != SYNC_METHOD_OPEN_DSYNC) { + bool openedLogFile = false; if (openLogFile >= 0 && !XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo, wal_segment_size)) @@ -2351,9 +2352,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) openLogTLI = tli; openLogFile = XLogFileOpen(openLogSegNo, tli); ReserveExternalFD(); + openedLogFile = true; } - issue_xlog_fsync(openLogFile, openLogSegNo, tli); + /* can we bypass fsyncing() XLOG from the backend if + * we have been called without commit request? + * usually the feature will be off here (XLogDelayPending=false) + */ + if(openedLogFile == true || XLogDelayPending == false) + issue_xlog_fsync(openLogFile, openLogSegNo, tli); } + maybe some additional logic to ensure that this micro-optimization for saving IOPS would be not enabled if the backend is calling that XLogFlush/Write() for actual COMMIT record > But I think the backends still have to sleep at some point, so that they > don't queue too much unflushed WAL - that's kinda the whole point, no? Yes, but it can be flushed to standby, flushed locally but not fsynced locally (?) - provided that it was not COMMIT - I'm just wondering whether it makes sense (Question 1) > The issue is more about triggering the throttling too early, before we > hit the bandwidth limit. Which happens simply because we don't have a > very good way to decide whether the latency is growing, so the patch > just throttles everything. Maximum TCP bandwidth limit seems to be fluctuating in the real world I suppose, so it couldn't be a hard limit. On the other hand I can imagine operators setting "throttle-those-backends-if-global-WALlatencyORrate>XXX" (administrative decision). That would be cool to have but yes it would require WAL latency and rate measurement first (on its own that would make a very nice addition to the pg_stat_replication). But one thing to note would be that there could be many potential latencies (& WAL throughput rates) to consider (e.g. quorum of 3 standby sync having different latencies) - which one to choose? (Question 2) I think we have reached simply a decision point on whether the WIP/PoC is good enough as it is (like Andres wanted and you +1 to this) or it should work as you propose or maybe keep it as an idea for the future? -J.
Hi, I keep getting occasional complaints about the impact of large/bulk transactions on latency of small OLTP transactions, so I'd like to revive this thread a bit and move it forward. Attached is a rebased v3, followed by 0002 patch with some review comments, missing comments and minor tweaks. More about that later ... It's been a couple months, and there's been a fair amount of discussion and changes earlier, so I guess it makes sense to post a summary, stating the purpose (and scope), and then go through the various open questions etc. goals ----- The goal is to limit the impact of large transactions (producing a lot of WAL) on small OLTP transactions, in a cluster with a sync replica. Imagine a backend executing single-row inserts, or something like that. The commit will wait for the replica to confirm the WAL, which may be expensive, but it's well determined by the network roundtrip. But then a large transaction comes, and inserts a lot of WAL (imagine a COPY which inserts 100MB of data, VACUUM, CREATE INDEX and so on). A small transaction may insert a COMMIT record right after this WAL chunk, and locally that's (mostly) fine. But with the sync replica it's much worse - we don't send WAL until it's flushed locally, and then we need to wait for the WAL to be sent, applied and confirmed by the replica. This takes time (depending on the bandwidth), and it may not happen until the small transaction does COMMIT (because we may not flush WAL from in-progress transaction very often). Jakub Wartak presented some examples of the impact when he started this thread, and it can be rather bad. Particularly for latency-sensitive applications. I plan to do more experiments with the current patch, but I don't have the results yet. scope ----- Now, let's talk about scope - what the patch does not aim to do. The patch is explicitly intended for syncrep clusters, not async. There have been proposals to also support throttling for async replicas, logical replication etc. I suppose all of that could be implemented, and I do see the benefit of defining some sort of maximum lag even for async replicas. But the agreement was to focus on the syncrep case, where it's particularly painful, and perhaps extend it in the future. I believe adding throttling for physical async replication should not be difficult - in principle we need to determine how far the replica got, and compare it to the local LSN. But there's likely complexity with defining which async replicas to look at, inventing a sensible way to configure this, etc. It'd be helpful if people interested in that feature took a look at this patch and tried extending etc. It's not clear to me what to do about disconnected replicas, though. We may not even know about them, if there's no slot (and how would we know what the slot is for). So maybe this would need a new GUC listing the interesting replicas, and all would need to be connected. But that's an availability issue, because then all replicas need to be connected. I'm not sure about logical replication, but I guess we could treat it similarly to async. But what I think would need to be different is handling of small transactions. For syncrep we automatically wait for those at commit, which means automatic throttling. But for async (and logical), it's trivial to cause ever-increasing lag with only tiny transactions, thanks to the single-process replay, so maybe we'd need to throttle those too. (The recovery prefetching improved this for async quite a bit, ofc.) implementation -------------- The implementation is fairly straightforward, and happens in two places. XLogInsertRecord() decides if a throttling might be needed for this backend, and then HandleXLogDelayPending() does the wait. XLogInsertRecord() checks if the backend produced certain amount of WAL (might be 1MB, for example). We do this because we don't want to do the expensive stuff in HandleXLogDelayPending() too often (e.g. after every XLOG record). HandleXLogDelayPending() picks a suitable LSN, flushes it and then also waits for the sync replica, as if it was a commit. This limits the lag, i.e. the amount of WAL that the small transaction will need to wait for to be replicated and confirmed by the replica. There was a fair amount of discussion about how to pick the LSN. I think the agreement is we certainly can't pick the current LSN (because that would lead to write amplification for the partially filled page), and we probably even want to backoff a bit more, to make it more likely the LSN is already flushed. So for example with the threshold set to 8MB we might go back 1MB, or something like that. That'd still limit the lag. problems -------- Now let's talk about some problems - both conceptual and technical (essentially review comments for the patch). 1) The goal of the patch is to limit the impact on latency, but the relationship between WAL amounts and latency may not be linear. But we don't have a good way to predict latency, and WAL lag is the only thing we have, so there's that. Ultimately, it's a best effort. 2) The throttling is per backend. That makes it simple, but it means that it's hard to enforce a global lag limit. Imagine the limit is 8MB, and with a single backend that works fine - the lag should not exceed the 8MB value. But if there are N backends, the lag could be up to N-times 8MB, I believe. That's a bit annoying, but I guess the only solution would be to have some autovacuum-like cost balancing, with all backends (or at least those running large stuff) doing the checks more often. I'm not sure we want to do that. 3) The actual throttling (flush and wait for syncrep) happens in ProcessInterrupts(), which mostly works but it has two drawbacks: * It may not happen "early enough" if the backends inserts a lot of XLOG records without processing interrupts in between. * It may happen "too early" if the backend inserts enough WAL to need throttling (i.e. sets XLogDelayPending), but then after processing interrupts it would be busy with other stuff, not inserting more WAL. I think ideally we'd do the throttling right before inserting the next XLOG record, but there's no convenient place, I think. We'd need to annotate a lot of places, etc. So maybe ProcessInterrupts() is a reasonable approximation. We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but that seems reasonable. 4) I'm not sure I understand why we need XactLastThrottledRecEnd. Why can't we just use XLogRecEnd? 5) I think the way XLogFlush() resets backendWalInserted is a bit wrong. Imagine a backend generates a fair amount of WAL, and then calls XLogFlush(lsn). Why is it OK to set backendWalInserted=0 when we don't know if the generated WAL was before the "lsn"? I suppose we don't use very old lsn values for flushing, but I don't know if this drift could accumulate over time, or cause some other issues. 6) Why XLogInsertRecord() did skip SYNCHRONOUS_COMMIT_REMOTE_FLUSH? 7) I find the "synchronous_commit_wal_throttle_threshold" name annoyingly long, so I renamed it to just "wal_throttle_threshold". I've also renamed the GUC to "wal_throttle_after" and I wonder if maybe it should be configured in , maybe it should be in GUC_UNIT_BLOCKS just like the other _after options? But those changes are more a matter of taste, feel free to ignore this. missing pieces -------------- The thing that's missing is that some processes (like aggressive anti-wraparound autovacuum) should not be throttled. If people set the GUC in the postgresql.conf, I guess that'll affect those processes too, so I guess we should explicitly reset the GUC for those processes. I wonder if there are other cases that should not be throttled. tangents -------- While discussing this with Andres a while ago, he mentioned a somewhat orthogonal idea - sending unflushed data to the replica. We currently never send unflushed data to the replica, which makes sense because this data is not durable and if the primary crashes/restarts, this data will disappear. But it also means there may be a fairly large chunk of WAL data that we may need to send at COMMIT and wait for the confirmation. He suggested we might actually send the data to the replica, but the replica would know this data is not flushed yet and so would not do the recovery etc. And at commit we could just send a request to flush, without having to transfer the data at that moment. I don't have a very good intuition about how large the effect would be, i.e. how much unflushed WAL data could accumulate on the primary (kilobytes/megabytes?), and how big is the difference between sending a couple kilobytes or just a request to flush. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote: > scope > ----- > Now, let's talk about scope - what the patch does not aim to do. The > patch is explicitly intended for syncrep clusters, not async. There have > been proposals to also support throttling for async replicas, logical > replication etc. I suppose all of that could be implemented, and I do > see the benefit of defining some sort of maximum lag even for async > replicas. But the agreement was to focus on the syncrep case, where it's > particularly painful, and perhaps extend it in the future. Perhaps we should take care to make the configuration extensible in that direction in the future? Hm - is this feature really tied to replication, at all? Pretty much the same situation exists without. On an ok-ish local nvme I ran pgbench with 1 client and -P1. Guess where I started a VACUUM (on a fully cached table, so no continuous WAL flushes): progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed At 32 clients we go from ~10k to 2.5k, with a full 2s of 0. Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the "good times" show a delay of ~8kB (note that this includes WAL records that are still being inserted). Once the VACUUM runs, it's ~2-3MB. The picture with more clients is similar. If I instead severely limit the amount of outstanding (but not the amount of unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less (down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32). Of course that's ridiculous and will completely trash performance in many other cases, but it shows that limiting the amount of outstanding WAL could help without replication as well. With remote storage, that'd likely be a bigger difference. > problems > -------- > Now let's talk about some problems - both conceptual and technical > (essentially review comments for the patch). > > 1) The goal of the patch is to limit the impact on latency, but the > relationship between WAL amounts and latency may not be linear. But we > don't have a good way to predict latency, and WAL lag is the only thing > we have, so there's that. Ultimately, it's a best effort. It's indeed probably not linear. Realistically, to do better, we probably need statistics for the specific system in question - the latency impact will differ hugely between different storage/network. > 2) The throttling is per backend. That makes it simple, but it means > that it's hard to enforce a global lag limit. Imagine the limit is 8MB, > and with a single backend that works fine - the lag should not exceed > the 8MB value. But if there are N backends, the lag could be up to > N-times 8MB, I believe. That's a bit annoying, but I guess the only > solution would be to have some autovacuum-like cost balancing, with all > backends (or at least those running large stuff) doing the checks more > often. I'm not sure we want to do that. Hm. The average case is likely fine - the throttling of the different backends will intersperse and flush more frequently - but the worst case is presumably part of the issue here. I wonder if we could deal with this by somehow offsetting the points at which backends flush at somehow. I doubt we want to go for something autovacuum balancing like - that doesn't seem to work well - but I think we could take the amount of actually unflushed WAL into account when deciding whether to throttle. We have the necessary state in local memory IIRC. We'd have to be careful to not throttle every backend at the same time, or we'll introduce latency penalties that way. But what if we scaled synchronous_commit_wal_throttle_threshold depending on the amount of unflushed WAL? By still taking backendWalInserted into account, we'd avoid throttling everyone at the same time, but still would make throttling more aggressive depending on the amount of unflushed/unreplicated WAL. > 3) The actual throttling (flush and wait for syncrep) happens in > ProcessInterrupts(), which mostly works but it has two drawbacks: > > * It may not happen "early enough" if the backends inserts a lot of > XLOG records without processing interrupts in between. Does such code exist? And if so, is there a reason not to fix said code? > * It may happen "too early" if the backend inserts enough WAL to need > throttling (i.e. sets XLogDelayPending), but then after processing > interrupts it would be busy with other stuff, not inserting more WAL. > I think ideally we'd do the throttling right before inserting the next > XLOG record, but there's no convenient place, I think. We'd need to > annotate a lot of places, etc. So maybe ProcessInterrupts() is a > reasonable approximation. Yea, I think there's no way to do that with reasonable effort. Starting to wait with a bunch of lwlocks held would obviously be bad. > We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but > that seems reasonable. And independently beneficial. > missing pieces > -------------- > The thing that's missing is that some processes (like aggressive > anti-wraparound autovacuum) should not be throttled. If people set the > GUC in the postgresql.conf, I guess that'll affect those processes too, > so I guess we should explicitly reset the GUC for those processes. I > wonder if there are other cases that should not be throttled. Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone else even further, even though the goal of the feature might be the opposite. I don't think that's warranted for anti-wraparound vacuums - they're normal. I think failsafe vacuums are a different story - there we really just don't care about impacting other backends, the goal is to prevent moving the cluster to read only pretty soon. > tangents > -------- > While discussing this with Andres a while ago, he mentioned a somewhat > orthogonal idea - sending unflushed data to the replica. > > We currently never send unflushed data to the replica, which makes sense > because this data is not durable and if the primary crashes/restarts, > this data will disappear. But it also means there may be a fairly large > chunk of WAL data that we may need to send at COMMIT and wait for the > confirmation. > > He suggested we might actually send the data to the replica, but the > replica would know this data is not flushed yet and so would not do the > recovery etc. And at commit we could just send a request to flush, > without having to transfer the data at that moment. > > I don't have a very good intuition about how large the effect would be, > i.e. how much unflushed WAL data could accumulate on the primary > (kilobytes/megabytes?), Obviously heavily depends on the workloads. If you have anything with bulk writes it can be many megabytes. > and how big is the difference between sending a couple kilobytes or just a > request to flush. Obviously heavily depends on the network... I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit network (albeit with a crappy external card for my laptop), to put some numbers to this. I used -r $s,100 to test sending a variable sized data to the other size, with the other side always responding with 100 bytes (assuming that'd more than fit a feedback response). Command: fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields; fors in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done 10gbe: request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate 10 100 43 64.30 390 96 15526.084 100 100 57 75.12 428 122 13286.602 1000 100 47 74.41 270 108 13412.125 10000 100 89 114.63 712 152 8700.643 100000 100 167 255.90 584 312 3903.516 1000000 100 891 1015.99 2470 1143 983.708 Same hosts, but with my workstation forced to use a 1gbit connection: request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate 10 100 78 131.18 2425 257 7613.416 100 100 81 129.25 425 255 7727.473 1000 100 100 162.12 1444 266 6161.388 10000 100 310 686.19 1797 927 1456.204 100000 100 1006 1114.20 1472 1199 896.770 1000000 100 8338 8420.96 8827 8498 118.410 I haven't checked, but I'd assume that 100bytes back and forth should easily fit a new message to update LSNs and the existing feedback response. Even just the difference between sending 100 bytes and sending 10k (a bit more than a single WAL page) is pretty significant on a 1gbit network. Of course, the relatively low latency between these systems makes this more pronounced than if this were a cross regional or even cross continental link, were the roundtrip latency is more likely to be dominated by distance rather than throughput. Testing between europe and western US: request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate 10 100 157934 167627.12 317705 160000 5.652 100 100 161294 171323.59 324017 170000 5.530 1000 100 161392 171521.82 324629 170000 5.524 10000 100 163651 173651.06 328488 170000 5.456 100000 100 166344 198070.20 638205 170000 4.781 1000000 100 225555 361166.12 1302368 240000 2.568 No meaningful difference before getting to 100k. But it's pretty easy to lag by 100k on a longer distance link... Greetings, Andres Freund
On 11/8/23 07:40, Andres Freund wrote: > Hi, > > On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote: >> scope >> ----- >> Now, let's talk about scope - what the patch does not aim to do. The >> patch is explicitly intended for syncrep clusters, not async. There have >> been proposals to also support throttling for async replicas, logical >> replication etc. I suppose all of that could be implemented, and I do >> see the benefit of defining some sort of maximum lag even for async >> replicas. But the agreement was to focus on the syncrep case, where it's >> particularly painful, and perhaps extend it in the future. > > Perhaps we should take care to make the configuration extensible in that > direction in the future? > Yes, if we can come up with a suitable configuration, that would work for the other use cases. I don't have a very good idea what to do about replicas that may not be connected, of have connected but need to catch up. IMHO it would be silly to turn this into "almost a sync rep". > > Hm - is this feature really tied to replication, at all? Pretty much the same > situation exists without. On an ok-ish local nvme I ran pgbench with 1 client > and -P1. Guess where I started a VACUUM (on a fully cached table, so no > continuous WAL flushes): > > progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed > progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed > progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed > progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed > progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed > progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed > progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed > progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed > progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed > progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed > > At 32 clients we go from ~10k to 2.5k, with a full 2s of 0. > > Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the > "good times" show a delay of ~8kB (note that this includes WAL records that > are still being inserted). Once the VACUUM runs, it's ~2-3MB. > > The picture with more clients is similar. > > If I instead severely limit the amount of outstanding (but not the amount of > unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less > (down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32). Of course > that's ridiculous and will completely trash performance in many other cases, > but it shows that limiting the amount of outstanding WAL could help without > replication as well. With remote storage, that'd likely be a bigger > difference. > Yeah, that's an interesting idea. I think the idea of enforcing "maximum lag" is somewhat general, the difference is against what LSN the lag is measured. For the syncrep case it was about LSN confirmed by the replica, what you described would measure it for either flush LSN or write LSN (which would be the "outstanding" case I think). I guess the remote storage is somewhat similar to the syncrep case, in that the lag includes some network communication. > >> problems >> -------- >> Now let's talk about some problems - both conceptual and technical >> (essentially review comments for the patch). >> >> 1) The goal of the patch is to limit the impact on latency, but the >> relationship between WAL amounts and latency may not be linear. But we >> don't have a good way to predict latency, and WAL lag is the only thing >> we have, so there's that. Ultimately, it's a best effort. > > It's indeed probably not linear. Realistically, to do better, we probably need > statistics for the specific system in question - the latency impact will > differ hugely between different storage/network. > True. I can imagine two ways to measure that. We could have a standalone tool similar to pg_test_fsync that would mimic how we write/flush WAL, and measure the latency for different amounts flushed data. The DBA would then be responsible for somehow using this to configure the database (perhaps the tool could calculate some "optimal" value to flush). Alternatively, we could collect timing in XLogFlush, so that we'd track amount of data to flush + timing, and then use that to calculate expected latency (e.g. by binning by data size and using average latency for each bin). And then use that, somehow. So you could say - maximum commit latency is 10ms, and the system would be able to estimate that the maximum amount of unflushed WAL is 256kB, and it'd enforce this distance. Still only best offort, no guarantees, of course. > >> 2) The throttling is per backend. That makes it simple, but it means >> that it's hard to enforce a global lag limit. Imagine the limit is 8MB, >> and with a single backend that works fine - the lag should not exceed >> the 8MB value. But if there are N backends, the lag could be up to >> N-times 8MB, I believe. That's a bit annoying, but I guess the only >> solution would be to have some autovacuum-like cost balancing, with all >> backends (or at least those running large stuff) doing the checks more >> often. I'm not sure we want to do that. > > Hm. The average case is likely fine - the throttling of the different backends > will intersperse and flush more frequently - but the worst case is presumably > part of the issue here. I wonder if we could deal with this by somehow > offsetting the points at which backends flush at somehow. > If I understand correctly, you want to ensure the backends start measuring the WAL from different LSNs, in order to "distribute" them uniformly within the WAL (and not "coordinate" them, which leads to higher lag). I guess we could do that, say by flushing only up to hash(pid) % maximum_allowed_lag I'm not sure that'll really work, especially if the backends are somewhat naturally "correlated". Maybe it could work if the backends explicitly coordinated the flushes to distribute them, but then how is that different from just doing what autovacuum-like costing does in principle. However, perhaps there's an "adaptive" way to do this - each backend know how much WAL it produced since the last flush LSN, and it can easily measure the actual lag (unflushed WAL). It could compare those, and estimate what fraction of the lag it's likely responsible for. And then adjust the "flush distance" based on that. Imagine we aim for 1MB unflushed WAL, and you have two backends that happen to execute at the same time. They both generate 1MB of WAL and hit the throttling code at about the same size. They discover the actual lag is not the desired 1MB but 2MB, so (requested_lag/actual_lag) = 0.5, and they'd adjust the flush distance to 1/2MB. And from that point we know the lag is 1MB even with two backends. Then one of the backends terminates, and the other backend eventually hits the 1/2MB limit again, but the desired lag is 1MB, and it doubles the distance again. Of course, in practice the behavior would be more complicated, thanks to backends that generate WAL but don't really hit the threshold. There'd probably also be some sort of ramp-up, i.e. the backed would not start with the "full" 1MB limit, but perhaps something lower. Would need to be careful to be high enough to ignore the OLTP transactions, though. > I doubt we want to go for something autovacuum balancing like - that doesn't > seem to work well - but I think we could take the amount of actually unflushed > WAL into account when deciding whether to throttle. We have the necessary > state in local memory IIRC. We'd have to be careful to not throttle every > backend at the same time, or we'll introduce latency penalties that way. But > what if we scaled synchronous_commit_wal_throttle_threshold depending on the > amount of unflushed WAL? By still taking backendWalInserted into account, we'd > avoid throttling everyone at the same time, but still would make throttling > more aggressive depending on the amount of unflushed/unreplicated WAL. > Oh! Perhaps similar to the adaptive behavior I explained above? > >> 3) The actual throttling (flush and wait for syncrep) happens in >> ProcessInterrupts(), which mostly works but it has two drawbacks: >> >> * It may not happen "early enough" if the backends inserts a lot of >> XLOG records without processing interrupts in between. > > Does such code exist? And if so, is there a reason not to fix said code? > Not sure. I thought maybe index builds might do something like that, but it doesn't seem to be the case (at least for the built-in indexes). But if adding CHECK_FOR_INTERRUPTS to more places is an acceptable fix, I'm OK with that. > >> * It may happen "too early" if the backend inserts enough WAL to need >> throttling (i.e. sets XLogDelayPending), but then after processing >> interrupts it would be busy with other stuff, not inserting more WAL. > >> I think ideally we'd do the throttling right before inserting the next >> XLOG record, but there's no convenient place, I think. We'd need to >> annotate a lot of places, etc. So maybe ProcessInterrupts() is a >> reasonable approximation. > > Yea, I think there's no way to do that with reasonable effort. Starting to > wait with a bunch of lwlocks held would obviously be bad. > OK. > >> We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but >> that seems reasonable. > > And independently beneficial. > OK. > >> missing pieces >> -------------- >> The thing that's missing is that some processes (like aggressive >> anti-wraparound autovacuum) should not be throttled. If people set the >> GUC in the postgresql.conf, I guess that'll affect those processes too, >> so I guess we should explicitly reset the GUC for those processes. I >> wonder if there are other cases that should not be throttled. > > Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone > else even further, even though the goal of the feature might be the opposite. > I don't think that's warranted for anti-wraparound vacuums - they're normal. I > think failsafe vacuums are a different story - there we really just don't care > about impacting other backends, the goal is to prevent moving the cluster to > read only pretty soon. > Right, I confused those two autovacuum modes. > >> tangents >> -------- >> While discussing this with Andres a while ago, he mentioned a somewhat >> orthogonal idea - sending unflushed data to the replica. >> >> We currently never send unflushed data to the replica, which makes sense >> because this data is not durable and if the primary crashes/restarts, >> this data will disappear. But it also means there may be a fairly large >> chunk of WAL data that we may need to send at COMMIT and wait for the >> confirmation. >> >> He suggested we might actually send the data to the replica, but the >> replica would know this data is not flushed yet and so would not do the >> recovery etc. And at commit we could just send a request to flush, >> without having to transfer the data at that moment. >> >> I don't have a very good intuition about how large the effect would be, >> i.e. how much unflushed WAL data could accumulate on the primary >> (kilobytes/megabytes?), > > Obviously heavily depends on the workloads. If you have anything with bulk > writes it can be many megabytes. > > >> and how big is the difference between sending a couple kilobytes or just a >> request to flush. > > Obviously heavily depends on the network... > I know it depends on workload/network. I'm merely saying I don't have a very good intuition what value would be suitable for a particular workload / network. > > I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit > network (albeit with a crappy external card for my laptop), to put some > numbers to this. I used -r $s,100 to test sending a variable sized data to the > other size, with the other side always responding with 100 bytes (assuming > that'd more than fit a feedback response). > > Command: > fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields; fors in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done > > 10gbe: > > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > 10 100 43 64.30 390 96 15526.084 > 100 100 57 75.12 428 122 13286.602 > 1000 100 47 74.41 270 108 13412.125 > 10000 100 89 114.63 712 152 8700.643 > 100000 100 167 255.90 584 312 3903.516 > 1000000 100 891 1015.99 2470 1143 983.708 > > > Same hosts, but with my workstation forced to use a 1gbit connection: > > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > 10 100 78 131.18 2425 257 7613.416 > 100 100 81 129.25 425 255 7727.473 > 1000 100 100 162.12 1444 266 6161.388 > 10000 100 310 686.19 1797 927 1456.204 > 100000 100 1006 1114.20 1472 1199 896.770 > 1000000 100 8338 8420.96 8827 8498 118.410 > > I haven't checked, but I'd assume that 100bytes back and forth should easily > fit a new message to update LSNs and the existing feedback response. Even just > the difference between sending 100 bytes and sending 10k (a bit more than a > single WAL page) is pretty significant on a 1gbit network. > I'm on decaf so I may be a bit slow, but it's not very clear to me what conclusion to draw from these numbers. What is the takeaway? My understanding is that in both cases the latency is initially fairly stable, independent of the request size. This applies to request up to ~1000B. And then the latency starts increasing fairly quickly, even though it shouldn't hit the bandwidth (except maybe the 1MB requests). I don't think it says we should be replicating WAL in tiny chunks, because if you need to send a chunk of data it's always more efficient to send it at once (compared to sending multiple smaller pieces). But if we manage to send most of this "in the background", only leaving the last small bit to be sent at the very end, that'd help. > Of course, the relatively low latency between these systems makes this more > pronounced than if this were a cross regional or even cross continental link, > were the roundtrip latency is more likely to be dominated by distance rather > than throughput. > > Testing between europe and western US: > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > 10 100 157934 167627.12 317705 160000 5.652 > 100 100 161294 171323.59 324017 170000 5.530 > 1000 100 161392 171521.82 324629 170000 5.524 > 10000 100 163651 173651.06 328488 170000 5.456 > 100000 100 166344 198070.20 638205 170000 4.781 > 1000000 100 225555 361166.12 1302368 240000 2.568 > > > No meaningful difference before getting to 100k. But it's pretty easy to lag > by 100k on a longer distance link... > Right. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote: > > I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit > > network (albeit with a crappy external card for my laptop), to put some > > numbers to this. I used -r $s,100 to test sending a variable sized data to the > > other size, with the other side always responding with 100 bytes (assuming > > that'd more than fit a feedback response). > > > > Command: > > fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields;for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done > > > > 10gbe: > > > > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > > 10 100 43 64.30 390 96 15526.084 > > 100 100 57 75.12 428 122 13286.602 > > 1000 100 47 74.41 270 108 13412.125 > > 10000 100 89 114.63 712 152 8700.643 > > 100000 100 167 255.90 584 312 3903.516 > > 1000000 100 891 1015.99 2470 1143 983.708 > > > > > > Same hosts, but with my workstation forced to use a 1gbit connection: > > > > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > > 10 100 78 131.18 2425 257 7613.416 > > 100 100 81 129.25 425 255 7727.473 > > 1000 100 100 162.12 1444 266 6161.388 > > 10000 100 310 686.19 1797 927 1456.204 > > 100000 100 1006 1114.20 1472 1199 896.770 > > 1000000 100 8338 8420.96 8827 8498 118.410 Looks like the 1gbit numbers were somewhat bogus-ified due having configured jumbo frames and some network component doing something odd with that (handling them in software maybe?). 10gbe: request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate 10 100 56 68.56 483 87 14562.476 100 100 57 75.68 353 123 13185.485 1000 100 60 71.97 391 94 13870.659 10000 100 58 92.42 489 140 10798.444 100000 100 184 260.48 1141 338 3834.504 1000000 100 926 1071.46 2012 1466 933.009 1gbe request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate 10 100 77 132.19 1097 257 7555.420 100 100 79 127.85 534 249 7810.862 1000 100 98 155.91 966 265 6406.818 10000 100 176 235.37 1451 314 4245.304 100000 100 944 1022.00 1380 1148 977.930 1000000 100 8649 8768.42 9018 8895 113.703 > > I haven't checked, but I'd assume that 100bytes back and forth should easily > > fit a new message to update LSNs and the existing feedback response. Even just > > the difference between sending 100 bytes and sending 10k (a bit more than a > > single WAL page) is pretty significant on a 1gbit network. > > > > I'm on decaf so I may be a bit slow, but it's not very clear to me what > conclusion to draw from these numbers. What is the takeaway? > > My understanding is that in both cases the latency is initially fairly > stable, independent of the request size. This applies to request up to > ~1000B. And then the latency starts increasing fairly quickly, even > though it shouldn't hit the bandwidth (except maybe the 1MB requests). Except for the smallest end, these are bandwidth related, I think. Converting 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on 1gbit. If you subtract the minmum latency of about 130us, that's nearly all of the latency. The reason this matters is that the numbers show that the latency of having to send a small message with updated positions is far smaller than having to send all the outstanding data. Even having to send a single WAL page over the network ~doubles the latency of the response on 1gbit! Of course the impact is smaller on 10gbit, but even there latency substantially increases around 100kB of outstanding data. In a local pgbench with 32 clients I see WAL write sizes between 8kB and ~220kB. Being able to stream those out before the local flush completed therefore seems likely to reduce synchronous_commit overhead substantially. > I don't think it says we should be replicating WAL in tiny chunks, > because if you need to send a chunk of data it's always more efficient > to send it at once (compared to sending multiple smaller pieces). I don't think that's a very large factor for network data, once your minimal data sizes is ~8kB (or ~4kB if we lower wal_block_size). TCP messsages will get chunked into something smaller anyway and small messages don't need to get acknowledged individually. Sending more data at once is good for CPU efficiency (reducing syscall and network device overhead), but doesn't do much for throughput. Sending 4kB of data in each send() in a bandwidth oriented test already gets to ~9.3gbit/s in my network. That's close to the maximum atainable with normal framing. If I change the mtu back to 9000 I get 9.89 gbit/s, again very close to the theoretical max. Greetings, Andres Freund
On 11/8/23 18:11, Andres Freund wrote: > Hi, > > On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote: >>> I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit >>> network (albeit with a crappy external card for my laptop), to put some >>> numbers to this. I used -r $s,100 to test sending a variable sized data to the >>> other size, with the other side always responding with 100 bytes (assuming >>> that'd more than fit a feedback response). >>> >>> Command: >>> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields;for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done >>> >>> 10gbe: >>> >>> request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate >>> 10 100 43 64.30 390 96 15526.084 >>> 100 100 57 75.12 428 122 13286.602 >>> 1000 100 47 74.41 270 108 13412.125 >>> 10000 100 89 114.63 712 152 8700.643 >>> 100000 100 167 255.90 584 312 3903.516 >>> 1000000 100 891 1015.99 2470 1143 983.708 >>> >>> >>> Same hosts, but with my workstation forced to use a 1gbit connection: >>> >>> request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate >>> 10 100 78 131.18 2425 257 7613.416 >>> 100 100 81 129.25 425 255 7727.473 >>> 1000 100 100 162.12 1444 266 6161.388 >>> 10000 100 310 686.19 1797 927 1456.204 >>> 100000 100 1006 1114.20 1472 1199 896.770 >>> 1000000 100 8338 8420.96 8827 8498 118.410 > > Looks like the 1gbit numbers were somewhat bogus-ified due having configured > jumbo frames and some network component doing something odd with that > (handling them in software maybe?). > > 10gbe: > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > 10 100 56 68.56 483 87 14562.476 > 100 100 57 75.68 353 123 13185.485 > 1000 100 60 71.97 391 94 13870.659 > 10000 100 58 92.42 489 140 10798.444 > 100000 100 184 260.48 1141 338 3834.504 > 1000000 100 926 1071.46 2012 1466 933.009 > > 1gbe > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > 10 100 77 132.19 1097 257 7555.420 > 100 100 79 127.85 534 249 7810.862 > 1000 100 98 155.91 966 265 6406.818 > 10000 100 176 235.37 1451 314 4245.304 > 100000 100 944 1022.00 1380 1148 977.930 > 1000000 100 8649 8768.42 9018 8895 113.703 > > >>> I haven't checked, but I'd assume that 100bytes back and forth should easily >>> fit a new message to update LSNs and the existing feedback response. Even just >>> the difference between sending 100 bytes and sending 10k (a bit more than a >>> single WAL page) is pretty significant on a 1gbit network. >>> >> >> I'm on decaf so I may be a bit slow, but it's not very clear to me what >> conclusion to draw from these numbers. What is the takeaway? >> >> My understanding is that in both cases the latency is initially fairly >> stable, independent of the request size. This applies to request up to >> ~1000B. And then the latency starts increasing fairly quickly, even >> though it shouldn't hit the bandwidth (except maybe the 1MB requests). > > Except for the smallest end, these are bandwidth related, I think. Converting > 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving > the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on > 1gbit. If you subtract the minmum latency of about 130us, that's nearly all of > the latency. > Maybe I don't understand what you mean "bandwidth related" but surely the smaller requests are not limited by bandwidth. I mean, 100B and 1kB (and even 10kB) requests have almost the same transaction rate, yet there's an order of magnitude difference in bandwidth (sure, there's overhead, but this much magnitude?). On the higher end, sure, that seems bandwidth related. But for 100kB, it's still just ~50% of the 1Gbps. > The reason this matters is that the numbers show that the latency of having to > send a small message with updated positions is far smaller than having to send > all the outstanding data. Even having to send a single WAL page over the > network ~doubles the latency of the response on 1gbit! Of course the impact > is smaller on 10gbit, but even there latency substantially increases around > 100kB of outstanding data. Understood. I wonder if this is one of the things we'd need to measure to adjust the write size (i.e. how eagerly to write WAL to disk / over network). Essentially, we'd get an the size where the latency starts increasing much faster, and try to write WAL faster than that. I wonder if storage (not network) has a similar pattern. > In a local pgbench with 32 clients I see WAL write sizes between 8kB and > ~220kB. Being able to stream those out before the local flush completed > therefore seems likely to reduce synchronous_commit overhead substantially. > Yeah, those writes are certainly too large. If we can write them earlier, and then only do smaller messages to write the remaining bit and the positions, that'd help a lot. > >> I don't think it says we should be replicating WAL in tiny chunks, >> because if you need to send a chunk of data it's always more efficient >> to send it at once (compared to sending multiple smaller pieces). > > I don't think that's a very large factor for network data, once your minimal > data sizes is ~8kB (or ~4kB if we lower wal_block_size). TCP messsages will > get chunked into something smaller anyway and small messages don't need to get > acknowledged individually. Sending more data at once is good for CPU > efficiency (reducing syscall and network device overhead), but doesn't do much > for throughput. > > Sending 4kB of data in each send() in a bandwidth oriented test already gets > to ~9.3gbit/s in my network. That's close to the maximum atainable with normal > framing. If I change the mtu back to 9000 I get 9.89 gbit/s, again very close > to the theoretical max. > Got it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-11-08 19:29:38 +0100, Tomas Vondra wrote: > >>> I haven't checked, but I'd assume that 100bytes back and forth should easily > >>> fit a new message to update LSNs and the existing feedback response. Even just > >>> the difference between sending 100 bytes and sending 10k (a bit more than a > >>> single WAL page) is pretty significant on a 1gbit network. > >>> > >> > >> I'm on decaf so I may be a bit slow, but it's not very clear to me what > >> conclusion to draw from these numbers. What is the takeaway? > >> > >> My understanding is that in both cases the latency is initially fairly > >> stable, independent of the request size. This applies to request up to > >> ~1000B. And then the latency starts increasing fairly quickly, even > >> though it shouldn't hit the bandwidth (except maybe the 1MB requests). > > > > Except for the smallest end, these are bandwidth related, I think. Converting > > 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving > > the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on > > 1gbit. If you subtract the minmum latency of about 130us, that's nearly all of > > the latency. > > > > Maybe I don't understand what you mean "bandwidth related" but surely > the smaller requests are not limited by bandwidth. I mean, 100B and 1kB > (and even 10kB) requests have almost the same transaction rate, yet > there's an order of magnitude difference in bandwidth (sure, there's > overhead, but this much magnitude?). What I mean is that bandwidth is the biggest factor determining latency in the numbers I showed (due to decent sized packet and it being a local network). At line rate it takes ~80us to send 10kB over 1gbit ethernet. So a roundtrip cannot be faster than 80us, even if everything else added zero latency. That's why my numbers show such a lower latency for the 10gbit network - it's simply faster to put even small-ish amounts of data onto the wire. That does not mean that the link is fully utilized over time - because we wait for the other side to receive the data, wake up a user space process, send back 100 bytes, wait for the data be transmitted, and then wake up a process, there are periods where the link in one direction is largely idle. But in case of a 10kB packet on the 1gbit network, yes, we are bandwidth limited for ~80us (or perhaps more interestingly, we are bandwidth limited for 0.8ms when sending 100kB). Greetings, Andres Freund
Hi, Since the last patch version I've done a number of experiments with this throttling idea, so let me share some of the ideas and results, and see where that gets us. The patch versions so far tied everything to syncrep - commit latency with sync replica was the original motivation, so this makes sense. But while thinking about this and discussing this with a couple people, I've been wondering why to limit this to just that particular option. There's a couple other places in the WAL write path where we might do a similar thing (i.e. wait) or be a bit more aggressive (and do a write/flush), depending on circumstances. If I simplify this a bit, there are about 3 WAL positions that I could think of: - write LSN (how far we wrote WAL to disk) - flush LSN (how far we flushed WAL to local disk) - syncrep LSN (how far the sync replica confirmed WAL) So, why couldn't there be a similar "throttling threshold" for these events too? Imagine we have three GUCs, with values satisfying this: wal_write_after < wal_flush_after_local < wal_flush_after_remote and this meaning: wal_write_after - if a backend generates this amount of WAL, it will write the completed WAL (but only whole pages) wal_flush_after_local - if a backend generates this amount of WAL, it will not only write the WAL, but also issue a flush (if still needed) wal_flush_after_remote - if this amount of WAL is generated, it will wait for syncrep to confirm the flushed LSN The attached PoC patch does this, mostly the same way as earlier patches. XLogInsertRecord is where the decision whether throttling may be needed is done, HandleXLogDelayPending then does the actual work (writing WAL, flushing it, waiting for syncrep). The one new thing HandleXLogDelayPending also does is auto-tuning the values a bit. The idea is that with per-backend threshold, it's hard to enforce some sort of global limit, because if depends on the number of active backends. If you set 1MB of WAL per backend, the total might be 1MB or 1000MB, if there are 1000 backends. Who knows. So this tries to reduce the threshold (if the backend generated only a tiny fraction of the WAL), or increase the threshold (if it generated most of it). I'm not entirely sure this behaves sanely under all circumstances, but for a PoC patch it seems OK. The first two GUCs remind me what walwriter is doing, and I've been asking myself if maybe making it more aggressive would have the same effect. But I don't think so, because a big part of this throttling patch is ... well, throttling. Making the backends sleep for a bit (or wait for something), to slow it down. And walwriter doesn't really do that I think. In a recent off-list discussion, someone asked if maybe this might be useful to prevent emergencies due to archiver not keeping up and WAL filling disk. A bit like enforcing a more "strict" limit on WAL than the current max_wal_size GUC. I'm not sure about that, it's certainly a very different use case than minimizing impact on OLTP latency. But it seems like "archived LSN" might be another "event" the backends would wait for, just like they wait for syncrep to confirm a LSN. Ideally it'd never happen, ofc, and it seems a bit like a great footgun (outage on archiver may kill PROD), but if you're at risk of ENOSPACE on pg_wal, not doing anything may be risky too ... FWIW I wonder if maybe we should frame this a as a QoS feature, where instead of "minimize impact of bulk loads" we'd try to "guarantee" or "reserve" some part of the capacity to certain backends/... Now, let's look at results from some of the experiments. I wanted to see how effective this approach could be in minimizing impact of large bulk loads at small OLTP transactions in different setups. Thanks to the two new GUCs this is not strictly about syncrep, so I decided to try three cases: 1) local, i.e. single-node instance 2) syncrep on the same switch, with 0.1ms latency (1Gbit) 2) syncrep with 10ms latency (also 1Gbit) And for each configuration I did ran a pgbench (30 minutes), either on it's own, or concurrently with bulk COPY of 1GB data. The load was done either by a single backend (so one backend loading 1GB of data), or the file was split into 10 files 100MB each, and this was loaded by 10 concurrent backends. And I did this test with three configurations: (a) master - unpatched, current behavior (b) throttle-1: patched with limits set like this: # Add settings for extensions here wal_write_after = '8kB' wal_flush_after_local = '16kB' wal_flush_after_remote = '32kB' (c) throttle-2: patched with throttling limits set to 4x of (b), i.e. # Add settings for extensions here wal_write_after = '32kB' wal_flush_after_local = '64kB' wal_flush_after_remote = '128kB' And I did this for the traditional three scales (small, medium, large), to hit different bottlenecks. And of course, I measured both throughput and latencies. The full results are available here: [1] https://github.com/tvondra/wal-throttle-results/tree/master I'm not going to attach the files visualizing the results here, because it's like 1MB per file, which is not great for e-mail. https://github.com/tvondra/wal-throttle-results/blob/master/wal-throttling.pdf ---------------------------------------------------------------------- The first file summarizes the throughput results for the three configurations, different scales etc. On the left is throughput, on the right is the number of load cycles completed. I think this behaves mostly as expected - with the bulk loads, the throughput drops. How much depends on the configuration (for syncrep it's far more pronounced). The throttling recovers a lot of it, at the expense of doing fewer loads - and it's quite significant drop. But that's expected, and it was kinda what this patch was about - prioritise the small OLTP transactions by doing fewer loads. This is not a patch that would magically inflate capacity of the system to do more things. I however agree this does not really represent a typical production OLTP system. Those systems don't run at 100% saturation, except for short periods, certainly not if they're doing something latency sensitive. So a somewhat realistic test would be pgbench throttled at 75% capacity, leaving some spare capacity for the bulk loads. I actually tried that, and there are results in [1], but the behavior is pretty similar to what I'm describing here (except that the system does actually manages to do more bulk loads, ofc). https://raw.githubusercontent.com/tvondra/wal-throttle-results/master/syncrep/latencies-1000-full.eps ----------------------------------------------------------------------- Now let's look at the second file, which shows latency percentiles for the medium dataset on syncrep. The difference between master (on the left) and the two throttling builds is pretty obvious. It's not exactly the same as "no concurrent bulk loads" in the top row, but not far from it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hey Tomas, Shirisha had posted a recent re-attempt here [1] and then we were graciously redirected here by Jakub. We took a look at v5-0001-v4.patch and also a brief look at v5-0002-rework.patch. We feel that it might be worth considering throttling based on the remote standby to begin with for simplicity (i.e. wal_flush_after_remote), and then we can add the other knobs incrementally? Our comments on v5-0001-v4.patch are below: > + > + /* > + * Decide if we need to throttle this backend, so that it does not write > + * WAL too fast, causing lag against the sync standby (which in turn > + * increases latency for standby confirmations). We may be holding locks > + * and blocking interrupts here, so we only make the decision, but the > + * wait (for sync standby confirmation) happens elsewhere. Slightly reworded: * wait (for sync standby confirmation) happens as part of the next * CHECK_FOR_INTERRUPTS(). See HandleXLogDelayPending() for details on * why the delay is deferred. > + * > + * The throttling is applied only to large transactions (producing more > + * than wal_throttle_threshold kilobytes of WAL). Throttled backends > + * can be identified by a new wait event SYNC_REP_THROTTLED. > + * > + * Small transactions (by amount of produced WAL) are still subject to > + * the sync replication, so the same wait happens at commit time. > + * > + * XXX Not sure this is the right place for a comment explaining how the > + * throttling works. This place is way too low level, and rather far from > + * the place where the wait actually happens. Perhaps the best course of action is to move the code and the comments above into an inline function: SetAndCheckWALThrottle(). > + > +/* > + * HandleXLogDelayPending > + * Throttle backends generating large amounts of WAL. > + * > + * The throttling is implemented by waiting for a sync replica confirmation for > + * a convenient LSN position. In particular, we do not wait for the current LSN, > + * which may be in a partially filled WAL page (and we don't want to write this > + * one out - we'd have to write it out again, causing write amplification). > + * Instead, we move back to the last fully WAL page. > + * > + * Called from ProcessMessageInterrupts() to avoid syncrep waits in XLogInsert(), > + * which happens in critical section and with blocked interrupts (so it would be > + * impossible to cancel the wait if it gets stuck). Also, there may be locks held > + * and we don't want to hold them longer just because of the wait. > + * > + * XXX Andres suggested we actually go back a couple pages, to increase the > + * probability the LSN was already flushed (obviously, this depends on how much > + * lag we allow). > + * > + * XXX Not sure why we use XactLastThrottledRecEnd and not simply XLogRecEnd? > + */ > +void > +HandleXLogDelayPending() > +{ > + XLogRecPtr lsn; > + > + /* calculate last fully filled page */ > + lsn = XactLastThrottledRecEnd - (XactLastThrottledRecEnd % XLOG_BLCKSZ); > + > + Assert(wal_throttle_threshold > 0); > + Assert(backendWalInserted >= wal_throttle_threshold * 1024L); > + Assert(XactLastThrottledRecEnd != InvalidXLogRecPtr); > + > + XLogFlush(lsn); > + SyncRepWaitForLSN(lsn, false); > + > + XLogDelayPending = false; > +} (1) Can't we simply call SyncRepWaitForLSN(LogwrtResult.Flush, false); here? LogwrtResult.Flush will guarantee that we are waiting on something that has already been flushed or will be flushed soon. Then we wouldn't need to maintain XactLastThrottledRecEnd, nor call XLogFlush() before calling SyncRepWaitForLSN(). LogwrtResult can be slightly stale, but does that really matter here? (2) Also, to make things more understandable, instead of maintaining a counter to track the number of WAL bytes written, maybe we should maintain a LSN pointer called XLogDelayWindowStart. And then in here, we can assert: Assert(LogwrtResult.Flush - XLogDelayWindowStart > wal_throttle_threshold * 1024); Similarly, we can check the same conditional in SetAndCheckWALThrottle(). After the wait is done and we reset XLogDelayPending = false, we can perhaps reset XLogDelayWindowStart = LogwrtResult.Flush. The only downside probably is that if our standby is caught up enough, we may be repeatedly and unnecessarily invoking a HandleXLogDelayPending(), where our SyncRepWaitForLSN() would be called with an older LSN and it would be a no-op early exit. > + * XXX Should this be done even if XLogDelayPending is already set? Maybe > + * that should only update XactLastThrottledRecEnd, withoug incrementing > + * the pgWalUsage.wal_throttled counter? > + */ > + backendWalInserted += rechdr->xl_tot_len; > + > + if ((synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_WRITE) && > + (wal_throttle_threshold > 0) && > + (backendWalInserted >= wal_throttle_threshold * 1024L)) > + { > + XactLastThrottledRecEnd = XactLastRecEnd; > + InterruptPending = true; > + XLogDelayPending = true; > + pgWalUsage.wal_throttled++; XLogDelayWindowStart = LogwrtResult.Flush; > + } Yeah we shouldn't do all of this if XLogDelayPending is already set. We can add a quick !XLogDelayPending leading conditional. Also, [1] has a TAP test that may be useful. Regards, Soumyadeep & Shirisha Broadcom [1] https://www.postgresql.org/message-id/CAP3-t08umaBEUEppzBVY6%3D%3D3tbdLwG7b4wfrba73zfOAUrRsoQ%40mail.gmail.com
Hey Tomas, Would you be interested in registering the patch for CF 2024-11? Please let us know if there is anything we can do to help with additional review or testing. Regards, Soumyadeep & Shirisha Broadcom
Hi Soumyadeep & Shirisha,
On Wed, Oct 16, 2024 at 8:04 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
Hey Tomas,
Would you be interested in registering the patch for CF 2024-11?
Please let us know if there is anything we can do to help with additional review
or testing.
Tomas has apparently never received Your earlier email with review [1] - I've included his current email in this message and I've registered it in CF under https://commitfest.postgresql.org/50/5309/ . Even if this $topic doesn't get traction and will be rejected/returned with feedback due to lack of feedback or work, having the record in CF should help others in future using those patches and design discussions as a good starting point. Right now I do not have plans to work on that,but maybe I'll be able to help in future.
Regards,
-Jakub Wartak.