Re: Syncrep and improving latency due to WAL throttling - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Syncrep and improving latency due to WAL throttling |
Date | |
Msg-id | ead51688-958e-2f3b-ae72-baff0031a9c3@enterprisedb.com Whole thread Raw |
In response to | Re: Syncrep and improving latency due to WAL throttling (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Syncrep and improving latency due to WAL throttling
|
List | pgsql-hackers |
Hi, Since the last patch version I've done a number of experiments with this throttling idea, so let me share some of the ideas and results, and see where that gets us. The patch versions so far tied everything to syncrep - commit latency with sync replica was the original motivation, so this makes sense. But while thinking about this and discussing this with a couple people, I've been wondering why to limit this to just that particular option. There's a couple other places in the WAL write path where we might do a similar thing (i.e. wait) or be a bit more aggressive (and do a write/flush), depending on circumstances. If I simplify this a bit, there are about 3 WAL positions that I could think of: - write LSN (how far we wrote WAL to disk) - flush LSN (how far we flushed WAL to local disk) - syncrep LSN (how far the sync replica confirmed WAL) So, why couldn't there be a similar "throttling threshold" for these events too? Imagine we have three GUCs, with values satisfying this: wal_write_after < wal_flush_after_local < wal_flush_after_remote and this meaning: wal_write_after - if a backend generates this amount of WAL, it will write the completed WAL (but only whole pages) wal_flush_after_local - if a backend generates this amount of WAL, it will not only write the WAL, but also issue a flush (if still needed) wal_flush_after_remote - if this amount of WAL is generated, it will wait for syncrep to confirm the flushed LSN The attached PoC patch does this, mostly the same way as earlier patches. XLogInsertRecord is where the decision whether throttling may be needed is done, HandleXLogDelayPending then does the actual work (writing WAL, flushing it, waiting for syncrep). The one new thing HandleXLogDelayPending also does is auto-tuning the values a bit. The idea is that with per-backend threshold, it's hard to enforce some sort of global limit, because if depends on the number of active backends. If you set 1MB of WAL per backend, the total might be 1MB or 1000MB, if there are 1000 backends. Who knows. So this tries to reduce the threshold (if the backend generated only a tiny fraction of the WAL), or increase the threshold (if it generated most of it). I'm not entirely sure this behaves sanely under all circumstances, but for a PoC patch it seems OK. The first two GUCs remind me what walwriter is doing, and I've been asking myself if maybe making it more aggressive would have the same effect. But I don't think so, because a big part of this throttling patch is ... well, throttling. Making the backends sleep for a bit (or wait for something), to slow it down. And walwriter doesn't really do that I think. In a recent off-list discussion, someone asked if maybe this might be useful to prevent emergencies due to archiver not keeping up and WAL filling disk. A bit like enforcing a more "strict" limit on WAL than the current max_wal_size GUC. I'm not sure about that, it's certainly a very different use case than minimizing impact on OLTP latency. But it seems like "archived LSN" might be another "event" the backends would wait for, just like they wait for syncrep to confirm a LSN. Ideally it'd never happen, ofc, and it seems a bit like a great footgun (outage on archiver may kill PROD), but if you're at risk of ENOSPACE on pg_wal, not doing anything may be risky too ... FWIW I wonder if maybe we should frame this a as a QoS feature, where instead of "minimize impact of bulk loads" we'd try to "guarantee" or "reserve" some part of the capacity to certain backends/... Now, let's look at results from some of the experiments. I wanted to see how effective this approach could be in minimizing impact of large bulk loads at small OLTP transactions in different setups. Thanks to the two new GUCs this is not strictly about syncrep, so I decided to try three cases: 1) local, i.e. single-node instance 2) syncrep on the same switch, with 0.1ms latency (1Gbit) 2) syncrep with 10ms latency (also 1Gbit) And for each configuration I did ran a pgbench (30 minutes), either on it's own, or concurrently with bulk COPY of 1GB data. The load was done either by a single backend (so one backend loading 1GB of data), or the file was split into 10 files 100MB each, and this was loaded by 10 concurrent backends. And I did this test with three configurations: (a) master - unpatched, current behavior (b) throttle-1: patched with limits set like this: # Add settings for extensions here wal_write_after = '8kB' wal_flush_after_local = '16kB' wal_flush_after_remote = '32kB' (c) throttle-2: patched with throttling limits set to 4x of (b), i.e. # Add settings for extensions here wal_write_after = '32kB' wal_flush_after_local = '64kB' wal_flush_after_remote = '128kB' And I did this for the traditional three scales (small, medium, large), to hit different bottlenecks. And of course, I measured both throughput and latencies. The full results are available here: [1] https://github.com/tvondra/wal-throttle-results/tree/master I'm not going to attach the files visualizing the results here, because it's like 1MB per file, which is not great for e-mail. https://github.com/tvondra/wal-throttle-results/blob/master/wal-throttling.pdf ---------------------------------------------------------------------- The first file summarizes the throughput results for the three configurations, different scales etc. On the left is throughput, on the right is the number of load cycles completed. I think this behaves mostly as expected - with the bulk loads, the throughput drops. How much depends on the configuration (for syncrep it's far more pronounced). The throttling recovers a lot of it, at the expense of doing fewer loads - and it's quite significant drop. But that's expected, and it was kinda what this patch was about - prioritise the small OLTP transactions by doing fewer loads. This is not a patch that would magically inflate capacity of the system to do more things. I however agree this does not really represent a typical production OLTP system. Those systems don't run at 100% saturation, except for short periods, certainly not if they're doing something latency sensitive. So a somewhat realistic test would be pgbench throttled at 75% capacity, leaving some spare capacity for the bulk loads. I actually tried that, and there are results in [1], but the behavior is pretty similar to what I'm describing here (except that the system does actually manages to do more bulk loads, ofc). https://raw.githubusercontent.com/tvondra/wal-throttle-results/master/syncrep/latencies-1000-full.eps ----------------------------------------------------------------------- Now let's look at the second file, which shows latency percentiles for the medium dataset on syncrep. The difference between master (on the left) and the two throttling builds is pretty obvious. It's not exactly the same as "no concurrent bulk loads" in the top row, but not far from it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: