Measuring replay lag - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Measuring replay lag |
Date | |
Msg-id | CAEepm=37=p6AUYtm7v+qsyerqbDMQsmnP2-CJpT42j-rOqNBbQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Measuring replay lag
Re: [HACKERS] Measuring replay lag |
List | pgsql-hackers |
Hi hackers, Here is a new version of my patch to add a replay_lag column to the pg_stat_replication view (originally proposed as part of a larger patch set for 9.6[1]), like this: postgres=# select application_name, replay_lag from pg_stat_replication; ┌──────────────────┬─────────────────┐ │ application_name │ replay_lag │ ├──────────────────┼─────────────────┤ │ replica1 │ 00:00:00.595382 │ │ replica2 │ 00:00:00.598448 │ │ replica3 │ 00:00:00.541597 │ │ replica4 │ 00:00:00.551227 │ └──────────────────┴─────────────────┘ (4 rows) It works by taking advantage of the { time, end-of-WAL } samples that sending servers already include in message headers to standbys. That seems to provide a pretty good proxy for when the WAL was written, if you ignore messages where the LSN hasn't advanced. The patch introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to control how often the standby should record these timestamped LSNs into a small circular buffer. When its recovery process eventually replays a timestamped LSN, the timestamp is sent back to the upstream server in a new reply message field. The value visible in pg_stat_replication.replay_lag can then be updated by comparing with the current time. Compared to the usual techniques people use to estimate replay lag, this approach has the following advantages: 1. The lag is measured in time, not LSN difference. 2. The lag time is computed using two observations of a single server's clock, so there is no clock skew. 3. The lag is updated even between commits (during large data loads etc). In the previous version I was effectively showing the ping time between the servers during idle times when the standby was fully caught up because there was nothing happening. I decided that was not useful information and that it's more newsworthy and interesting to see the estimated replay lag for the most recent real replayed activity, so I changed that. In the last thread[1], Robert Haas wrote: > Well, one problem with this is that you can't put a loop inside of a > spinlock-protected critical section. Fixed. > In general, I think this is a pretty reasonable way of attacking this > problem, but I'd say it's significantly under-commented. Where should > someone go to get a general overview of this mechanism? The answer is > not "at place XXX within the patch". (I think it might merit some > more extensive documentation, too, although I'm not exactly sure what > that should look like.) I have added lots of comments. > When you overflow the buffer, you could thin in out in a smarter way, > like by throwing away every other entry instead of the oldest one. I > guess you'd need to be careful how you coded that, though, because > replaying an entry with a timestamp invalidates some of the saved > entries without formally throwing them out. Done, by overwriting the newest sample rather than the oldest if the buffer is full. That seems to give pretty reasonable degradation, effectively lowering the sampling rate, without any complicated buffer or rate management code. > Conceivably, 0002 could be split into two patches, one of which > computes "stupid replay lag" considering only records that naturally > carry timestamps, and a second adding the circular buffer to handle > the case where much time passes without finding such a record. I contemplated this but decided that it'd be best to use ONLY samples from walsender headers, and never use the time stamps from commit records for this. If we use times from commit records, then a cascading sending server will not be able to compute the difference in time without introducing clock skew (not to mention the difficulty of combining timestamps from two sources if we try to do both). I figured that it's better to have value that shows a cascading sender->standby->cascading sender round trip time that is free of clock skew, than a master->cascading sender->standby->cascading sender incomplete round trip that includes clock skew. By the same reasoning I decided against introducing a new periodic WAL record or field from the master to hold extra time stamps in between commits to do this, in favour of the buffered transient timestamp approach I took in this patch. That said, I can see there are arguments for doing it with extra periodic WAL timestamps, if people don't think it'd be too invasive to mess with the WAL for this, and don't care about cascading standbys giving skewed readings. One advantage would be that persistent WAL timestamps would still be able to provide lag estimates if a standby has been down for a while and was catching up, and this approach can't until it's caught up due to lack of buffered transient timestamps. Thoughts? I plan to post a new "causal reads" patch at some point which will depend on this, but in any case I think this is a useful feature on its own. I'd be grateful for any feedback, flames, better ideas etc. Thanks for reading. [1] https://www.postgresql.org/message-id/CAEepm%3D31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date: