Re: Proposal: "Causal reads" mode for load balancing reads without stale data - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Proposal: "Causal reads" mode for load balancing reads without stale data |
Date | |
Msg-id | CAEepm=3D_qNup8xFKNV_mdbX_z99S8pzWLZVXkN2hmzqx_3dkQ@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal: "Causal reads" mode for load balancing reads without stale data (Michael Paquier <michael.paquier@gmail.com>) |
List | pgsql-hackers |
On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> >>> Hi, >>> >>> On 2016/02/29 18:05, Thomas Munro wrote: >>>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote: >>>>> + servers. A transaction that is run with >>>>> <varname>causal_reads</> set >>>>> + to <literal>on</> is guaranteed either to see the effects of all >>>>> + completed transactions run on the primary with the setting on, or to >>>>> + receive an error "standby is not available for causal reads". >>>>> >>>>> "A transaction that is run" means "A transaction that is run on a >>>>> standby", right? >>>> >>>> Well, it could be any server, standby or primary. Of course standbys >>>> are the interesting case since it it was already true that if you run >>>> two sequential transactions run on the primary, the second can see the >>>> effect of the first, but I like the idea of a general rule that >>>> applies anywhere, allowing you not to care which server it is. >>> >>> I meant actually in context of that sentence only. >> >> Ok, here's a new version that includes that change, fixes a conflict >> with recent commit 10b48522 and removes an accidental duplicate copy >> of the README file. > > Finally I got my eyes on this patch. > > To put it short, this patch introduces two different concepts: > - addition of a new value, remote_apply, for synchronous_commit, which > is actually where things overlap a bit with the N-sync patch, because > by combining the addition of remote_apply with the N-sync patch, it is > possible to ensure that an LSN is applied to multiple targets instead > of one now. In this case, as the master will wait for the LSN to be > applied on the sync standby, there is no need for the application to > have error handling in case a read transaction is happening on the > standby as the change is already visible on the standby when > committing it on master. However the cost here is that if the standby > node lags behind, it puts some extra wait as well on the master side. > - casual reads, which makes the master able to drop standbys that are > lagging too much behind and let callers of standbys know that it is > lagging to much behind and cannot satisfy causal reads. In this case > error handling is needed by the application in case a standby is > lagging to much behind. > > That's one of my concerns about this patch now: it is trying to do too > much. I think that there is definitely a need for both things: > applications may be fine to pay the lagging price when remote_apply is > used by not having an extra error handling layer, or they cannot > accept a perhaps large of lag and are ready to have an extra error > handling layer to manage read failures on standbys. So I would > recommend a split to begin with: > 1) Addition of remote_apply in synchronous_commit, which would be > quite a simple patch, and I am convinced that there are benefits in > having that. Compared to the previous patch sent, a standby message is > sent back to the master once COMMIT has been applied, accelerating > things a bit. > 2) Patch for causal reads, with all its extra parametrization logic > and stuff to select standby candidates. Thanks. Yes, this makes a lot of sense. I have done some work on splitting this up and will post the result soon, along with my responses to your other feedback. > Here is as well a more detailed review... > > Regression tests are failing, rules.sql is generating diffs because > pg_stat_replication is changed. > > CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE, > COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also > call XLogRequestWalReceiverReply() when needed. > > The new parameters are missing from postgresql.conf.sample. > > +static bool > +SyncRepCheckEarlyExit(void) > +{ > Doing this refactoring would actually make sense as a separate patch I > think, and that would simplify the core patch for causal reads. > > +For this reason we say that the causal reads guarantee only holds as > +long as the absolute difference between the system clocks of the > +machines is no more than max_clock_skew. The theory is that NTP makes > +it possible to reason about the maximum possible clock difference > +between machines and choose a value that allows for a much larger > +difference. However, we do make a best effort attempt to detect > +misconfigured systems as described above, to catch the case of servers > +not running ntp a correctly configured ntp daemon, or with a clock so > +far out of whack that ntp refuses to fix it. > Just wondering how this reacts when standby and master are on > different timezones. I know of two ways to measure WAL lag: one is > what you are doing, by using a timestamp and rely on the two servers > to be in sync at clock level. The second is in bytes with a WAL > quantity, though it is less user-friendly to set up, say max_wal_lag > or similar, symbolized by the number of WAL segments the standby is > lagging behind, the concerns regarding clock sync across nodes go > away. To put it short, I find the timestamp approach easy to set up > and understand for the user, but not really solid as it depends much > on the state dependency between different hosts, while a difference > between flush and apply LSN positions is a quite independent concept. > So basically what would be used as a base comparison is not the > timestamp of the transaction commit but the flush LSN at the moment > commit has been replayed. > > + /* > + * If a causal_reads_timeout is configured, it is used instead of > + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 + > + * allowance for network latency, but since walreceiver can become quite > + * bogged down fsyncing WAL we allow more tolerance. (This could be > + * tightened up once standbys hand writing off to the WAL writer). > + */ > + if (causal_reads_timeout != 0) > + allowed_time = causal_reads_timeout; > + else > + allowed_time = wal_sender_timeout; > I find that surprising, for two reasons: > 1) it seems to me that causal_read_timeout has in concept no relation > with WAL sender process control. > 2) A standby should still be able to receive WAL even if it cannot > satisfy causal reads to give it a chance to catch up faster the amount > it is late. > > - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote > + * apply */ > + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote > + apply with causal consistency */ > SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a > typo s/flusha/flush a/. > > I am still digging into the patch, the available/joining/unavailable > logic being quite interesting. > -- > Michael -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: