Re: [HACKERS] Causal reads take II - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] Causal reads take II |
Date | |
Msg-id | CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Causal reads take II (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: [HACKERS] Causal reads take II
|
List | pgsql-hackers |
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm very happy that you are addressing this topic. > > I noticed you didn't put in links my earlier doubts about this > specific scheme, though I can see doubts from myself and Heikki at > least in the URLs. I maintain those doubts as to whether this is the > right way forwards. One technical problem was raised in the earlier thread by Ants Aasma that I concede may be fatal to this design (see note below about read-follows-read below), but I'm not sure. All the other discussion seemed to be about trade-offs between writer-waits and reader-waits schemes, both of which I still view as reasonable options for an end user to have in the toolbox. Your initial reaction was: > What we want is to isolate the wait only to people performing a write-read > sequence, so I think it should be readers that wait. I agree with you 100%, up to the comma. The difficulty is identifying which transactions are part of a write-read sequence. An application-managed LSN tracking system allows for waits to occur strictly in reads that are part of a write-read sequence because the application links them explicitly, and nobody is arguing that we shouldn't support that for hard working expert users. But to support applications that don't deal with LSNs (or some kind of "causality tokens") explicitly I think we'll finish up having to make readers wait for incidental later transactions too, not just the write that your read is dependent on, as I'll show below. When you can't identify write-read sequences perfectly, it comes down to a choice between taxing writers or taxing readers, and I'm not sure that one approach is universally better. Let me summarise my understanding of that trade-off. I'm going to use this terminology: synchronous replay = my proposal: ask the primary server to wait until standbys have applied tx1, a bit like 9.6 synchronous_commit = remote_apply, but with a system of leases for graceful failure. causality tokens = the approach Heikki mentioned: provide a way for tx1 to report its commit LSN to the client, then provide a way for a client to wait for the LSN to be replayed before taking a snapshot for tx2. tx1, tx2 = a pair of transactions with a causal dependency; we want tx2 to see tx1 because tx1 caused tx2 in some sense so must be seen to precede it. A poor man's causality token system can be cobbled together today with pg_current_wal_lsn() and a polling loop that checks pg_last_wal_replay_lsn(). It's a fairly obvious thing to want to do. Several people including Heikki Linnakangas, Craig Ringer, Ants Aasma, Ivan Kartyshov and probably many others have discussed better ways to do that[1], and a patch for the wait-for-LSN piece appeared in a recent commitfest[2]. I reviewed Ivan's patch and voted -1 only because it didn't work for higher isolation levels. If he continues to develop that I will be happy to review and help get it into committable shape, and if he doesn't I may try to develop it myself. In short, I think this is a good tool to have in the toolbox and PostgreSQL 11 should have it! But I don't think it necessarily invalidates my synchronous replay proposal: they represent different sets of trade-offs and might appeal to different users. Here's why: To actually use a causality token system you either need a carefully designed application that keeps track of causal dependencies and tokens, in which case the developer works harder but can benefit from from an asynchronous pipelining effect (by the time tx2 runs we hope that tx1 has been applied, so neither transaction had to wait). Let's call that "application-managed causality tokens". That's great for those users -- let's make that possible -- but most users don't want to write code like that. So I see approximately three choices for transparent middleware (or support built into standbys), which I'll name and describe as follows: 1. "Panoptic" middleware: Sees all queries so that it can observe commit LSNs and inject wait directives into all following read-only transactions. Since all queries now need to pass through a single process, you have a new bottleneck, an extra network hop, and a single point of failure so you'll probably want a failover system with split-brain defences. 2. "Hybrid" middleware: The middleware (or standby itself) listens to the replication stream so that it knows about commit LSNs (rather than spying on commits). The primary server waits until all connected middleware instances acknowledge commit LSNs before it releases commits, and then the middleware inserts wait-for-LSN directives into read-only transactions. Now there is no single point problem, but writers are impacted too. I mention this design because I believe this is conceptually similar to how Galera wsrep_sync_wait (AKA wsrep_causal_reads) works. (I call this "hybrid" because it splits the waiting between tx1 and tx2. Since it's synchronous, dealing with failure gracefully is tricky, probably needing a lease system like SR. I acknowledge that comparisons between our streaming replication and Galera are slightly bogus because Galera is a synchronous multi-master system.) 3. "Forward-only" middleware (insert better name): The middleware (or standby itself) asks the primary server for the latest committed LSN at the start of every transaction, and then tells the standby to wait for that LSN to be applied. There are probably some other schemes involving communicating middleware instances, but I don't think they'll be better in ways that matter -- am I wrong? Here's a trade-off table: SR AT PT HT FT tx1 commit waits? yes no no yes no tx2 snapshot waits? no yes yes yes yes tx2 waits for incidentaltransactions? no no yes yes yes tx2 has round-trip to primary? no no no no yes can tx2's wait be pipelined? yes no* no* no* SR = synchronous replay AT = application-managed causality tokens PT = panoptic middleware-managed causality tokens HT =hybrid middleware-managed or standby-managed causality tokens FT = forward-only middleware-managed causality tokens *Note that only synchronous replay and application-managed causality tokens track the actual causal dependency tx2->tx1. SR does it by making tx1 wait for replay so that tx2 doesn't have to wait at all and AT does it by making tx2 wait specifically for tx1 to be applied. PT, HT and FT don't actually know anything about tx1, so they make every read query wait until *all known transactions* are applied ("incidental transactions" above), throwing away the pipelining benefits of causality token systems (hence "no*" above). I haven't used it myself but I have heard that that is why read latency is a problem on Galera with causal reads mode enabled: due to lack of better information you have to wait for the replication system to drain its current apply queue before every query is processed, even if tx1 in your causally dependent transaction pair was already visible on the current node. So far I think that SR and AT are sweet spots. AT for people who are prepared to juggle causality tokens in their applications and SR for people who want to remain oblivious to all this stuff and who can tolerate a reduction in single-client write TPS. I also think AT's pipelining advantage over SR and SR's single-client TPS impact are diminished if you also choose to enable syncrep for durability, which isn't a crazy thing to want to do if you're doing anything important with your data. The other models where all readers wait for incidental transactions don't seem terribly attractive to me, especially if the motivating premise of load balancing with read-only replicas is (to steal a line) "ye [readers] are many, they are few". One significant blow my proposal received in the last thread was a comment from Ants about read-follows-read[3]. What do you think? I suspect the same problem applies to causality token based systems as discussed so far (except perhaps FT, the slowest and probably least acceptable to anyone). On the other hand, I think it's at least possible to fix that problem with causality tokens. You'd have to expose and capture the last-commit-LSN for every snapshot used in every single read query, and wait for it at the moment ever following read query takes a new snapshot. This would make AT even harder work for developers, make PT even slower, and make HT unworkable (it only knows about commits, not reads). I also suspect that a sizeable class of applications cares a lot less about read-follows-read than read-follows-write, but I could be wrong about that. > This patch presumes we will load balance writes to a master and reads > to a pool of standbys. How will we achieve that? > > 1. We decorate the application with additional info to indicate > routing/write concerns. > 2. We get middleware to do routing for us, e.g. pgpool style read/write routing > > The explicit premise of the patch is that neither of the above options > are practical, so I'm unclear how this makes sense. Is there some use > case that you have in mind that has not been fully described? If so, > lets get it on the table. I don't think that pgpool routing is impractical, just that it's not a great place to put transparent causality token tracking for the reasons I've explained above -- you'll introduce a ton of latency for all readers because you can't tell which earlier transactions they might be causally dependent on. I think it's also nice to be able to support the in-process connection pooling and routing that many application developers use to avoid extra hops, so it'd be nice to avoid making pooling/routing/proxy servers strictly necessary. > What I think we need is a joined up plan for load balancing, so that > we can understand how it will work. i.e. explain the whole use case > and how the solution works. Here are some ways you could set a system up: 1. Use middleware like pgpool or pgbouncer-rr to route queries automatically; this is probably limited to single-statement queries, since multi-statement queries can't be judged by their first statement alone. (Those types of systems could be taught to understand a request for a connection with causal reads enabled, and look at the current set of usable standbys by looking at the pg_stat_replication table.) 2. Use the connection pooling inside your application server or application framework/library: for example Hibernate[4], Django[5] and many other libraries offer ways to configure multiple database connection pools and route queries appropriately at a fairly high level. Such systems could probably be improved to handle 'synchronous replay not available' errors by throwing away the connection and retrying automatically on another connection, much as they do for serialization failures and deadlocks. 3. Modify your application to deal with separate connection pools directly wherever it runs database transactions. Perhaps I'm not thinking big enough: I tried to come up with an incremental improvement to PostgreSQL that would fix a problem that I know people have with their current hot standby deployment. I deliberately avoided proposing radical architectural projects such as moving cluster management, discovery, proxying, pooling and routing responsibilities into PostgreSQL. Perhaps those working on GTM type systems which effectively present a seamless single system find this whole discussion to be aiming too low and dealing with the wrong problems. > I'm especially uncomfortable with any approaches that treat all > sessions as one pool. For me, a server should support multiple pools. > Causality seems to be a property of a particular set of pools. e.g. > PoolS1 supports causal reads against writes to PoolM1 but not PoolM2, > yet PoolS2 does not provide causal reads against PoolM1 orPoolM2. Interesting, but I don't immediately see any fundamental difficulty for any of the designs discussed. For example, maybe tx1 should be able to set synchronous_replay = <group name>, rather than just 'on', to refer to a ground of standbys defined in some GUC. Just by the way, while looking for references I found PinningMasterSlaveRouter which provides a cute example of demand for causal reads (however implemented) in the Django community: https://github.com/jbalogh/django-multidb-router It usually sends read-only transactions to standbys, but keeps your web session temporarily pinned to the primary database to give you 15 seconds' worth of read-your-writes after each write transaction. [1] https://www.postgresql.org/message-id/flat/53E2D346.9030806%402ndquadrant.com [2] https://www.postgresql.org/message-id/flat/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru [3] https://www.postgresql.org/message-id/CAEepm%3D15WC7A9Zdj2Qbw3CUDXWHe69d%3DnBpf%2BjXui7OYXXq11w%40mail.gmail.com [4] https://stackoverflow.com/questions/25911359/read-write-splitting-hibernate [5] https://github.com/yandex/django_replicated (for example; several similar extensions exist) -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: