Thread: SERIALIZABLE on standby servers
Hi hackers Here is an experimental WIP patch to allow SERIALIZABLE READ ONLY DEFERRABLE transactions on standby servers without serialisation anomalies, based loosely on an old email from Kevin Grittner[1]. I'm not sure how far this is from what he had in mind or whether I've misunderstood something fundamental here, but I hope this can at least serve as a starting point and we can try to get something into Postgres 10. The patch works by teaching the standby how to do somethings similar to what SERIALIZABLE READ ONLY DEFERRABLE does on the primary server, with some help from the primary server in the form of extra information in the WAL. The basic idea is: the standby should wait until a point in the transaction stream where it can take a snapshot and either (1) there were no concurrent read/write SERIALIZABLE transactions running on the primary, or (2) the last concurrent read/write SERIALIZABLE transaction at snapshot time has now ended without creating a dangerous cycle with our transaction. In case (1), the primary sometimes adds an extra xl_xact_snapshot_safety struct to commit messages which says 'a snapshot taken after this commit and before any future SSI commits is safe, because there are no concurrent read/write SSI transactions at this moment'. In case (2), the xl_xact_snapshot_safety struct embedded in a commit record instead says 'a snapshot taken after this commit and before any future SSI commits is of unknown safety, because there are concurrent transactions; I'll tell you when it has been determined; please remember this token'. The token (which happens to be a CSN but that is not important) will appear in a future independent snapshot safety message which says whether a snapshot taken at that time is safe or unsafe. Note that xl_xact_snapshot_safety is embedded in the commit messages (for SSI transactions only), but also appears as its own WAL record to report the final state of a token from an earlier commit. So if you do a lot of non-overlapping writable SSI transactions, you'll get just a few extra bytes in each commit record, but overlapping transactions will generate a stream of extra snapshot safety messages, one for each commit involved. In order to generate follow-up snapshot safety messages, the patch creates 'hypothetical' transactions on the primary whenever a writeable SSI transaction commits, so that it can figure out whether such a transaction would conflict. These phantom transactions are proxies for any transaction that may be created on a standby at the same point in the transaction stream (with respect to SSI commits) on any standby, and survive in memory just until they are found to be safe or unsafe. Example of use: T1 on primary: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; T1 on primary: INSERT INTO foo VALUES ('x'); T2 on primary: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; T2 on primary: INSERT INTO foo VALUES ('x'); T2 on primary: COMMIT; T3 on standby: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE; T3 on standby: SELECT * FROM foo; T3 on standby: <...waits...> T1 on primary: COMMIT; T3 on standby: <...continues...> Not tested much and certainly has bugs and many details to sort out, but first... is this sound or could it be made so? Is there a better way? [1] https://www.postgresql.org/message-id/4D3735E30200002500039869%40gw.wicourts.gov -- Thomas Munro http://www.enterprisedb.com
Attachment
On Tue, Nov 8, 2016 at 5:56 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here is an experimental WIP patch to allow SERIALIZABLE READ ONLY > DEFERRABLE transactions on standby servers without serialisation > anomalies, based loosely on an old email from Kevin Grittner[1]. I'm > not sure how far this is from what he had in mind or whether I've > misunderstood something fundamental here, but I hope this can at least > serve as a starting point and we can try to get something into > Postgres 10. While out walking I realised what was wrong with that. It's going to take me a while to find the time to get back to this, so I figured I should share this realisation in case anyone else is interested in the topic. The problem is that it determines snapshot safety in PreCommit_CheckForSerializationFailure, and then races other backends to XactLogCommitRecord. It could determine that a hypothetical snapshot taken after this commit is safe, but then other activity resulting in a hypothetical snapshot of unknown safety could happen and be logged before we record our determination in the log. One solution could be to serialise XactLogCommitRecord for SSI transactions using SerializableXactHashLock, and determine hypothetical snapshot safety at the same time, so that commit replay order matches safety determination order. But it would suck to add another point of lock contention to SSI commits. Another solution could be to have recovery on the standby detect tokens (CSNs incremented by PreCommit_CheckForSerializationFailure) arriving out of order, but I don't know what exactly it should do about that when it is detected: you shouldn't respect an out-of-order claim of safety, but then what should you wait for? Perhaps if the last replayed commit record before that was marked SNAPSHOT_SAFE then it's OK to leave it that way, and if it was marked SNAPSHOT_SAFETY_UNKNOWN then you have to wait for that one to be resolved by a follow-up snapshot safety message and then rince-and-repeat (take a new snapshot etc). I think that might work, but it seems strange to allow random races on the primary to create extra delays on the standby. Perhaps there is some much simpler way to do all this that I'm missing. Another detail is that standbys that start up from a checkpoint and don't see any SSI transactions commit don't yet have any snapshot safety information, but defaulting to assuming that this point is safe doesn't seem right, so I suspect it needs to be in checkpoints. Attached is a tidied up version which doesn't try to address the above problems yet. When time permits I'll come back to this. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Nov 16, 2016 at 9:26 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Nov 8, 2016 at 5:56 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > [..] Another solution > could be to have recovery on the standby detect tokens (CSNs > incremented by PreCommit_CheckForSerializationFailure) arriving out of > order, but I don't know what exactly it should do about that when it > is detected: you shouldn't respect an out-of-order claim of safety, > but then what should you wait for? Perhaps if the last replayed > commit record before that was marked SNAPSHOT_SAFE then it's OK to > leave it that way, and if it was marked SNAPSHOT_SAFETY_UNKNOWN then > you have to wait for that one to be resolved by a follow-up snapshot > safety message and then rince-and-repeat (take a new snapshot etc). I > think that might work, but it seems strange to allow random races on > the primary to create extra delays on the standby. Perhaps there is > some much simpler way to do all this that I'm missing. > > Another detail is that standbys that start up from a checkpoint and > don't see any SSI transactions commit don't yet have any snapshot > safety information, but defaulting to assuming that this point is safe > doesn't seem right, so I suspect it needs to be in checkpoints. > > Attached is a tidied up version which doesn't try to address the above > problems yet. When time permits I'll come back to this. I haven't looked at this again yet but a nearby thread reminded me of another problem with this which I wanted to restate explicitly here in the context of this patch. Even without replication in the picture, there is a race to reach ProcArrayEndTransaction() after RecordTransactionCommit() runs, which means that the DO history (normal primary server) and REDO history (recovery) don't always agree on the order that transactions become visible. With this patch, this kind of diverging DO and REDO could allow undetectable read only serialization anomalies. I think that ProcArrayEndTransaction() and RecordTransactionCommit() need to be made atomic in the simple case so that DO and REDO agree. Synchronous replication can make that more likely and it seems like some other approach is probably needed to delay visibility of not-yet-durable transactions while keeping the order that transactions become visible the same on all nodes. Aside from the problems I mentioned in my earlier message (race between snapshot safety decision and logging order, and lack of checkpointing of snapshot safety information), it seems like the two DO vs REDO problems (race to ProcArrayEndTransaction, and deliberately delayed visibility in syncrep) also need to be addressed before SERIALIZABLE DEFERRABLE on standbys could make a water tight guarantee. -- Thomas Munro http://www.enterprisedb.com
On 7 November 2016 at 23:56, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > The patch works by teaching the standby how to do somethings similar > to what SERIALIZABLE READ ONLY DEFERRABLE does on the primary server, > with some help from the primary server in the form of extra > information in the WAL. This is going in a direction I'm interested in, though I had some independent thoughts. > The basic idea is: the standby should wait until a point in the > transaction stream where it can take a snapshot and either (1) there > were no concurrent read/write SERIALIZABLE transactions running on the > primary, or (2) the last concurrent read/write SERIALIZABLE > transaction at snapshot time has now ended without creating a > dangerous cycle with our transaction. > > In case (1), the primary sometimes adds an extra > xl_xact_snapshot_safety struct to commit messages which says 'a > snapshot taken after this commit and before any future SSI commits is > safe, because there are no concurrent read/write SSI transactions at > this moment'. That has the problem that we need to decorate every commit. I'd be looking for a way to make the default case the safe path, so we add little new data to WAL. ISTM possible to insert a wait prior to the commit record when we detect that our commit would cause a pivot failure, so we actively try to avoid unsafe orderings. The purpose of that is to 1) ensure that the WAL ordering is a Serializable ordering in most cases. The wait would have a timeout for the cases where we can't determine safety. But is also has purpose 2) reduce the number of serialization failures on primary. Once we know that the WAL sequence is serializable we can use that information for both physical and logical. > In case (2), the xl_xact_snapshot_safety struct embedded in a commit > record instead says 'a snapshot taken after this commit and before any > future SSI commits is of unknown safety, because there are concurrent > transactions; I'll tell you when it has been determined; please > remember this token'. The token (which happens to be a CSN but that > is not important) will appear in a future independent snapshot safety > message which says whether a snapshot taken at that time is safe or > unsafe. > > Note that xl_xact_snapshot_safety is embedded in the commit messages > (for SSI transactions only), but also appears as its own WAL record to > report the final state of a token from an earlier commit. So if you > do a lot of non-overlapping writable SSI transactions, you'll get just > a few extra bytes in each commit record, but overlapping transactions > will generate a stream of extra snapshot safety messages, one for each > commit involved. I would want to add the Serilializable ordering info, not just "it is safe". I think we'd need to add a timeout to the deferrable snapshot request, because it could be hours. > In order to generate follow-up snapshot safety messages, the patch > creates 'hypothetical' transactions on the primary whenever a > writeable SSI transaction commits, so that it can figure out whether > such a transaction would conflict. These phantom transactions are > proxies for any transaction that may be created on a standby at the > same point in the transaction stream (with respect to SSI commits) on > any standby, and survive in memory just until they are found to be > safe or unsafe. That needs a lot more explanation to determine the soundness of that theory. README++ I think we need some isolationtester tests that generate various orderings and then standby tests to check that works. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 19 January 2017 at 16:16, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Nov 16, 2016 at 9:26 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Tue, Nov 8, 2016 at 5:56 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> [..] Another solution >> could be to have recovery on the standby detect tokens (CSNs >> incremented by PreCommit_CheckForSerializationFailure) arriving out of >> order, but I don't know what exactly it should do about that when it >> is detected: you shouldn't respect an out-of-order claim of safety, >> but then what should you wait for? Perhaps if the last replayed >> commit record before that was marked SNAPSHOT_SAFE then it's OK to >> leave it that way, and if it was marked SNAPSHOT_SAFETY_UNKNOWN then >> you have to wait for that one to be resolved by a follow-up snapshot >> safety message and then rince-and-repeat (take a new snapshot etc). I >> think that might work, but it seems strange to allow random races on >> the primary to create extra delays on the standby. Perhaps there is >> some much simpler way to do all this that I'm missing. >> >> Another detail is that standbys that start up from a checkpoint and >> don't see any SSI transactions commit don't yet have any snapshot >> safety information, but defaulting to assuming that this point is safe >> doesn't seem right, so I suspect it needs to be in checkpoints. >> >> Attached is a tidied up version which doesn't try to address the above >> problems yet. When time permits I'll come back to this. > > I haven't looked at this again yet but a nearby thread reminded me of > another problem with this which I wanted to restate explicitly here in > the context of this patch. Even without replication in the picture, > there is a race to reach ProcArrayEndTransaction() after > RecordTransactionCommit() runs, which means that the DO history > (normal primary server) and REDO history (recovery) don't always agree > on the order that transactions become visible. With this patch, this > kind of diverging DO and REDO could allow undetectable read only > serialization anomalies. I think that ProcArrayEndTransaction() and > RecordTransactionCommit() need to be made atomic in the simple case so > that DO and REDO agree. Not atomic, we just need to make ProcArrayEndTransaction() apply changes in the order of commits. I think that is more easily possible by reusing the SyncRepWaitForLSN() code, since that already orders things by LSN. So make all committers wait and then get WALwriter to wake people after ProcArrayEndTransaction() has been applied. > Synchronous replication can make that more > likely and it seems like some other approach is probably needed to > delay visibility of not-yet-durable transactions while keeping the > order that transactions become visible the same on all nodes. > Aside from the problems I mentioned in my earlier message (race > between snapshot safety decision and logging order, and lack of > checkpointing of snapshot safety information), it seems like the two > DO vs REDO problems (race to ProcArrayEndTransaction, and deliberately > delayed visibility in syncrep) also need to be addressed before > SERIALIZABLE DEFERRABLE on standbys could make a water tight > guarantee. While the difference in ordering is there, it would be useful to show how that allows serializable anomalies iff the WAL ordering is already known serializable. Mixing robustness modes with access to the same data is avoidable by design, but we could have a parameter(s) to prevent that if desirable, but only to prevent documented problems. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 16, 2017 at 5:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I haven't looked at this again yet but a nearby thread reminded me of >> another problem with this which I wanted to restate explicitly here in >> the context of this patch. Even without replication in the picture, >> there is a race to reach ProcArrayEndTransaction() after >> RecordTransactionCommit() runs, which means that the DO history >> (normal primary server) and REDO history (recovery) don't always agree >> on the order that transactions become visible. With this patch, this >> kind of diverging DO and REDO could allow undetectable read only >> serialization anomalies. I think that ProcArrayEndTransaction() and >> RecordTransactionCommit() need to be made atomic in the simple case so >> that DO and REDO agree. > > Not atomic, we just need to make ProcArrayEndTransaction() apply > changes in the order of commits. > > I think that is more easily possible by reusing the > SyncRepWaitForLSN() code, since that already orders things by LSN. > > So make all committers wait and then get WALwriter to wake people > after ProcArrayEndTransaction() has been applied. That doesn't solve any problem we have. Making committers wait after ProcArrayEndTransaction() wouldn't keep the effects of those transactions from being visible to other transactions in the system. And that's precisely the issue: on the primary, transactions become visible to other transactions in the order in which they perform ProcArrayEndTransaction(), but on the standby, they become visible in the order in which they RecordTransactionCommit(). That difference is a source of serialization anomalies. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Kevin, all, /me pokes ancient thread I haven't done any more work on the problems mentioned above, but I ran into Kevin at PostgresOpen in San Francisco and he said he might have some time to look at this problem. So, here is a long overdue rebase of the WIP patch. It shows a first order approximation of DEFERRABLE working on a standby (for example, see the sequence from the first message in the thread[1]). I'll add it to the next Commitfest so I know when to rebase it. [1] https://www.postgresql.org/message-id/flat/CAEepm%3D2b9TV%2BvJ4UeSBixDrW7VUiTjxPwWq8K3QwFSWx0pTXHQ%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
On Sat, Sep 22, 2018 at 12:28 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I'll add it to the next > Commitfest so I know when to rebase it. And cfbot immediately showed that this assertion in OldSerXidSetActiveSerXmin() could fail in the isolation tests: Assert(!TransactionIdIsValid(oldSerXidControl->tailXid) || TransactionIdFollows(xid, oldSerXidControl->tailXid)); Not sure how that ever worked or if I screwed something up while rebasing, but the quick (and possibly wrong?) solution I found was to exclude hypothetical SERIALIABLEXACTs when scanning for the new oldest xid: @@ -3181,6 +3322,7 @@ SetNewSxactGlobalXmin(void) for (sxact = FirstPredXact(); sxact != NULL; sxact = NextPredXact(sxact)) { if (!SxactIsRolledBack(sxact) + && !SxactIsHypothetical(sxact) && !SxactIsCommitted(sxact) && sxact != OldCommittedSxact) Here's a version like that in the meantime so that the CI passes. The real solution might be to give them their own xid (right now they "borrow" one, see PreCommit_CheckForSerializationFailure()... now that I think about it, that must be wrong), when I have more time for this project. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Sep 21, 2018 at 7:29 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I'll add it to the next Commitfest so I know when to rebase it. I signed up as reviewer in that CF. -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
> On Fri, Sep 21, 2018 at 2:29 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > Hi Kevin, all, > > /me pokes ancient thread This amazing feeling of being like Indiana Jones, thinking whether it's worth it to touch another ancient artifact. On Tue, Sep 25, 2018 at 4:51 PM Kevin Grittner <kgrittn@gmail.com> wrote: > > On Fri, Sep 21, 2018 at 7:29 AM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > > I'll add it to the next Commitfest so I know when to rebase it. > > I signed up as reviewer in that CF. Great! Can you provide a review already?
On Fri, Nov 30, 2018 at 3:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > On Tue, Sep 25, 2018 at 4:51 PM Kevin Grittner <kgrittn@gmail.com> wrote: > > On Fri, Sep 21, 2018 at 7:29 AM Thomas Munro > > <thomas.munro@enterprisedb.com> wrote: > > > I'll add it to the next Commitfest so I know when to rebase it. > > > > I signed up as reviewer in that CF. > > Great! Can you provide a review already? Just to be clear, although this patch is registered in the commitfest and currently applies and tests pass, it is prototype/WIP code with significant problems that remain to be resolved. I sort of wish there were a way to indicate that in the CF but since there isn't, I'm saying that here. What I hope to get from Kevin, Simon or other reviewers is some feedback on the general approach and problems discussed upthread (and other problems and better ideas I might have missed). So it's not seriously proposed for commit in this CF. (Unlike the nearby "SERIALIZABLE with parallel query" thread, which is about ready to land by all accounts, pending a round of perf testing if a decent test can be selected.) -- Thomas Munro http://www.enterprisedb.com
On Fri, Dec 28, 2018 at 02:21:44PM +1300, Thomas Munro wrote: > Just to be clear, although this patch is registered in the commitfest > and currently applies and tests pass, it is prototype/WIP code with > significant problems that remain to be resolved. I sort of wish there > were a way to indicate that in the CF but since there isn't, I'm > saying that here. What I hope to get from Kevin, Simon or other > reviewers is some feedback on the general approach and problems > discussed upthread (and other problems and better ideas I might have > missed). So it's not seriously proposed for commit in this CF. No feedback has actually come, so I have moved it to next CF. -- Michael
Attachment
On Mon, Jun 15, 2020 at 5:00 PM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Dec 28, 2018 at 02:21:44PM +1300, Thomas Munro wrote: > > Just to be clear, although this patch is registered in the commitfest > > and currently applies and tests pass, it is prototype/WIP code with > > significant problems that remain to be resolved. I sort of wish there > > were a way to indicate that in the CF but since there isn't, I'm > > saying that here. What I hope to get from Kevin, Simon or other > > reviewers is some feedback on the general approach and problems > > discussed upthread (and other problems and better ideas I might have > > missed). So it's not seriously proposed for commit in this CF. > > No feedback has actually come, so I have moved it to next CF. Having been nerd-sniped by SSI again, I spent some time this weekend rebasing this old patch, making a few improvements, and reformulating the problems to be solved as I see them. It's very roughly based on Kevin Grittner and Dan Ports' description of how you could give SERIALIZABLE a useful meaning on hot standbys. The short version of the theory is that you can make it work like SERIALIZABLE READ ONLY DEFERRABLE by adding a bit of extra information into the WAL stream. Problems: 1. As a prerequisite, we'd need to teach primary servers to make transactions visible in the same order that they log commits. Otherwise, we permit nonsense like seeing TX1 but not TX2 on the primary, and TX2 but not TX1 on the replica. You can probably argue that our read replicas don't satisfy the lower isolation levels, let alone serializable. 2. Similarly, it's probably not OK that PreCommit_CheckForSerializationFailure() determines MySerializableXact->snapshotSafetyAfterThisCommit. That may not happen in exactly the same order as commits are logged. Or maybe there is some argument for why that is OK, based on what we're doing with prepareSeqNo, or maybe we can do something with that to detect disorder. 3. The patch doesn't yet attempt to checkpoint the snapshot safety state. That's needed to start up in a sane state, without having to wait for WAL activity. 4. XactLogSnapshotSafetyRecord() flushes the WAL an extra time after a commit is flushed, which I put in for testing; that's silly... somehow it needs to be better integrated so we don't generate two sync I/Os in a row. 5. You probably want a way to turn off the extra WAL records and SERIALIZABLEXACT consumption if you're using SERIALIZABLE on a primary but not on the standby. Or maybe there is some way to make it come on automatically. I think I have cleared up the matter of xmin tracking for "hypothetical" SERIALIZABLEXACTs mentioned earlier. It's not needed, so should be set to InvalidTransactionId, and I added a comment to explain why. I also wrote a TAP test to exercise this thing. It is the same schedule as src/test/isolation/specs/read-only-anomaly-3.spec, except that transaction 3 runs on a streaming replica. One thing to point out is that this patch only aims to make it so that streaming replicas can't observe a state that would have caused a transaction to abort if it had been observed on the primary. The TAP test still has to insert its own wait-for-LSN loop to make sure step "s1c" is replayed before "s3r" runs. We could use synchronous_commit=remote_apply, and that'd probably work just as well for this particular test, but I'm not sure how to square that with fixing problem #1 above. The perl hackery I used to do overlapping transactions in a TAP test is pretty crufty. I guess we'd ideally have the isolation tester support per-session connection strings, and somehow get some perl code to orchestrate the cluster setup but then run the real isolation tester. Or something like that.