Thread: logical decoding and replication of sequences, take 2
Hi, Here's a rebased version of the patch adding logical decoding of sequences. The previous attempt [1] ended up getting reverted, due to running into issues with non-transactional nature of sequences when decoding the existing WAL records. See [2] for details. This patch uses a different approach, proposed by Hannu Krosing [3], based on tracking sequences actually modified in each transaction, and then WAL-logging the state at the end. This does work, but I'm not very happy about WAL-logging all sequences at the end. The "problem" is we have to re-read the current state of the sequence from disk, because it might be concurrently updated by another transaction. Imagine two transactions, T1 and T2: T1: BEGIN T1: SELECT nextval('s') FROM generate_series(1,1000) T2: BEGIN T2: SELECT nextval('s') FROM generate_series(1,1000) T2: COMMIT T1: COMMIT The expected outcome is that the sequence value is ~2000. We must not blindly apply the changes from T2 by the increments in T1. So the patch simply reads "current" state of the transaction at commit time. Which is annoying, because it involves I/O, increases the commit duration, etc. On the other hand, this is likely cheaper than the other approach based on WAL-logging every sequence increment (that would have to be careful about obsoleted increments too, when applying them transactionally). I wonder if we might deal with this by simply WAL-logging LSN of the last change for each sequence (in the given xact), which would allow discarding the "obsolete" changes quite easily I think. nextval() would simply look at LSN in the page header. And maybe we could then use the LSN to read the increment from the WAL during decoding, instead of having to read it and WAL-log it during commit. Essentially, we'd run a local XLogReader. Of course, we'd have to be careful about checkpoints, not sure what to do about that. Another idea that just occurred to me is that if we end up having to read the sequence state during commit, maybe we could at least optimize it somehow. For example we might track LSN of the last logged state for each sequence (in shared memory or something), and the other sessions could just skip the WAL-log if their "local" LSN is <= than this LSN. regards [1] https://www.postgresql.org/message-id/flat/d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com [2] https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com [3] https://www.postgresql.org/message-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
I've been thinking about the two optimizations mentioned at the end a bit more, so let me share my thoughts before I forget that: On 8/18/22 23:10, Tomas Vondra wrote: > > ... > > And maybe we could then use the LSN to read the increment from the WAL > during decoding, instead of having to read it and WAL-log it during > commit. Essentially, we'd run a local XLogReader. Of course, we'd have > to be careful about checkpoints, not sure what to do about that. > I think logging just the LSN is workable. I was worried about dealing with checkpoints, because imagine you do nextval() on sequence that was last WAL-logged a couple checkpoints back. Then you wouldn't be able to read the LSN (when decoding), because the WAL might have been recycled. But that can't happen, because we always force WAL-logging the first time nextval() is called after a checkpoint. So we know the LSN is guaranteed to be available. Of course, this would not reduce the amount of WAL messages, because we'd still log all sequences touched by the transaction. We wouldn't need to read the state from disk, though, and we could ignore "old" stuff in decoding (with LSN lower than the last LSN we decoded). For frequently used sequences that seems like a win. > Another idea that just occurred to me is that if we end up having to > read the sequence state during commit, maybe we could at least optimize > it somehow. For example we might track LSN of the last logged state for > each sequence (in shared memory or something), and the other sessions > could just skip the WAL-log if their "local" LSN is <= than this LSN. > Tracking the last LSN for each sequence (in a SLRU or something) should work too, I guess. In principle this just moves the skipping of "old" increments from decoding to writing, so that we don't even have to write those into WAL. We don't even need persistence, nor to keep all the records, I think. If you don't find a record for a given sequence, assume it wasn't logged yet and just log it. Of course, it requires a bit of shared memory for each sequence, say ~32B. Not sure about the overhead, but I'd bet if you have many (~thousands) frequently used sequences, there'll be a lot of other overhead making this irrelevant. Of course, if we're doing the skipping when writing the WAL, maybe we should just read the sequence state - we'd do the I/O, but only in fraction of the transactions, and we wouldn't need to read old WAL in logical decoding. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I noticed on cfbot the patch no longer applies, so here's a rebased version. Most of the breakage was due to the column filtering reworks, grammar changes etc. A lot of bitrot, but mostly mechanical stuff. I haven't looked into the optimizations / improvements I discussed in my previous post (logging only LSN of the last WAL-logged increment), because while fixing "make check-world" I ran into a more serious issue that I think needs to be discussed first. And I suspect it might also affect the feasibility of the LSN optimization. So, what's the issue - the current solution is based on WAL-logging state of all sequences incremented by the transaction at COMMIT. To do that, we read the state from disk, and write that into WAL. However, these WAL messages are not necessarily correlated to COMMIT records, so stuff like this might happen: 1. transaction T1 increments sequence S 2. transaction T2 increments sequence S 3. both T1 and T2 start to COMMIT 4. T1 reads state of S from disk, writes it into WAL 5. transaction T3 increments sequence S 6. T2 reads state of S from disk, writes it into WAL 7. T2 write COMMIT into WAL 8. T1 write COMMIT into WAL Because the apply order is determined by ordering of COMMIT records, this means we'd apply the increments logged by T2, and then by T1. But that undoes the increment by T3, and the sequence would go backwards. The previous patch version addressed that by acquiring lock on the sequence, holding it until transaction end. This effectively ensures the order of sequence messages and COMMIT matches. But that's problematic for a number of reasons: 1) throughput reduction, because the COMMIT records need to serialize 2) deadlock risk, if we happen to lock sequences in different order (in different transactions) 3) problem for prepared transactions - the sequences are locked and logged in PrepareTransaction, because we may not have seqhashtab beyond that point. This is a much worse variant of (1). Note: I also wonder what happens if someone does DISCARD SEQUENCES. I guess we'll forget the sequences, which is bad - so we'd have to invent a separate cache that does not have this issue. I realized (3) because one of the test_decoding TAP tests got stuck exactly because of a sequence locked by a prepared transaction. This patch simply releases the lock after writing the WAL message, but that just makes it vulnerable to the reordering. And this would have been true even with the LSN optimization. However, I was thinking that maybe we could use the LSN of the WAL message (XLOG_LOGICAL_SEQUENCE) to deal with the ordering issue, because *this* is the sensible sequence increment ordering. In the example above, we'd first apply the WAL message from T2 (because that commits first). And then we'd get to apply T1, but the WAL message has an older LSN, so we'd skip it. But this requires us remembering LSN of the already applied WAL sequence messages, which could be tricky - we'd need to persist it in some way because of restarts, etc. We can't do this while decoding but on the apply side, I think, because of streaming, aborts. The other option might be to make these messages non-transactional, in which case we'd separate the ordering from COMMIT ordering, evading the reordering problem. That'd mean we'd ignore rollbacks (which seems fine), we could probably optimize this by checking if the state actually changed, etc. But we'd also need to deal with transactions created in the (still uncommitted) transaction. But I'm also worried it might lead to the same issue with non-transactional behaviors that forced revert in v15. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
2022年11月12日(土) 7:49 Tomas Vondra <tomas.vondra@enterprisedb.com>: > > Hi, > > I noticed on cfbot the patch no longer applies, so here's a rebased > version. Most of the breakage was due to the column filtering reworks, > grammar changes etc. A lot of bitrot, but mostly mechanical stuff. (...) Hi Thanks for the update patch. While reviewing the patch backlog, we have determined that this patch adds one or more TAP tests but has not added the test to the "meson.build" file. To do this, locate the relevant "meson.build" file for each test and add it in the 'tests' dictionary, which will look something like this: 'tap': { 'tests': [ 't/001_basic.pl', ], }, For some additional details please see this Wiki article: https://wiki.postgresql.org/wiki/Meson_for_patch_authors For more information on the meson build system for PostgreSQL see: https://wiki.postgresql.org/wiki/Meson Regards Ian Barwick
On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > The other option might be to make these messages non-transactional, in > which case we'd separate the ordering from COMMIT ordering, evading the > reordering problem. > > That'd mean we'd ignore rollbacks (which seems fine), we could probably > optimize this by checking if the state actually changed, etc. But we'd > also need to deal with transactions created in the (still uncommitted) > transaction. But I'm also worried it might lead to the same issue with > non-transactional behaviors that forced revert in v15. I think it might be a good idea to step back slightly from implementation details and try to agree on a theoretical model of what's happening here. Let's start by banishing the words transactional and non-transactional from the conversation and talk about what logical replication is trying to do. We can imagine that the replicated objects on the primary pass through a series of states S1, S2, ..., Sn, where n keeps going up as new state changes occur. The state, for our purposes here, is the contents of the database as they could be observed by a user running SELECT queries at some moment in time chosen by the user. For instance, if the initial state of the database is S1, and then the user executes BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the state that differs from S1 in that both of those rows are now part of the database contents. There is no state where one of those rows is visible and the other is not. That was never observable by the user, except from within the transaction as it was executing, which we can and should discount. I believe that the goal of logical replication is to bring about a state of affairs where the set of states observable on the standby is a subset of the states observable on the primary. That is, if the primary goes from S1 to S2 to S3, the standby can do the same thing, or it can go straight from S1 to S3 without ever making it possible for the user to observe S2. Either is correct behavior. But the standby cannot invent any new states that didn't occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to S3, or something like that. It can only consolidate changes that occurred separately on the primary, never split them up. Neither can it reorder them. Now, if you accept this as a reasonable definition of correctness, then the next question is what consequences it has for transactional and non-transactional behavior. If all behavior is transactional, then we've basically got to replay each primary transaction in a single standby transaction, and commit those transactions in the same order that the corresponding primary transactions committed. We could legally choose to merge a group of transactions that committed one after the other on the primary into a single transaction on the standby, and it might even be a good idea if they're all very tiny, but it's not required. But if there are non-transactional things happening, then there are changes that become visible at some time other than at a transaction commit. For example, consider this sequence of events, in which each "thing" that happens is transactional except where the contrary is noted: T1: BEGIN; T2: BEGIN; T1: Do thing 1; T2: Do thing 2; T1: Do a non-transactional thing; T1: Do thing 3; T2: Do thing 4; T2: COMMIT; T1: COMMIT; From the point of the user here, there are 4 observable states here: S1: Initiate state. S2: State after the non-transactional thing happens. S3: State after T2 commits (reflects the non-transactional thing plus things 2 and 4). S4: State after T1 commits. Basically, the non-transactional thing behaves a whole lot like a separate transaction. That non-transactional operation ought to be replicated before T2, which ought to be replicated before T1. Maybe logical replication ought to treat it in exactly that way: as a separate operation that needs to be replicated after any earlier transactions that completed prior to the history shown here, but before T2 or T1. Alternatively, you can merge the non-transactional change into T2, i.e. the first transaction that committed after it happened. But you can't merge it into T1, even though it happened in T1. If you do that, then you're creating states on the standby that never existed on the primary, which is wrong. You could argue that this is just nitpicking: who cares if the change in the sequence value doesn't get replicated at exactly the right moment? But I don't think it's a technicality at all: I think if we don't make the operation appear to happen at the same point in the sequence as it became visible on the master, then there will be endless artifacts and corner cases to the bottom of which we will never get. Just like if we replicated the actual transactions out of order, chaos would ensue, because there can be logical dependencies between them, so too can there be logical dependencies between non-transactional operations, or between a non-transactional operation and a transactional operation. To make it more concrete, consider two sessions concurrently running this SQL: insert into t1 select nextval('s1') from generate_series(1,1000000) g; There are, in effect, 2000002 transaction-like things here. The sequence gets incremented 2 million times, and then there are 2 commits that each insert a million rows. Perhaps the actual order of events looks something like this: 1. nextval the sequence N times, where N >= 1 million 2. commit the first transaction, adding a million rows to t1 3. nextval the sequence 2 million - N times 4. commit the second transaction, adding another million rows to t1 Unless we replicate all of the nextval operations that occur in step 1 at the same time or prior to replicating the first transaction in step 2, we might end up making visible a state where the next value of the sequence is less than the highest value present in the table, which would be bad. With that perhaps overly-long set of preliminaries, I'm going to move on to talking about the implementation ideas which you mention. You write that "the current solution is based on WAL-logging state of all sequences incremented by the transaction at COMMIT" and then, it seems to me, go on to demonstrate that it's simply incorrect. In my opinion, the fundamental problem is that it doesn't look at the order that things happened on the primary and do them in the same order on the standby. Instead, it accepts that the non-transactional operations are going to be replicated at the wrong time, and then tries to patch around the issue by attempting to scrounge up the correct values at some convenient point and use that data to compensate for our failure to do the right thing at an earlier point. That doesn't seem like a satisfying solution, and I think it will be hard to make it fully correct. Your alternative proposal says "The other option might be to make these messages non-transactional, in which case we'd separate the ordering from COMMIT ordering, evading the reordering problem." But I don't think that avoids the reordering problem at all. Nor do I think it's correct. I don't think you *can* separate the ordering of these operations from the COMMIT ordering. They are, as I argue here, essentially mini-commits that only bump the sequence value, and they need to be replicated after the transactions that commit prior to the sequence value bump and before those that commit afterward. If they aren't handled that way, I don't think you're going to get fully correct behavior. I'm going to confess that I have no really specific idea how to implement that. I'm just not sufficiently familiar with this code. However, I suspect that the solution lies in changing things on the decoding side rather than in the WAL format. I feel like the information that we need in order to do the right thing must already be present in the WAL. If it weren't, then how could crash recovery work correctly, or physical replication? At any given moment, you can choose to promote a physical standby, and at that point the state you observe on the new primary had better be some state that existed on the primary at some point in its history. At any moment, you can unplug the primary, restart it, and run crash recovery, and if you do, you had better end up with some state that existed on the primary at some point shortly before the crash. I think that there are actually a few subtle inaccuracies in the last two sentences, because actually the order in which transactions become visible on a physical standby can differ from the order in which it happens on the primary, but I don't think that actually changes the picture much. The point is that the WAL is the definitive source of information about what happened and in what order it happened, and we use it in that way already in the context of physical replication, and of standbys. If logical decoding has a problem with some case that those systems handle correctly, the problem is with logical decoding, not the WAL format. In particular, I think it's likely that the "non-transactional messages" that you mention earlier don't get applied at the point in the commit sequence where they were found in the WAL. Not sure why exactly, but perhaps the point at which we're reading WAL runs ahead of the decoding per se, or something like that, and thus those non-transactional messages arrive too early relative to the commit ordering. Possibly that could be changed, and they could be buffered until earlier commits are replicated. Or else, when we see a WAL record for a non-transactional sequence operation, we could arrange to bundle that operation into an "adjacent" replicated transaction i.e. the transaction whose commit record occurs most nearly prior to, or most nearly after, the WAL record for the operation itself. Or else, we could create "virtual" transactions for such operations and make sure those get replayed at the right point in the commit sequence. Or else, I don't know, maybe something else. But I think the overall picture is that we need to approach the problem by replicating changes in WAL order, as a physical standby would do. Saying that a change is "nontransactional" doesn't mean that it's exempt from ordering requirements; rather, it means that that change has its own place in that ordering, distinct from the transaction in which it occurred. -- Robert Haas EDB: http://www.enterprisedb.com
On 11/16/22 22:05, Robert Haas wrote: > On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> The other option might be to make these messages non-transactional, in >> which case we'd separate the ordering from COMMIT ordering, evading the >> reordering problem. >> >> That'd mean we'd ignore rollbacks (which seems fine), we could probably >> optimize this by checking if the state actually changed, etc. But we'd >> also need to deal with transactions created in the (still uncommitted) >> transaction. But I'm also worried it might lead to the same issue with >> non-transactional behaviors that forced revert in v15. > > I think it might be a good idea to step back slightly from > implementation details and try to agree on a theoretical model of > what's happening here. Let's start by banishing the words > transactional and non-transactional from the conversation and talk > about what logical replication is trying to do. > OK, let's try. > We can imagine that the replicated objects on the primary pass through > a series of states S1, S2, ..., Sn, where n keeps going up as new > state changes occur. The state, for our purposes here, is the contents > of the database as they could be observed by a user running SELECT > queries at some moment in time chosen by the user. For instance, if > the initial state of the database is S1, and then the user executes > BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the > state that differs from S1 in that both of those rows are now part of > the database contents. There is no state where one of those rows is > visible and the other is not. That was never observable by the user, > except from within the transaction as it was executing, which we can > and should discount. I believe that the goal of logical replication is > to bring about a state of affairs where the set of states observable > on the standby is a subset of the states observable on the primary. > That is, if the primary goes from S1 to S2 to S3, the standby can do > the same thing, or it can go straight from S1 to S3 without ever > making it possible for the user to observe S2. Either is correct > behavior. But the standby cannot invent any new states that didn't > occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to > S3, or something like that. It can only consolidate changes that > occurred separately on the primary, never split them up. Neither can > it reorder them. > I mostly agree, and in a way the last patch aims to do roughly this, i.e. make sure that the state after each transaction matches the state a user might observe on the primary (modulo implementation challenges). There's a couple of caveats, though: 1) Maybe we should focus more on "actually observed" state instead of "observable". Who cares if the sequence moved forward in a transaction that was ultimately rolled back? No committed transaction should have observer those values - in a way, the last "valid" state of the sequence is the last value generated in a transaction that ultimately committed. 2) I think what matters more is that we never generate duplicate value. That is, if you generate a value from a sequence, commit a transaction and replicate it, then the logical standby should not generate the same value from the sequence. This guarantee seems necessary for "failover" to logical standby. > Now, if you accept this as a reasonable definition of correctness, > then the next question is what consequences it has for transactional > and non-transactional behavior. If all behavior is transactional, then > we've basically got to replay each primary transaction in a single > standby transaction, and commit those transactions in the same order > that the corresponding primary transactions committed. We could > legally choose to merge a group of transactions that committed one > after the other on the primary into a single transaction on the > standby, and it might even be a good idea if they're all very tiny, > but it's not required. But if there are non-transactional things > happening, then there are changes that become visible at some time > other than at a transaction commit. For example, consider this > sequence of events, in which each "thing" that happens is > transactional except where the contrary is noted: > > T1: BEGIN; > T2: BEGIN; > T1: Do thing 1; > T2: Do thing 2; > T1: Do a non-transactional thing; > T1: Do thing 3; > T2: Do thing 4; > T2: COMMIT; > T1: COMMIT; > > From the point of the user here, there are 4 observable states here: > > S1: Initiate state. > S2: State after the non-transactional thing happens. > S3: State after T2 commits (reflects the non-transactional thing plus > things 2 and 4). > S4: State after T1 commits. > > Basically, the non-transactional thing behaves a whole lot like a > separate transaction. That non-transactional operation ought to be > replicated before T2, which ought to be replicated before T1. Maybe > logical replication ought to treat it in exactly that way: as a > separate operation that needs to be replicated after any earlier > transactions that completed prior to the history shown here, but > before T2 or T1. Alternatively, you can merge the non-transactional > change into T2, i.e. the first transaction that committed after it > happened. But you can't merge it into T1, even though it happened in > T1. If you do that, then you're creating states on the standby that > never existed on the primary, which is wrong. You could argue that > this is just nitpicking: who cares if the change in the sequence value > doesn't get replicated at exactly the right moment? But I don't think > it's a technicality at all: I think if we don't make the operation > appear to happen at the same point in the sequence as it became > visible on the master, then there will be endless artifacts and corner > cases to the bottom of which we will never get. Just like if we > replicated the actual transactions out of order, chaos would ensue, > because there can be logical dependencies between them, so too can > there be logical dependencies between non-transactional operations, or > between a non-transactional operation and a transactional operation. > Well, yeah - we can either try to perform the stuff independently of the transactions that triggered it, or we can try making it part of some of the transactions. Each of those options has problems, though :-( The first version of the patch tried the first approach, i.e. decode the increments and apply that independently. But: (a) What would you do with increments of sequences created/reset in a transaction? Can't apply those outside the transaction, because it might be rolled back (and that state is not visible on primary). (b) What about increments created before we have a proper snapshot? There may be transactions dependent on the increment. This is what ultimately led to revert of the patch. This version of the patch tries to do the opposite thing - make sure that the state after each commit matches what the transaction might have seen (for sequences it accessed). It's imperfect, because it might log a state generated "after" the sequence got accessed - it focuses on the guarantee not to generate duplicate values. > To make it more concrete, consider two sessions concurrently running this SQL: > > insert into t1 select nextval('s1') from generate_series(1,1000000) g; > > There are, in effect, 2000002 transaction-like things here. The > sequence gets incremented 2 million times, and then there are 2 > commits that each insert a million rows. Perhaps the actual order of > events looks something like this: > > 1. nextval the sequence N times, where N >= 1 million > 2. commit the first transaction, adding a million rows to t1 > 3. nextval the sequence 2 million - N times > 4. commit the second transaction, adding another million rows to t1 > > Unless we replicate all of the nextval operations that occur in step 1 > at the same time or prior to replicating the first transaction in step > 2, we might end up making visible a state where the next value of the > sequence is less than the highest value present in the table, which > would be bad. > Right, that's the "guarantee" I've mentioned above, more or less. > With that perhaps overly-long set of preliminaries, I'm going to move > on to talking about the implementation ideas which you mention. You > write that "the current solution is based on WAL-logging state of all > sequences incremented by the transaction at COMMIT" and then, it seems > to me, go on to demonstrate that it's simply incorrect. In my opinion, > the fundamental problem is that it doesn't look at the order that > things happened on the primary and do them in the same order on the > standby. Instead, it accepts that the non-transactional operations are > going to be replicated at the wrong time, and then tries to patch > around the issue by attempting to scrounge up the correct values at > some convenient point and use that data to compensate for our failure > to do the right thing at an earlier point. That doesn't seem like a > satisfying solution, and I think it will be hard to make it fully > correct. > I understand what you're saying, but I'm not sure I agree with you. Yes, this would mean we accept we may end up with something like this: 1: T1 logs sequence state S1 2: someone increments sequence 3: T2 logs sequence stats S2 4: T2 commits 5: T1 commits which "inverts" the apply order of S1 vs. S2, because we first apply S2 and then the "old" S1. But as long as we're smart enough to "discard" applying S1, I think that's acceptable - because it guarantees we'll not generate duplicate values (with values in the committed transaction). I'd also argue it does not actually generate invalid state, because once we commit either transaction, S2 is what's visible. Yes, if you so "SELECT * FROM sequence" you'll see some intermediate state, but that's not how sequences are accessed. And you can't do currval('s') from a transaction that never accessed the sequence. And if it did, we'd write S2 (or whatever it saw) as part of it's commits. So I think the main issue of this approach is how to decide which sequence states are obsolete and should be skipped. > Your alternative proposal says "The other option might be to make > these messages non-transactional, in which case we'd separate the > ordering from COMMIT ordering, evading the reordering problem." But I > don't think that avoids the reordering problem at all. I don't understand why. Why would it not address the reordering issue? > Nor do I think it's correct. Nor do I understand this. I mean, isn't it essentially the option you mentioned earlier - treating the non-transactional actions as independent transactions? Yes, we'd be batching them so that we'd not see "intermediate" states, but those are not observed by abyone. > I don't think you *can* separate the ordering of these > operations from the COMMIT ordering. They are, as I argue here, > essentially mini-commits that only bump the sequence value, and they > need to be replicated after the transactions that commit prior to the > sequence value bump and before those that commit afterward. If they > aren't handled that way, I don't think you're going to get fully > correct behavior. I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine you have something like this: 1: T1 does something and also increments a sequence 2: T1 logs state of the sequence (right before commit) 3: T1 writes COMMIT Now when we decode/apply this, we end up doing this: 1: decode all T1 changes, stash them 2: decode the sequence state and apply it separately 3: decode COMMIT, apply all T1 changes There might be other transactions interleaving with this, but I think it'd behave correctly. What example would not work? > > I'm going to confess that I have no really specific idea how to > implement that. I'm just not sufficiently familiar with this code. > However, I suspect that the solution lies in changing things on the > decoding side rather than in the WAL format. I feel like the > information that we need in order to do the right thing must already > be present in the WAL. If it weren't, then how could crash recovery > work correctly, or physical replication? At any given moment, you can > choose to promote a physical standby, and at that point the state you > observe on the new primary had better be some state that existed on > the primary at some point in its history. At any moment, you can > unplug the primary, restart it, and run crash recovery, and if you do, > you had better end up with some state that existed on the primary at > some point shortly before the crash. I think that there are actually a > few subtle inaccuracies in the last two sentences, because actually > the order in which transactions become visible on a physical standby > can differ from the order in which it happens on the primary, but I > don't think that actually changes the picture much. The point is that > the WAL is the definitive source of information about what happened > and in what order it happened, and we use it in that way already in > the context of physical replication, and of standbys. If logical > decoding has a problem with some case that those systems handle > correctly, the problem is with logical decoding, not the WAL format. > The problem lies in how we log sequences. If we wrote each individual increment to WAL, it might work the way you propose (except for cases with sequences created in a transaction, etc.). But that's not what we do - we log sequence increments in batches of 32 values, and then only modify the sequence relfilenode. This works for physical replication, because the WAL describes the "next" state of the sequence (so if you do "SELECT * FROM sequence" you'll not see the same state, and the sequence value may "jump ahead" after a failover). But for logical replication this does not work, because the transaction might depend on a state created (WAL-logged) by some other transaction. And perhaps that transaction actually happened *before* we even built the first snapshot for decoding :-/ There's also the issue with what snapshot to use when decoding these transactional changes in logical decoding (see > In particular, I think it's likely that the "non-transactional > messages" that you mention earlier don't get applied at the point in > the commit sequence where they were found in the WAL. Not sure why > exactly, but perhaps the point at which we're reading WAL runs ahead > of the decoding per se, or something like that, and thus those > non-transactional messages arrive too early relative to the commit > ordering. Possibly that could be changed, and they could be buffered I'm not sure which case of "non-transactional messages" this refers to, so I can't quite respond to these comments. Perhaps you mean the problems that killed the previous patch [1]? [1] https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com > until earlier commits are replicated. Or else, when we see a WAL > record for a non-transactional sequence operation, we could arrange to > bundle that operation into an "adjacent" replicated transaction i.e. IIRC moving stuff between transactions during decoding is problematic, because of snapshots. > the transaction whose commit record occurs most nearly prior to, or > most nearly after, the WAL record for the operation itself. Or else, > we could create "virtual" transactions for such operations and make > sure those get replayed at the right point in the commit sequence. Or > else, I don't know, maybe something else. But I think the overall > picture is that we need to approach the problem by replicating changes > in WAL order, as a physical standby would do. Saying that a change is > "nontransactional" doesn't mean that it's exempt from ordering > requirements; rather, it means that that change has its own place in > that ordering, distinct from the transaction in which it occurred. > But doesn't the approach with WAL-logging sequence state before COMMIT, and then applying it independently in WAL-order, do pretty much this? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote: > Well, yeah - we can either try to perform the stuff independently of the > transactions that triggered it, or we can try making it part of some of > the transactions. Each of those options has problems, though :-( > > The first version of the patch tried the first approach, i.e. decode the > increments and apply that independently. But: > > (a) What would you do with increments of sequences created/reset in a > transaction? Can't apply those outside the transaction, because it > might be rolled back (and that state is not visible on primary). I think a reasonable approach could be to actually perform different WAL logging for that case. It'll require a bit of machinery, but could actually result in *less* WAL logging overall, because we don't need to emit a WAL record for each SEQ_LOG_VALS sequence values. > (b) What about increments created before we have a proper snapshot? > There may be transactions dependent on the increment. This is what > ultimately led to revert of the patch. I don't understand this - why would we ever need to process those increments from before we have a snapshot? Wouldn't they, by definition, be before the slot was active? To me this is the rough equivalent of logical decoding not giving the initial state of all tables. You need some process outside of logical decoding to get that (obviously we have some support for that via the exported data snapshot during slot creation). I assume that part of the initial sync would have to be a new sequence synchronization step that reads all the sequence states on the publisher and ensures that the subscriber sequences are at the same point. There's a bit of trickiness there, but it seems entirely doable. The logical replication replay support for sequences will have to be a bit careful about not decreasing the subscriber's sequence values - the standby initially will be ahead of the increments we'll see in the WAL. But that seems inevitable given the non-transactional nature of sequences. > This version of the patch tries to do the opposite thing - make sure > that the state after each commit matches what the transaction might have > seen (for sequences it accessed). It's imperfect, because it might log a > state generated "after" the sequence got accessed - it focuses on the > guarantee not to generate duplicate values. That approach seems quite wrong to me. > > I'm going to confess that I have no really specific idea how to > > implement that. I'm just not sufficiently familiar with this code. > > However, I suspect that the solution lies in changing things on the > > decoding side rather than in the WAL format. I feel like the > > information that we need in order to do the right thing must already > > be present in the WAL. If it weren't, then how could crash recovery > > work correctly, or physical replication? At any given moment, you can > > choose to promote a physical standby, and at that point the state you > > observe on the new primary had better be some state that existed on > > the primary at some point in its history. At any moment, you can > > unplug the primary, restart it, and run crash recovery, and if you do, > > you had better end up with some state that existed on the primary at > > some point shortly before the crash. One minor exception here is that there's no real time bound to see the last few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL flush. > > I think that there are actually a > > few subtle inaccuracies in the last two sentences, because actually > > the order in which transactions become visible on a physical standby > > can differ from the order in which it happens on the primary, but I > > don't think that actually changes the picture much. The point is that > > the WAL is the definitive source of information about what happened > > and in what order it happened, and we use it in that way already in > > the context of physical replication, and of standbys. If logical > > decoding has a problem with some case that those systems handle > > correctly, the problem is with logical decoding, not the WAL format. > > > > The problem lies in how we log sequences. If we wrote each individual > increment to WAL, it might work the way you propose (except for cases > with sequences created in a transaction, etc.). But that's not what we > do - we log sequence increments in batches of 32 values, and then only > modify the sequence relfilenode. > This works for physical replication, because the WAL describes the > "next" state of the sequence (so if you do "SELECT * FROM sequence" > you'll not see the same state, and the sequence value may "jump ahead" > after a failover). > > But for logical replication this does not work, because the transaction > might depend on a state created (WAL-logged) by some other transaction. > And perhaps that transaction actually happened *before* we even built > the first snapshot for decoding :-/ I really can't follow the "depend on state ... by some other transaction" aspect. Even the case of a sequence that is renamed inside a transaction that did *not* create / reset the sequence and then also triggers increment of the sequence seems to be dealt with reasonably by processing sequence increments outside a transaction - the old name will be used for the increments, replay of the renaming transaction would then implement the rename in a hypothetical DDL-replay future. > There's also the issue with what snapshot to use when decoding these > transactional changes in logical decoding (see Incomplete parenthetical? Or were you referencing the next paragraph? What are the transactional changes you're referring to here? I did some skimming of the referenced thread about the reversal of the last approach, but I couldn't really understand what the fundamental issues were with the reverted implementation - it's a very long thread and references other threads. Greetings, Andres Freund
On 11/17/22 03:43, Andres Freund wrote: > Hi, > > > On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote: >> Well, yeah - we can either try to perform the stuff independently of the >> transactions that triggered it, or we can try making it part of some of >> the transactions. Each of those options has problems, though :-( >> >> The first version of the patch tried the first approach, i.e. decode the >> increments and apply that independently. But: >> >> (a) What would you do with increments of sequences created/reset in a >> transaction? Can't apply those outside the transaction, because it >> might be rolled back (and that state is not visible on primary). > > I think a reasonable approach could be to actually perform different WAL > logging for that case. It'll require a bit of machinery, but could actually > result in *less* WAL logging overall, because we don't need to emit a WAL > record for each SEQ_LOG_VALS sequence values. > Could you elaborate? Hard to comment without knowing more ... My point was that stuff like this (creating a new sequence or at least a new relfilenode) means we can't apply that independently of the transaction (unlike regular increments). I'm not sure how a change to WAL logging would make that go away. > > >> (b) What about increments created before we have a proper snapshot? >> There may be transactions dependent on the increment. This is what >> ultimately led to revert of the patch. > > I don't understand this - why would we ever need to process those increments > from before we have a snapshot? Wouldn't they, by definition, be before the > slot was active? > > To me this is the rough equivalent of logical decoding not giving the initial > state of all tables. You need some process outside of logical decoding to get > that (obviously we have some support for that via the exported data snapshot > during slot creation). > Which is what already happens during tablesync, no? We more or less copy sequences as if they were tables. > I assume that part of the initial sync would have to be a new sequence > synchronization step that reads all the sequence states on the publisher and > ensures that the subscriber sequences are at the same point. There's a bit of > trickiness there, but it seems entirely doable. The logical replication replay > support for sequences will have to be a bit careful about not decreasing the > subscriber's sequence values - the standby initially will be ahead of the > increments we'll see in the WAL. But that seems inevitable given the > non-transactional nature of sequences. > See fetch_sequence_data / copy_sequence in the patch. The bit about ensuring the sequence does not go away (say, using page LSN and/or LSN of the increment) is not there, however isn't that pretty much what I proposed doing for "reconciling" the sequence state logged at COMMIT? > >> This version of the patch tries to do the opposite thing - make sure >> that the state after each commit matches what the transaction might have >> seen (for sequences it accessed). It's imperfect, because it might log a >> state generated "after" the sequence got accessed - it focuses on the >> guarantee not to generate duplicate values. > > That approach seems quite wrong to me. > Why? Because it might log a state for sequence as of COMMIT, when the transaction accessed the sequence much earlier? That is, this may happen: T1: nextval('s') -> 1 T2: call nextval('s') 1000000x T1: commit and T1 will log sequence state ~1000001, give or take. I don't think there's way around that, given the non-transactional nature of sequences. And I'm not convinced this is an issue, as it ensures uniqueness of values generated on the subscriber. And I think it's reasonable to replicate the sequence state as of the commit (because that's what you'd see on the primary). > >>> I'm going to confess that I have no really specific idea how to >>> implement that. I'm just not sufficiently familiar with this code. >>> However, I suspect that the solution lies in changing things on the >>> decoding side rather than in the WAL format. I feel like the >>> information that we need in order to do the right thing must already >>> be present in the WAL. If it weren't, then how could crash recovery >>> work correctly, or physical replication? At any given moment, you can >>> choose to promote a physical standby, and at that point the state you >>> observe on the new primary had better be some state that existed on >>> the primary at some point in its history. At any moment, you can >>> unplug the primary, restart it, and run crash recovery, and if you do, >>> you had better end up with some state that existed on the primary at >>> some point shortly before the crash. > > One minor exception here is that there's no real time bound to see the last > few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL > flush. > Right. Another issue is we ignore stuff that happened in aborted transactions, so then nextval('s') in another transaction may not wait for syncrep to confirm receiving that WAL. Which is a data loss case, see [1]: [1] https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9%40enterprisedb.com > >>> I think that there are actually a >>> few subtle inaccuracies in the last two sentences, because actually >>> the order in which transactions become visible on a physical standby >>> can differ from the order in which it happens on the primary, but I >>> don't think that actually changes the picture much. The point is that >>> the WAL is the definitive source of information about what happened >>> and in what order it happened, and we use it in that way already in >>> the context of physical replication, and of standbys. If logical >>> decoding has a problem with some case that those systems handle >>> correctly, the problem is with logical decoding, not the WAL format. >>> >> >> The problem lies in how we log sequences. If we wrote each individual >> increment to WAL, it might work the way you propose (except for cases >> with sequences created in a transaction, etc.). But that's not what we >> do - we log sequence increments in batches of 32 values, and then only >> modify the sequence relfilenode. > >> This works for physical replication, because the WAL describes the >> "next" state of the sequence (so if you do "SELECT * FROM sequence" >> you'll not see the same state, and the sequence value may "jump ahead" >> after a failover). >> >> But for logical replication this does not work, because the transaction >> might depend on a state created (WAL-logged) by some other transaction. >> And perhaps that transaction actually happened *before* we even built >> the first snapshot for decoding :-/ > > I really can't follow the "depend on state ... by some other transaction" > aspect. > T1: nextval('s') -> writes WAL, covering by the next 32 increments T2: nextval('s') -> no WAL generated, covered by T1 WAL This is what I mean by "dependency" on state logged by another transaction. It already causes problems with streaming replication (see the reference to syncrep above), logical replication has the same issue. > > Even the case of a sequence that is renamed inside a transaction that did > *not* create / reset the sequence and then also triggers increment of the > sequence seems to be dealt with reasonably by processing sequence increments > outside a transaction - the old name will be used for the increments, replay > of the renaming transaction would then implement the rename in a hypothetical > DDL-replay future. > > >> There's also the issue with what snapshot to use when decoding these >> transactional changes in logical decoding (see > > Incomplete parenthetical? Or were you referencing the next paragraph? > > What are the transactional changes you're referring to here? > Sorry, IIRC I merely wanted to mention/reference the snapshot issue in the thread [2] that I ended up referencing in the next paragraph. [2] https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com > > I did some skimming of the referenced thread about the reversal of the last > approach, but I couldn't really understand what the fundamental issues were > with the reverted implementation - it's a very long thread and references > other threads. > Yes, it's long/complex, but I intentionally linked to a specific message which describes the issue ... It's entirely possible there is a simple fix for the issue, and I just got confused / unable to see the solution. The whole issue was due to having a mix of transactional and non-transactional cases, similarly to logical messages - and logicalmsg_decode() has the same issue, so maybe let's talk about that for a moment. See [3] and imagine you're dealing with a transactional message, but you're still building a consistent snapshot. So the first branch applies: if (transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; but because we don't have a snapshot, SnapBuildProcessChange does this: if (builder->state < SNAPBUILD_FULL_SNAPSHOT) return false; which however means logicalmsg_decode() does snapshot = SnapBuildGetOrBuildSnapshot(builder); which crashes, because it hits this assert: Assert(builder->state == SNAPBUILD_CONSISTENT); The sequence decoding did almost the same thing, with the same issue. Maybe the correct thing to do is to just ignore the change in this case? Presumably it'd be replicated by tablesync. But we've been unable to convince ourselves that's correct, or what snapshot to pass to ReorderBufferQueueMessage/ReorderBufferQueueSequence. [3] https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/decode.c#L585 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 16, 2022 at 8:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > There's a couple of caveats, though: > > 1) Maybe we should focus more on "actually observed" state instead of > "observable". Who cares if the sequence moved forward in a transaction > that was ultimately rolled back? No committed transaction should have > observer those values - in a way, the last "valid" state of the sequence > is the last value generated in a transaction that ultimately committed. When I say "observable" I mean from a separate transaction, not one that is making changes to things. I said "observable" rather than "actually observed" because we neither know nor care whether someone actually ran a SELECT statement at any given moment in time, just what they would have seen if they did. > 2) I think what matters more is that we never generate duplicate value. > That is, if you generate a value from a sequence, commit a transaction > and replicate it, then the logical standby should not generate the same > value from the sequence. This guarantee seems necessary for "failover" > to logical standby. I think that matters, but I don't think it's sufficient. We need to preserve the order in which things appear to happen, and which changes are and are not atomic, not just the final result. > Well, yeah - we can either try to perform the stuff independently of the > transactions that triggered it, or we can try making it part of some of > the transactions. Each of those options has problems, though :-( > > The first version of the patch tried the first approach, i.e. decode the > increments and apply that independently. But: > > (a) What would you do with increments of sequences created/reset in a > transaction? Can't apply those outside the transaction, because it > might be rolled back (and that state is not visible on primary). If the state isn't going to be visible until the transaction commits, it has to be replicated as part of the transaction. If I create a sequence and then nextval it a bunch of times, I can't replicate that by first creating the sequence, and then later, as a separate operation, replicating the nextvals. If I do that, then there's an intermediate state visible on the replica that was never visible on the origin server. That's broken. > (b) What about increments created before we have a proper snapshot? > There may be transactions dependent on the increment. This is what > ultimately led to revert of the patch. Whatever problem exists here is with the implementation, not the concept. If you copy the initial state as it exists at some moment in time to a replica, and then replicate all the changes that happen afterward to that replica without messing up the order, the replica WILL be in sync with the origin server. The things that happen before you copy the initial state do not and cannot matter. But what you're describing sounds like the changes aren't really replicated in visibility order, and then it is easy to see how a problem like this can happen. Because now, an operation that actually became visible just before or just after the initial copy was taken might be thought to belong on the other side of that boundary, and then everything will break. And it sounds like that is what you are describing. > This version of the patch tries to do the opposite thing - make sure > that the state after each commit matches what the transaction might have > seen (for sequences it accessed). It's imperfect, because it might log a > state generated "after" the sequence got accessed - it focuses on the > guarantee not to generate duplicate values. Like Andres, I just can't imagine this being correct. It feels like it's trying to paper over the failure to do the replication properly during the transaction by overwriting state at the end. > Yes, this would mean we accept we may end up with something like this: > > 1: T1 logs sequence state S1 > 2: someone increments sequence > 3: T2 logs sequence stats S2 > 4: T2 commits > 5: T1 commits > > which "inverts" the apply order of S1 vs. S2, because we first apply S2 > and then the "old" S1. But as long as we're smart enough to "discard" > applying S1, I think that's acceptable - because it guarantees we'll not > generate duplicate values (with values in the committed transaction). > > I'd also argue it does not actually generate invalid state, because once > we commit either transaction, S2 is what's visible. I agree that it's OK if the sequence increment gets merged into the commit that immediately follows. However, I disagree with the idea of discarding the second update on the grounds that it would make the sequence go backward and we know that can't be right. That algorithm works in the really specific case where the only operations are increments. As soon as anyone does anything else to the sequence, such an algorithm can no longer work. Nor can it work for objects that are not sequences. The alternative strategy of replicating each change exactly once and in the correct order works for all current and future object types in all cases. > > Your alternative proposal says "The other option might be to make > > these messages non-transactional, in which case we'd separate the > > ordering from COMMIT ordering, evading the reordering problem." But I > > don't think that avoids the reordering problem at all. > > I don't understand why. Why would it not address the reordering issue? > > > Nor do I think it's correct. > > Nor do I understand this. I mean, isn't it essentially the option you > mentioned earlier - treating the non-transactional actions as > independent transactions? Yes, we'd be batching them so that we'd not > see "intermediate" states, but those are not observed by abyone. I don't think that batching them is a bad idea, in fact I think it's necessary. But those batches still have to be applied at the right time relative to the sequence of commits. > I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine > you have something like this: > > 1: T1 does something and also increments a sequence > 2: T1 logs state of the sequence (right before commit) > 3: T1 writes COMMIT > > Now when we decode/apply this, we end up doing this: > > 1: decode all T1 changes, stash them > 2: decode the sequence state and apply it separately > 3: decode COMMIT, apply all T1 changes > > There might be other transactions interleaving with this, but I think > it'd behave correctly. What example would not work? What if one of the other transactions renames the sequence, or changes the current value, or does basically anything to it other than nextval? > The problem lies in how we log sequences. If we wrote each individual > increment to WAL, it might work the way you propose (except for cases > with sequences created in a transaction, etc.). But that's not what we > do - we log sequence increments in batches of 32 values, and then only > modify the sequence relfilenode. > > This works for physical replication, because the WAL describes the > "next" state of the sequence (so if you do "SELECT * FROM sequence" > you'll not see the same state, and the sequence value may "jump ahead" > after a failover). > > But for logical replication this does not work, because the transaction > might depend on a state created (WAL-logged) by some other transaction. > And perhaps that transaction actually happened *before* we even built > the first snapshot for decoding :-/ I agree that there's a problem here but I don't think that it's a huge problem. I think that it's not QUITE right to think about what state is visible on the primary. It's better to think about what state would be visible on the primary if it crashed and restarted after writing any given amount of WAL, or what would be visible on a physical standby after replaying any given amount of WAL. If logical replication mimics that, I think it's as correct as it needs to be. If not, those other systems are broken, too. So I think what should happen is that when we write a WAL record saying that the sequence has been incremented by 32, that should be logically replicated after all commits whose commit record precedes that WAL record and before commits whose commit record follows that WAL record. It is OK to merge the replication of that record into one of either the immediately preceding or the immediately following commit, but you can't do it as part of any other commit because then you're changing the order of operations. For instance, consider: T1: BEGIN; INSERT; COMMIT; T2: BEGIN; nextval('a_seq') causing a logged advancement to the sequence; T3: BEGIN; nextval('b_seq') causing a logged advancement to the sequence; T4: BEGIN; INSERT; COMMIT; T2: COMMIT; T3: COMMIT; The sequence increments can be replicated as part of T1 or part of T4 or in between applying T1 and T4. They cannot be applied as part of T2 or T3. Otherwise, suppose T4 read the current value of one of those sequences and included that value in the inserted row, and the target table happened to be the sequence_value_at_end_of_period table. Then imagine that after receiving the data for T4 and replicating it, the primary server is hit by a meteor and the replica is promoted. Well, it's now possible for some new transaction to get a value from that sequence than what has already been written to the sequence_value_at_end_of_period table, which will presumably break the application. > > In particular, I think it's likely that the "non-transactional > > messages" that you mention earlier don't get applied at the point in > > the commit sequence where they were found in the WAL. Not sure why > > exactly, but perhaps the point at which we're reading WAL runs ahead > > of the decoding per se, or something like that, and thus those > > non-transactional messages arrive too early relative to the commit > > ordering. Possibly that could be changed, and they could be buffered > > I'm not sure which case of "non-transactional messages" this refers to, > so I can't quite respond to these comments. Perhaps you mean the > problems that killed the previous patch [1]? In http://postgr.es/m/8bf1c518-b886-fe1b-5c42-09f9c663146d@enterprisedb.com you said "The other option might be to make these messages non-transactional". I was referring to that. > > the transaction whose commit record occurs most nearly prior to, or > > most nearly after, the WAL record for the operation itself. Or else, > > we could create "virtual" transactions for such operations and make > > sure those get replayed at the right point in the commit sequence. Or > > else, I don't know, maybe something else. But I think the overall > > picture is that we need to approach the problem by replicating changes > > in WAL order, as a physical standby would do. Saying that a change is > > "nontransactional" doesn't mean that it's exempt from ordering > > requirements; rather, it means that that change has its own place in > > that ordering, distinct from the transaction in which it occurred. > > But doesn't the approach with WAL-logging sequence state before COMMIT, > and then applying it independently in WAL-order, do pretty much this? I'm sort of repeating myself here, but: only if the only operations that ever get performed on sequences are increments. Which is just not true. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote: > On 11/17/22 03:43, Andres Freund wrote: > > On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote: > >> Well, yeah - we can either try to perform the stuff independently of the > >> transactions that triggered it, or we can try making it part of some of > >> the transactions. Each of those options has problems, though :-( > >> > >> The first version of the patch tried the first approach, i.e. decode the > >> increments and apply that independently. But: > >> > >> (a) What would you do with increments of sequences created/reset in a > >> transaction? Can't apply those outside the transaction, because it > >> might be rolled back (and that state is not visible on primary). > > > > I think a reasonable approach could be to actually perform different WAL > > logging for that case. It'll require a bit of machinery, but could actually > > result in *less* WAL logging overall, because we don't need to emit a WAL > > record for each SEQ_LOG_VALS sequence values. > > > > Could you elaborate? Hard to comment without knowing more ... > > My point was that stuff like this (creating a new sequence or at least a > new relfilenode) means we can't apply that independently of the > transaction (unlike regular increments). I'm not sure how a change to > WAL logging would make that go away. Different WAL logging would make it easy to handle that on the logical decoding level. We don't need to emit WAL records each time a created-in-this-toplevel-xact sequences gets incremented as they're not persisting anyway if the surrounding xact aborts. We already need to remember the filenode so it can be dropped at the end of the transaction, so we could emit a single record for each sequence at that point. > >> (b) What about increments created before we have a proper snapshot? > >> There may be transactions dependent on the increment. This is what > >> ultimately led to revert of the patch. > > > > I don't understand this - why would we ever need to process those increments > > from before we have a snapshot? Wouldn't they, by definition, be before the > > slot was active? > > > > To me this is the rough equivalent of logical decoding not giving the initial > > state of all tables. You need some process outside of logical decoding to get > > that (obviously we have some support for that via the exported data snapshot > > during slot creation). > > > > Which is what already happens during tablesync, no? We more or less copy > sequences as if they were tables. I think you might have to copy sequences after tables, but I'm not sure. But otherwise, yea. > > I assume that part of the initial sync would have to be a new sequence > > synchronization step that reads all the sequence states on the publisher and > > ensures that the subscriber sequences are at the same point. There's a bit of > > trickiness there, but it seems entirely doable. The logical replication replay > > support for sequences will have to be a bit careful about not decreasing the > > subscriber's sequence values - the standby initially will be ahead of the > > increments we'll see in the WAL. But that seems inevitable given the > > non-transactional nature of sequences. > > > > See fetch_sequence_data / copy_sequence in the patch. The bit about > ensuring the sequence does not go away (say, using page LSN and/or LSN > of the increment) is not there, however isn't that pretty much what I > proposed doing for "reconciling" the sequence state logged at COMMIT? Well, I think the approach of logging all sequence increments at commit is the wrong idea... Creating a new relfilenode whenever a sequence is incremented seems like a complete no-go to me. That increases sequence overhead by several orders of magnitude and will lead to *awful* catalog bloat on the subscriber. > > > >> This version of the patch tries to do the opposite thing - make sure > >> that the state after each commit matches what the transaction might have > >> seen (for sequences it accessed). It's imperfect, because it might log a > >> state generated "after" the sequence got accessed - it focuses on the > >> guarantee not to generate duplicate values. > > > > That approach seems quite wrong to me. > > > > Why? Because it might log a state for sequence as of COMMIT, when the > transaction accessed the sequence much earlier? Mainly because sequences aren't transactional and trying to make them will require awful contortions. While there are cases where we don't flush the WAL / wait for syncrep for sequences, we do replicate their state correctly on physical replication. If an LSN has been acknowledged as having been replicated, we won't just loose a prior sequence increment after promotion, even if the transaction didn't [yet] commit. It's completely valid for an application to call nextval() in one transaction, potentially even abort it, and then only use that sequence value in another transaction. > > I did some skimming of the referenced thread about the reversal of the last > > approach, but I couldn't really understand what the fundamental issues were > > with the reverted implementation - it's a very long thread and references > > other threads. > > > > Yes, it's long/complex, but I intentionally linked to a specific message > which describes the issue ... > > It's entirely possible there is a simple fix for the issue, and I just > got confused / unable to see the solution. The whole issue was due to > having a mix of transactional and non-transactional cases, similarly to > logical messages - and logicalmsg_decode() has the same issue, so maybe > let's talk about that for a moment. > > See [3] and imagine you're dealing with a transactional message, but > you're still building a consistent snapshot. So the first branch applies: > > if (transactional && > !SnapBuildProcessChange(builder, xid, buf->origptr)) > return; > > but because we don't have a snapshot, SnapBuildProcessChange does this: > > if (builder->state < SNAPBUILD_FULL_SNAPSHOT) > return false; In this case we'd just return without further work in logicalmsg_decode(). The problematic case presumably is is when we have a full snapshot but aren't yet consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns true. And we reach: > which however means logicalmsg_decode() does > > snapshot = SnapBuildGetOrBuildSnapshot(builder); > > which crashes, because it hits this assert: > > Assert(builder->state == SNAPBUILD_CONSISTENT); I think the problem here is just that we shouldn't even try to get a snapshot in the transactional case - note that it's not even used in ReorderBufferQueueMessage() for transactional message. The transactional case needs to behave like a "normal" change - we might never decode the message if the transaction ends up committing before we've reached a consistent point. > The sequence decoding did almost the same thing, with the same issue. > Maybe the correct thing to do is to just ignore the change in this case? No, I don't think that'd be correct, the message | sequence needs to be queued for the transaction. If the transaction ends up committing after we've reached consistency, we'll get the correct snapshot from the base snapshot set in SnapBuildProcessChange(). Greetings, Andres Freund
On 11/17/22 18:07, Andres Freund wrote: > Hi, > > On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote: >> On 11/17/22 03:43, Andres Freund wrote: >>> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote: >>>> Well, yeah - we can either try to perform the stuff independently of the >>>> transactions that triggered it, or we can try making it part of some of >>>> the transactions. Each of those options has problems, though :-( >>>> >>>> The first version of the patch tried the first approach, i.e. decode the >>>> increments and apply that independently. But: >>>> >>>> (a) What would you do with increments of sequences created/reset in a >>>> transaction? Can't apply those outside the transaction, because it >>>> might be rolled back (and that state is not visible on primary). >>> >>> I think a reasonable approach could be to actually perform different WAL >>> logging for that case. It'll require a bit of machinery, but could actually >>> result in *less* WAL logging overall, because we don't need to emit a WAL >>> record for each SEQ_LOG_VALS sequence values. >>> >> >> Could you elaborate? Hard to comment without knowing more ... >> >> My point was that stuff like this (creating a new sequence or at least a >> new relfilenode) means we can't apply that independently of the >> transaction (unlike regular increments). I'm not sure how a change to >> WAL logging would make that go away. > > Different WAL logging would make it easy to handle that on the logical > decoding level. We don't need to emit WAL records each time a > created-in-this-toplevel-xact sequences gets incremented as they're not > persisting anyway if the surrounding xact aborts. We already need to remember > the filenode so it can be dropped at the end of the transaction, so we could > emit a single record for each sequence at that point. > > >>>> (b) What about increments created before we have a proper snapshot? >>>> There may be transactions dependent on the increment. This is what >>>> ultimately led to revert of the patch. >>> >>> I don't understand this - why would we ever need to process those increments >>> from before we have a snapshot? Wouldn't they, by definition, be before the >>> slot was active? >>> >>> To me this is the rough equivalent of logical decoding not giving the initial >>> state of all tables. You need some process outside of logical decoding to get >>> that (obviously we have some support for that via the exported data snapshot >>> during slot creation). >>> >> >> Which is what already happens during tablesync, no? We more or less copy >> sequences as if they were tables. > > I think you might have to copy sequences after tables, but I'm not sure. But > otherwise, yea. > > >>> I assume that part of the initial sync would have to be a new sequence >>> synchronization step that reads all the sequence states on the publisher and >>> ensures that the subscriber sequences are at the same point. There's a bit of >>> trickiness there, but it seems entirely doable. The logical replication replay >>> support for sequences will have to be a bit careful about not decreasing the >>> subscriber's sequence values - the standby initially will be ahead of the >>> increments we'll see in the WAL. But that seems inevitable given the >>> non-transactional nature of sequences. >>> >> >> See fetch_sequence_data / copy_sequence in the patch. The bit about >> ensuring the sequence does not go away (say, using page LSN and/or LSN >> of the increment) is not there, however isn't that pretty much what I >> proposed doing for "reconciling" the sequence state logged at COMMIT? > > Well, I think the approach of logging all sequence increments at commit is the > wrong idea... > But we're not logging all sequence increments, no? We're logging the state for each sequence touched by the transaction, but only once - if the transaction incremented the sequence 1000000x times, we'll still log it just once (at least for this particular purpose). Yes, if transactions touch each sequence just once, then we're logging individual increments. The only more efficient solution would be to decode the existing WAL (every ~32 increments), and perhaps also tracking which sequences were accessed by a transaction. And then simply stashing the increments in a global reorderbuffer hash table, and then applying only the last one at commit time. This would require the transactional / non-transactional behavior (I think), but perhaps we can make that work. Or are you thinking about some other scheme? > Creating a new relfilenode whenever a sequence is incremented seems like a > complete no-go to me. That increases sequence overhead by several orders of > magnitude and will lead to *awful* catalog bloat on the subscriber. > You mean on the the apply side? Yes, I agree this needs a better approach, I've focused on the decoding side so far. > >>> >>>> This version of the patch tries to do the opposite thing - make sure >>>> that the state after each commit matches what the transaction might have >>>> seen (for sequences it accessed). It's imperfect, because it might log a >>>> state generated "after" the sequence got accessed - it focuses on the >>>> guarantee not to generate duplicate values. >>> >>> That approach seems quite wrong to me. >>> >> >> Why? Because it might log a state for sequence as of COMMIT, when the >> transaction accessed the sequence much earlier? > > Mainly because sequences aren't transactional and trying to make them will > require awful contortions. > > While there are cases where we don't flush the WAL / wait for syncrep for > sequences, we do replicate their state correctly on physical replication. If > an LSN has been acknowledged as having been replicated, we won't just loose a > prior sequence increment after promotion, even if the transaction didn't [yet] > commit. > True, I agree we should aim to achieve that. > It's completely valid for an application to call nextval() in one transaction, > potentially even abort it, and then only use that sequence value in another > transaction. > I don't quite agree with that - we make no promises about what happens to sequence changes in aborted transactions. I don't think I've ever seen an application using such pattern either. And I'd argue we already fail to uphold such guarantee, because we don't wait for syncrep if the sequence WAL happened in aborted transaction. So if you use the value elsewhere (outside PG), you may lose it. Anyway, I think the scheme I outlined above (with stashing decoded increments, logged once every 32 values) and applying the latest increment for all sequences at commit, would work. > > >>> I did some skimming of the referenced thread about the reversal of the last >>> approach, but I couldn't really understand what the fundamental issues were >>> with the reverted implementation - it's a very long thread and references >>> other threads. >>> >> >> Yes, it's long/complex, but I intentionally linked to a specific message >> which describes the issue ... >> >> It's entirely possible there is a simple fix for the issue, and I just >> got confused / unable to see the solution. The whole issue was due to >> having a mix of transactional and non-transactional cases, similarly to >> logical messages - and logicalmsg_decode() has the same issue, so maybe >> let's talk about that for a moment. >> >> See [3] and imagine you're dealing with a transactional message, but >> you're still building a consistent snapshot. So the first branch applies: >> >> if (transactional && >> !SnapBuildProcessChange(builder, xid, buf->origptr)) >> return; >> >> but because we don't have a snapshot, SnapBuildProcessChange does this: >> >> if (builder->state < SNAPBUILD_FULL_SNAPSHOT) >> return false; > > In this case we'd just return without further work in logicalmsg_decode(). The > problematic case presumably is is when we have a full snapshot but aren't yet > consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns > true. And we reach: > >> which however means logicalmsg_decode() does >> >> snapshot = SnapBuildGetOrBuildSnapshot(builder); >> >> which crashes, because it hits this assert: >> >> Assert(builder->state == SNAPBUILD_CONSISTENT); > > I think the problem here is just that we shouldn't even try to get a snapshot > in the transactional case - note that it's not even used in > ReorderBufferQueueMessage() for transactional message. The transactional case > needs to behave like a "normal" change - we might never decode the message if > the transaction ends up committing before we've reached a consistent point. > > >> The sequence decoding did almost the same thing, with the same issue. >> Maybe the correct thing to do is to just ignore the change in this case? > > No, I don't think that'd be correct, the message | sequence needs to be queued > for the transaction. If the transaction ends up committing after we've reached > consistency, we'll get the correct snapshot from the base snapshot set in > SnapBuildProcessChange(). > Yeah, I think you're right. I looked at this again, with fresh mind, and I came to the same conclusion. Roughly what the attached patch does. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2022-11-17 22:13:23 +0100, Tomas Vondra wrote: > On 11/17/22 18:07, Andres Freund wrote: > > On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote: > >> On 11/17/22 03:43, Andres Freund wrote: > >>> I assume that part of the initial sync would have to be a new sequence > >>> synchronization step that reads all the sequence states on the publisher and > >>> ensures that the subscriber sequences are at the same point. There's a bit of > >>> trickiness there, but it seems entirely doable. The logical replication replay > >>> support for sequences will have to be a bit careful about not decreasing the > >>> subscriber's sequence values - the standby initially will be ahead of the > >>> increments we'll see in the WAL. But that seems inevitable given the > >>> non-transactional nature of sequences. > >>> > >> > >> See fetch_sequence_data / copy_sequence in the patch. The bit about > >> ensuring the sequence does not go away (say, using page LSN and/or LSN > >> of the increment) is not there, however isn't that pretty much what I > >> proposed doing for "reconciling" the sequence state logged at COMMIT? > > > > Well, I think the approach of logging all sequence increments at commit is the > > wrong idea... > > > > But we're not logging all sequence increments, no? I was imprecise - I meant streaming them out at commit. > Yeah, I think you're right. I looked at this again, with fresh mind, and > I came to the same conclusion. Roughly what the attached patch does. To me it seems a bit nicer to keep the SnapBuildGetOrBuildSnapshot() call in decode.c instead of moving it to reorderbuffer.c. Perhaps we should add a snapbuild.c helper similar to SnapBuildProcessChange() for non-transactional changes that also gets a snapshot? Could look something like Snapshot snapshot = NULL; if (message->transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!SnapBuildProcessStateNonTx(builder, &snapshot)) return; ... Or perhaps we should just bite the bullet and add an argument to SnapBuildProcessChange to deal with that? Greetings, Andres Freund
Hi, Here's a rebased version of the sequence decoding patch. 0001 is a fix for the pre-existing issue in logicalmsg_decode, attempting to build a snapshot before getting into a consistent state. AFAICS this only affects assert-enabled builds and is otherwise harmless, because we are not actually using the snapshot (apply gets a valid snapshot from the transaction). This is mostly the fix I shared in November, except that I kept the call in decode.c (per comment from Andres). I haven't added any argument to SnapBuildProcessChange because we may need to backpatch this (and it didn't seem much simpler, IMHO). 0002 is a rebased version of the original approach, committed as 0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix as 0001 (for the sequence messages), the primary reason for the revert. The rebase was not quite straightforward, due to extensive changes in how publications deal with tables/schemas, and so on. So this adopts them, but other than that it behaves just like the original patch. So this abandons the approach with COMMIT-time logging for sequences accessed/modified by the transaction, proposed in response to the revert. It seemed like a good (and simpler) alternative, but there were far too many issues - higher overhead, ordering of records for concurrent transactions, making it reliable, etc. I think the main remaining question is what's the goal of this patch, or rather what "guarantees" we expect from it - what we expect to see on the replica after incrementing a sequence on the primary. Robert described [1] a model and argued the standby should not "invent" new states. It's a long / detailed explanation, I'm not going to try to shorten in here because that'd inevitably omit various details. So better read it whole ... Anyway, I don't think this approach (essentially treating most sequence increments as non-transactional) breaks any consistency guarantees or introduces any "new" states that would not be observable on the primary. In a way, this treats non-transactional sequence increments as separate transactions, and applies them directly. If you read the sequence in between two commits, you might see any "intermediate" state of the sequence - that's the nature of non-transactional changes. We could "postpone" applying the decoded changes until the first next commit, which might improve performance if a transaction is long enough to cover many sequence increments. But that's more a performance optimization than a matter of correctness, IMHO. One caveat is that because of how WAL works for sequences, we're actually decoding changes "ahead" so if you read the sequence on the subscriber it'll actually seem to be slightly ahead (up to ~32 values). This could be eliminated by setting SEQ_LOG_VALS to 0, which however increases the sequence costs, of course. This however brings me to the original question what's the purpose of this patch - and that's essentially keeping sequences up to date to make them usable after a failover. We can't generate values from the sequence on the subscriber, because it'd just get overwritten. And from this point of view, it's also fine that the sequence is slightly ahead, because that's what happens after crash recovery anyway. And we're not guaranteeing the sequences to be gap-less. regards [1] https://www.postgresql.org/message-id/CA%2BTgmoaYG7672OgdwpGm5cOwy8_ftbs%3D3u-YMvR9fiJwQUzgrQ%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > 0001 is a fix for the pre-existing issue in logicalmsg_decode, > attempting to build a snapshot before getting into a consistent state. > AFAICS this only affects assert-enabled builds and is otherwise > harmless, because we are not actually using the snapshot (apply gets a > valid snapshot from the transaction). > > This is mostly the fix I shared in November, except that I kept the call > in decode.c (per comment from Andres). I haven't added any argument to > SnapBuildProcessChange because we may need to backpatch this (and it > didn't seem much simpler, IMHO). I tend to associate transactional behavior with snapshots, so it looks odd to see code that builds a snapshot only when the message is non-transactional. I think that a more detailed comment spelling out the reasoning would be useful here. > This however brings me to the original question what's the purpose of > this patch - and that's essentially keeping sequences up to date to make > them usable after a failover. We can't generate values from the sequence > on the subscriber, because it'd just get overwritten. And from this > point of view, it's also fine that the sequence is slightly ahead, > because that's what happens after crash recovery anyway. And we're not > guaranteeing the sequences to be gap-less. I agree that it's fine for the sequence to be slightly ahead, but I think that it can't be too far ahead without causing problems. Suppose for example that transaction #1 creates a sequence. Transaction #2 does nextval on the sequence a bunch of times and inserts rows into a table using the sequence values as the PK. It's fine if the nextval operations are replicated ahead of the commit of transaction #2 -- in fact I'd say it's necessary for correctness -- but they can't precede the commit of transaction #1, since then the sequence won't exist yet. Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode, I think that needs to act as a barrier: non-transactional changes that happened before that transaction must also be replicated before that transaction is replicated, and those that happened after that transaction is replicated must be replayed after that transaction is replicated. Otherwise, at the very least, there will be states visible on the standby that were never visible on the origin server, and maybe we'll just straight up get the wrong answer. For instance: 1. nextval, setting last_value to 3 2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19 3. nextval, setting last_value to 20 If 3 happens before 2, the sequence ends up in the wrong state. Maybe you've already got this and similar cases totally correctly handled, I'm not sure, just throwing it out there. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/10/23 20:52, Robert Haas wrote: > On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> 0001 is a fix for the pre-existing issue in logicalmsg_decode, >> attempting to build a snapshot before getting into a consistent state. >> AFAICS this only affects assert-enabled builds and is otherwise >> harmless, because we are not actually using the snapshot (apply gets a >> valid snapshot from the transaction). >> >> This is mostly the fix I shared in November, except that I kept the call >> in decode.c (per comment from Andres). I haven't added any argument to >> SnapBuildProcessChange because we may need to backpatch this (and it >> didn't seem much simpler, IMHO). > > I tend to associate transactional behavior with snapshots, so it looks > odd to see code that builds a snapshot only when the message is > non-transactional. I think that a more detailed comment spelling out > the reasoning would be useful here. > I'll try adding a comment explaining this, but the reasoning is fairly simple AFAICS: 1) We don't actually need to build the snapshot for transactional changes, because if we end up applying the change, we'll use snapshot provided/maintained by reorderbuffer. 2) But we don't know if we end up applying the change - it may happen this is one of the transactions we're waiting to finish / skipped, in which case the snapshot is kinda bogus anyway. What "saved" us is that we'll not actually use the snapshot in the end. It's just the assert that causes issues. 3) For non-transactional changes, we need a snapshot because we're going to execute the callback right away. But in this case the code actually protects against building inconsistent snapshots. >> This however brings me to the original question what's the purpose of >> this patch - and that's essentially keeping sequences up to date to make >> them usable after a failover. We can't generate values from the sequence >> on the subscriber, because it'd just get overwritten. And from this >> point of view, it's also fine that the sequence is slightly ahead, >> because that's what happens after crash recovery anyway. And we're not >> guaranteeing the sequences to be gap-less. > > I agree that it's fine for the sequence to be slightly ahead, but I > think that it can't be too far ahead without causing problems. Suppose > for example that transaction #1 creates a sequence. Transaction #2 > does nextval on the sequence a bunch of times and inserts rows into a > table using the sequence values as the PK. It's fine if the nextval > operations are replicated ahead of the commit of transaction #2 -- in > fact I'd say it's necessary for correctness -- but they can't precede > the commit of transaction #1, since then the sequence won't exist yet. It's not clear to me how could that even happen. If transaction #1 creates a sequence, it's invisible for transaction #2. So how could it do nextval() on it? #2 has to wait for #1 to commit before it can do anything on the sequence, which enforces the correct ordering, no? > Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode, > I think that needs to act as a barrier: non-transactional changes that > happened before that transaction must also be replicated before that > transaction is replicated, and those that happened after that > transaction is replicated must be replayed after that transaction is > replicated. Otherwise, at the very least, there will be states visible > on the standby that were never visible on the origin server, and maybe > we'll just straight up get the wrong answer. For instance: > > 1. nextval, setting last_value to 3 > 2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19 > 3. nextval, setting last_value to 20 > > If 3 happens before 2, the sequence ends up in the wrong state. > > Maybe you've already got this and similar cases totally correctly > handled, I'm not sure, just throwing it out there. > I believe this should behave correctly too, thanks to locking. If a transaction does ALTER SEQUENCE, that locks the sequence, so only that transaction can do stuff with that sequence (and changes from that point are treated as transactional). And everyone else is waiting for #1 to commit. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Heikki, CCed you due to the point about 2c03216d8311 below. On 2023-01-10 19:32:12 +0100, Tomas Vondra wrote: > 0001 is a fix for the pre-existing issue in logicalmsg_decode, > attempting to build a snapshot before getting into a consistent state. > AFAICS this only affects assert-enabled builds and is otherwise > harmless, because we are not actually using the snapshot (apply gets a > valid snapshot from the transaction). LGTM. > 0002 is a rebased version of the original approach, committed as > 0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix > as 0001 (for the sequence messages), the primary reason for the revert. > > The rebase was not quite straightforward, due to extensive changes in > how publications deal with tables/schemas, and so on. So this adopts > them, but other than that it behaves just like the original patch. This is a huge diff: > 72 files changed, 4715 insertions(+), 612 deletions(-) It'd be nice to split it to make review easier. Perhaps the sequence decoding support could be split from the whole publication rigamarole? > This does not include any changes to test_decoding and/or the built-in > replication - those will be committed in separate patches. Looks like that's not the case anymore? > +/* > + * Update the sequence state by modifying the existing sequence data row. > + * > + * This keeps the same relfilenode, so the behavior is non-transactional. > + */ > +static void > +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called) > +{ > + SeqTable elm; > + Relation seqrel; > + Buffer buf; > + HeapTupleData seqdatatuple; > + Form_pg_sequence_data seq; > + > + /* open and lock sequence */ > + init_sequence(seqrelid, &elm, &seqrel); > + > + /* lock page' buffer and read tuple */ > + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); > + > + /* check the comment above nextval_internal()'s equivalent call. */ > + if (RelationNeedsWAL(seqrel)) > + { > + GetTopTransactionId(); > + > + if (XLogLogicalInfoActive()) > + GetCurrentTransactionId(); > + } > + > + /* ready to change the on-disk (or really, in-buffer) tuple */ > + START_CRIT_SECTION(); > + > + seq->last_value = last_value; > + seq->is_called = is_called; > + seq->log_cnt = log_cnt; > + > + MarkBufferDirty(buf); > + > + /* XLOG stuff */ > + if (RelationNeedsWAL(seqrel)) > + { > + xl_seq_rec xlrec; > + XLogRecPtr recptr; > + Page page = BufferGetPage(buf); > + > + XLogBeginInsert(); > + XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT); > + > + xlrec.locator = seqrel->rd_locator; > + xlrec.created = false; > + > + XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec)); > + XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len); > + > + recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); > + > + PageSetLSN(page, recptr); > + } > + > + END_CRIT_SECTION(); > + > + UnlockReleaseBuffer(buf); > + > + /* Clear local cache so that we don't think we have cached numbers */ > + /* Note that we do not change the currval() state */ > + elm->cached = elm->last; > + > + relation_close(seqrel, NoLock); > +} > + > +/* > + * Update the sequence state by creating a new relfilenode. > + * > + * This creates a new relfilenode, to allow transactional behavior. > + */ > +static void > +SetSequence_transactional(Oid seq_relid, int64 last_value, int64 log_cnt, bool is_called) > +{ > + SeqTable elm; > + Relation seqrel; > + Buffer buf; > + HeapTupleData seqdatatuple; > + Form_pg_sequence_data seq; > + HeapTuple tuple; > + > + /* open and lock sequence */ > + init_sequence(seq_relid, &elm, &seqrel); > + > + /* lock page' buffer and read tuple */ > + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); > + > + /* Copy the existing sequence tuple. */ > + tuple = heap_copytuple(&seqdatatuple); > + > + /* Now we're done with the old page */ > + UnlockReleaseBuffer(buf); > + > + /* > + * Modify the copied tuple to update the sequence state (similar to what > + * ResetSequence does). > + */ > + seq = (Form_pg_sequence_data) GETSTRUCT(tuple); > + seq->last_value = last_value; > + seq->is_called = is_called; > + seq->log_cnt = log_cnt; > + > + /* > + * Create a new storage file for the sequence - this is needed for the > + * transactional behavior. > + */ > + RelationSetNewRelfilenumber(seqrel, seqrel->rd_rel->relpersistence); > + > + /* > + * Ensure sequence's relfrozenxid is at 0, since it won't contain any > + * unfrozen XIDs. Same with relminmxid, since a sequence will never > + * contain multixacts. > + */ > + Assert(seqrel->rd_rel->relfrozenxid == InvalidTransactionId); > + Assert(seqrel->rd_rel->relminmxid == InvalidMultiXactId); > + > + /* > + * Insert the modified tuple into the new storage file. This does all the > + * necessary WAL-logging etc. > + */ > + fill_seq_with_data(seqrel, tuple); > + > + /* Clear local cache so that we don't think we have cached numbers */ > + /* Note that we do not change the currval() state */ > + elm->cached = elm->last; > + > + relation_close(seqrel, NoLock); > +} > + > +/* > + * Set a sequence to a specified internal state. > + * > + * The change is made transactionally, so that on failure of the current > + * transaction, the sequence will be restored to its previous state. > + * We do that by creating a whole new relfilenode for the sequence; so this > + * works much like the rewriting forms of ALTER TABLE. > + * > + * Caller is assumed to have acquired AccessExclusiveLock on the sequence, > + * which must not be released until end of transaction. Caller is also > + * responsible for permissions checking. > + */ > +void > +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called) > +{ > + if (transactional) > + SetSequence_transactional(seq_relid, last_value, log_cnt, is_called); > + else > + SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called); > +} That's a lot of duplication with existing code. There's no explanation why SetSequence() as well as do_setval() exists. > /* > * Initialize a sequence's relation with the specified tuple as content > * > @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum) > > /* check the comment above nextval_internal()'s equivalent call. */ > if (RelationNeedsWAL(rel)) > + { > GetTopTransactionId(); > > + if (XLogLogicalInfoActive()) > + GetCurrentTransactionId(); > + } Is it actually possible to reach this without an xid already having been assigned for the current xact? > @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions) > * It's sufficient to ensure the toplevel transaction has an xid, no need > * to assign xids subxacts, that'll already trigger an appropriate wait. > * (Have to do that here, so we're outside the critical section) > + * > + * We have to ensure we have a proper XID, which will be included in > + * the XLOG record by XLogRecordAssemble. Otherwise the first nextval() > + * in a subxact (without any preceding changes) would get XID 0, and it > + * would then be impossible to decide which top xact it belongs to. > + * It'd also trigger assert in DecodeSequence. We only do that with > + * wal_level=logical, though. > + * > + * XXX This might seem unnecessary, because if there's no XID the xact > + * couldn't have done anything important yet, e.g. it could not have > + * created a sequence. But that's incorrect, because of subxacts. The > + * current subtransaction might not have done anything yet (thus no XID), > + * but an earlier one might have created the sequence. > */ What about restricting this to the case you're mentioning, i.e. subtransactions? > @@ -845,6 +1023,7 @@ nextval_internal(Oid relid, bool check_permissions) > seq->log_cnt = 0; > > xlrec.locator = seqrel->rd_locator; I realize this isn't from this patch, but: Why do we include the locator in the record? We already have it via XLogRegisterBuffer(), no? And afaict we don't even use it, as we read the page via XLogInitBufferForRedo() during recovery. Kinda looks like an oversight in 2c03216d8311 > +/* > + * Handle sequence decode > + * > + * Decoding sequences is a bit tricky, because while most sequence actions > + * are non-transactional (not subject to rollback), some need to be handled > + * as transactional. > + * > + * By default, a sequence increment is non-transactional - we must not queue > + * it in a transaction as other changes, because the transaction might get > + * rolled back and we'd discard the increment. The downstream would not be > + * notified about the increment, which is wrong. > + * > + * On the other hand, the sequence may be created in a transaction. In this > + * case we *should* queue the change as other changes in the transaction, > + * because we don't want to send the increments for unknown sequence to the > + * plugin - it might get confused about which sequence it's related to etc. > + */ > +void > +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > +{ > + /* extract the WAL record, with "created" flag */ > + xlrec = (xl_seq_rec *) XLogRecGetData(r); > + > + /* XXX how could we have sequence change without data? */ > + if(!datalen || !tupledata) > + return; Yea, I think we should error out here instead, something has gone quite wrong if this happens. > + tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen); > + DecodeSeqTuple(tupledata, datalen, tuplebuf); > + > + /* > + * Should we handle the sequence increment as transactional or not? > + * > + * If the sequence was created in a still-running transaction, treat > + * it as transactional and queue the increments. Otherwise it needs > + * to be treated as non-transactional, in which case we send it to > + * the plugin right away. > + */ > + transactional = ReorderBufferSequenceIsTransactional(ctx->reorder, > + target_locator, > + xlrec->created); Why re-create this information during decoding, when we basically already have it available on the primary? I think we already pay the price for that tracking, which we e.g. use for doing a non-transactional truncate: /* * Normally, we need a transaction-safe truncation here. However, if * the table was either created in the current (sub)transaction or has * a new relfilenumber in the current (sub)transaction, then we can * just truncate it in-place, because a rollback would cause the whole * table or the current physical file to be thrown away anyway. */ if (rel->rd_createSubid == mySubid || rel->rd_newRelfilelocatorSubid == mySubid) { /* Immediate, non-rollbackable truncation is OK */ heap_truncate_one_rel(rel); } Afaict we could do something similar for sequences, except that I think we would just check if the sequence was created in the current transaction (i.e. any of the fields are set). > +/* > + * A transactional sequence increment is queued to be processed upon commit > + * and a non-transactional increment gets processed immediately. > + * > + * A sequence update may be both transactional and non-transactional. When > + * created in a running transaction, treat it as transactional and queue > + * the change in it. Otherwise treat it as non-transactional, so that we > + * don't forget the increment in case of a rollback. > + */ > +void > +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid, > + Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id, > + RelFileLocator rlocator, bool transactional, bool created, > + ReorderBufferTupleBuf *tuplebuf) > + /* > + * Decoding needs access to syscaches et al., which in turn use > + * heavyweight locks and such. Thus we need to have enough state around to > + * keep track of those. The easiest way is to simply use a transaction > + * internally. That also allows us to easily enforce that nothing writes > + * to the database by checking for xid assignments. > + * > + * When we're called via the SQL SRF there's already a transaction > + * started, so start an explicit subtransaction there. > + */ > + using_subtxn = IsTransactionOrTransactionBlock(); This duplicates a lot of the code from ReorderBufferProcessTXN(). But only does so partially. It's hard to tell whether some of the differences are intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()? Maybe something like void ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals); void ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error); Greetings, Andres Freund
On Wed, Jan 11, 2023 at 1:29 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I agree that it's fine for the sequence to be slightly ahead, but I > > think that it can't be too far ahead without causing problems. Suppose > > for example that transaction #1 creates a sequence. Transaction #2 > > does nextval on the sequence a bunch of times and inserts rows into a > > table using the sequence values as the PK. It's fine if the nextval > > operations are replicated ahead of the commit of transaction #2 -- in > > fact I'd say it's necessary for correctness -- but they can't precede > > the commit of transaction #1, since then the sequence won't exist yet. > > It's not clear to me how could that even happen. If transaction #1 > creates a sequence, it's invisible for transaction #2. So how could it > do nextval() on it? #2 has to wait for #1 to commit before it can do > anything on the sequence, which enforces the correct ordering, no? Yeah, I meant if #1 had committed and then #2 started to do its thing. I was worried that decoding might reach the nextval operations in transaction #2 before it replayed #1. This worry may be entirely based on me not understanding how this actually works. Do we always apply a transaction as soon as we see the commit record for it, before decoding any further? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-01-11 15:23:18 -0500, Robert Haas wrote: > Yeah, I meant if #1 had committed and then #2 started to do its thing. > I was worried that decoding might reach the nextval operations in > transaction #2 before it replayed #1. > > This worry may be entirely based on me not understanding how this > actually works. Do we always apply a transaction as soon as we see the > commit record for it, before decoding any further? Yes. Otherwise we'd have a really hard time figuring out the correct historical snapshot to use for subsequent transactions - they'd have been able to see the catalog modifications made by the committing transaction. Greetings, Andres Freund
On Wed, Jan 11, 2023 at 3:28 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-11 15:23:18 -0500, Robert Haas wrote: > > Yeah, I meant if #1 had committed and then #2 started to do its thing. > > I was worried that decoding might reach the nextval operations in > > transaction #2 before it replayed #1. > > > > This worry may be entirely based on me not understanding how this > > actually works. Do we always apply a transaction as soon as we see the > > commit record for it, before decoding any further? > > Yes. > > Otherwise we'd have a really hard time figuring out the correct historical > snapshot to use for subsequent transactions - they'd have been able to see the > catalog modifications made by the committing transaction. I wonder, then, what happens if somebody wants to do parallel apply. That would seem to require some relaxation of this rule, but then doesn't that break what this patch wants to do? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-01-11 15:41:45 -0500, Robert Haas wrote: > I wonder, then, what happens if somebody wants to do parallel apply. That > would seem to require some relaxation of this rule, but then doesn't that > break what this patch wants to do? I don't think it'd pose a direct problem - presumably you'd only parallelize applying changes, not committing the transactions containing them. You'd get a lot of inconsistencies otherwise. If you're thinking of decoding changes in parallel (rather than streaming out large changes before commit when possible), you'd only be able to do that in cases when transaction haven't performed catalog changes, I think. In which case there'd also be no issue wrt transactional sequence changes. Greetings, Andres Freund
On 1/11/23 21:58, Andres Freund wrote: > Hi, > > On 2023-01-11 15:41:45 -0500, Robert Haas wrote: >> I wonder, then, what happens if somebody wants to do parallel apply. That >> would seem to require some relaxation of this rule, but then doesn't that >> break what this patch wants to do? > > I don't think it'd pose a direct problem - presumably you'd only parallelize > applying changes, not committing the transactions containing them. You'd get a > lot of inconsistencies otherwise. > Right. It's the commit order that matters - as long as that's maintained, the result should be consistent etc. There's plenty of other hard problems, though - for example it's trivial for the apply workers to apply the changes in the incorrect order (contradicting commit order) and then a deadlock. And the deadlock detector may easily keep aborting the incorrect worker (the oldest one), so that the replication grinds down to a halt. I was wondering recently how far would we get by just doing prefetch for logical apply - instead of applying the changes, just try doing a lookup on he replica identity values, and then simple serial apply. > If you're thinking of decoding changes in parallel (rather than streaming out > large changes before commit when possible), you'd only be able to do that in > cases when transaction haven't performed catalog changes, I think. In which > case there'd also be no issue wrt transactional sequence changes. > Perhaps, although it's not clear to me how would you know that in advance? I mean, you could start decoding changes in parallel, and then you find one of the earlier transactions touched a catalog. Bu maybe I misunderstand what "decoding" refers to - don't we need the snapshot only in reorderbuffer? In which case all the other stuff could be parallelized (not sure if that's really expensive). Anyway, all of this is far out of scope of this patch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/11/23 21:12, Andres Freund wrote: > Hi, > > > Heikki, CCed you due to the point about 2c03216d8311 below. > > > On 2023-01-10 19:32:12 +0100, Tomas Vondra wrote: >> 0001 is a fix for the pre-existing issue in logicalmsg_decode, >> attempting to build a snapshot before getting into a consistent state. >> AFAICS this only affects assert-enabled builds and is otherwise >> harmless, because we are not actually using the snapshot (apply gets a >> valid snapshot from the transaction). > > LGTM. > > >> 0002 is a rebased version of the original approach, committed as >> 0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix >> as 0001 (for the sequence messages), the primary reason for the revert. >> >> The rebase was not quite straightforward, due to extensive changes in >> how publications deal with tables/schemas, and so on. So this adopts >> them, but other than that it behaves just like the original patch. > > This is a huge diff: >> 72 files changed, 4715 insertions(+), 612 deletions(-) > > It'd be nice to split it to make review easier. Perhaps the sequence decoding > support could be split from the whole publication rigamarole? > > >> This does not include any changes to test_decoding and/or the built-in >> replication - those will be committed in separate patches. > > Looks like that's not the case anymore? > Ah, right! Now I realized I originally committed this in chunks, but the revert was a single commit. And I just "reverted the revert" to create this patch. I'll definitely split this into smaller patches. This also explains the obsolete commit message about test_decoding not being included, etc. > >> +/* >> + * Update the sequence state by modifying the existing sequence data row. >> + * >> + * This keeps the same relfilenode, so the behavior is non-transactional. >> + */ >> +static void >> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called) >> +{ >> + SeqTable elm; >> + Relation seqrel; >> + Buffer buf; >> + HeapTupleData seqdatatuple; >> + Form_pg_sequence_data seq; >> + >> + /* open and lock sequence */ >> + init_sequence(seqrelid, &elm, &seqrel); >> + >> + /* lock page' buffer and read tuple */ >> + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); >> + >> + /* check the comment above nextval_internal()'s equivalent call. */ >> + if (RelationNeedsWAL(seqrel)) >> + { >> + GetTopTransactionId(); >> + >> + if (XLogLogicalInfoActive()) >> + GetCurrentTransactionId(); >> + } >> + >> + /* ready to change the on-disk (or really, in-buffer) tuple */ >> + START_CRIT_SECTION(); >> + >> + seq->last_value = last_value; >> + seq->is_called = is_called; >> + seq->log_cnt = log_cnt; >> + >> + MarkBufferDirty(buf); >> + >> + /* XLOG stuff */ >> + if (RelationNeedsWAL(seqrel)) >> + { >> + xl_seq_rec xlrec; >> + XLogRecPtr recptr; >> + Page page = BufferGetPage(buf); >> + >> + XLogBeginInsert(); >> + XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT); >> + >> + xlrec.locator = seqrel->rd_locator; >> + xlrec.created = false; >> + >> + XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec)); >> + XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len); >> + >> + recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); >> + >> + PageSetLSN(page, recptr); >> + } >> + >> + END_CRIT_SECTION(); >> + >> + UnlockReleaseBuffer(buf); >> + >> + /* Clear local cache so that we don't think we have cached numbers */ >> + /* Note that we do not change the currval() state */ >> + elm->cached = elm->last; >> + >> + relation_close(seqrel, NoLock); >> +} >> + >> +/* >> + * Update the sequence state by creating a new relfilenode. >> + * >> + * This creates a new relfilenode, to allow transactional behavior. >> + */ >> +static void >> +SetSequence_transactional(Oid seq_relid, int64 last_value, int64 log_cnt, bool is_called) >> +{ >> + SeqTable elm; >> + Relation seqrel; >> + Buffer buf; >> + HeapTupleData seqdatatuple; >> + Form_pg_sequence_data seq; >> + HeapTuple tuple; >> + >> + /* open and lock sequence */ >> + init_sequence(seq_relid, &elm, &seqrel); >> + >> + /* lock page' buffer and read tuple */ >> + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); >> + >> + /* Copy the existing sequence tuple. */ >> + tuple = heap_copytuple(&seqdatatuple); >> + >> + /* Now we're done with the old page */ >> + UnlockReleaseBuffer(buf); >> + >> + /* >> + * Modify the copied tuple to update the sequence state (similar to what >> + * ResetSequence does). >> + */ >> + seq = (Form_pg_sequence_data) GETSTRUCT(tuple); >> + seq->last_value = last_value; >> + seq->is_called = is_called; >> + seq->log_cnt = log_cnt; >> + >> + /* >> + * Create a new storage file for the sequence - this is needed for the >> + * transactional behavior. >> + */ >> + RelationSetNewRelfilenumber(seqrel, seqrel->rd_rel->relpersistence); >> + >> + /* >> + * Ensure sequence's relfrozenxid is at 0, since it won't contain any >> + * unfrozen XIDs. Same with relminmxid, since a sequence will never >> + * contain multixacts. >> + */ >> + Assert(seqrel->rd_rel->relfrozenxid == InvalidTransactionId); >> + Assert(seqrel->rd_rel->relminmxid == InvalidMultiXactId); >> + >> + /* >> + * Insert the modified tuple into the new storage file. This does all the >> + * necessary WAL-logging etc. >> + */ >> + fill_seq_with_data(seqrel, tuple); >> + >> + /* Clear local cache so that we don't think we have cached numbers */ >> + /* Note that we do not change the currval() state */ >> + elm->cached = elm->last; >> + >> + relation_close(seqrel, NoLock); >> +} >> + >> +/* >> + * Set a sequence to a specified internal state. >> + * >> + * The change is made transactionally, so that on failure of the current >> + * transaction, the sequence will be restored to its previous state. >> + * We do that by creating a whole new relfilenode for the sequence; so this >> + * works much like the rewriting forms of ALTER TABLE. >> + * >> + * Caller is assumed to have acquired AccessExclusiveLock on the sequence, >> + * which must not be released until end of transaction. Caller is also >> + * responsible for permissions checking. >> + */ >> +void >> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called) >> +{ >> + if (transactional) >> + SetSequence_transactional(seq_relid, last_value, log_cnt, is_called); >> + else >> + SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called); >> +} > > That's a lot of duplication with existing code. There's no explanation why > SetSequence() as well as do_setval() exists. > Thanks, I'll look into this. > >> /* >> * Initialize a sequence's relation with the specified tuple as content >> * >> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum) >> >> /* check the comment above nextval_internal()'s equivalent call. */ >> if (RelationNeedsWAL(rel)) >> + { >> GetTopTransactionId(); >> >> + if (XLogLogicalInfoActive()) >> + GetCurrentTransactionId(); >> + } > > Is it actually possible to reach this without an xid already having been > assigned for the current xact? > I believe it is. That's probably how I found this change is needed, actually. > > >> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions) >> * It's sufficient to ensure the toplevel transaction has an xid, no need >> * to assign xids subxacts, that'll already trigger an appropriate wait. >> * (Have to do that here, so we're outside the critical section) >> + * >> + * We have to ensure we have a proper XID, which will be included in >> + * the XLOG record by XLogRecordAssemble. Otherwise the first nextval() >> + * in a subxact (without any preceding changes) would get XID 0, and it >> + * would then be impossible to decide which top xact it belongs to. >> + * It'd also trigger assert in DecodeSequence. We only do that with >> + * wal_level=logical, though. >> + * >> + * XXX This might seem unnecessary, because if there's no XID the xact >> + * couldn't have done anything important yet, e.g. it could not have >> + * created a sequence. But that's incorrect, because of subxacts. The >> + * current subtransaction might not have done anything yet (thus no XID), >> + * but an earlier one might have created the sequence. >> */ > > What about restricting this to the case you're mentioning, > i.e. subtransactions? > That might work, but I need to think about it a bit. I don't think it'd save us much, though. I mean, vast majority of transactions (and subtransactions) calling nextval() will then do something else which requires a XID. This just moves the XID a bit, that's all. > >> @@ -845,6 +1023,7 @@ nextval_internal(Oid relid, bool check_permissions) >> seq->log_cnt = 0; >> >> xlrec.locator = seqrel->rd_locator; > > I realize this isn't from this patch, but: > > Why do we include the locator in the record? We already have it via > XLogRegisterBuffer(), no? And afaict we don't even use it, as we read the page > via XLogInitBufferForRedo() during recovery. > > Kinda looks like an oversight in 2c03216d8311 > I don't know, it's what the code did. > > > >> +/* >> + * Handle sequence decode >> + * >> + * Decoding sequences is a bit tricky, because while most sequence actions >> + * are non-transactional (not subject to rollback), some need to be handled >> + * as transactional. >> + * >> + * By default, a sequence increment is non-transactional - we must not queue >> + * it in a transaction as other changes, because the transaction might get >> + * rolled back and we'd discard the increment. The downstream would not be >> + * notified about the increment, which is wrong. >> + * >> + * On the other hand, the sequence may be created in a transaction. In this >> + * case we *should* queue the change as other changes in the transaction, >> + * because we don't want to send the increments for unknown sequence to the >> + * plugin - it might get confused about which sequence it's related to etc. >> + */ >> +void >> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) >> +{ > >> + /* extract the WAL record, with "created" flag */ >> + xlrec = (xl_seq_rec *) XLogRecGetData(r); >> + >> + /* XXX how could we have sequence change without data? */ >> + if(!datalen || !tupledata) >> + return; > > Yea, I think we should error out here instead, something has gone quite wrong > if this happens. > OK > >> + tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen); >> + DecodeSeqTuple(tupledata, datalen, tuplebuf); >> + >> + /* >> + * Should we handle the sequence increment as transactional or not? >> + * >> + * If the sequence was created in a still-running transaction, treat >> + * it as transactional and queue the increments. Otherwise it needs >> + * to be treated as non-transactional, in which case we send it to >> + * the plugin right away. >> + */ >> + transactional = ReorderBufferSequenceIsTransactional(ctx->reorder, >> + target_locator, >> + xlrec->created); > > Why re-create this information during decoding, when we basically already have > it available on the primary? I think we already pay the price for that > tracking, which we e.g. use for doing a non-transactional truncate: > > /* > * Normally, we need a transaction-safe truncation here. However, if > * the table was either created in the current (sub)transaction or has > * a new relfilenumber in the current (sub)transaction, then we can > * just truncate it in-place, because a rollback would cause the whole > * table or the current physical file to be thrown away anyway. > */ > if (rel->rd_createSubid == mySubid || > rel->rd_newRelfilelocatorSubid == mySubid) > { > /* Immediate, non-rollbackable truncation is OK */ > heap_truncate_one_rel(rel); > } > > Afaict we could do something similar for sequences, except that I think we > would just check if the sequence was created in the current transaction > (i.e. any of the fields are set). > Hmm, good point. > >> +/* >> + * A transactional sequence increment is queued to be processed upon commit >> + * and a non-transactional increment gets processed immediately. >> + * >> + * A sequence update may be both transactional and non-transactional. When >> + * created in a running transaction, treat it as transactional and queue >> + * the change in it. Otherwise treat it as non-transactional, so that we >> + * don't forget the increment in case of a rollback. >> + */ >> +void >> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid, >> + Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id, >> + RelFileLocator rlocator, bool transactional, bool created, >> + ReorderBufferTupleBuf *tuplebuf) > > >> + /* >> + * Decoding needs access to syscaches et al., which in turn use >> + * heavyweight locks and such. Thus we need to have enough state around to >> + * keep track of those. The easiest way is to simply use a transaction >> + * internally. That also allows us to easily enforce that nothing writes >> + * to the database by checking for xid assignments. >> + * >> + * When we're called via the SQL SRF there's already a transaction >> + * started, so start an explicit subtransaction there. >> + */ >> + using_subtxn = IsTransactionOrTransactionBlock(); > > This duplicates a lot of the code from ReorderBufferProcessTXN(). But only > does so partially. It's hard to tell whether some of the differences are > intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()? > > Maybe something like > > void > ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals); > > void > ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error); > Thanks for the suggestion, I'll definitely consider that in the next version of the patch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-01-11 22:30:42 +0100, Tomas Vondra wrote: > On 1/11/23 21:58, Andres Freund wrote: > > If you're thinking of decoding changes in parallel (rather than streaming out > > large changes before commit when possible), you'd only be able to do that in > > cases when transaction haven't performed catalog changes, I think. In which > > case there'd also be no issue wrt transactional sequence changes. > > > > Perhaps, although it's not clear to me how would you know that in > advance? I mean, you could start decoding changes in parallel, and then > you find one of the earlier transactions touched a catalog. You could have a running count of in-progress catalog modifying transactions and not allow parallelized processing when that's not 0. > Bu maybe I misunderstand what "decoding" refers to - don't we need the > snapshot only in reorderbuffer? In which case all the other stuff could > be parallelized (not sure if that's really expensive). Calling output functions is pretty expensive, so being able to call those in parallel has some benefits. But I don't think we're there. > Anyway, all of this is far out of scope of this patch. Yea, clearly that's independent work. And I don't think relying on commit order in one more place, i.e. for sequences, would make it harder. Greetings, Andres Freund
Hi, here's a slightly updated version - the main change is splitting the patch into multiple parts, along the lines of the original patch reverted in 2c7ea57e56ca5f668c32d4266e0a3e45b455bef5: - basic sequence decoding infrastructure - support in test_decoding - support in built-in logical replication The revert mentions a couple additional parts, but those were mostly fixes / improvements. And those are not merged into the three parts. On 1/11/23 22:46, Tomas Vondra wrote: > >>... >> >>> +/* >>> + * Update the sequence state by modifying the existing sequence data row. >>> + * >>> + * This keeps the same relfilenode, so the behavior is non-transactional. >>> + */ >>> +static void >>> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called) >>> +{ >>> ... >>> >>> +void >>> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called) >>> +{ >>> + if (transactional) >>> + SetSequence_transactional(seq_relid, last_value, log_cnt, is_called); >>> + else >>> + SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called); >>> +} >> >> That's a lot of duplication with existing code. There's no explanation why >> SetSequence() as well as do_setval() exists. >> > > Thanks, I'll look into this. > I haven't done anything about this yet. The functions are doing similar things, but there's also a fair amount of differences so I haven't found a good way to merge them yet. >> >>> /* >>> * Initialize a sequence's relation with the specified tuple as content >>> * >>> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum) >>> >>> /* check the comment above nextval_internal()'s equivalent call. */ >>> if (RelationNeedsWAL(rel)) >>> + { >>> GetTopTransactionId(); >>> >>> + if (XLogLogicalInfoActive()) >>> + GetCurrentTransactionId(); >>> + } >> >> Is it actually possible to reach this without an xid already having been >> assigned for the current xact? >> > > I believe it is. That's probably how I found this change is needed, > actually. > I've added a comment explaining why this needed. I don't think it's worth trying to optimize this, because in plausible workloads we'd just delay the work a little bit. >> >> >>> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions) >>> * It's sufficient to ensure the toplevel transaction has an xid, no need >>> * to assign xids subxacts, that'll already trigger an appropriate wait. >>> * (Have to do that here, so we're outside the critical section) >>> + * >>> + * We have to ensure we have a proper XID, which will be included in >>> + * the XLOG record by XLogRecordAssemble. Otherwise the first nextval() >>> + * in a subxact (without any preceding changes) would get XID 0, and it >>> + * would then be impossible to decide which top xact it belongs to. >>> + * It'd also trigger assert in DecodeSequence. We only do that with >>> + * wal_level=logical, though. >>> + * >>> + * XXX This might seem unnecessary, because if there's no XID the xact >>> + * couldn't have done anything important yet, e.g. it could not have >>> + * created a sequence. But that's incorrect, because of subxacts. The >>> + * current subtransaction might not have done anything yet (thus no XID), >>> + * but an earlier one might have created the sequence. >>> */ >> >> What about restricting this to the case you're mentioning, >> i.e. subtransactions? >> > > That might work, but I need to think about it a bit. > > I don't think it'd save us much, though. I mean, vast majority of > transactions (and subtransactions) calling nextval() will then do > something else which requires a XID. This just moves the XID a bit, > that's all. > After thinking about this a bit more, I don't think the optimization is worth it, for the reasons explained above. >> >>> +/* >>> + * Handle sequence decode >>> + * >>> + * Decoding sequences is a bit tricky, because while most sequence actions >>> + * are non-transactional (not subject to rollback), some need to be handled >>> + * as transactional. >>> + * >>> + * By default, a sequence increment is non-transactional - we must not queue >>> + * it in a transaction as other changes, because the transaction might get >>> + * rolled back and we'd discard the increment. The downstream would not be >>> + * notified about the increment, which is wrong. >>> + * >>> + * On the other hand, the sequence may be created in a transaction. In this >>> + * case we *should* queue the change as other changes in the transaction, >>> + * because we don't want to send the increments for unknown sequence to the >>> + * plugin - it might get confused about which sequence it's related to etc. >>> + */ >>> +void >>> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) >>> +{ >> >>> + /* extract the WAL record, with "created" flag */ >>> + xlrec = (xl_seq_rec *) XLogRecGetData(r); >>> + >>> + /* XXX how could we have sequence change without data? */ >>> + if(!datalen || !tupledata) >>> + return; >> >> Yea, I think we should error out here instead, something has gone quite wrong >> if this happens. >> > > OK > Done. >> >>> + tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen); >>> + DecodeSeqTuple(tupledata, datalen, tuplebuf); >>> + >>> + /* >>> + * Should we handle the sequence increment as transactional or not? >>> + * >>> + * If the sequence was created in a still-running transaction, treat >>> + * it as transactional and queue the increments. Otherwise it needs >>> + * to be treated as non-transactional, in which case we send it to >>> + * the plugin right away. >>> + */ >>> + transactional = ReorderBufferSequenceIsTransactional(ctx->reorder, >>> + target_locator, >>> + xlrec->created); >> >> Why re-create this information during decoding, when we basically already have >> it available on the primary? I think we already pay the price for that >> tracking, which we e.g. use for doing a non-transactional truncate: >> >> /* >> * Normally, we need a transaction-safe truncation here. However, if >> * the table was either created in the current (sub)transaction or has >> * a new relfilenumber in the current (sub)transaction, then we can >> * just truncate it in-place, because a rollback would cause the whole >> * table or the current physical file to be thrown away anyway. >> */ >> if (rel->rd_createSubid == mySubid || >> rel->rd_newRelfilelocatorSubid == mySubid) >> { >> /* Immediate, non-rollbackable truncation is OK */ >> heap_truncate_one_rel(rel); >> } >> >> Afaict we could do something similar for sequences, except that I think we >> would just check if the sequence was created in the current transaction >> (i.e. any of the fields are set). >> > > Hmm, good point. > But rd_createSubid/rd_newRelfilelocatorSubid fields are available only in the original transaction, not during decoding. So we'd have to do this check there and add the result to the WAL record. Is that what you had in mind? >> >>> +/* >>> + * A transactional sequence increment is queued to be processed upon commit >>> + * and a non-transactional increment gets processed immediately. >>> + * >>> + * A sequence update may be both transactional and non-transactional. When >>> + * created in a running transaction, treat it as transactional and queue >>> + * the change in it. Otherwise treat it as non-transactional, so that we >>> + * don't forget the increment in case of a rollback. >>> + */ >>> +void >>> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid, >>> + Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id, >>> + RelFileLocator rlocator, bool transactional, bool created, >>> + ReorderBufferTupleBuf *tuplebuf) >> >> >>> + /* >>> + * Decoding needs access to syscaches et al., which in turn use >>> + * heavyweight locks and such. Thus we need to have enough state around to >>> + * keep track of those. The easiest way is to simply use a transaction >>> + * internally. That also allows us to easily enforce that nothing writes >>> + * to the database by checking for xid assignments. >>> + * >>> + * When we're called via the SQL SRF there's already a transaction >>> + * started, so start an explicit subtransaction there. >>> + */ >>> + using_subtxn = IsTransactionOrTransactionBlock(); >> >> This duplicates a lot of the code from ReorderBufferProcessTXN(). But only >> does so partially. It's hard to tell whether some of the differences are >> intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()? >> >> Maybe something like >> >> void >> ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals); >> >> void >> ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error); >> > > Thanks for the suggestion, I'll definitely consider that in the next > version of the patch. I did look at the code a bit, but I'm not sure there really is a lot of duplicated code - yes, we start/abort the (sub)transaction, setup and tear down the snapshot, etc. Or what else would you put into the two new functions? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
cfbot didn't like the rebased / split patch, and after looking at it I believe it's a bug in parallel apply of large transactions (216a784829), which seems to have changed interpretation of in_remote_transaction and in_streamed_transaction. I've reported the issue on that thread [1], but here's a version with a temporary workaround so that we can continue reviewing it. regards [1] https://www.postgresql.org/message-id/984ff689-adde-9977-affe-cd6029e850be%40enterprisedb.com On 1/15/23 00:39, Tomas Vondra wrote: > Hi, > > here's a slightly updated version - the main change is splitting the > patch into multiple parts, along the lines of the original patch > reverted in 2c7ea57e56ca5f668c32d4266e0a3e45b455bef5: > > - basic sequence decoding infrastructure > - support in test_decoding > - support in built-in logical replication > > The revert mentions a couple additional parts, but those were mostly > fixes / improvements. And those are not merged into the three parts. > > > On 1/11/23 22:46, Tomas Vondra wrote: >> >>> ... >>> >>>> +/* >>>> + * Update the sequence state by modifying the existing sequence data row. >>>> + * >>>> + * This keeps the same relfilenode, so the behavior is non-transactional. >>>> + */ >>>> +static void >>>> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called) >>>> +{ >>>> ... >>>> >>>> +void >>>> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called) >>>> +{ >>>> + if (transactional) >>>> + SetSequence_transactional(seq_relid, last_value, log_cnt, is_called); >>>> + else >>>> + SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called); >>>> +} >>> >>> That's a lot of duplication with existing code. There's no explanation why >>> SetSequence() as well as do_setval() exists. >>> >> >> Thanks, I'll look into this. >> > > I haven't done anything about this yet. The functions are doing similar > things, but there's also a fair amount of differences so I haven't found > a good way to merge them yet. > >>> >>>> /* >>>> * Initialize a sequence's relation with the specified tuple as content >>>> * >>>> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum) >>>> >>>> /* check the comment above nextval_internal()'s equivalent call. */ >>>> if (RelationNeedsWAL(rel)) >>>> + { >>>> GetTopTransactionId(); >>>> >>>> + if (XLogLogicalInfoActive()) >>>> + GetCurrentTransactionId(); >>>> + } >>> >>> Is it actually possible to reach this without an xid already having been >>> assigned for the current xact? >>> >> >> I believe it is. That's probably how I found this change is needed, >> actually. >> > > I've added a comment explaining why this needed. I don't think it's > worth trying to optimize this, because in plausible workloads we'd just > delay the work a little bit. > >>> >>> >>>> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions) >>>> * It's sufficient to ensure the toplevel transaction has an xid, no need >>>> * to assign xids subxacts, that'll already trigger an appropriate wait. >>>> * (Have to do that here, so we're outside the critical section) >>>> + * >>>> + * We have to ensure we have a proper XID, which will be included in >>>> + * the XLOG record by XLogRecordAssemble. Otherwise the first nextval() >>>> + * in a subxact (without any preceding changes) would get XID 0, and it >>>> + * would then be impossible to decide which top xact it belongs to. >>>> + * It'd also trigger assert in DecodeSequence. We only do that with >>>> + * wal_level=logical, though. >>>> + * >>>> + * XXX This might seem unnecessary, because if there's no XID the xact >>>> + * couldn't have done anything important yet, e.g. it could not have >>>> + * created a sequence. But that's incorrect, because of subxacts. The >>>> + * current subtransaction might not have done anything yet (thus no XID), >>>> + * but an earlier one might have created the sequence. >>>> */ >>> >>> What about restricting this to the case you're mentioning, >>> i.e. subtransactions? >>> >> >> That might work, but I need to think about it a bit. >> >> I don't think it'd save us much, though. I mean, vast majority of >> transactions (and subtransactions) calling nextval() will then do >> something else which requires a XID. This just moves the XID a bit, >> that's all. >> > > After thinking about this a bit more, I don't think the optimization is > worth it, for the reasons explained above. > >>> >>>> +/* >>>> + * Handle sequence decode >>>> + * >>>> + * Decoding sequences is a bit tricky, because while most sequence actions >>>> + * are non-transactional (not subject to rollback), some need to be handled >>>> + * as transactional. >>>> + * >>>> + * By default, a sequence increment is non-transactional - we must not queue >>>> + * it in a transaction as other changes, because the transaction might get >>>> + * rolled back and we'd discard the increment. The downstream would not be >>>> + * notified about the increment, which is wrong. >>>> + * >>>> + * On the other hand, the sequence may be created in a transaction. In this >>>> + * case we *should* queue the change as other changes in the transaction, >>>> + * because we don't want to send the increments for unknown sequence to the >>>> + * plugin - it might get confused about which sequence it's related to etc. >>>> + */ >>>> +void >>>> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) >>>> +{ >>> >>>> + /* extract the WAL record, with "created" flag */ >>>> + xlrec = (xl_seq_rec *) XLogRecGetData(r); >>>> + >>>> + /* XXX how could we have sequence change without data? */ >>>> + if(!datalen || !tupledata) >>>> + return; >>> >>> Yea, I think we should error out here instead, something has gone quite wrong >>> if this happens. >>> >> >> OK >> > > Done. > >>> >>>> + tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen); >>>> + DecodeSeqTuple(tupledata, datalen, tuplebuf); >>>> + >>>> + /* >>>> + * Should we handle the sequence increment as transactional or not? >>>> + * >>>> + * If the sequence was created in a still-running transaction, treat >>>> + * it as transactional and queue the increments. Otherwise it needs >>>> + * to be treated as non-transactional, in which case we send it to >>>> + * the plugin right away. >>>> + */ >>>> + transactional = ReorderBufferSequenceIsTransactional(ctx->reorder, >>>> + target_locator, >>>> + xlrec->created); >>> >>> Why re-create this information during decoding, when we basically already have >>> it available on the primary? I think we already pay the price for that >>> tracking, which we e.g. use for doing a non-transactional truncate: >>> >>> /* >>> * Normally, we need a transaction-safe truncation here. However, if >>> * the table was either created in the current (sub)transaction or has >>> * a new relfilenumber in the current (sub)transaction, then we can >>> * just truncate it in-place, because a rollback would cause the whole >>> * table or the current physical file to be thrown away anyway. >>> */ >>> if (rel->rd_createSubid == mySubid || >>> rel->rd_newRelfilelocatorSubid == mySubid) >>> { >>> /* Immediate, non-rollbackable truncation is OK */ >>> heap_truncate_one_rel(rel); >>> } >>> >>> Afaict we could do something similar for sequences, except that I think we >>> would just check if the sequence was created in the current transaction >>> (i.e. any of the fields are set). >>> >> >> Hmm, good point. >> > > But rd_createSubid/rd_newRelfilelocatorSubid fields are available only > in the original transaction, not during decoding. So we'd have to do > this check there and add the result to the WAL record. Is that what you > had in mind? > >>> >>>> +/* >>>> + * A transactional sequence increment is queued to be processed upon commit >>>> + * and a non-transactional increment gets processed immediately. >>>> + * >>>> + * A sequence update may be both transactional and non-transactional. When >>>> + * created in a running transaction, treat it as transactional and queue >>>> + * the change in it. Otherwise treat it as non-transactional, so that we >>>> + * don't forget the increment in case of a rollback. >>>> + */ >>>> +void >>>> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid, >>>> + Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id, >>>> + RelFileLocator rlocator, bool transactional, bool created, >>>> + ReorderBufferTupleBuf *tuplebuf) >>> >>> >>>> + /* >>>> + * Decoding needs access to syscaches et al., which in turn use >>>> + * heavyweight locks and such. Thus we need to have enough state around to >>>> + * keep track of those. The easiest way is to simply use a transaction >>>> + * internally. That also allows us to easily enforce that nothing writes >>>> + * to the database by checking for xid assignments. >>>> + * >>>> + * When we're called via the SQL SRF there's already a transaction >>>> + * started, so start an explicit subtransaction there. >>>> + */ >>>> + using_subtxn = IsTransactionOrTransactionBlock(); >>> >>> This duplicates a lot of the code from ReorderBufferProcessTXN(). But only >>> does so partially. It's hard to tell whether some of the differences are >>> intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()? >>> >>> Maybe something like >>> >>> void >>> ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals); >>> >>> void >>> ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error); >>> >> >> Thanks for the suggestion, I'll definitely consider that in the next >> version of the patch. > > I did look at the code a bit, but I'm not sure there really is a lot of > duplicated code - yes, we start/abort the (sub)transaction, setup and > tear down the snapshot, etc. Or what else would you put into the two new > functions? > > > regards > -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Fix-snapshot-handling-in-logicalmsg_decode-20230116.patch
- 0002-Logical-decoding-of-sequences-20230116.patch
- 0003-Add-decoding-of-sequences-to-test_decoding-20230116.patch
- 0004-Add-decoding-of-sequences-to-built-in-repli-20230116.patch
- 0005-WIP-workaround-for-issue-in-parallel-apply-20230116.patch
On Mon, 16 Jan 2023 at 04:49, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > cfbot didn't like the rebased / split patch, and after looking at it I > believe it's a bug in parallel apply of large transactions (216a784829), > which seems to have changed interpretation of in_remote_transaction and > in_streamed_transaction. I've reported the issue on that thread [1], but > here's a version with a temporary workaround so that we can continue > reviewing it. > The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID 17e72ec45d313b98bd90b95bc71b4cc77c2c89c3 === === applying patch ./0001-Fix-snapshot-handling-in-logicalmsg_decode-20230116.patch patching file src/backend/replication/logical/decode.c patching file src/backend/replication/logical/reorderbuffer.c === applying patch ./0002-Logical-decoding-of-sequences-20230116.patch patching file doc/src/sgml/logicaldecoding.sgml Hunk #3 FAILED at 483. Hunk #4 FAILED at 494. Hunk #7 succeeded at 1252 (offset 4 lines). 2 out of 7 hunks FAILED -- saving rejects to file doc/src/sgml/logicaldecoding.sgml.rej [1] - http://cfbot.cputube.org/patch_41_3823.log Regards, Vignesh
Hi, Here's a rebased patch, without the last bit which is now unnecessary thanks to c981d9145dea. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2/16/23 10:50 AM, Tomas Vondra wrote: > Hi, > > Here's a rebased patch, without the last bit which is now unnecessary > thanks to c981d9145dea. Thanks for continuing to work on this patch! I tested the latest version and have some feedback/clarifications. I did some testing using a demo-app-based-on-a-real-world app I had conjured up[1]. This uses integer sequences as surrogate keys. In general things seemed to work, but I had a couple of observations/questions. 1. Sequence IDs after a "failover". I believe this is a design decision, but I noticed that after simulating a failover, the IDs were replicating from a higher value, e.g. INSERT INTO room (name) VALUES ('room 1'); INSERT INTO room (name) VALUES ('room 2'); INSERT INTO room (name) VALUES ('room 3'); INSERT INTO room (name) VALUES ('room 4'); The values of room_id_seq on each instance: instance 1: last_value | log_cnt | is_called ------------+---------+----------- 4 | 29 | t instance 2: last_value | log_cnt | is_called ------------+---------+----------- 33 | 0 | t After the switchover on instance 2: INSERT INTO room (name) VALUES ('room 5') RETURNING id; id ---- 34 I don't see this as an issue for most applications, but we should at least document the behavior somewhere. 2. Using with origin=none with nonconflicting sequences. I modified the example in [1] to set up two schemas with non-conflicting sequences[2], e.g. on instance 1: CREATE TABLE public.room ( id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1) PRIMARY KEY, name text NOT NULL ); and instance 2: CREATE TABLE public.room ( id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2) PRIMARY KEY, name text NOT NULL ); I ran the following on instance 1: INSERT INTO public.room ('name') VALUES ('room 1-e'); This committed and successfully replicated. However, when I ran the following on instance 2, I received a conlifct error: INSERT INTO public.room ('name') VALUES ('room 1-w'); The conflict came further down the trigger change, i.e. to a change in the `public.calendar` table: 2023-02-22 01:49:12.293 UTC [87235] ERROR: duplicate key value violates unique constraint "calendar_pkey" 2023-02-22 01:49:12.293 UTC [87235] DETAIL: Key (id)=(661) already exists. After futzing with the logging and restarting, I was also able to reproduce a similar conflict with the same insert pattern into 'room'. I did notice that the sequence values kept bouncing around between the servers. Without any activity, this is what "SELECT * FROM room_id_seq" would return with queries run ~4s apart: last_value | log_cnt | is_called ------------+---------+----------- 131 | 0 | t last_value | log_cnt | is_called ------------+---------+----------- 65 | 0 | t The values were more varying on "calendar". Again, this is under no additional write activity, these numbers kept fluctuating: last_value | log_cnt | is_called ------------+---------+----------- 197 | 0 | t last_value | log_cnt | is_called ------------+---------+----------- 461 | 0 | t last_value | log_cnt | is_called ------------+---------+----------- 263 | 0 | t last_value | log_cnt | is_called ------------+---------+----------- 527 | 0 | t To handle this case for now, I adapted the schema to create sequences that we clearly independently named[3]. I did learn that I had to create sequences on both instances to support this behavior, e.g.: -- instance 1 CREATE SEQUENCE public.room_id_1_seq AS int INCREMENT BY 2 START WITH 1; CREATE SEQUENCE public.room_id_2_seq AS int INCREMENT BY 2 START WITH 2; CREATE TABLE public.room ( id int DEFAULT nextval('room_id_1_seq') PRIMARY KEY, name text NOT NULL ); -- instance 2 CREATE SEQUENCE public.room_id_1_seq AS int INCREMENT BY 2 START WITH 1; CREATE SEQUENCE public.room_id_2_seq AS int INCREMENT BY 2 START WITH 2; CREATE TABLE public.room ( id int DEFAULT nextval('room_id_2_seq') PRIMARY KEY, name text NOT NULL ); After building out [3] this did work, but it was more tedious. Is it possible to support IDENTITY columns (or serial columns) where the values of the sequence are set to different intervals on the publisher/subscriber? Thanks, Jonathan [1] https://github.com/CrunchyData/postgres-realtime-demo/blob/main/examples/demo/demo1.sql [2] https://gist.github.com/jkatz/5c34bf1e401b3376dfe8e627fcd30af3 [3] https://gist.github.com/jkatz/1599e467d55abec88ab487d8ac9dc7c3
Attachment
On 2/22/23 03:28, Jonathan S. Katz wrote: > Hi, > > On 2/16/23 10:50 AM, Tomas Vondra wrote: >> Hi, >> >> Here's a rebased patch, without the last bit which is now unnecessary >> thanks to c981d9145dea. > > Thanks for continuing to work on this patch! I tested the latest version > and have some feedback/clarifications. > Thanks! > I did some testing using a demo-app-based-on-a-real-world app I had > conjured up[1]. This uses integer sequences as surrogate keys. > > In general things seemed to work, but I had a couple of > observations/questions. > > 1. Sequence IDs after a "failover". I believe this is a design decision, > but I noticed that after simulating a failover, the IDs were replicating > from a higher value, e.g. > > INSERT INTO room (name) VALUES ('room 1'); > INSERT INTO room (name) VALUES ('room 2'); > INSERT INTO room (name) VALUES ('room 3'); > INSERT INTO room (name) VALUES ('room 4'); > > The values of room_id_seq on each instance: > > instance 1: > > last_value | log_cnt | is_called > ------------+---------+----------- > 4 | 29 | t > > instance 2: > > last_value | log_cnt | is_called > ------------+---------+----------- > 33 | 0 | t > > After the switchover on instance 2: > > INSERT INTO room (name) VALUES ('room 5') RETURNING id; > > id > ---- > 34 > > I don't see this as an issue for most applications, but we should at > least document the behavior somewhere. > Yes, this is due to how we WAL-log sequences. We don't log individual increments, but every 32nd increment and we log the "future" sequence state so that after a crash/recovery we don't generate duplicates. So you do nextval() and it returns 1. But into WAL we record 32. And there will be no WAL records until nextval reaches 32 and needs to generate another batch. And because logical replication relies on these WAL records, it inherits this batching behavior with a "jump" on recovery/failover. IMHO it's OK, it works for the "logical failover" use case and if you need gapless sequences then regular sequences are not an issue anyway. It's possible to reduce the jump a bit by reducing the batch size (from 32 to 0) so that every increment is logged. But it doesn't eliminate it because of rollbacks. > 2. Using with origin=none with nonconflicting sequences. > > I modified the example in [1] to set up two schemas with non-conflicting > sequences[2], e.g. on instance 1: > > CREATE TABLE public.room ( > id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1) > PRIMARY KEY, > name text NOT NULL > ); > > and instance 2: > > CREATE TABLE public.room ( > id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2) > PRIMARY KEY, > name text NOT NULL > ); > Well, yeah. We don't support active-active logical replication (at least not with the built-in). You can easily get into similar issues without sequences. Replicating a sequence overwrites the state of the sequence on the other side, which may result in it generating duplicate values with the other node, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/22/23 5:02 AM, Tomas Vondra wrote: > > On 2/22/23 03:28, Jonathan S. Katz wrote: >> Thanks for continuing to work on this patch! I tested the latest version >> and have some feedback/clarifications. >> > > Thanks! Also I should mention I've been testing with both async/sync logical replication. I didn't have any specific comments on either as it seemed to just work and behaviors aligned with existing expectations. Generally it's been a good experience and it seems to be working. :) At this point I'm trying to understand the limitations and tripwires so we can guide users appropriately. > Yes, this is due to how we WAL-log sequences. We don't log individual > increments, but every 32nd increment and we log the "future" sequence > state so that after a crash/recovery we don't generate duplicates. > > So you do nextval() and it returns 1. But into WAL we record 32. And > there will be no WAL records until nextval reaches 32 and needs to > generate another batch. > > And because logical replication relies on these WAL records, it inherits > this batching behavior with a "jump" on recovery/failover. IMHO it's OK, > it works for the "logical failover" use case and if you need gapless > sequences then regular sequences are not an issue anyway. > > It's possible to reduce the jump a bit by reducing the batch size (from > 32 to 0) so that every increment is logged. But it doesn't eliminate it > because of rollbacks. I generally agree. I think it's mainly something we should capture in the user docs that they can be a jump on the subscriber side, so people are not surprised. Interestingly, in systems that tend to have higher rates of failover (I'm thinking of a few distributed systems), this may cause int4 sequences to exhaust numbers slightly (marginally?) more quickly. Likely not too big of an issue, but something to keep in mind. >> 2. Using with origin=none with nonconflicting sequences. >> >> I modified the example in [1] to set up two schemas with non-conflicting >> sequences[2], e.g. on instance 1: >> >> CREATE TABLE public.room ( >> id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1) >> PRIMARY KEY, >> name text NOT NULL >> ); >> >> and instance 2: >> >> CREATE TABLE public.room ( >> id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2) >> PRIMARY KEY, >> name text NOT NULL >> ); >> > > Well, yeah. We don't support active-active logical replication (at least > not with the built-in). You can easily get into similar issues without > sequences. The "origin=none" feature lets you replicate tables bidirectionally. While it's not full "active-active", this is a starting point and a feature for v16. We'll definitely have users replicating data bidirectionally with this. > Replicating a sequence overwrites the state of the sequence on the other > side, which may result in it generating duplicate values with the other > node, etc. I understand that we don't currently support global sequences, but I am concerned there may be a tripwire here in the origin=none case given it's fairly common to use serial/GENERATED BY to set primary keys. And it's fairly trivial to set them to be nonconflicting, or at least give the user the appearance that they are nonconflicting. From my high level understand of how sequences work, this sounds like it would be a lift to support the example in [1]. Or maybe the answer is that you can bidirectionally replicate the changes in the tables, but not sequences? In any case, we should update the restrictions in [2] to state: while sequences can be replicated, there is additional work required if you are bidirectionally replicating tables that use sequences, esp. if used in a PK or a constraint. We can provide alternatives to how a user could set that up, i.e. not replicates the sequences or do something like in [3]. Thanks, Jonathan [1] https://gist.github.com/jkatz/5c34bf1e401b3376dfe8e627fcd30af3 [2] https://www.postgresql.org/docs/devel/logical-replication-restrictions.html [3] https://gist.github.com/jkatz/1599e467d55abec88ab487d8ac9dc7c3
Attachment
On 2/22/23 18:04, Jonathan S. Katz wrote: > On 2/22/23 5:02 AM, Tomas Vondra wrote: >> >> On 2/22/23 03:28, Jonathan S. Katz wrote: > >>> Thanks for continuing to work on this patch! I tested the latest version >>> and have some feedback/clarifications. >>> >> >> Thanks! > > Also I should mention I've been testing with both async/sync logical > replication. I didn't have any specific comments on either as it seemed > to just work and behaviors aligned with existing expectations. > > Generally it's been a good experience and it seems to be working. :) At > this point I'm trying to understand the limitations and tripwires so we > can guide users appropriately. > Good to hear. >> Yes, this is due to how we WAL-log sequences. We don't log individual >> increments, but every 32nd increment and we log the "future" sequence >> state so that after a crash/recovery we don't generate duplicates. >> >> So you do nextval() and it returns 1. But into WAL we record 32. And >> there will be no WAL records until nextval reaches 32 and needs to >> generate another batch. >> >> And because logical replication relies on these WAL records, it inherits >> this batching behavior with a "jump" on recovery/failover. IMHO it's OK, >> it works for the "logical failover" use case and if you need gapless >> sequences then regular sequences are not an issue anyway. >> >> It's possible to reduce the jump a bit by reducing the batch size (from >> 32 to 0) so that every increment is logged. But it doesn't eliminate it >> because of rollbacks. > > I generally agree. I think it's mainly something we should capture in > the user docs that they can be a jump on the subscriber side, so people > are not surprised. > > Interestingly, in systems that tend to have higher rates of failover > (I'm thinking of a few distributed systems), this may cause int4 > sequences to exhaust numbers slightly (marginally?) more quickly. Likely > not too big of an issue, but something to keep in mind. > IMHO the number of systems that would work fine with int4 sequences but this change results in the sequences being "exhausted" too quickly is indistinguishable from 0. I don't think this is an issue. >>> 2. Using with origin=none with nonconflicting sequences. >>> >>> I modified the example in [1] to set up two schemas with non-conflicting >>> sequences[2], e.g. on instance 1: >>> >>> CREATE TABLE public.room ( >>> id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1) >>> PRIMARY KEY, >>> name text NOT NULL >>> ); >>> >>> and instance 2: >>> >>> CREATE TABLE public.room ( >>> id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2) >>> PRIMARY KEY, >>> name text NOT NULL >>> ); >>> >> >> Well, yeah. We don't support active-active logical replication (at least >> not with the built-in). You can easily get into similar issues without >> sequences. > > The "origin=none" feature lets you replicate tables bidirectionally. > While it's not full "active-active", this is a starting point and a > feature for v16. We'll definitely have users replicating data > bidirectionally with this. > Well, then the users need to use some other way to generate IDs, not local sequences. Either some sort of distributed/global sequence, UUIDs or something like that. >> Replicating a sequence overwrites the state of the sequence on the other >> side, which may result in it generating duplicate values with the other >> node, etc. > > I understand that we don't currently support global sequences, but I am > concerned there may be a tripwire here in the origin=none case given > it's fairly common to use serial/GENERATED BY to set primary keys. And > it's fairly trivial to set them to be nonconflicting, or at least give > the user the appearance that they are nonconflicting. > > From my high level understand of how sequences work, this sounds like it > would be a lift to support the example in [1]. Or maybe the answer is > that you can bidirectionally replicate the changes in the tables, but > not sequences? > Yes, I don't think local sequences don't and can't work in such setups. > In any case, we should update the restrictions in [2] to state: while > sequences can be replicated, there is additional work required if you > are bidirectionally replicating tables that use sequences, esp. if used > in a PK or a constraint. We can provide alternatives to how a user could > set that up, i.e. not replicates the sequences or do something like in [3]. > I agree. I see this as mostly a documentation issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/23/23 7:56 AM, Tomas Vondra wrote: > On 2/22/23 18:04, Jonathan S. Katz wrote: >> On 2/22/23 5:02 AM, Tomas Vondra wrote: >>> >> Interestingly, in systems that tend to have higher rates of failover >> (I'm thinking of a few distributed systems), this may cause int4 >> sequences to exhaust numbers slightly (marginally?) more quickly. Likely >> not too big of an issue, but something to keep in mind. >> > > IMHO the number of systems that would work fine with int4 sequences but > this change results in the sequences being "exhausted" too quickly is > indistinguishable from 0. I don't think this is an issue. I agree it's an edge case. I do think it's a number greater than 0, having seen some incredibly flaky setups, particularly in distributed systems. I would not worry about it, but only mentioned it to try and probe edge cases. >>> Well, yeah. We don't support active-active logical replication (at least >>> not with the built-in). You can easily get into similar issues without >>> sequences. >> >> The "origin=none" feature lets you replicate tables bidirectionally. >> While it's not full "active-active", this is a starting point and a >> feature for v16. We'll definitely have users replicating data >> bidirectionally with this. >> > > Well, then the users need to use some other way to generate IDs, not > local sequences. Either some sort of distributed/global sequence, UUIDs > or something like that. [snip] >> In any case, we should update the restrictions in [2] to state: while >> sequences can be replicated, there is additional work required if you >> are bidirectionally replicating tables that use sequences, esp. if used >> in a PK or a constraint. We can provide alternatives to how a user could >> set that up, i.e. not replicates the sequences or do something like in [3]. >> > > I agree. I see this as mostly a documentation issue. Great. I agree that users need other mechanisms to generate IDs, but we should ensure we document that. If needed, I'm happy to help with the docs here. Thanks, Jonathan
Attachment
Hi, here's a rebased patch to make cfbot happy, dropping the first part that is now unnecessary thanks to 7fe1aa991b. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Mar 1, 2023 at 1:02 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> here's a rebased patch to make cfbot happy, dropping the first part that
> is now unnecessary thanks to 7fe1aa991b.
Hi Tomas,
I'm looking into doing some "in situ" testing, but for now I'll mention some minor nits I found:
0001
+ * so we simply do a lookup (the sequence is identified by relfilende). If
relfilenode? Or should it be called a relfilelocator, which is the parameter type? I see some other references to relfilenode in comments and commit message, and I'm not sure which need to be updated.
+ /* XXX Maybe check that we're still in the same top-level xact? */
Any ideas on what should happen here?
+ /* XXX how could we have sequence change without data? */
+ if(!datalen || !tupledata)
+ elog(ERROR, "sequence decode missing tuple data");
Since the ERROR is new based on feedback, we can get rid of XXX I think.
More generally, I associate XXX comments to highlight problems or unpleasantness in the code that don't quite rise to the level of FIXME, but are perhaps more serious than "NB:", "Note:", or "Important:"
+ * When we're called via the SQL SRF there's already a transaction
I see this was copied from existing code, but I found it confusing -- does this function have a stable name?
+ /* Only ever called from ReorderBufferApplySequence, so transational. */
Typo: transactional
0002
I see a few SERIAL types in the tests but no GENERATED ... AS IDENTITY -- not sure if it matters, but seems good for completeness.
Reminder for later: Patches 0002 and 0003 still refer to 0da92dc530, which is a reverted commit -- I assume it intends to refer to the content of 0001?
--
John Naylor
EDB: http://www.enterprisedb.com
I tried a couple toy examples with various combinations of use styles.
Three with "automatic" reading from sequences:
create table test(i serial);
create table test(i int GENERATED BY DEFAULT AS IDENTITY);
create table test(i int default nextval('s1'));
...where s1 has some non-default parameters:
CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
...and then two with explicit use of s1, one inserting the 'nextval' into a table with no default, and one with no table at all, just selecting from the sequence.
The last two seem to work similarly to the first three, so it seems like FOR ALL TABLES adds all sequences as well. Is that expected? The documentation for CREATE PUBLICATION mentions sequence options, but doesn't really say how these options should be used.
Here's the script:
# alter system set wal_level='logical';
# restart
# port 7777 is subscriber
echo
echo "PUB:"
psql -c "drop sequence if exists s1;"
psql -c "drop publication if exists pub1;"
echo
echo "SUB:"
psql -p 7777 -c "drop sequence if exists s1;"
psql -p 7777 -c "drop subscription if exists sub1 ;"
echo
echo "PUB:"
psql -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
echo
echo "SUB:"
psql -p 7777 -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost dbname=john application_name=sub1 port=5432' PUBLICATION pub1;"
echo
echo "PUB:"
psql -c "select nextval('s1');"
psql -c "select nextval('s1');"
psql -c "select * from s1;"
sleep 1
echo
echo "SUB:"
psql -p 7777 -c "select * from s1;"
psql -p 7777 -c "drop subscription sub1 ;"
psql -p 7777 -c "select nextval('s1');"
psql -p 7777 -c "select * from s1;"
...with the last two queries returning
nextval
---------
67
(1 row)
last_value | log_cnt | is_called
------------+---------+-----------
67 | 32 | t
So, I interpret that the decrement by 32 got logged here.
Also, running
CREATE PUBLICATION pub2 FOR ALL SEQUENCES WITH (publish = 'insert, update, delete, truncate, sequence');
...reports success, but do non-default values of "publish = ..." have an effect (or should they), or are these just ignored? It seems like these cases shouldn't be treated orthogonally.
--
John Naylor
EDB: http://www.enterprisedb.com
Three with "automatic" reading from sequences:
create table test(i serial);
create table test(i int GENERATED BY DEFAULT AS IDENTITY);
create table test(i int default nextval('s1'));
...where s1 has some non-default parameters:
CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
...and then two with explicit use of s1, one inserting the 'nextval' into a table with no default, and one with no table at all, just selecting from the sequence.
The last two seem to work similarly to the first three, so it seems like FOR ALL TABLES adds all sequences as well. Is that expected? The documentation for CREATE PUBLICATION mentions sequence options, but doesn't really say how these options should be used.
Here's the script:
# alter system set wal_level='logical';
# restart
# port 7777 is subscriber
echo
echo "PUB:"
psql -c "drop sequence if exists s1;"
psql -c "drop publication if exists pub1;"
echo
echo "SUB:"
psql -p 7777 -c "drop sequence if exists s1;"
psql -p 7777 -c "drop subscription if exists sub1 ;"
echo
echo "PUB:"
psql -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
echo
echo "SUB:"
psql -p 7777 -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost dbname=john application_name=sub1 port=5432' PUBLICATION pub1;"
echo
echo "PUB:"
psql -c "select nextval('s1');"
psql -c "select nextval('s1');"
psql -c "select * from s1;"
sleep 1
echo
echo "SUB:"
psql -p 7777 -c "select * from s1;"
psql -p 7777 -c "drop subscription sub1 ;"
psql -p 7777 -c "select nextval('s1');"
psql -p 7777 -c "select * from s1;"
...with the last two queries returning
nextval
---------
67
(1 row)
last_value | log_cnt | is_called
------------+---------+-----------
67 | 32 | t
So, I interpret that the decrement by 32 got logged here.
Also, running
CREATE PUBLICATION pub2 FOR ALL SEQUENCES WITH (publish = 'insert, update, delete, truncate, sequence');
...reports success, but do non-default values of "publish = ..." have an effect (or should they), or are these just ignored? It seems like these cases shouldn't be treated orthogonally.
--
John Naylor
EDB: http://www.enterprisedb.com
On 3/10/23 11:03, John Naylor wrote: > > On Wed, Mar 1, 2023 at 1:02 AM Tomas Vondra > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> > wrote: >> here's a rebased patch to make cfbot happy, dropping the first part that >> is now unnecessary thanks to 7fe1aa991b. > > Hi Tomas, > > I'm looking into doing some "in situ" testing, but for now I'll mention > some minor nits I found: > > 0001 > > + * so we simply do a lookup (the sequence is identified by relfilende). If > > relfilenode? Or should it be called a relfilelocator, which is the > parameter type? I see some other references to relfilenode in comments > and commit message, and I'm not sure which need to be updated. > Yeah, that's a leftover from the original patch, before the relfilenode was renamed to relfilelocator. > + /* XXX Maybe check that we're still in the same top-level xact? */ > > Any ideas on what should happen here? > I don't recall why I added this comment, but I don't think there's anything we need to do (so drop the comment). > + /* XXX how could we have sequence change without data? */ > + if(!datalen || !tupledata) > + elog(ERROR, "sequence decode missing tuple data"); > > Since the ERROR is new based on feedback, we can get rid of XXX I think. > > More generally, I associate XXX comments to highlight problems or > unpleasantness in the code that don't quite rise to the level of FIXME, > but are perhaps more serious than "NB:", "Note:", or "Important:" > Understood. I keep adding XXX in places where I have some open questions, or something that may need to be improved (so kinda less serious than a FIXME). > + * When we're called via the SQL SRF there's already a transaction > > I see this was copied from existing code, but I found it confusing -- > does this function have a stable name? > What do you mean by "stable name"? It certainly is not exposed as a user-callable SQL function, so I think this comment it misleading and should be removed. > + /* Only ever called from ReorderBufferApplySequence, so transational. */ > > Typo: transactional > > 0002 > > I see a few SERIAL types in the tests but no GENERATED ... AS IDENTITY > -- not sure if it matters, but seems good for completeness. > That's a good point. Adding tests for GENERATED ... AS IDENTITY is a good idea. > Reminder for later: Patches 0002 and 0003 still refer to 0da92dc530, > which is a reverted commit -- I assume it intends to refer to the > content of 0001? > Correct. That needs to be adjusted at commit time. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/14/23 08:30, John Naylor wrote: > I tried a couple toy examples with various combinations of use styles. > > Three with "automatic" reading from sequences: > > create table test(i serial); > create table test(i int GENERATED BY DEFAULT AS IDENTITY); > create table test(i int default nextval('s1')); > > ...where s1 has some non-default parameters: > > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1; > > ...and then two with explicit use of s1, one inserting the 'nextval' > into a table with no default, and one with no table at all, just > selecting from the sequence. > > The last two seem to work similarly to the first three, so it seems like > FOR ALL TABLES adds all sequences as well. Is that expected? Yeah, that's a bug - we shouldn't replicate the sequence changes, unless the sequence is actually added to the publication. I tracked this down to a thinko in get_rel_sync_entry() which failed to check the object type when puballtables or puballsequences was set. Attached is a patch fixing this. > The documentation for CREATE PUBLICATION mentions sequence options, > but doesn't really say how these options should be used. Good point. The idea is that we handle tables and sequences the same way, i.e. if you specify 'sequence' then we'll replicate increments for sequences explicitly added to the publication. If this is not clear, the docs may need some improvements. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 3/14/23 08:30, John Naylor wrote: > > I tried a couple toy examples with various combinations of use styles. > > > > Three with "automatic" reading from sequences: > > > > create table test(i serial); > > create table test(i int GENERATED BY DEFAULT AS IDENTITY); > > create table test(i int default nextval('s1')); > > > > ...where s1 has some non-default parameters: > > > > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1; > > > > ...and then two with explicit use of s1, one inserting the 'nextval' > > into a table with no default, and one with no table at all, just > > selecting from the sequence. > > > > The last two seem to work similarly to the first three, so it seems like > > FOR ALL TABLES adds all sequences as well. Is that expected? > > Yeah, that's a bug - we shouldn't replicate the sequence changes, unless > the sequence is actually added to the publication. I tracked this down > to a thinko in get_rel_sync_entry() which failed to check the object > type when puballtables or puballsequences was set. > > Attached is a patch fixing this. > > > The documentation for CREATE PUBLICATION mentions sequence options, > > but doesn't really say how these options should be used. > Good point. The idea is that we handle tables and sequences the same > way, i.e. if you specify 'sequence' then we'll replicate increments for > sequences explicitly added to the publication. > > If this is not clear, the docs may need some improvements. > I'm late to this thread, but I have some questions and review comments. Regarding sequence logical replication, it seems that changes of sequence created after CREATE SUBSCRIPTION are applied on the subscriber even without REFRESH PUBLICATION command on the subscriber. Which is a different behavior than tables. For example, I set both publisher and subscriber as follows: 1. On publisher create publication test_pub for all sequences; 2. On subscriber create subscription test_sub connection 'dbname=postgres port=5551' publication test_pub; -- port=5551 is the publisher 3. On publisher create sequence s1; select nextval('s1'); I got the error "ERROR: relation "public.s1" does not exist on the subscriber". Probably we need to do should_apply_changes_for_rel() check in apply_handle_sequence(). If my understanding is correct, is there any case where the subscriber needs to apply transactional sequence changes? The commit message of 0001 patch says: * Changes for sequences created in the same top-level transaction are treated as transactional, i.e. just like any other change from that transaction, and discarded in case of a rollback. IIUC such sequences are not visible to the subscriber, so it cannot subscribe to them until the commit. --- I got an assertion failure. The reproducible steps are: 1. On publisher alter system set logical_replication_mode = 'immediate'; select pg_reload_conf(); create publication test_pub for all sequences; 2. On subscriber create subscription test_sub connection 'dbname=postgres port=5551' publication test_pub with (streaming='parall\el') 3. On publisher begin; create table bar (c int, d serial); insert into bar(c) values (100); commit; I got the following assertion failure: TRAP: failed Assert("(!seq.transactional) || in_remote_transaction"), File: "worker.c", Line: 1458, PID: 508056 postgres: logical replication parallel apply worker for subscription 16388 (ExceptionalCondition+0x9e)[0xb6c0af] postgres: logical replication parallel apply worker for subscription 16388 [0x92f7fe] postgres: logical replication parallel apply worker for subscription 16388 (apply_dispatch+0xed)[0x932925] postgres: logical replication parallel apply worker for subscription 16388 [0x90d927] postgres: logical replication parallel apply worker for subscription 16388 (ParallelApplyWorkerMain+0x34f)[0x90dd8d] postgres: logical replication parallel apply worker for subscription 16388 (StartBackgroundWorker+0x1f3)[0x8e7b19] postgres: logical replication parallel apply worker for subscription 16388 [0x8f1798] postgres: logical replication parallel apply worker for subscription 16388 [0x8f1b53] postgres: logical replication parallel apply worker for subscription 16388 [0x8f0bed] postgres: logical replication parallel apply worker for subscription 16388 [0x8ecca4] postgres: logical replication parallel apply worker for subscription 16388 (PostmasterMain+0x1246)[0x8ec6d7] postgres: logical replication parallel apply worker for subscription 16388 [0x7bbe5c] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f69094cbcf3] postgres: logical replication parallel apply worker for subscription 16388 (_start+0x2e)[0x49d15e] 2023-03-16 12:33:19.471 JST [507974] LOG: background worker "logical replication parallel worker" (PID 508056) was terminated by signal 6: Aborted seq.transactional is true and in_remote_transaction is false. It might be an issue of the parallel apply feature rather than this patch. --- There is no documentation about the new 'sequence' value of the publish option in CREATE/ALTER PUBLICATION. It seems to be possible to specify something like "CREATE PUBLICATION ... FOR ALL SEQUENCES WITH (publish = 'truncate')" (i.e., not specifying 'sequence' value in the publish option). How does logical replication work with this setting? Nothing is replicated? --- It seems that sequence replication does't work well together with ALTER SUBSCRIPTION ... SKIP command. IIUC these changes are not skipped even if these are transactional changes. The reproducible steps are: 1. On both nodes create table a (c int primary key); 2. On publisher create publication hoge_pub for all sequences, tables 3. On subscriber create subscription hoge_sub connection 'dbname=postgres port=5551' publication hoge_pub; insert into a values (1); 4. On publisher begin; create sequence s2; insert into a values (nextval('s2')); commit; At step 4, applying INSERT conflicts with the existing row on the subscriber. If I skip this transaction using ALTER SUBSCRIPTION ... SKIP command, I got: ERROR: relation "public.s2" does not exist CONTEXT: processing remote data for replication origin "pg_16390" during message type "BEGIN" in transaction 734, finished at 0/1751698 If I create the sequence s2 in advance on the subscriber, the sequence change is applied on the subscriber. If the subscriber doesn't need to apply transactional sequence changes in the first place, this problem will disappear. --- There are two typos in 0001 patch: In the commit message: ensure the sequence record has a valid XID - until now the the increment s/the the/ the/ And, + /* Only ever called from ReorderBufferApplySequence, so transational. */ s/transational/transactional/ Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 16, 2023 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > > > > > On 3/14/23 08:30, John Naylor wrote: > --- > I got an assertion failure. The reproducible steps are: > > 1. On publisher > alter system set logical_replication_mode = 'immediate'; > select pg_reload_conf(); > create publication test_pub for all sequences; > > 2. On subscriber > create subscription test_sub connection 'dbname=postgres port=5551' > publication test_pub with (streaming='parall\el') > > 3. On publisher > begin; > create table bar (c int, d serial); > insert into bar(c) values (100); > commit; > > I got the following assertion failure: > > TRAP: failed Assert("(!seq.transactional) || in_remote_transaction"), ... > > seq.transactional is true and in_remote_transaction is false. It might > be an issue of the parallel apply feature rather than this patch. > During parallel apply we didn't need to rely on in_remote_transaction, so it was not set. I haven't checked the patch in detail but am wondering, isn't it sufficient to instead check IsTransactionState() and or IsTransactionOrTransactionBlock()? -- With Regards, Amit Kapila.
Hi! On 3/16/23 08:38, Masahiko Sawada wrote: > Hi, > > On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> >> >> On 3/14/23 08:30, John Naylor wrote: >>> I tried a couple toy examples with various combinations of use styles. >>> >>> Three with "automatic" reading from sequences: >>> >>> create table test(i serial); >>> create table test(i int GENERATED BY DEFAULT AS IDENTITY); >>> create table test(i int default nextval('s1')); >>> >>> ...where s1 has some non-default parameters: >>> >>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1; >>> >>> ...and then two with explicit use of s1, one inserting the 'nextval' >>> into a table with no default, and one with no table at all, just >>> selecting from the sequence. >>> >>> The last two seem to work similarly to the first three, so it seems like >>> FOR ALL TABLES adds all sequences as well. Is that expected? >> >> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless >> the sequence is actually added to the publication. I tracked this down >> to a thinko in get_rel_sync_entry() which failed to check the object >> type when puballtables or puballsequences was set. >> >> Attached is a patch fixing this. >> >>> The documentation for CREATE PUBLICATION mentions sequence options, >>> but doesn't really say how these options should be used. >> Good point. The idea is that we handle tables and sequences the same >> way, i.e. if you specify 'sequence' then we'll replicate increments for >> sequences explicitly added to the publication. >> >> If this is not clear, the docs may need some improvements. >> > > I'm late to this thread, but I have some questions and review comments. > > Regarding sequence logical replication, it seems that changes of > sequence created after CREATE SUBSCRIPTION are applied on the > subscriber even without REFRESH PUBLICATION command on the subscriber. > Which is a different behavior than tables. For example, I set both > publisher and subscriber as follows: > > 1. On publisher > create publication test_pub for all sequences; > > 2. On subscriber > create subscription test_sub connection 'dbname=postgres port=5551' > publication test_pub; -- port=5551 is the publisher > > 3. On publisher > create sequence s1; > select nextval('s1'); > > I got the error "ERROR: relation "public.s1" does not exist on the > subscriber". Probably we need to do should_apply_changes_for_rel() > check in apply_handle_sequence(). > Yes, you're right - the sequence handling should have been calling the should_apply_changes_for_rel() etc. The attached 0005 patch should fix that - I still need to test it a bit more and maybe clean it up a bit, but hopefully it'll allow you to continue the review. I had to tweak the protocol a bit, so that this uses the same cache as tables. I wonder if maybe we should make it even more similar, by essentially treating sequences as tables with (last_value, log_cnt, called) columns. > If my understanding is correct, is there any case where the subscriber > needs to apply transactional sequence changes? The commit message of > 0001 patch says: > > * Changes for sequences created in the same top-level transaction are > treated as transactional, i.e. just like any other change from that > transaction, and discarded in case of a rollback. > > IIUC such sequences are not visible to the subscriber, so it cannot > subscribe to them until the commit. > The comment is slightly misleading, as it talks about creation of sequences, but it should be talking about relfilenodes. For example, if you create a sequence, add it to publication, and then in a later transaction you do ALTER SEQUENCE x RESTART or something else that creates a new relfilenode, then the subsequent increments are visible only in that transaction. But we still need to apply those on the subscriber, but only as part of the transaction, because it might roll back. > --- > I got an assertion failure. The reproducible steps are: > I do believe this was due to a thinko in apply_handle_sequence, which sometimes started transaction and didn't terminate it correctly. I've changed it to use the begin_replication_step() etc. and it seems to be working fine now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Logical-decoding-of-sequences-20230316.patch
- 0002-Add-decoding-of-sequences-to-test_decoding-20230316.patch
- 0003-Add-decoding-of-sequences-to-built-in-repli-20230316.patch
- 0004-puballtables-fixup-20230316.patch
- 0005-fixup-syncing-refresh-sequences-20230316.patch
- 0006-john-s-review-20230316.patch
On Thu, 16 Mar 2023 at 21:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi! > > On 3/16/23 08:38, Masahiko Sawada wrote: > > Hi, > > > > On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> > >> > >> On 3/14/23 08:30, John Naylor wrote: > >>> I tried a couple toy examples with various combinations of use styles. > >>> > >>> Three with "automatic" reading from sequences: > >>> > >>> create table test(i serial); > >>> create table test(i int GENERATED BY DEFAULT AS IDENTITY); > >>> create table test(i int default nextval('s1')); > >>> > >>> ...where s1 has some non-default parameters: > >>> > >>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1; > >>> > >>> ...and then two with explicit use of s1, one inserting the 'nextval' > >>> into a table with no default, and one with no table at all, just > >>> selecting from the sequence. > >>> > >>> The last two seem to work similarly to the first three, so it seems like > >>> FOR ALL TABLES adds all sequences as well. Is that expected? > >> > >> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless > >> the sequence is actually added to the publication. I tracked this down > >> to a thinko in get_rel_sync_entry() which failed to check the object > >> type when puballtables or puballsequences was set. > >> > >> Attached is a patch fixing this. > >> > >>> The documentation for CREATE PUBLICATION mentions sequence options, > >>> but doesn't really say how these options should be used. > >> Good point. The idea is that we handle tables and sequences the same > >> way, i.e. if you specify 'sequence' then we'll replicate increments for > >> sequences explicitly added to the publication. > >> > >> If this is not clear, the docs may need some improvements. > >> > > > > I'm late to this thread, but I have some questions and review comments. > > > > Regarding sequence logical replication, it seems that changes of > > sequence created after CREATE SUBSCRIPTION are applied on the > > subscriber even without REFRESH PUBLICATION command on the subscriber. > > Which is a different behavior than tables. For example, I set both > > publisher and subscriber as follows: > > > > 1. On publisher > > create publication test_pub for all sequences; > > > > 2. On subscriber > > create subscription test_sub connection 'dbname=postgres port=5551' > > publication test_pub; -- port=5551 is the publisher > > > > 3. On publisher > > create sequence s1; > > select nextval('s1'); > > > > I got the error "ERROR: relation "public.s1" does not exist on the > > subscriber". Probably we need to do should_apply_changes_for_rel() > > check in apply_handle_sequence(). > > > > Yes, you're right - the sequence handling should have been calling the > should_apply_changes_for_rel() etc. > > The attached 0005 patch should fix that - I still need to test it a bit > more and maybe clean it up a bit, but hopefully it'll allow you to > continue the review. > > I had to tweak the protocol a bit, so that this uses the same cache as > tables. I wonder if maybe we should make it even more similar, by > essentially treating sequences as tables with (last_value, log_cnt, > called) columns. > > > If my understanding is correct, is there any case where the subscriber > > needs to apply transactional sequence changes? The commit message of > > 0001 patch says: > > > > * Changes for sequences created in the same top-level transaction are > > treated as transactional, i.e. just like any other change from that > > transaction, and discarded in case of a rollback. > > > > IIUC such sequences are not visible to the subscriber, so it cannot > > subscribe to them until the commit. > > > > The comment is slightly misleading, as it talks about creation of > sequences, but it should be talking about relfilenodes. For example, if > you create a sequence, add it to publication, and then in a later > transaction you do > > ALTER SEQUENCE x RESTART > > or something else that creates a new relfilenode, then the subsequent > increments are visible only in that transaction. But we still need to > apply those on the subscriber, but only as part of the transaction, > because it might roll back. > > > --- > > I got an assertion failure. The reproducible steps are: > > > > I do believe this was due to a thinko in apply_handle_sequence, which > sometimes started transaction and didn't terminate it correctly. I've > changed it to use the begin_replication_step() etc. and it seems to be > working fine now. One of the patch does not apply on HEAD, because of a recent commit, we might have to rebase the patch: git am 0005-fixup-syncing-refresh-sequences-20230316.patch Applying: fixup syncing/refresh sequences error: patch failed: src/backend/replication/pgoutput/pgoutput.c:711 error: src/backend/replication/pgoutput/pgoutput.c: patch does not apply Patch failed at 0001 fixup syncing/refresh sequences Regards, Vignesh
On Wed, Mar 15, 2023 at 7:51 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 3/14/23 08:30, John Naylor wrote:
> > I tried a couple toy examples with various combinations of use styles.
> >
> > Three with "automatic" reading from sequences:
> >
> > create table test(i serial);
> > create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> > create table test(i int default nextval('s1'));
> >
> > ...where s1 has some non-default parameters:
> >
> > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> >
> > ...and then two with explicit use of s1, one inserting the 'nextval'
> > into a table with no default, and one with no table at all, just
> > selecting from the sequence.
> >
> > The last two seem to work similarly to the first three, so it seems like
> > FOR ALL TABLES adds all sequences as well. Is that expected?
>
> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
> the sequence is actually added to the publication. I tracked this down
> to a thinko in get_rel_sync_entry() which failed to check the object
> type when puballtables or puballsequences was set.
>
> Attached is a patch fixing this.
Okay, I can verify that with 0001-0006, sequences don't replicate unless specified. I do see an additional change that doesn't make sense: On the subscriber I no longer see a jump to the logged 32 increment, I see the very next value:
# alter system set wal_level='logical';
# port 7777 is subscriber
echo
echo "PUB:"
psql -c "drop table if exists test;"
psql -c "drop publication if exists pub1;"
echo
echo "SUB:"
psql -p 7777 -c "drop table if exists test;"
psql -p 7777 -c "drop subscription if exists sub1 ;"
echo
echo "PUB:"
psql -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
psql -c "CREATE PUBLICATION pub2 FOR ALL SEQUENCES;"
echo
echo "SUB:"
psql -p 7777 -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost dbname=postgres application_name=sub1 port=5432' PUBLICATION pub1;"
psql -p 7777 -c "CREATE SUBSCRIPTION sub2 CONNECTION 'host=localhost dbname=postgres application_name=sub2 port=5432' PUBLICATION pub2;"
echo
echo "PUB:"
psql -c "insert into test default values;"
psql -c "insert into test default values;"
psql -c "select * from test;"
psql -c "select * from test_i_seq;"
sleep 1
echo
echo "SUB:"
psql -p 7777 -c "select * from test;"
psql -p 7777 -c "select * from test_i_seq;"
psql -p 7777 -c "drop subscription sub1 ;"
psql -p 7777 -c "drop subscription sub2 ;"
psql -p 7777 -c "insert into test default values;"
psql -p 7777 -c "select * from test;"
psql -p 7777 -c "select * from test_i_seq;"
The last two queries on the subscriber show:
i
---
1
2
3
(3 rows)
last_value | log_cnt | is_called
------------+---------+-----------
3 | 30 | t
(1 row)
...whereas before with 0001-0003 I saw:
i
----
1
2
34
(3 rows)
last_value | log_cnt | is_called
------------+---------+-----------
34 | 32 | t
> > The documentation for CREATE PUBLICATION mentions sequence options,
> > but doesn't really say how these options should be used.
> Good point. The idea is that we handle tables and sequences the same
> way, i.e. if you specify 'sequence' then we'll replicate increments for
> sequences explicitly added to the publication.
>
> If this is not clear, the docs may need some improvements.
Aside from docs, I'm not clear what some of the tests are doing:
+CREATE PUBLICATION testpub_forallsequences FOR ALL SEQUENCES WITH (publish = 'sequence');
+RESET client_min_messages;
+ALTER PUBLICATION testpub_forallsequences SET (publish = 'insert, sequence');
What does it mean to add 'insert' to a sequence publication?
Likewise, from a brief change in my test above, 'sequence' seems to be a noise word for table publications. I'm not fully read up on the background of this topic, but wanted to make sure I understood the design of the syntax.
--
John Naylor
EDB: http://www.enterprisedb.com
>
>
>
> On 3/14/23 08:30, John Naylor wrote:
> > I tried a couple toy examples with various combinations of use styles.
> >
> > Three with "automatic" reading from sequences:
> >
> > create table test(i serial);
> > create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> > create table test(i int default nextval('s1'));
> >
> > ...where s1 has some non-default parameters:
> >
> > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> >
> > ...and then two with explicit use of s1, one inserting the 'nextval'
> > into a table with no default, and one with no table at all, just
> > selecting from the sequence.
> >
> > The last two seem to work similarly to the first three, so it seems like
> > FOR ALL TABLES adds all sequences as well. Is that expected?
>
> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
> the sequence is actually added to the publication. I tracked this down
> to a thinko in get_rel_sync_entry() which failed to check the object
> type when puballtables or puballsequences was set.
>
> Attached is a patch fixing this.
Okay, I can verify that with 0001-0006, sequences don't replicate unless specified. I do see an additional change that doesn't make sense: On the subscriber I no longer see a jump to the logged 32 increment, I see the very next value:
# alter system set wal_level='logical';
# port 7777 is subscriber
echo
echo "PUB:"
psql -c "drop table if exists test;"
psql -c "drop publication if exists pub1;"
echo
echo "SUB:"
psql -p 7777 -c "drop table if exists test;"
psql -p 7777 -c "drop subscription if exists sub1 ;"
echo
echo "PUB:"
psql -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
psql -c "CREATE PUBLICATION pub2 FOR ALL SEQUENCES;"
echo
echo "SUB:"
psql -p 7777 -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost dbname=postgres application_name=sub1 port=5432' PUBLICATION pub1;"
psql -p 7777 -c "CREATE SUBSCRIPTION sub2 CONNECTION 'host=localhost dbname=postgres application_name=sub2 port=5432' PUBLICATION pub2;"
echo
echo "PUB:"
psql -c "insert into test default values;"
psql -c "insert into test default values;"
psql -c "select * from test;"
psql -c "select * from test_i_seq;"
sleep 1
echo
echo "SUB:"
psql -p 7777 -c "select * from test;"
psql -p 7777 -c "select * from test_i_seq;"
psql -p 7777 -c "drop subscription sub1 ;"
psql -p 7777 -c "drop subscription sub2 ;"
psql -p 7777 -c "insert into test default values;"
psql -p 7777 -c "select * from test;"
psql -p 7777 -c "select * from test_i_seq;"
The last two queries on the subscriber show:
i
---
1
2
3
(3 rows)
last_value | log_cnt | is_called
------------+---------+-----------
3 | 30 | t
(1 row)
...whereas before with 0001-0003 I saw:
i
----
1
2
34
(3 rows)
last_value | log_cnt | is_called
------------+---------+-----------
34 | 32 | t
> > The documentation for CREATE PUBLICATION mentions sequence options,
> > but doesn't really say how these options should be used.
> Good point. The idea is that we handle tables and sequences the same
> way, i.e. if you specify 'sequence' then we'll replicate increments for
> sequences explicitly added to the publication.
>
> If this is not clear, the docs may need some improvements.
Aside from docs, I'm not clear what some of the tests are doing:
+CREATE PUBLICATION testpub_forallsequences FOR ALL SEQUENCES WITH (publish = 'sequence');
+RESET client_min_messages;
+ALTER PUBLICATION testpub_forallsequences SET (publish = 'insert, sequence');
What does it mean to add 'insert' to a sequence publication?
Likewise, from a brief change in my test above, 'sequence' seems to be a noise word for table publications. I'm not fully read up on the background of this topic, but wanted to make sure I understood the design of the syntax.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Mar 15, 2023 at 7:00 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 3/10/23 11:03, John Naylor wrote:
> > + * When we're called via the SQL SRF there's already a transaction
> >
> > I see this was copied from existing code, but I found it confusing --
> > does this function have a stable name?
>
> What do you mean by "stable name"? It certainly is not exposed as a
> user-callable SQL function, so I think this comment it misleading and
> should be removed.
Okay, I was just trying to think of why it was phrased this way...
>
> On 3/10/23 11:03, John Naylor wrote:
> > + * When we're called via the SQL SRF there's already a transaction
> >
> > I see this was copied from existing code, but I found it confusing --
> > does this function have a stable name?
>
> What do you mean by "stable name"? It certainly is not exposed as a
> user-callable SQL function, so I think this comment it misleading and
> should be removed.
Okay, I was just trying to think of why it was phrased this way...
On Thu, 16 Mar 2023 at 21:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi! > > On 3/16/23 08:38, Masahiko Sawada wrote: > > Hi, > > > > On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> > >> > >> On 3/14/23 08:30, John Naylor wrote: > >>> I tried a couple toy examples with various combinations of use styles. > >>> > >>> Three with "automatic" reading from sequences: > >>> > >>> create table test(i serial); > >>> create table test(i int GENERATED BY DEFAULT AS IDENTITY); > >>> create table test(i int default nextval('s1')); > >>> > >>> ...where s1 has some non-default parameters: > >>> > >>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1; > >>> > >>> ...and then two with explicit use of s1, one inserting the 'nextval' > >>> into a table with no default, and one with no table at all, just > >>> selecting from the sequence. > >>> > >>> The last two seem to work similarly to the first three, so it seems like > >>> FOR ALL TABLES adds all sequences as well. Is that expected? > >> > >> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless > >> the sequence is actually added to the publication. I tracked this down > >> to a thinko in get_rel_sync_entry() which failed to check the object > >> type when puballtables or puballsequences was set. > >> > >> Attached is a patch fixing this. > >> > >>> The documentation for CREATE PUBLICATION mentions sequence options, > >>> but doesn't really say how these options should be used. > >> Good point. The idea is that we handle tables and sequences the same > >> way, i.e. if you specify 'sequence' then we'll replicate increments for > >> sequences explicitly added to the publication. > >> > >> If this is not clear, the docs may need some improvements. > >> > > > > I'm late to this thread, but I have some questions and review comments. > > > > Regarding sequence logical replication, it seems that changes of > > sequence created after CREATE SUBSCRIPTION are applied on the > > subscriber even without REFRESH PUBLICATION command on the subscriber. > > Which is a different behavior than tables. For example, I set both > > publisher and subscriber as follows: > > > > 1. On publisher > > create publication test_pub for all sequences; > > > > 2. On subscriber > > create subscription test_sub connection 'dbname=postgres port=5551' > > publication test_pub; -- port=5551 is the publisher > > > > 3. On publisher > > create sequence s1; > > select nextval('s1'); > > > > I got the error "ERROR: relation "public.s1" does not exist on the > > subscriber". Probably we need to do should_apply_changes_for_rel() > > check in apply_handle_sequence(). > > > > Yes, you're right - the sequence handling should have been calling the > should_apply_changes_for_rel() etc. > > The attached 0005 patch should fix that - I still need to test it a bit > more and maybe clean it up a bit, but hopefully it'll allow you to > continue the review. > > I had to tweak the protocol a bit, so that this uses the same cache as > tables. I wonder if maybe we should make it even more similar, by > essentially treating sequences as tables with (last_value, log_cnt, > called) columns. > > > If my understanding is correct, is there any case where the subscriber > > needs to apply transactional sequence changes? The commit message of > > 0001 patch says: > > > > * Changes for sequences created in the same top-level transaction are > > treated as transactional, i.e. just like any other change from that > > transaction, and discarded in case of a rollback. > > > > IIUC such sequences are not visible to the subscriber, so it cannot > > subscribe to them until the commit. > > > > The comment is slightly misleading, as it talks about creation of > sequences, but it should be talking about relfilenodes. For example, if > you create a sequence, add it to publication, and then in a later > transaction you do > > ALTER SEQUENCE x RESTART > > or something else that creates a new relfilenode, then the subsequent > increments are visible only in that transaction. But we still need to > apply those on the subscriber, but only as part of the transaction, > because it might roll back. > > > --- > > I got an assertion failure. The reproducible steps are: > > > > I do believe this was due to a thinko in apply_handle_sequence, which > sometimes started transaction and didn't terminate it correctly. I've > changed it to use the begin_replication_step() etc. and it seems to be > working fine now. Few comments: 1) One of the test is failing for me, I had also seen the same failure in CFBOT at [1] too: # Failed test 'create sequence, advance it in rolled-back transaction, but commit the create' # at t/030_sequences.pl line 152. # got: '1|0|f' # expected: '132|0|t' t/030_sequences.pl ................. 5/? ? # Failed test 'advance the new sequence in a transaction and roll it back' # at t/030_sequences.pl line 175. # got: '1|0|f' # expected: '231|0|t' # Failed test 'advance sequence in a subtransaction' # at t/030_sequences.pl line 198. # got: '1|0|f' # expected: '330|0|t' # Looks like you failed 3 tests of 6. 2) We could replace the below: $node_publisher->wait_for_catchup('seq_sub'); # Wait for initial sync to finish as well my $synced_query = "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('s', 'r');"; $node_subscriber->poll_query_until('postgres', $synced_query) or die "Timed out while waiting for subscriber to synchronize data"; with: $node_subscriber->wait_for_subscription_sync; 3) We could change 030_sequences to 033_sequences.pl as 030 is already used: diff --git a/src/test/subscription/t/030_sequences.pl b/src/test/subscription/t/030_sequences.pl new file mode 100644 index 00000000000..9ae3c03d7d1 --- /dev/null +++ b/src/test/subscription/t/030_sequences.pl 4) Copyright year should be changed to 2023: @@ -0,0 +1,202 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# This tests that sequences are replicated correctly by logical replication +use strict; +use warnings; [1] - https://cirrus-ci.com/task/5032679352041472 Regards, Vignesh
On 3/17/23 06:53, John Naylor wrote: > On Wed, Mar 15, 2023 at 7:51 PM Tomas Vondra > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> > wrote: >> >> >> >> On 3/14/23 08:30, John Naylor wrote: >> > I tried a couple toy examples with various combinations of use styles. >> > >> > Three with "automatic" reading from sequences: >> > >> > create table test(i serial); >> > create table test(i int GENERATED BY DEFAULT AS IDENTITY); >> > create table test(i int default nextval('s1')); >> > >> > ...where s1 has some non-default parameters: >> > >> > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1; >> > >> > ...and then two with explicit use of s1, one inserting the 'nextval' >> > into a table with no default, and one with no table at all, just >> > selecting from the sequence. >> > >> > The last two seem to work similarly to the first three, so it seems like >> > FOR ALL TABLES adds all sequences as well. Is that expected? >> >> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless >> the sequence is actually added to the publication. I tracked this down >> to a thinko in get_rel_sync_entry() which failed to check the object >> type when puballtables or puballsequences was set. >> >> Attached is a patch fixing this. > > Okay, I can verify that with 0001-0006, sequences don't replicate unless > specified. I do see an additional change that doesn't make sense: On the > subscriber I no longer see a jump to the logged 32 increment, I see the > very next value: > > # alter system set wal_level='logical'; > # port 7777 is subscriber > > echo > echo "PUB:" > psql -c "drop table if exists test;" > psql -c "drop publication if exists pub1;" > > echo > echo "SUB:" > psql -p 7777 -c "drop table if exists test;" > psql -p 7777 -c "drop subscription if exists sub1 ;" > > echo > echo "PUB:" > psql -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);" > psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;" > psql -c "CREATE PUBLICATION pub2 FOR ALL SEQUENCES;" > > echo > echo "SUB:" > psql -p 7777 -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);" > psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost > dbname=postgres application_name=sub1 port=5432' PUBLICATION pub1;" > psql -p 7777 -c "CREATE SUBSCRIPTION sub2 CONNECTION 'host=localhost > dbname=postgres application_name=sub2 port=5432' PUBLICATION pub2;" > > echo > echo "PUB:" > psql -c "insert into test default values;" > psql -c "insert into test default values;" > psql -c "select * from test;" > psql -c "select * from test_i_seq;" > > sleep 1 > > echo > echo "SUB:" > psql -p 7777 -c "select * from test;" > psql -p 7777 -c "select * from test_i_seq;" > > psql -p 7777 -c "drop subscription sub1 ;" > psql -p 7777 -c "drop subscription sub2 ;" > > psql -p 7777 -c "insert into test default values;" > psql -p 7777 -c "select * from test;" > psql -p 7777 -c "select * from test_i_seq;" > > The last two queries on the subscriber show: > > i > --- > 1 > 2 > 3 > (3 rows) > > last_value | log_cnt | is_called > ------------+---------+----------- > 3 | 30 | t > (1 row) > > ...whereas before with 0001-0003 I saw: > > i > ---- > 1 > 2 > 34 > (3 rows) > > last_value | log_cnt | is_called > ------------+---------+----------- > 34 | 32 | t > Oh, this is a silly thinko in how sequences are synced at the beginning (or maybe a combination of two issues). fetch_sequence_data() simply runs a select from the sequence SELECT last_value, log_cnt, is_called but that's wrong, because that's the *current* state of the sequence, at the moment it's initially synced. We to make this "correct" with respect to the decoding, we'd need to deduce what was the last WAL record, so something like last_value += log_cnt + 1 That should produce 34 again. FWIW the older patch has this issue too, I believe the difference is merely due to a slightly different timing between the sync and decoding the first insert. If you insert a sleep after the CREATE SUBSCRIPTION commands, it should disappear. This however made me realize the initial sync of sequences may not be correct. I mean, the idea of tablesync is syncing the data in REPEATABLE READ transaction, and then applying decoded changes. But sequences are not transactional in this way - if you select from a sequence, you'll always see the latest data, even in REPEATABLE READ. I wonder if this might result in losing some of the sequence increments, and/or applying them in the wrong order (so that the sequence goes backward for a while). >> > The documentation for CREATE PUBLICATION mentions sequence options, >> > but doesn't really say how these options should be used. >> Good point. The idea is that we handle tables and sequences the same >> way, i.e. if you specify 'sequence' then we'll replicate increments for >> sequences explicitly added to the publication. >> >> If this is not clear, the docs may need some improvements. > > Aside from docs, I'm not clear what some of the tests are doing: > > +CREATE PUBLICATION testpub_forallsequences FOR ALL SEQUENCES WITH > (publish = 'sequence'); > +RESET client_min_messages; > +ALTER PUBLICATION testpub_forallsequences SET (publish = 'insert, > sequence'); > > What does it mean to add 'insert' to a sequence publication? > I don't recall why this particular test exists, but you can still add tables to "for all sequences" publication. IMO it's fine to allow adding actions that are irrelevant for currently published objects, we don't have a cross-check to prevent that (how would you even do that e.g. for FOR ALL TABLES publications?). > Likewise, from a brief change in my test above, 'sequence' seems to be a > noise word for table publications. I'm not fully read up on the > background of this topic, but wanted to make sure I understood the > design of the syntax. > I think it's fine, for the same reason as above. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/17/23 18:55, Tomas Vondra wrote: > > ... > > This however made me realize the initial sync of sequences may not be > correct. I mean, the idea of tablesync is syncing the data in REPEATABLE > READ transaction, and then applying decoded changes. But sequences are > not transactional in this way - if you select from a sequence, you'll > always see the latest data, even in REPEATABLE READ. > > I wonder if this might result in losing some of the sequence increments, > and/or applying them in the wrong order (so that the sequence goes > backward for a while). > Yeah, I think my suspicion was warranted - it's pretty easy to make the sequence go backwards for a while by adding a sleep between the slot creation and the copy_sequence() call, and increment the sequence in between (enough to do some WAL logging). The copy_sequence() then reads the current on-disk state (because of the non-transactional nature w.r.t. REPEATABLE READ), applies it, and then we start processing the WAL added since the slot creation. But those are older, so stuff like this happens: 21:52:54.147 CET [35404] WARNING: copy_sequence 1222 0 1 21:52:54.163 CET [35404] WARNING: apply_handle_sequence 990 0 1 21:52:54.163 CET [35404] WARNING: apply_handle_sequence 1023 0 1 21:52:54.163 CET [35404] WARNING: apply_handle_sequence 1056 0 1 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1089 0 1 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1122 0 1 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1155 0 1 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1188 0 1 21:52:54.175 CET [35404] WARNING: apply_handle_sequence 1221 0 1 21:52:54.898 CET [35402] WARNING: apply_handle_sequence 1254 0 1 Clearly, for sequences we can't quite rely on snapshots/slots, we need to get the LSN to decide what changes to apply/skip from somewhere else. I wonder if we can just ignore the queued changes in tablesync, but I guess not - there can be queued increments after reading the sequence state, and we need to apply those. But maybe we could use the page LSN from the relfilenode - that should be the LSN of the last WAL record. Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we use to read the sequence state ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/17/23 18:55, Tomas Vondra wrote: > > > > ... > > > > This however made me realize the initial sync of sequences may not be > > correct. I mean, the idea of tablesync is syncing the data in REPEATABLE > > READ transaction, and then applying decoded changes. But sequences are > > not transactional in this way - if you select from a sequence, you'll > > always see the latest data, even in REPEATABLE READ. > > > > I wonder if this might result in losing some of the sequence increments, > > and/or applying them in the wrong order (so that the sequence goes > > backward for a while). > > > > Yeah, I think my suspicion was warranted - it's pretty easy to make the > sequence go backwards for a while by adding a sleep between the slot > creation and the copy_sequence() call, and increment the sequence in > between (enough to do some WAL logging). > > The copy_sequence() then reads the current on-disk state (because of the > non-transactional nature w.r.t. REPEATABLE READ), applies it, and then > we start processing the WAL added since the slot creation. But those are > older, so stuff like this happens: > > 21:52:54.147 CET [35404] WARNING: copy_sequence 1222 0 1 > 21:52:54.163 CET [35404] WARNING: apply_handle_sequence 990 0 1 > 21:52:54.163 CET [35404] WARNING: apply_handle_sequence 1023 0 1 > 21:52:54.163 CET [35404] WARNING: apply_handle_sequence 1056 0 1 > 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1089 0 1 > 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1122 0 1 > 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1155 0 1 > 21:52:54.174 CET [35404] WARNING: apply_handle_sequence 1188 0 1 > 21:52:54.175 CET [35404] WARNING: apply_handle_sequence 1221 0 1 > 21:52:54.898 CET [35402] WARNING: apply_handle_sequence 1254 0 1 > > Clearly, for sequences we can't quite rely on snapshots/slots, we need > to get the LSN to decide what changes to apply/skip from somewhere else. > I wonder if we can just ignore the queued changes in tablesync, but I > guess not - there can be queued increments after reading the sequence > state, and we need to apply those. But maybe we could use the page LSN > from the relfilenode - that should be the LSN of the last WAL record. > > Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we > use to read the sequence state ... > What if some Alter Sequence is performed before the copy starts and after the copy is finished, the containing transaction rolled back? Won't it copy something which shouldn't have been copied? -- With Regards, Amit Kapila.
On 3/18/23 06:35, Amit Kapila wrote: > On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> ... >> >> Clearly, for sequences we can't quite rely on snapshots/slots, we need >> to get the LSN to decide what changes to apply/skip from somewhere else. >> I wonder if we can just ignore the queued changes in tablesync, but I >> guess not - there can be queued increments after reading the sequence >> state, and we need to apply those. But maybe we could use the page LSN >> from the relfilenode - that should be the LSN of the last WAL record. >> >> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we >> use to read the sequence state ... >> > > What if some Alter Sequence is performed before the copy starts and > after the copy is finished, the containing transaction rolled back? > Won't it copy something which shouldn't have been copied? > That shouldn't be possible - the alter creates a new relfilenode and it's invisible until commit. So either it gets committed (and then replicated), or it remains invisible to the SELECT during sync. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/18/23 06:35, Amit Kapila wrote: > > On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> ... > >> > >> Clearly, for sequences we can't quite rely on snapshots/slots, we need > >> to get the LSN to decide what changes to apply/skip from somewhere else. > >> I wonder if we can just ignore the queued changes in tablesync, but I > >> guess not - there can be queued increments after reading the sequence > >> state, and we need to apply those. But maybe we could use the page LSN > >> from the relfilenode - that should be the LSN of the last WAL record. > >> > >> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we > >> use to read the sequence state ... > >> > > > > What if some Alter Sequence is performed before the copy starts and > > after the copy is finished, the containing transaction rolled back? > > Won't it copy something which shouldn't have been copied? > > > > That shouldn't be possible - the alter creates a new relfilenode and > it's invisible until commit. So either it gets committed (and then > replicated), or it remains invisible to the SELECT during sync. > Okay, however, we need to ensure that such a change will later be replicated and also need to ensure that the required WAL doesn't get removed. Say, if we use your first idea of page LSN from the relfilenode, then how do we ensure that the corresponding WAL doesn't get removed when later the sync worker tries to start replication from that LSN? I am imagining here the sync_sequence_slot will be created before copy_sequence but even then it is possible that the sequence has not been updated for a long time and the LSN location will be in the past (as compared to the slot's LSN) which means the corresponding WAL could be removed. Now, here we can't directly start using the slot's LSN to stream changes because there is no correlation of it with the LSN (page LSN of sequence's relfilnode) where we want to start streaming. Now, for the second idea which is to directly use pg_current_wal_insert_lsn(), I think we won't be able to ensure that the changes covered by in-progress transactions like the one with Alter Sequence I have given example would be streamed later after the initial copy. Because the LSN returned by pg_current_wal_insert_lsn() could be an LSN after the LSN associated with Alter Sequence but before the corresponding xact's commit. -- With Regards, Amit Kapila.
On 3/20/23 04:42, Amit Kapila wrote: > On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 3/18/23 06:35, Amit Kapila wrote: >>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> ... >>>> >>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need >>>> to get the LSN to decide what changes to apply/skip from somewhere else. >>>> I wonder if we can just ignore the queued changes in tablesync, but I >>>> guess not - there can be queued increments after reading the sequence >>>> state, and we need to apply those. But maybe we could use the page LSN >>>> from the relfilenode - that should be the LSN of the last WAL record. >>>> >>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we >>>> use to read the sequence state ... >>>> >>> >>> What if some Alter Sequence is performed before the copy starts and >>> after the copy is finished, the containing transaction rolled back? >>> Won't it copy something which shouldn't have been copied? >>> >> >> That shouldn't be possible - the alter creates a new relfilenode and >> it's invisible until commit. So either it gets committed (and then >> replicated), or it remains invisible to the SELECT during sync. >> > > Okay, however, we need to ensure that such a change will later be > replicated and also need to ensure that the required WAL doesn't get > removed. > > Say, if we use your first idea of page LSN from the relfilenode, then > how do we ensure that the corresponding WAL doesn't get removed when > later the sync worker tries to start replication from that LSN? I am > imagining here the sync_sequence_slot will be created before > copy_sequence but even then it is possible that the sequence has not > been updated for a long time and the LSN location will be in the past > (as compared to the slot's LSN) which means the corresponding WAL > could be removed. Now, here we can't directly start using the slot's > LSN to stream changes because there is no correlation of it with the > LSN (page LSN of sequence's relfilnode) where we want to start > streaming. > I don't understand why we'd need WAL from before the slot is created, which happens before copy_sequence so the sync will see a more recent state (reflecting all changes up to the slot LSN). I think the only "issue" are the WAL records after the slot LSN, or more precisely deciding which of the decoded changes to apply. > Now, for the second idea which is to directly use > pg_current_wal_insert_lsn(), I think we won't be able to ensure that > the changes covered by in-progress transactions like the one with > Alter Sequence I have given example would be streamed later after the > initial copy. Because the LSN returned by pg_current_wal_insert_lsn() > could be an LSN after the LSN associated with Alter Sequence but > before the corresponding xact's commit. Yeah, I think you're right - the locking itself is not sufficient to prevent this ordering of operations. copy_sequence would have to lock the sequence exclusively, which seems bit disruptive. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > On 3/20/23 04:42, Amit Kapila wrote: > > On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 3/18/23 06:35, Amit Kapila wrote: > >>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra > >>> <tomas.vondra@enterprisedb.com> wrote: > >>>> > >>>> ... > >>>> > >>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need > >>>> to get the LSN to decide what changes to apply/skip from somewhere else. > >>>> I wonder if we can just ignore the queued changes in tablesync, but I > >>>> guess not - there can be queued increments after reading the sequence > >>>> state, and we need to apply those. But maybe we could use the page LSN > >>>> from the relfilenode - that should be the LSN of the last WAL record. > >>>> > >>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we > >>>> use to read the sequence state ... > >>>> > >>> > >>> What if some Alter Sequence is performed before the copy starts and > >>> after the copy is finished, the containing transaction rolled back? > >>> Won't it copy something which shouldn't have been copied? > >>> > >> > >> That shouldn't be possible - the alter creates a new relfilenode and > >> it's invisible until commit. So either it gets committed (and then > >> replicated), or it remains invisible to the SELECT during sync. > >> > > > > Okay, however, we need to ensure that such a change will later be > > replicated and also need to ensure that the required WAL doesn't get > > removed. > > > > Say, if we use your first idea of page LSN from the relfilenode, then > > how do we ensure that the corresponding WAL doesn't get removed when > > later the sync worker tries to start replication from that LSN? I am > > imagining here the sync_sequence_slot will be created before > > copy_sequence but even then it is possible that the sequence has not > > been updated for a long time and the LSN location will be in the past > > (as compared to the slot's LSN) which means the corresponding WAL > > could be removed. Now, here we can't directly start using the slot's > > LSN to stream changes because there is no correlation of it with the > > LSN (page LSN of sequence's relfilnode) where we want to start > > streaming. > > > > I don't understand why we'd need WAL from before the slot is created, > which happens before copy_sequence so the sync will see a more recent > state (reflecting all changes up to the slot LSN). > Imagine the following sequence of events: 1. Operation on a sequence seq-1 which requires WAL. Say, this is done at LSN 1000. 2. Some other random operations on unrelated objects. This would increase LSN to 2000. 3. Create a slot that uses current LSN 2000. 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then you will use LSN 1000 as a starting point to start replication in sequence sync worker. It is quite possible that WAL from LSN 1000 may not be present. Now, it may be possible that we use the slot's LSN in this case but currently, it may not be possible without some changes in the slot machinery. Even, if we somehow solve this, we have the below problem where we can miss some concurrent activity. > I think the only "issue" are the WAL records after the slot LSN, or more > precisely deciding which of the decoded changes to apply. > > > > Now, for the second idea which is to directly use > > pg_current_wal_insert_lsn(), I think we won't be able to ensure that > > the changes covered by in-progress transactions like the one with > > Alter Sequence I have given example would be streamed later after the > > initial copy. Because the LSN returned by pg_current_wal_insert_lsn() > > could be an LSN after the LSN associated with Alter Sequence but > > before the corresponding xact's commit. > > Yeah, I think you're right - the locking itself is not sufficient to > prevent this ordering of operations. copy_sequence would have to lock > the sequence exclusively, which seems bit disruptive. > Right, that doesn't sound like a good idea. -- With Regards, Amit Kapila.
On 3/20/23 12:00, Amit Kapila wrote: > On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> >> On 3/20/23 04:42, Amit Kapila wrote: >>> On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> On 3/18/23 06:35, Amit Kapila wrote: >>>>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra >>>>> <tomas.vondra@enterprisedb.com> wrote: >>>>>> >>>>>> ... >>>>>> >>>>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need >>>>>> to get the LSN to decide what changes to apply/skip from somewhere else. >>>>>> I wonder if we can just ignore the queued changes in tablesync, but I >>>>>> guess not - there can be queued increments after reading the sequence >>>>>> state, and we need to apply those. But maybe we could use the page LSN >>>>>> from the relfilenode - that should be the LSN of the last WAL record. >>>>>> >>>>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we >>>>>> use to read the sequence state ... >>>>>> >>>>> >>>>> What if some Alter Sequence is performed before the copy starts and >>>>> after the copy is finished, the containing transaction rolled back? >>>>> Won't it copy something which shouldn't have been copied? >>>>> >>>> >>>> That shouldn't be possible - the alter creates a new relfilenode and >>>> it's invisible until commit. So either it gets committed (and then >>>> replicated), or it remains invisible to the SELECT during sync. >>>> >>> >>> Okay, however, we need to ensure that such a change will later be >>> replicated and also need to ensure that the required WAL doesn't get >>> removed. >>> >>> Say, if we use your first idea of page LSN from the relfilenode, then >>> how do we ensure that the corresponding WAL doesn't get removed when >>> later the sync worker tries to start replication from that LSN? I am >>> imagining here the sync_sequence_slot will be created before >>> copy_sequence but even then it is possible that the sequence has not >>> been updated for a long time and the LSN location will be in the past >>> (as compared to the slot's LSN) which means the corresponding WAL >>> could be removed. Now, here we can't directly start using the slot's >>> LSN to stream changes because there is no correlation of it with the >>> LSN (page LSN of sequence's relfilnode) where we want to start >>> streaming. >>> >> >> I don't understand why we'd need WAL from before the slot is created, >> which happens before copy_sequence so the sync will see a more recent >> state (reflecting all changes up to the slot LSN). >> > > Imagine the following sequence of events: > 1. Operation on a sequence seq-1 which requires WAL. Say, this is done > at LSN 1000. > 2. Some other random operations on unrelated objects. This would > increase LSN to 2000. > 3. Create a slot that uses current LSN 2000. > 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then > you will use LSN 1000 as a starting point to start replication in > sequence sync worker. > > It is quite possible that WAL from LSN 1000 may not be present. Now, > it may be possible that we use the slot's LSN in this case but > currently, it may not be possible without some changes in the slot > machinery. Even, if we somehow solve this, we have the below problem > where we can miss some concurrent activity. > I think the question is what would be the WAL-requiring operation at LSN 1000. If it's just regular nextval(), then we *will* see it during copy_sequence - sequences are not transactional in the MVCC sense. If it's an ALTER SEQUENCE, I guess it might create a new relfilenode, and then we might fail to apply this - that'd be bad. I wonder if we'd allow actually discarding the WAL while building the consistent snapshot, though. You're however right we can't just decide this based on LSN, we'd probably need to compare the relfilenodes too or something like that ... >> I think the only "issue" are the WAL records after the slot LSN, or more >> precisely deciding which of the decoded changes to apply. >> >> >>> Now, for the second idea which is to directly use >>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that >>> the changes covered by in-progress transactions like the one with >>> Alter Sequence I have given example would be streamed later after the >>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn() >>> could be an LSN after the LSN associated with Alter Sequence but >>> before the corresponding xact's commit. >> >> Yeah, I think you're right - the locking itself is not sufficient to >> prevent this ordering of operations. copy_sequence would have to lock >> the sequence exclusively, which seems bit disruptive. >> > > Right, that doesn't sound like a good idea. > Although, maybe we could use a less strict lock level? I mean, one that allows nextval() to continue, but would conflict with ALTER SEQUENCE. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 20, 2023 at 5:13 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/20/23 12:00, Amit Kapila wrote: > > On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> > >> I don't understand why we'd need WAL from before the slot is created, > >> which happens before copy_sequence so the sync will see a more recent > >> state (reflecting all changes up to the slot LSN). > >> > > > > Imagine the following sequence of events: > > 1. Operation on a sequence seq-1 which requires WAL. Say, this is done > > at LSN 1000. > > 2. Some other random operations on unrelated objects. This would > > increase LSN to 2000. > > 3. Create a slot that uses current LSN 2000. > > 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then > > you will use LSN 1000 as a starting point to start replication in > > sequence sync worker. > > > > It is quite possible that WAL from LSN 1000 may not be present. Now, > > it may be possible that we use the slot's LSN in this case but > > currently, it may not be possible without some changes in the slot > > machinery. Even, if we somehow solve this, we have the below problem > > where we can miss some concurrent activity. > > > > I think the question is what would be the WAL-requiring operation at LSN > 1000. If it's just regular nextval(), then we *will* see it during > copy_sequence - sequences are not transactional in the MVCC sense. > > If it's an ALTER SEQUENCE, I guess it might create a new relfilenode, > and then we might fail to apply this - that'd be bad. > > I wonder if we'd allow actually discarding the WAL while building the > consistent snapshot, though. > No, as soon as we reserve the WAL location, we update the slot's minLSN (replicationSlotMinLSN) which would prevent the required WAL from being removed. > You're however right we can't just decide > this based on LSN, we'd probably need to compare the relfilenodes too or > something like that ... > > >> I think the only "issue" are the WAL records after the slot LSN, or more > >> precisely deciding which of the decoded changes to apply. > >> > >> > >>> Now, for the second idea which is to directly use > >>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that > >>> the changes covered by in-progress transactions like the one with > >>> Alter Sequence I have given example would be streamed later after the > >>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn() > >>> could be an LSN after the LSN associated with Alter Sequence but > >>> before the corresponding xact's commit. > >> > >> Yeah, I think you're right - the locking itself is not sufficient to > >> prevent this ordering of operations. copy_sequence would have to lock > >> the sequence exclusively, which seems bit disruptive. > >> > > > > Right, that doesn't sound like a good idea. > > > > Although, maybe we could use a less strict lock level? I mean, one that > allows nextval() to continue, but would conflict with ALTER SEQUENCE. > I don't know if that is a good idea but are you imagining a special interface/mechanism just for logical replication because as far as I can see you have used SELECT to fetch the sequence values? -- With Regards, Amit Kapila.
On 3/20/23 13:26, Amit Kapila wrote: > On Mon, Mar 20, 2023 at 5:13 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 3/20/23 12:00, Amit Kapila wrote: >>> On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> >>>> I don't understand why we'd need WAL from before the slot is created, >>>> which happens before copy_sequence so the sync will see a more recent >>>> state (reflecting all changes up to the slot LSN). >>>> >>> >>> Imagine the following sequence of events: >>> 1. Operation on a sequence seq-1 which requires WAL. Say, this is done >>> at LSN 1000. >>> 2. Some other random operations on unrelated objects. This would >>> increase LSN to 2000. >>> 3. Create a slot that uses current LSN 2000. >>> 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then >>> you will use LSN 1000 as a starting point to start replication in >>> sequence sync worker. >>> >>> It is quite possible that WAL from LSN 1000 may not be present. Now, >>> it may be possible that we use the slot's LSN in this case but >>> currently, it may not be possible without some changes in the slot >>> machinery. Even, if we somehow solve this, we have the below problem >>> where we can miss some concurrent activity. >>> >> >> I think the question is what would be the WAL-requiring operation at LSN >> 1000. If it's just regular nextval(), then we *will* see it during >> copy_sequence - sequences are not transactional in the MVCC sense. >> >> If it's an ALTER SEQUENCE, I guess it might create a new relfilenode, >> and then we might fail to apply this - that'd be bad. >> >> I wonder if we'd allow actually discarding the WAL while building the >> consistent snapshot, though. >> > > No, as soon as we reserve the WAL location, we update the slot's > minLSN (replicationSlotMinLSN) which would prevent the required WAL > from being removed. > >> You're however right we can't just decide >> this based on LSN, we'd probably need to compare the relfilenodes too or >> something like that ... >> >>>> I think the only "issue" are the WAL records after the slot LSN, or more >>>> precisely deciding which of the decoded changes to apply. >>>> >>>> >>>>> Now, for the second idea which is to directly use >>>>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that >>>>> the changes covered by in-progress transactions like the one with >>>>> Alter Sequence I have given example would be streamed later after the >>>>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn() >>>>> could be an LSN after the LSN associated with Alter Sequence but >>>>> before the corresponding xact's commit. >>>> >>>> Yeah, I think you're right - the locking itself is not sufficient to >>>> prevent this ordering of operations. copy_sequence would have to lock >>>> the sequence exclusively, which seems bit disruptive. >>>> >>> >>> Right, that doesn't sound like a good idea. >>> >> >> Although, maybe we could use a less strict lock level? I mean, one that >> allows nextval() to continue, but would conflict with ALTER SEQUENCE. >> > > I don't know if that is a good idea but are you imagining a special > interface/mechanism just for logical replication because as far as I > can see you have used SELECT to fetch the sequence values? > Not sure what would the special mechanism be? I don't think it could read the sequence from somewhere else, and due the lack of MVCC we'd just read same sequence data from the current relfilenode. Or what else would it do? The one thing we can't quite do at the moment is locking the sequence, because LOCK is only supported for tables. So we could either provide a function to lock a sequence, or locks it and then returns the current state (as if we did a SELECT). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/20/23 18:03, Tomas Vondra wrote: > > ... >> >> I don't know if that is a good idea but are you imagining a special >> interface/mechanism just for logical replication because as far as I >> can see you have used SELECT to fetch the sequence values? >> > > Not sure what would the special mechanism be? I don't think it could > read the sequence from somewhere else, and due the lack of MVCC we'd > just read same sequence data from the current relfilenode. Or what else > would it do? > I was thinking about alternative ways to do this, but I couldn't think of anything. The non-MVCC behavior of sequences means it's not really possible to do this based on snapshots / slots or stuff like that ... > The one thing we can't quite do at the moment is locking the sequence, > because LOCK is only supported for tables. So we could either provide a > function to lock a sequence, or locks it and then returns the current > state (as if we did a SELECT). > ... so I took a stab at doing it like this. I didn't feel relaxing LOCK restrictions to also allow locking sequences would be the right choice, so I added a new function pg_sequence_lock_for_sync(). I wonder if we could/should restrict this to logical replication use, somehow. The interlock happens right after creating the slot - I was thinking about doing it even before the slot gets created, but that's not possible, because that installs a snapshot (so it has to be the first command in the transaction). It acquires RowExclusiveLock, which is enough to conflict with ALTER SEQUENCE, but allows nextval(). AFAICS this does the trick - if there's ALTER SEQUENCE, we'll wait for it to complete. And copy_sequence() will read the resulting state, even though this is REPEATABLE READ - remember, sequences are not subject to that consistency. The once anomaly I can think of is the sequence might seem to go "backwards" for a little bit during the sync. Imagine this sequence of operations: 1) tablesync creates slot 2) S1 does ALTER SEQUENCE ... RESTART WITH 20 (gets lock) 3) S2 tries ALTER SEQUENCE ... RESTART WITH 100 (waits for lock) 4) tablesync requests lock 5) S1 does the thing, commits 6) S2 acquires lock, does the thing, commits 7) tablesync gets lock, reads current sequence state 8) tablesync decodes changes from S1 and S2, applies them But I think this is fine - it's part of the catchup, and until that's done the sync is not considered completed. I merged the earlier "fixup" patches into the relevant parts, and left two patches with new tweaks (deducing the corrent "WAL" state from the current state read by copy_sequence), and the interlock discussed here. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I merged the earlier "fixup" patches into the relevant parts, and left > two patches with new tweaks (deducing the corrent "WAL" state from the > current state read by copy_sequence), and the interlock discussed here. > Apart from that, how does the publication having sequences work with subscribers who are not able to handle sequence changes, e.g. in a case where PostgreSQL version of publication is newer than the subscriber? As far as I tested the latest patches, the subscriber (v15) errors out with the error 'invalid logical replication message type "Q"' when receiving a sequence change. I'm not sure it's sensible behavior. I think we should instead either (1) deny starting the replication if the subscriber isn't able to handle sequence changes and the publication includes that, or (2) not send sequence changes to such subscribers. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 3/27/23 03:32, Masahiko Sawada wrote: > Hi, > > On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I merged the earlier "fixup" patches into the relevant parts, and left >> two patches with new tweaks (deducing the corrent "WAL" state from the >> current state read by copy_sequence), and the interlock discussed here. >> > > Apart from that, how does the publication having sequences work with > subscribers who are not able to handle sequence changes, e.g. in a > case where PostgreSQL version of publication is newer than the > subscriber? As far as I tested the latest patches, the subscriber > (v15) errors out with the error 'invalid logical replication message > type "Q"' when receiving a sequence change. I'm not sure it's sensible > behavior. I think we should instead either (1) deny starting the > replication if the subscriber isn't able to handle sequence changes > and the publication includes that, or (2) not send sequence changes to > such subscribers. > I agree the "invalid message" error is not great, but it's not clear to me how to do either (1). The trouble is we don't really know if the publication contains (or will contain) sequences. I mean, what would happen if the replication starts and then someone adds a sequence? For (2), I think that's not something we should do - silently discarding some messages seems error-prone. If the publication includes sequences, presumably the user wanted to replicate those. If they want to replicate to an older subscriber, create a publication without sequences. Perhaps the right solution would be to check if the subscriber supports replication of sequences in the output plugin, while attempting to write the "Q" message. And error-out if the subscriber does not support it. What do you think? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 3/27/23 03:32, Masahiko Sawada wrote: > > Hi, > > > > On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> I merged the earlier "fixup" patches into the relevant parts, and left > >> two patches with new tweaks (deducing the corrent "WAL" state from the > >> current state read by copy_sequence), and the interlock discussed here. > >> > > > > Apart from that, how does the publication having sequences work with > > subscribers who are not able to handle sequence changes, e.g. in a > > case where PostgreSQL version of publication is newer than the > > subscriber? As far as I tested the latest patches, the subscriber > > (v15) errors out with the error 'invalid logical replication message > > type "Q"' when receiving a sequence change. I'm not sure it's sensible > > behavior. I think we should instead either (1) deny starting the > > replication if the subscriber isn't able to handle sequence changes > > and the publication includes that, or (2) not send sequence changes to > > such subscribers. > > > > I agree the "invalid message" error is not great, but it's not clear to > me how to do either (1). The trouble is we don't really know if the > publication contains (or will contain) sequences. I mean, what would > happen if the replication starts and then someone adds a sequence? > > For (2), I think that's not something we should do - silently discarding > some messages seems error-prone. If the publication includes sequences, > presumably the user wanted to replicate those. If they want to replicate > to an older subscriber, create a publication without sequences. > > Perhaps the right solution would be to check if the subscriber supports > replication of sequences in the output plugin, while attempting to write > the "Q" message. And error-out if the subscriber does not support it. It might be related to this topic; do we need to bump the protocol version? The commit 64824323e57d introduced new streaming callbacks and bumped the protocol version. I think the same seems to be true for this change as it adds sequence_cb callback. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 3/28/23 18:34, Masahiko Sawada wrote: > On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> >> >> On 3/27/23 03:32, Masahiko Sawada wrote: >>> Hi, >>> >>> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> I merged the earlier "fixup" patches into the relevant parts, and left >>>> two patches with new tweaks (deducing the corrent "WAL" state from the >>>> current state read by copy_sequence), and the interlock discussed here. >>>> >>> >>> Apart from that, how does the publication having sequences work with >>> subscribers who are not able to handle sequence changes, e.g. in a >>> case where PostgreSQL version of publication is newer than the >>> subscriber? As far as I tested the latest patches, the subscriber >>> (v15) errors out with the error 'invalid logical replication message >>> type "Q"' when receiving a sequence change. I'm not sure it's sensible >>> behavior. I think we should instead either (1) deny starting the >>> replication if the subscriber isn't able to handle sequence changes >>> and the publication includes that, or (2) not send sequence changes to >>> such subscribers. >>> >> >> I agree the "invalid message" error is not great, but it's not clear to >> me how to do either (1). The trouble is we don't really know if the >> publication contains (or will contain) sequences. I mean, what would >> happen if the replication starts and then someone adds a sequence? >> >> For (2), I think that's not something we should do - silently discarding >> some messages seems error-prone. If the publication includes sequences, >> presumably the user wanted to replicate those. If they want to replicate >> to an older subscriber, create a publication without sequences. >> >> Perhaps the right solution would be to check if the subscriber supports >> replication of sequences in the output plugin, while attempting to write >> the "Q" message. And error-out if the subscriber does not support it. > > It might be related to this topic; do we need to bump the protocol > version? The commit 64824323e57d introduced new streaming callbacks > and bumped the protocol version. I think the same seems to be true for > this change as it adds sequence_cb callback. > It's not clear to me what should be the exact behavior? I mean, imagine we're opening a connection for logical replication, and the subscriber does not handle sequences. What should the publisher do? (Note: The correct commit hash is 464824323e57d.) I don't think the streaming is a good match for sequences, because of a couple important differences ... Firstly, streaming determines *how* the changes are replicated, not what gets replicated. It doesn't (silently) filter out "bad" events that the subscriber doesn't know how to apply. If the subscriber does not know how to deal with streamed xacts, it'll still get the same changes exactly per the publication definition. Secondly, the default value is "streming=off", i.e. the subscriber has to explicitly request streaming when opening the connection. And we simply check it against the negotiated protocol version, i.e. the check in pgoutput_startup() protects against subscriber requesting a protocol v1 but also streaming=on. I don't think we can/should do more check at this point - we don't know what's included in the requested publications at that point, and I doubt it's worth adding because we certainly can't predict if the publication will be altered to include/decode sequences in the future. Speaking of precedents, TRUNCATE is probably a better one, because it's a new action and it determines *what* the subscriber can handle. But that does exactly the thing we do for sequences - if you open a connection from PG10 subscriber (truncate was added in PG11), and the publisher decodes a truncate, subscriber will do: 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical replication message type "T" 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical replication worker for subscription 16390 (PID 2357609) exited with exit code 1 I don't see why sequences should do anything else. If you need to replicate to such subscriber, create a publication that does not have 'sequence' in the publish option ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 29, 2023 at 3:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/28/23 18:34, Masahiko Sawada wrote: > > On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> > >> > >> On 3/27/23 03:32, Masahiko Sawada wrote: > >>> Hi, > >>> > >>> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra > >>> <tomas.vondra@enterprisedb.com> wrote: > >>>> > >>>> I merged the earlier "fixup" patches into the relevant parts, and left > >>>> two patches with new tweaks (deducing the corrent "WAL" state from the > >>>> current state read by copy_sequence), and the interlock discussed here. > >>>> > >>> > >>> Apart from that, how does the publication having sequences work with > >>> subscribers who are not able to handle sequence changes, e.g. in a > >>> case where PostgreSQL version of publication is newer than the > >>> subscriber? As far as I tested the latest patches, the subscriber > >>> (v15) errors out with the error 'invalid logical replication message > >>> type "Q"' when receiving a sequence change. I'm not sure it's sensible > >>> behavior. I think we should instead either (1) deny starting the > >>> replication if the subscriber isn't able to handle sequence changes > >>> and the publication includes that, or (2) not send sequence changes to > >>> such subscribers. > >>> > >> > >> I agree the "invalid message" error is not great, but it's not clear to > >> me how to do either (1). The trouble is we don't really know if the > >> publication contains (or will contain) sequences. I mean, what would > >> happen if the replication starts and then someone adds a sequence? > >> > >> For (2), I think that's not something we should do - silently discarding > >> some messages seems error-prone. If the publication includes sequences, > >> presumably the user wanted to replicate those. If they want to replicate > >> to an older subscriber, create a publication without sequences. > >> > >> Perhaps the right solution would be to check if the subscriber supports > >> replication of sequences in the output plugin, while attempting to write > >> the "Q" message. And error-out if the subscriber does not support it. > > > > It might be related to this topic; do we need to bump the protocol > > version? The commit 64824323e57d introduced new streaming callbacks > > and bumped the protocol version. I think the same seems to be true for > > this change as it adds sequence_cb callback. > > > > It's not clear to me what should be the exact behavior? > > I mean, imagine we're opening a connection for logical replication, and > the subscriber does not handle sequences. What should the publisher do? > > (Note: The correct commit hash is 464824323e57d.) Thanks. > > I don't think the streaming is a good match for sequences, because of a > couple important differences ... > > Firstly, streaming determines *how* the changes are replicated, not what > gets replicated. It doesn't (silently) filter out "bad" events that the > subscriber doesn't know how to apply. If the subscriber does not know > how to deal with streamed xacts, it'll still get the same changes > exactly per the publication definition. > > Secondly, the default value is "streming=off", i.e. the subscriber has > to explicitly request streaming when opening the connection. And we > simply check it against the negotiated protocol version, i.e. the check > in pgoutput_startup() protects against subscriber requesting a protocol > v1 but also streaming=on. > > I don't think we can/should do more check at this point - we don't know > what's included in the requested publications at that point, and I doubt > it's worth adding because we certainly can't predict if the publication > will be altered to include/decode sequences in the future. True. That's a valid argument. > > Speaking of precedents, TRUNCATE is probably a better one, because it's > a new action and it determines *what* the subscriber can handle. But > that does exactly the thing we do for sequences - if you open a > connection from PG10 subscriber (truncate was added in PG11), and the > publisher decodes a truncate, subscriber will do: > > 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical > replication message type "T" > 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical > replication worker for subscription 16390 (PID 2357609) exited with > exit code 1 > > I don't see why sequences should do anything else. If you need to > replicate to such subscriber, create a publication that does not have > 'sequence' in the publish option ... > I didn't check TRUNCATE cases, yes, sequence replication is a good match for them. So it seems we don't need to do anything. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 29, 2023 at 12:04 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/28/23 18:34, Masahiko Sawada wrote: > > On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >>> > >>> Apart from that, how does the publication having sequences work with > >>> subscribers who are not able to handle sequence changes, e.g. in a > >>> case where PostgreSQL version of publication is newer than the > >>> subscriber? As far as I tested the latest patches, the subscriber > >>> (v15) errors out with the error 'invalid logical replication message > >>> type "Q"' when receiving a sequence change. I'm not sure it's sensible > >>> behavior. I think we should instead either (1) deny starting the > >>> replication if the subscriber isn't able to handle sequence changes > >>> and the publication includes that, or (2) not send sequence changes to > >>> such subscribers. > >>> > >> > >> I agree the "invalid message" error is not great, but it's not clear to > >> me how to do either (1). The trouble is we don't really know if the > >> publication contains (or will contain) sequences. I mean, what would > >> happen if the replication starts and then someone adds a sequence? > >> > >> For (2), I think that's not something we should do - silently discarding > >> some messages seems error-prone. If the publication includes sequences, > >> presumably the user wanted to replicate those. If they want to replicate > >> to an older subscriber, create a publication without sequences. > >> > >> Perhaps the right solution would be to check if the subscriber supports > >> replication of sequences in the output plugin, while attempting to write > >> the "Q" message. And error-out if the subscriber does not support it. > > > > It might be related to this topic; do we need to bump the protocol > > version? The commit 64824323e57d introduced new streaming callbacks > > and bumped the protocol version. I think the same seems to be true for > > this change as it adds sequence_cb callback. > > > > It's not clear to me what should be the exact behavior? > > I mean, imagine we're opening a connection for logical replication, and > the subscriber does not handle sequences. What should the publisher do? > I think deciding anything at the publisher would be tricky but won't it be better if by default we disallow connection from subscriber to the publisher when the publisher's version is higher? And then allow it only based on some subscription option or maybe by default allow the connection to a higher version but based on option disallows the connection. > > Speaking of precedents, TRUNCATE is probably a better one, because it's > a new action and it determines *what* the subscriber can handle. But > that does exactly the thing we do for sequences - if you open a > connection from PG10 subscriber (truncate was added in PG11), and the > publisher decodes a truncate, subscriber will do: > > 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical > replication message type "T" > 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical > replication worker for subscription 16390 (PID 2357609) exited with > exit code 1 > > I don't see why sequences should do anything else. > Is this behavior of TRUNCATE known or discussed previously? I can't see any mention of this in the docs or commit message. I guess if we want to follow such behavior it should be well documented so that it won't be a surprise for users. I think we would face such cases in the future as well. One of the similar cases we are discussing for DDL replication where a higher version publisher could send some DDL syntax that lower version subscribers won't support and will lead to an error [1]. [1] - https://www.postgresql.org/message-id/OS0PR01MB5716088E497BDCBCED7FC3DA94849%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On 3/29/23 11:51, Amit Kapila wrote: > On Wed, Mar 29, 2023 at 12:04 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 3/28/23 18:34, Masahiko Sawada wrote: >>> On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>>> >>>>> Apart from that, how does the publication having sequences work with >>>>> subscribers who are not able to handle sequence changes, e.g. in a >>>>> case where PostgreSQL version of publication is newer than the >>>>> subscriber? As far as I tested the latest patches, the subscriber >>>>> (v15) errors out with the error 'invalid logical replication message >>>>> type "Q"' when receiving a sequence change. I'm not sure it's sensible >>>>> behavior. I think we should instead either (1) deny starting the >>>>> replication if the subscriber isn't able to handle sequence changes >>>>> and the publication includes that, or (2) not send sequence changes to >>>>> such subscribers. >>>>> >>>> >>>> I agree the "invalid message" error is not great, but it's not clear to >>>> me how to do either (1). The trouble is we don't really know if the >>>> publication contains (or will contain) sequences. I mean, what would >>>> happen if the replication starts and then someone adds a sequence? >>>> >>>> For (2), I think that's not something we should do - silently discarding >>>> some messages seems error-prone. If the publication includes sequences, >>>> presumably the user wanted to replicate those. If they want to replicate >>>> to an older subscriber, create a publication without sequences. >>>> >>>> Perhaps the right solution would be to check if the subscriber supports >>>> replication of sequences in the output plugin, while attempting to write >>>> the "Q" message. And error-out if the subscriber does not support it. >>> >>> It might be related to this topic; do we need to bump the protocol >>> version? The commit 64824323e57d introduced new streaming callbacks >>> and bumped the protocol version. I think the same seems to be true for >>> this change as it adds sequence_cb callback. >>> >> >> It's not clear to me what should be the exact behavior? >> >> I mean, imagine we're opening a connection for logical replication, and >> the subscriber does not handle sequences. What should the publisher do? >> > > I think deciding anything at the publisher would be tricky but won't > it be better if by default we disallow connection from subscriber to > the publisher when the publisher's version is higher? And then allow > it only based on some subscription option or maybe by default allow > the connection to a higher version but based on option disallows the > connection. > >> >> Speaking of precedents, TRUNCATE is probably a better one, because it's >> a new action and it determines *what* the subscriber can handle. But >> that does exactly the thing we do for sequences - if you open a >> connection from PG10 subscriber (truncate was added in PG11), and the >> publisher decodes a truncate, subscriber will do: >> >> 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical >> replication message type "T" >> 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical >> replication worker for subscription 16390 (PID 2357609) exited with >> exit code 1 >> >> I don't see why sequences should do anything else. >> > > Is this behavior of TRUNCATE known or discussed previously? I can't > see any mention of this in the docs or commit message. I guess if we > want to follow such behavior it should be well documented so that it > won't be a surprise for users. I think we would face such cases in the > future as well. One of the similar cases we are discussing for DDL > replication where a higher version publisher could send some DDL > syntax that lower version subscribers won't support and will lead to > an error [1]. > I don't know where/how it's documented, TBH. FWIW I agree the TRUNCATE-like behavior (failing on subscriber after receiving unknown message type) is a bit annoying. Perhaps it'd be reasonable to tie the "protocol version" to subscriber capabilities, so that a protocol version guarantees what message types the subscriber understands. So we could increment the protocol version, check it in pgoutput_startup and then error-out in the sequence callback if the subscriber version is too old. That'd be nicer in the sense that we'd generate nicer error message on the publisher, not an "unknown message type" on the subscriber. That's doable, the main problem being it'd be inconsistent with the TRUNCATE behavior. OTOH that was introduced in PG11, which is the oldest version still under support ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 29.03.23 16:28, Tomas Vondra wrote: > Perhaps it'd be reasonable to tie the "protocol version" to subscriber > capabilities, so that a protocol version guarantees what message types > the subscriber understands. So we could increment the protocol version, > check it in pgoutput_startup and then error-out in the sequence callback > if the subscriber version is too old. That would make sense. > That'd be nicer in the sense that we'd generate nicer error message on > the publisher, not an "unknown message type" on the subscriber. That's > doable, the main problem being it'd be inconsistent with the TRUNCATE > behavior. OTOH that was introduced in PG11, which is the oldest version > still under support ... I think at the time TRUNCATE support was added, we didn't have a strong sense of how the protocol versioning would work or whether it would work at all, so doing nothing was the easiest way out.
On Wed, Mar 29, 2023 at 7:58 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/29/23 11:51, Amit Kapila wrote: > >> > >> It's not clear to me what should be the exact behavior? > >> > >> I mean, imagine we're opening a connection for logical replication, and > >> the subscriber does not handle sequences. What should the publisher do? > >> > > > > I think deciding anything at the publisher would be tricky but won't > > it be better if by default we disallow connection from subscriber to > > the publisher when the publisher's version is higher? And then allow > > it only based on some subscription option or maybe by default allow > > the connection to a higher version but based on option disallows the > > connection. > > > >> > >> Speaking of precedents, TRUNCATE is probably a better one, because it's > >> a new action and it determines *what* the subscriber can handle. But > >> that does exactly the thing we do for sequences - if you open a > >> connection from PG10 subscriber (truncate was added in PG11), and the > >> publisher decodes a truncate, subscriber will do: > >> > >> 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical > >> replication message type "T" > >> 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical > >> replication worker for subscription 16390 (PID 2357609) exited with > >> exit code 1 > >> > >> I don't see why sequences should do anything else. > >> > > > > Is this behavior of TRUNCATE known or discussed previously? I can't > > see any mention of this in the docs or commit message. I guess if we > > want to follow such behavior it should be well documented so that it > > won't be a surprise for users. I think we would face such cases in the > > future as well. One of the similar cases we are discussing for DDL > > replication where a higher version publisher could send some DDL > > syntax that lower version subscribers won't support and will lead to > > an error [1]. > > > > I don't know where/how it's documented, TBH. > > FWIW I agree the TRUNCATE-like behavior (failing on subscriber after > receiving unknown message type) is a bit annoying. > > Perhaps it'd be reasonable to tie the "protocol version" to subscriber > capabilities, so that a protocol version guarantees what message types > the subscriber understands. So we could increment the protocol version, > check it in pgoutput_startup and then error-out in the sequence callback > if the subscriber version is too old. > > That'd be nicer in the sense that we'd generate nicer error message on > the publisher, not an "unknown message type" on the subscriber. > Agreed. So, we can probably formalize this rule such that whenever in a newer version publisher we want to send additional information which the old version subscriber won't be able to handle, the error should be raised at the publisher by using protocol version number. -- With Regards, Amit Kapila.
On Thu, Mar 30, 2023 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 29, 2023 at 7:58 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 3/29/23 11:51, Amit Kapila wrote: > > >> > > >> It's not clear to me what should be the exact behavior? > > >> > > >> I mean, imagine we're opening a connection for logical replication, and > > >> the subscriber does not handle sequences. What should the publisher do? > > >> > > > > > > I think deciding anything at the publisher would be tricky but won't > > > it be better if by default we disallow connection from subscriber to > > > the publisher when the publisher's version is higher? And then allow > > > it only based on some subscription option or maybe by default allow > > > the connection to a higher version but based on option disallows the > > > connection. > > > > > >> > > >> Speaking of precedents, TRUNCATE is probably a better one, because it's > > >> a new action and it determines *what* the subscriber can handle. But > > >> that does exactly the thing we do for sequences - if you open a > > >> connection from PG10 subscriber (truncate was added in PG11), and the > > >> publisher decodes a truncate, subscriber will do: > > >> > > >> 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical > > >> replication message type "T" > > >> 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical > > >> replication worker for subscription 16390 (PID 2357609) exited with > > >> exit code 1 > > >> > > >> I don't see why sequences should do anything else. > > >> > > > > > > Is this behavior of TRUNCATE known or discussed previously? I can't > > > see any mention of this in the docs or commit message. I guess if we > > > want to follow such behavior it should be well documented so that it > > > won't be a surprise for users. I think we would face such cases in the > > > future as well. One of the similar cases we are discussing for DDL > > > replication where a higher version publisher could send some DDL > > > syntax that lower version subscribers won't support and will lead to > > > an error [1]. > > > > > > > I don't know where/how it's documented, TBH. > > > > FWIW I agree the TRUNCATE-like behavior (failing on subscriber after > > receiving unknown message type) is a bit annoying. > > > > Perhaps it'd be reasonable to tie the "protocol version" to subscriber > > capabilities, so that a protocol version guarantees what message types > > the subscriber understands. So we could increment the protocol version, > > check it in pgoutput_startup and then error-out in the sequence callback > > if the subscriber version is too old. > > > > That'd be nicer in the sense that we'd generate nicer error message on > > the publisher, not an "unknown message type" on the subscriber. > > > > Agreed. So, we can probably formalize this rule such that whenever in > a newer version publisher we want to send additional information which > the old version subscriber won't be able to handle, the error should > be raised at the publisher by using protocol version number. +1 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 3/30/23 05:15, Masahiko Sawada wrote: > > ... > >>> >>> Perhaps it'd be reasonable to tie the "protocol version" to subscriber >>> capabilities, so that a protocol version guarantees what message types >>> the subscriber understands. So we could increment the protocol version, >>> check it in pgoutput_startup and then error-out in the sequence callback >>> if the subscriber version is too old. >>> >>> That'd be nicer in the sense that we'd generate nicer error message on >>> the publisher, not an "unknown message type" on the subscriber. >>> >> >> Agreed. So, we can probably formalize this rule such that whenever in >> a newer version publisher we want to send additional information which >> the old version subscriber won't be able to handle, the error should >> be raised at the publisher by using protocol version number. > > +1 > OK, I took a stab at this, see the attached 0007 patch which bumps the protocol version, and allows the subscriber to specify "sequences" when starting the replication, similar to what we do for the two-phase stuff. The patch essentially adds 'sequences' to the replication start command, depending on the server version, but it can be overridden by "sequences" subscription option. The patch is pretty small, but I wonder how much smarter this should be ... I think there are about 4 cases that we need to consider 1) there are no sequences in the publication -> OK 2) publication with sequences, subscriber knows how to apply (and specifies "sequences on" either automatically or explicitly) -> OK 3) publication with sequences, subscriber explicitly disabled them by specifying "sequences off" in startup -> OK 4) publication with sequences, subscriber without sequence support (e.g. older Postgres release) -> PROBLEM (?) The reason why I think (4) may be a problem is that my opinion is we shouldn't silently drop stuff that is meant to be part of the publication. That is, if someone creates a publication and adds a sequence to it, he wants to replicate the sequence. But the current behavior is the old subscriber connects, doesn't specify the 'sequences on' so the publisher disables that and then simply ignores sequence increments during decoding. I think we might want to detect this and error out instead of just skipping the change, but that needs to happen later, only when the publication actually has any sequences ... I don't want to over-think / over-engineer this, though, so I wonder what are your opinions on this? There's a couple XXX comments in the code, mostly about stuff I left out when copying the two-phase stuff. For example, we store two-phase stuff in the replication slot itself - I don't think we need to do that for sequences, though. Another thing what to do about ALTER SUBSCRIPTION - at the moment it's not possible to change the "sequences" option, but maybe we should allow that? But then we'd need to re-sync all the sequences, somehow ... Aside from that, I've also added 0005, which does the sync interlock in a slightly different way - instead of a custom function for locking sequence, it allows LOCK on sequences. Peter Eisentraut suggested doing it like this, it's simpler, and I can't see what issues it might cause. The patch should update LOCK documentation, I haven't done that yet. Ultimately it should all be merged into 0003, of course. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Logical-decoding-of-sequences-20230402.patch
- 0002-Add-decoding-of-sequences-to-test_decoding-20230402.patch
- 0003-Add-decoding-of-sequences-to-built-in-repli-20230402.patch
- 0004-add-interlock-with-ALTER-SEQUENCE-20230402.patch
- 0005-Support-LOCK-for-sequences-instead-of-funct-20230402.patch
- 0006-Reconstruct-the-right-state-from-the-on-dis-20230402.patch
- 0007-protocol-changes-20230402.patch
Fwiw the cfbot seems to have some failing tests with this patch: [19:05:11.398] # Failed test 'initial test data replicated' [19:05:11.398] # at t/030_sequences.pl line 75. [19:05:11.398] # got: '1|0|f' [19:05:11.398] # expected: '132|0|t' [19:05:11.398] [19:05:11.398] # Failed test 'advance sequence in rolled-back transaction' [19:05:11.398] # at t/030_sequences.pl line 98. [19:05:11.398] # got: '1|0|f' [19:05:11.398] # expected: '231|0|t' [19:05:11.398] [19:05:11.398] # Failed test 'create sequence, advance it in rolled-back transaction, but commit the create' [19:05:11.398] # at t/030_sequences.pl line 152. [19:05:11.398] # got: '1|0|f' [19:05:11.398] # expected: '132|0|t' [19:05:11.398] [19:05:11.398] # Failed test 'advance the new sequence in a transaction and roll it back' [19:05:11.398] # at t/030_sequences.pl line 175. [19:05:11.398] # got: '1|0|f' [19:05:11.398] # expected: '231|0|t' [19:05:11.398] [19:05:11.398] # Failed test 'advance sequence in a subtransaction' [19:05:11.398] # at t/030_sequences.pl line 198. [19:05:11.398] # got: '1|0|f' [19:05:11.398] # expected: '330|0|t' [19:05:11.398] # Looks like you failed 5 tests of 6. -- Gregory Stark As Commitfest Manager
Patch 0002 is very annoying to scroll, and I realized that it's because psql is writing 200kB of dashes in one of the test_decoding test cases. I propose to set psql's printing format to 'unaligned' to avoid that, which should cut the size of that patch to a tenth. I wonder if there's a similar issue in 0003, but I didn't check. It's annoying that git doesn't seem to have a way of reporting length of longest lines. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "I'm always right, but sometimes I'm more right than other times." (Linus Torvalds)
Attachment
On 4/5/23 12:39, Alvaro Herrera wrote: > Patch 0002 is very annoying to scroll, and I realized that it's because > psql is writing 200kB of dashes in one of the test_decoding test cases. > I propose to set psql's printing format to 'unaligned' to avoid that, > which should cut the size of that patch to a tenth. > Yeah, that's a good idea, I think. It shrunk the diff to ~90kB, which is much better. > I wonder if there's a similar issue in 0003, but I didn't check. > I don't think so, there just seems to be enough code changes to generate ~260kB diff with all the context. As for the cfbot failures reported by Greg, that turned out to be a minor thinko in the protocol version negotiation, introduced by part 0008 (current part, after adding Alvaro's patch tweaking test output). The subscriber failed to send 'sequences on' when starting the stream. It also forgot to refresh the subscription after a sequence was added. The attached patch version fixes all of this, but I think at this point it's better to just postpone this for PG17 - if it was something we could fix within a single release, maybe. But the replication protocol is something we can't easily change after release, so if we find out the versioning (and sequence negotiation) should work differently, we can't change it. In fact, we'd be probably stuck with it until PG16 gets out of support, not just until PG17 ... I've thought about pushing at least the first two parts (adding the sequence decoding infrastructure and test_decoding support), but I'm not sure that's quite worth it without the built-in replication stuff. Or we could push it and then tweak it after feature freeze, if we conclude the protocol versioning should work differently. I recall we did changes in the column and row filtering in PG15. But that seems quite wrong, obviously. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Logical-decoding-of-sequences-20230405.patch
- 0002-make-test_decoding-ddl.out-shorter-20230405.patch
- 0003-Add-decoding-of-sequences-to-test_decoding-20230405.patch
- 0004-Add-decoding-of-sequences-to-built-in-repli-20230405.patch
- 0005-add-interlock-with-ALTER-SEQUENCE-20230405.patch
- 0006-Support-LOCK-for-sequences-instead-of-funct-20230405.patch
- 0007-Reconstruct-the-right-state-from-the-on-dis-20230405.patch
- 0008-protocol-changes-20230405.patch
On 02.04.23 19:46, Tomas Vondra wrote: > OK, I took a stab at this, see the attached 0007 patch which bumps the > protocol version, and allows the subscriber to specify "sequences" when > starting the replication, similar to what we do for the two-phase stuff. > > The patch essentially adds 'sequences' to the replication start command, > depending on the server version, but it can be overridden by "sequences" > subscription option. The patch is pretty small, but I wonder how much > smarter this should be ... I think this should actually be much simpler. All the code needs to do is: - Raise protocol version (4->5) (Your patch does that.) - pgoutput_sequence() checks whether the protocol version is >=5 and if not it raises an error. - Subscriber uses old protocol if the remote end is an older PG version. (Your patch does that.) I don't see the need for the subscriber to toggle sequences explicitly or anything like that.
Hi, Sorry for jumping late in this thread. I started experimenting with the functionality. Maybe something that was already discussed earlier. Given that the thread is being discussed for so long and has gone several changes, revalidating the functionality is useful. I considered following aspects: Changes to the sequence on subscriber ----------------------------------------------------- 1. Since this is logical decoding, logical replica is writable. So the logically replicated sequence can be manipulated on the subscriber as well. This implementation consolidates the changes on subscriber and publisher rather than replicating the publisher state as is. That's good. See example command sequence below a. publisher calls nextval() - this sets the sequence state on publisher as (1, 32, t) which is replicated to the subscriber. b. subscriber calls nextval() once - this sets the sequence state on subscriber as (34, 32, t) c. subscriber calls nextval() 32 times - on-disk state of sequence doesn't change on subscriber d. subscriber calls nextval() 33 times - this sets the sequence state on subscriber as (99, 0, t) e. publisher calls nextval() 32 times - this sets the sequence state on publisher as (33, 0, t) The on-disk state on publisher at the end of e. is replicated to the subscriber but subscriber doesn't apply it. The state there is still (99, 0, t). I think this is closer to how logical replication of sequence should look like. This is aso good enough as long as we expect the replication of sequences to be used for failover and switchover. But it might not help if we want to consolidate the INSERTs that use nextvals(). If we were to treat sequences as accumulating the increments, we might be able to resolve the conflicts by adjusting the columns values considering the increments made on subscriber. IIUC, conflict resolution is not part of built-in logical replication. So we may not want to go this route. But worth considering. Implementation agnostic decoded change -------------------------------------------------------- Current method of decoding and replicating the sequences is tied to the implementation - it replicates the sequence row as is. If the implementation changes in future, we might need to revise the decoded presentation of sequence. I think only nextval() matters for sequence. So as long as we are replicating information enough to calculate the nextval we should be good. Current implementation does that by replicating the log_value and is_called. is_called can be consolidated into log_value itself. The implemented protocol, thus requires two extra values to be replicated. Those can be ignored right now. But they might pose a problem in future, if some downstream starts using them. We will be forced to provide fake but sane values even if a future upstream implementation does not produce those values. Of course we can't predict the future implementation enough to decide what would be an implementation independent format. E.g. if a pluggable storage were to be used to implement sequences or if we come around implementing distributed sequences, their shape can't be predicted right now. So a change in protocol seems to be unavoidable whatever we do. But starting with bare minimum might save us from larger troubles. I think, it's better to just replicate the nextval() and craft the representation on subscriber so that it produces that nextval(). 3. Primary key sequences ----------------------------------- I am not experimented with this. But I think we will need to add the sequences associated with the primary keys to the publications publishing the owner tables. Otherwise, we will have problems with the failover. And it needs to be done automatically since a. the names of these sequences are generated automatically b. publications with FOR ALL TABLES will add tables automatically and start replicating the changes. Users may not be able to intercept the replication activity to add the associated sequences are also addedto the publication. -- Best Wishes, Ashutosh Bapat
Patch set needs a rebase, PFA rebased patch-set. The conflict was in commit "Add decoding of sequences to built-in replication", in files tablesync.c and 002_pg_dump.pl. On Thu, May 18, 2023 at 7:53 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > Hi, > Sorry for jumping late in this thread. > > I started experimenting with the functionality. Maybe something that > was already discussed earlier. Given that the thread is being > discussed for so long and has gone several changes, revalidating the > functionality is useful. > > I considered following aspects: > Changes to the sequence on subscriber > ----------------------------------------------------- > 1. Since this is logical decoding, logical replica is writable. So the > logically replicated sequence can be manipulated on the subscriber as > well. This implementation consolidates the changes on subscriber and > publisher rather than replicating the publisher state as is. That's > good. See example command sequence below > a. publisher calls nextval() - this sets the sequence state on > publisher as (1, 32, t) which is replicated to the subscriber. > b. subscriber calls nextval() once - this sets the sequence state on > subscriber as (34, 32, t) > c. subscriber calls nextval() 32 times - on-disk state of sequence > doesn't change on subscriber > d. subscriber calls nextval() 33 times - this sets the sequence state > on subscriber as (99, 0, t) > e. publisher calls nextval() 32 times - this sets the sequence state > on publisher as (33, 0, t) > > The on-disk state on publisher at the end of e. is replicated to the > subscriber but subscriber doesn't apply it. The state there is still > (99, 0, t). I think this is closer to how logical replication of > sequence should look like. This is aso good enough as long as we > expect the replication of sequences to be used for failover and > switchover. > > But it might not help if we want to consolidate the INSERTs that use > nextvals(). If we were to treat sequences as accumulating the > increments, we might be able to resolve the conflicts by adjusting the > columns values considering the increments made on subscriber. IIUC, > conflict resolution is not part of built-in logical replication. So we > may not want to go this route. But worth considering. > > Implementation agnostic decoded change > -------------------------------------------------------- > Current method of decoding and replicating the sequences is tied to > the implementation - it replicates the sequence row as is. If the > implementation changes in future, we might need to revise the decoded > presentation of sequence. I think only nextval() matters for sequence. > So as long as we are replicating information enough to calculate the > nextval we should be good. Current implementation does that by > replicating the log_value and is_called. is_called can be consolidated > into log_value itself. The implemented protocol, thus requires two > extra values to be replicated. Those can be ignored right now. But > they might pose a problem in future, if some downstream starts using > them. We will be forced to provide fake but sane values even if a > future upstream implementation does not produce those values. Of > course we can't predict the future implementation enough to decide > what would be an implementation independent format. E.g. if a > pluggable storage were to be used to implement sequences or if we come > around implementing distributed sequences, their shape can't be > predicted right now. So a change in protocol seems to be unavoidable > whatever we do. But starting with bare minimum might save us from > larger troubles. I think, it's better to just replicate the nextval() > and craft the representation on subscriber so that it produces that > nextval(). > > 3. Primary key sequences > ----------------------------------- > I am not experimented with this. But I think we will need to add the > sequences associated with the primary keys to the publications > publishing the owner tables. Otherwise, we will have problems with the > failover. And it needs to be done automatically since a. the names of > these sequences are generated automatically b. publications with FOR > ALL TABLES will add tables automatically and start replicating the > changes. Users may not be able to intercept the replication activity > to add the associated sequences are also addedto the publication. > > -- > Best Wishes, > Ashutosh Bapat -- Best Wishes, Ashutosh Bapat
Attachment
- 0005-add-interlock-with-ALTER-SEQUENCE-20230613.patch
- 0001-Logical-decoding-of-sequences-20230613.patch
- 0003-Add-decoding-of-sequences-to-test_decoding-20230613.patch
- 0004-Add-decoding-of-sequences-to-built-in-repli-20230613.patch
- 0002-make-test_decoding-ddl.out-shorter-20230613.patch
- 0007-Reconstruct-the-right-state-from-the-on-dis-20230613.patch
- 0008-protocol-changes-20230613.patch
- 0006-Support-LOCK-for-sequences-instead-of-funct-20230613.patch
On 5/18/23 16:23, Ashutosh Bapat wrote: > Hi, > Sorry for jumping late in this thread. > > I started experimenting with the functionality. Maybe something that > was already discussed earlier. Given that the thread is being > discussed for so long and has gone several changes, revalidating the > functionality is useful. > > I considered following aspects: > Changes to the sequence on subscriber > ----------------------------------------------------- > 1. Since this is logical decoding, logical replica is writable. So the > logically replicated sequence can be manipulated on the subscriber as > well. This implementation consolidates the changes on subscriber and > publisher rather than replicating the publisher state as is. That's > good. See example command sequence below > a. publisher calls nextval() - this sets the sequence state on > publisher as (1, 32, t) which is replicated to the subscriber. > b. subscriber calls nextval() once - this sets the sequence state on > subscriber as (34, 32, t) > c. subscriber calls nextval() 32 times - on-disk state of sequence > doesn't change on subscriber > d. subscriber calls nextval() 33 times - this sets the sequence state > on subscriber as (99, 0, t) > e. publisher calls nextval() 32 times - this sets the sequence state > on publisher as (33, 0, t) > > The on-disk state on publisher at the end of e. is replicated to the > subscriber but subscriber doesn't apply it. The state there is still > (99, 0, t). I think this is closer to how logical replication of > sequence should look like. This is aso good enough as long as we > expect the replication of sequences to be used for failover and > switchover. > I'm really confused - are you describing what the patch is doing, or what you think it should be doing? Because right now there's nothing that'd "consolidate" the changes (in the sense of reconciling write conflicts), and there's absolutely no way to do that. So if the subscriber advances the sequence (which it technically can), the subscriber state will be eventually be discarded and overwritten when the next increment gets decoded from WAL on the publisher. There's no way to fix this with type of sequences - it requires some sort of global consensus (consensus on range assignment, locking or whatever), which we don't have. If the sequence is the only thing replicated, this may go unnoticed. But chances are the user is also replicating the table with PK populated by the sequence, at which point it'll lead to constraint violation. > But it might not help if we want to consolidate the INSERTs that use > nextvals(). If we were to treat sequences as accumulating the > increments, we might be able to resolve the conflicts by adjusting the > columns values considering the increments made on subscriber. IIUC, > conflict resolution is not part of built-in logical replication. So we > may not want to go this route. But worth considering. We can't just adjust values in columns that may be used externally. > > Implementation agnostic decoded change > -------------------------------------------------------- > Current method of decoding and replicating the sequences is tied to > the implementation - it replicates the sequence row as is. If the > implementation changes in future, we might need to revise the decoded > presentation of sequence. I think only nextval() matters for sequence. > So as long as we are replicating information enough to calculate the > nextval we should be good. Current implementation does that by > replicating the log_value and is_called. is_called can be consolidated > into log_value itself. The implemented protocol, thus requires two > extra values to be replicated. Those can be ignored right now. But > they might pose a problem in future, if some downstream starts using > them. We will be forced to provide fake but sane values even if a > future upstream implementation does not produce those values. Of > course we can't predict the future implementation enough to decide > what would be an implementation independent format. E.g. if a > pluggable storage were to be used to implement sequences or if we come > around implementing distributed sequences, their shape can't be > predicted right now. So a change in protocol seems to be unavoidable > whatever we do. But starting with bare minimum might save us from > larger troubles. I think, it's better to just replicate the nextval() > and craft the representation on subscriber so that it produces that > nextval(). Yes, I agree with this. It's probably better to replicate just the next value, without the log_cnt / is_called fields (which are implementation specific). > > 3. Primary key sequences > ----------------------------------- > I am not experimented with this. But I think we will need to add the > sequences associated with the primary keys to the publications > publishing the owner tables. Otherwise, we will have problems with the > failover. And it needs to be done automatically since a. the names of > these sequences are generated automatically b. publications with FOR > ALL TABLES will add tables automatically and start replicating the > changes. Users may not be able to intercept the replication activity > to add the associated sequences are also addedto the publication. > Right, this idea was mentioned before, and I agree maybe we should consider adding some of those "automatic" sequences automatically. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 13, 2023 at 11:01 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 5/18/23 16:23, Ashutosh Bapat wrote: > > Hi, > > Sorry for jumping late in this thread. > > > > I started experimenting with the functionality. Maybe something that > > was already discussed earlier. Given that the thread is being > > discussed for so long and has gone several changes, revalidating the > > functionality is useful. > > > > I considered following aspects: > > Changes to the sequence on subscriber > > ----------------------------------------------------- > > 1. Since this is logical decoding, logical replica is writable. So the > > logically replicated sequence can be manipulated on the subscriber as > > well. This implementation consolidates the changes on subscriber and > > publisher rather than replicating the publisher state as is. That's > > good. See example command sequence below > > a. publisher calls nextval() - this sets the sequence state on > > publisher as (1, 32, t) which is replicated to the subscriber. > > b. subscriber calls nextval() once - this sets the sequence state on > > subscriber as (34, 32, t) > > c. subscriber calls nextval() 32 times - on-disk state of sequence > > doesn't change on subscriber > > d. subscriber calls nextval() 33 times - this sets the sequence state > > on subscriber as (99, 0, t) > > e. publisher calls nextval() 32 times - this sets the sequence state > > on publisher as (33, 0, t) > > > > The on-disk state on publisher at the end of e. is replicated to the > > subscriber but subscriber doesn't apply it. The state there is still > > (99, 0, t). I think this is closer to how logical replication of > > sequence should look like. This is aso good enough as long as we > > expect the replication of sequences to be used for failover and > > switchover. > > > > I'm really confused - are you describing what the patch is doing, or > what you think it should be doing? Because right now there's nothing > that'd "consolidate" the changes (in the sense of reconciling write > conflicts), and there's absolutely no way to do that. > > So if the subscriber advances the sequence (which it technically can), > the subscriber state will be eventually be discarded and overwritten > when the next increment gets decoded from WAL on the publisher. I described what I observed in my experiments. My observation doesn't agree with your description. I will revisit this when I review the output plugin changes and the WAL receiver changes. > > Yes, I agree with this. It's probably better to replicate just the next > value, without the log_cnt / is_called fields (which are implementation > specific). Ok. I will review the logic once you revise the patches. > > > > > 3. Primary key sequences > > ----------------------------------- > > I am not experimented with this. But I think we will need to add the > > sequences associated with the primary keys to the publications > > publishing the owner tables. Otherwise, we will have problems with the > > failover. And it needs to be done automatically since a. the names of > > these sequences are generated automatically b. publications with FOR > > ALL TABLES will add tables automatically and start replicating the > > changes. Users may not be able to intercept the replication activity > > to add the associated sequences are also addedto the publication. > > > > Right, this idea was mentioned before, and I agree maybe we should > consider adding some of those "automatic" sequences automatically. > Are you planning to add this in the same patch set or separately? I reviewed 0001 and related parts of 0004 and 0008 in detail. I have only one major change request, about typedef struct xl_seq_rec { RelFileLocator locator; + bool created; /* creates a new relfilenode (CREATE/ALTER) */ I am not sure what are the repercussions of adding a member to an existing WAL record. I didn't see any code which handles the old WAL format which doesn't contain the "created" flag. IIUC, the logical decoding may come across a WAL record written in the old format after upgrade and restart. Is that not possible? But I don't think it's necessary. We can add a decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work as is. Of course we will add non-sequence relfilelocators as well but that should be fine. Creating a new relfilelocator shouldn't be a frequent operation. If at all we are worried about that, we can add only the relfilenodes associated with sequences to the hash table. If this idea has been discussed earlier, please point me to the relevant discussion. Some other minor comments and nitpicks. <function>stream_stop_cb</function>, <function>stream_abort_cb</function>, <function>stream_commit_cb</function>, and <function>stream_change_cb</function> - are required, while <function>stream_message_cb</function> and + are required, while <function>stream_message_cb</function>, + <function>stream_sequence_cb</function> and Like the non-streaming counterpart, should we also mention what happens if those callbacks are not defined? That applies to stream_message_cb and stream_truncate_cb too. + /* + * Make sure the subtransaction has a XID assigned, so that the sequence + * increment WAL record is properly associated with it. This matters for + * increments of sequences created/altered in the transaction, which are + * handled as transactional. + */ + if (XLogLogicalInfoActive()) + GetCurrentTransactionId(); GetCurrentTransactionId() will also assign xids to all the parents so it doesn't seem necessary to call both GetTopTransactionId() and GetCurrentTransactionId(). Calling only the latter should suffice. Applies to all the calls to GetCurrentTransactionId(). + + memcpy(((char *) tuple->tuple.t_data), + data + sizeof(xl_seq_rec), + SizeofHeapTupleHeader); + + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader, + data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader, + datalen); The memory chunks being copied in these memcpy calls are contiguous. Why don't we use a single memcpy? For readability? + * If we don't have snapshot or we are just fast-forwarding, there is no + * point in decoding messages. s/decoding messages/decoding sequence changes/ + tupledata = XLogRecGetData(r); + datalen = XLogRecGetDataLen(r); + tuplelen = datalen - SizeOfHeapHeader - sizeof(xl_seq_rec); + + /* extract the WAL record, with "created" flag */ + xlrec = (xl_seq_rec *) XLogRecGetData(r); I think we should set tupledata = xlrec + sizeof(xl_seq_rec) so that it points to actual tuple data. This will also simplify the calculations in DecodeSeqTule(). +/* entry for hash table we use to track sequences created in running xacts */ s/running/transaction being decoded/ ? + + /* search the lookup table (we ignore the return value, found is enough) */ + ent = hash_search(rb->sequences, + (void *) &rlocator, + created ? HASH_ENTER : HASH_FIND, + &found); Misleading comment. We seem to be using the return value later. + /* + * When creating the sequence, remember the XID of the transaction + * that created id. + */ + if (created) + ent->xid = xid; Should we set ent->locator as well? The sequence won't get cleaned otherwise. + + TeardownHistoricSnapshot(false); + + AbortCurrentTransaction(); This call to AbortCurrentTransaction() in PG_TRY should be called if only this block started the transaction? + PG_CATCH(); + { + TeardownHistoricSnapshot(true); + + AbortCurrentTransaction(); Shouldn't we do this only if this block started the transaction? And in that case, wouldn't PG_RE_THROW take care of it? +/* + * Helper function for ReorderBufferProcessTXN for applying sequences. + */ +static inline void +ReorderBufferApplySequence(ReorderBuffer *rb, ReorderBufferTXN *txn, + Relation relation, ReorderBufferChange *change, + bool streaming) Possibly we should find a way to call this function from ReorderBufferQueueSequence() when processing non-transactional sequence change. It should probably absorb logic common to both the cases. + + if (RelationIsLogicallyLogged(relation)) + ReorderBufferApplySequence(rb, txn, relation, change, streaming); This condition is not used in ReorderBufferQueueSequence() when processing non-transactional change there. Why? + + if (len) + { + memcpy(data, &tup->tuple, sizeof(HeapTupleData)); + data += sizeof(HeapTupleData); + + memcpy(data, tup->tuple.t_data, len); + data += len; + } + We are just copying the sequence data. Shouldn't we copy the file locator as well or that's not needed once the change has been queued? Similarly for ReorderBufferChangeSize() and ReorderBufferChangeSize() + /* + * relfilenode => XID lookup table for sequences created in a transaction + * (also includes altered sequences, which assigns new relfilenode) + */ + HTAB *sequences; + Better renamed as seq_rel_locator or some such. Shouldn't this be part of ReorderBufferTxn which has similar transaction specific hashes. I will continue reviewing the remaining patches. -- Best Wishes, Ashutosh Bapat
Regarding the patchsets, I think we will need to rearrange the commits. Right now 0004 has some parts that should have been in 0001. Also the logic to assign XID to a subtrasaction be better a separate commit. That piece is independent of logical decoding of sequences. On Fri, Jun 23, 2023 at 6:48 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Jun 13, 2023 at 11:01 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 5/18/23 16:23, Ashutosh Bapat wrote: > > > Hi, > > > Sorry for jumping late in this thread. > > > > > > I started experimenting with the functionality. Maybe something that > > > was already discussed earlier. Given that the thread is being > > > discussed for so long and has gone several changes, revalidating the > > > functionality is useful. > > > > > > I considered following aspects: > > > Changes to the sequence on subscriber > > > ----------------------------------------------------- > > > 1. Since this is logical decoding, logical replica is writable. So the > > > logically replicated sequence can be manipulated on the subscriber as > > > well. This implementation consolidates the changes on subscriber and > > > publisher rather than replicating the publisher state as is. That's > > > good. See example command sequence below > > > a. publisher calls nextval() - this sets the sequence state on > > > publisher as (1, 32, t) which is replicated to the subscriber. > > > b. subscriber calls nextval() once - this sets the sequence state on > > > subscriber as (34, 32, t) > > > c. subscriber calls nextval() 32 times - on-disk state of sequence > > > doesn't change on subscriber > > > d. subscriber calls nextval() 33 times - this sets the sequence state > > > on subscriber as (99, 0, t) > > > e. publisher calls nextval() 32 times - this sets the sequence state > > > on publisher as (33, 0, t) > > > > > > The on-disk state on publisher at the end of e. is replicated to the > > > subscriber but subscriber doesn't apply it. The state there is still > > > (99, 0, t). I think this is closer to how logical replication of > > > sequence should look like. This is aso good enough as long as we > > > expect the replication of sequences to be used for failover and > > > switchover. > > > > > > > I'm really confused - are you describing what the patch is doing, or > > what you think it should be doing? Because right now there's nothing > > that'd "consolidate" the changes (in the sense of reconciling write > > conflicts), and there's absolutely no way to do that. > > > > So if the subscriber advances the sequence (which it technically can), > > the subscriber state will be eventually be discarded and overwritten > > when the next increment gets decoded from WAL on the publisher. > > I described what I observed in my experiments. My observation doesn't > agree with your description. I will revisit this when I review the > output plugin changes and the WAL receiver changes. > > > > > Yes, I agree with this. It's probably better to replicate just the next > > value, without the log_cnt / is_called fields (which are implementation > > specific). > > Ok. I will review the logic once you revise the patches. > > > > > > > > > 3. Primary key sequences > > > ----------------------------------- > > > I am not experimented with this. But I think we will need to add the > > > sequences associated with the primary keys to the publications > > > publishing the owner tables. Otherwise, we will have problems with the > > > failover. And it needs to be done automatically since a. the names of > > > these sequences are generated automatically b. publications with FOR > > > ALL TABLES will add tables automatically and start replicating the > > > changes. Users may not be able to intercept the replication activity > > > to add the associated sequences are also addedto the publication. > > > > > > > Right, this idea was mentioned before, and I agree maybe we should > > consider adding some of those "automatic" sequences automatically. > > > > Are you planning to add this in the same patch set or separately? > > I reviewed 0001 and related parts of 0004 and 0008 in detail. > > I have only one major change request, about > typedef struct xl_seq_rec > { > RelFileLocator locator; > + bool created; /* creates a new relfilenode (CREATE/ALTER) */ > > I am not sure what are the repercussions of adding a member to an existing WAL > record. I didn't see any code which handles the old WAL format which doesn't > contain the "created" flag. IIUC, the logical decoding may come across > a WAL record written in the old format after upgrade and restart. Is > that not possible? > > But I don't think it's necessary. We can add a > decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator > in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work > as is. Of course we will add non-sequence relfilelocators as well but that > should be fine. Creating a new relfilelocator shouldn't be a frequent > operation. If at all we are worried about that, we can add only the > relfilenodes associated with sequences to the hash table. > > If this idea has been discussed earlier, please point me to the relevant > discussion. > > Some other minor comments and nitpicks. > > <function>stream_stop_cb</function>, <function>stream_abort_cb</function>, > <function>stream_commit_cb</function>, and <function>stream_change_cb</function> > - are required, while <function>stream_message_cb</function> and > + are required, while <function>stream_message_cb</function>, > + <function>stream_sequence_cb</function> and > > Like the non-streaming counterpart, should we also mention what happens if those > callbacks are not defined? That applies to stream_message_cb and > stream_truncate_cb too. > + /* > + * Make sure the subtransaction has a XID assigned, so that the sequence > + * increment WAL record is properly associated with it. This matters for > + * increments of sequences created/altered in the transaction, which are > + * handled as transactional. > + */ > + if (XLogLogicalInfoActive()) > + GetCurrentTransactionId(); > > GetCurrentTransactionId() will also assign xids to all the parents so it > doesn't seem necessary to call both GetTopTransactionId() and > GetCurrentTransactionId(). Calling only the latter should suffice. Applies to > all the calls to GetCurrentTransactionId(). > > + > + memcpy(((char *) tuple->tuple.t_data), > + data + sizeof(xl_seq_rec), > + SizeofHeapTupleHeader); > + > + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader, > + data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader, > + datalen); > > The memory chunks being copied in these memcpy calls are contiguous. Why don't > we use a single memcpy? For readability? > > + * If we don't have snapshot or we are just fast-forwarding, there is no > + * point in decoding messages. > > s/decoding messages/decoding sequence changes/ > > + tupledata = XLogRecGetData(r); > + datalen = XLogRecGetDataLen(r); > + tuplelen = datalen - SizeOfHeapHeader - sizeof(xl_seq_rec); > + > + /* extract the WAL record, with "created" flag */ > + xlrec = (xl_seq_rec *) XLogRecGetData(r); > > I think we should set tupledata = xlrec + sizeof(xl_seq_rec) so that it points > to actual tuple data. This will also simplify the calculations in > DecodeSeqTule(). > +/* entry for hash table we use to track sequences created in running xacts */ > > s/running/transaction being decoded/ ? > > + > + /* search the lookup table (we ignore the return value, found is enough) */ > + ent = hash_search(rb->sequences, > + (void *) &rlocator, > + created ? HASH_ENTER : HASH_FIND, > + &found); > > Misleading comment. We seem to be using the return value later. > > + /* > + * When creating the sequence, remember the XID of the transaction > + * that created id. > + */ > + if (created) > + ent->xid = xid; > > Should we set ent->locator as well? The sequence won't get cleaned otherwise. > > + > + TeardownHistoricSnapshot(false); > + > + AbortCurrentTransaction(); > > This call to AbortCurrentTransaction() in PG_TRY should be called if only this > block started the transaction? > > + PG_CATCH(); > + { > + TeardownHistoricSnapshot(true); > + > + AbortCurrentTransaction(); > > Shouldn't we do this only if this block started the transaction? And in that > case, wouldn't PG_RE_THROW take care of it? > > +/* > + * Helper function for ReorderBufferProcessTXN for applying sequences. > + */ > +static inline void > +ReorderBufferApplySequence(ReorderBuffer *rb, ReorderBufferTXN *txn, > + Relation relation, ReorderBufferChange *change, > + bool streaming) > > Possibly we should find a way to call this function from > ReorderBufferQueueSequence() when processing non-transactional sequence change. > It should probably absorb logic common to both the cases. > > + > + if (RelationIsLogicallyLogged(relation)) > + ReorderBufferApplySequence(rb, txn, relation, change, streaming); > > This condition is not used in ReorderBufferQueueSequence() when processing > non-transactional change there. Why? > + > + if (len) > + { > + memcpy(data, &tup->tuple, sizeof(HeapTupleData)); > + data += sizeof(HeapTupleData); > + > + memcpy(data, tup->tuple.t_data, len); > + data += len; > + } > + > > We are just copying the sequence data. Shouldn't we copy the file locator as > well or that's not needed once the change has been queued? Similarly for > ReorderBufferChangeSize() and ReorderBufferChangeSize() > > + /* > + * relfilenode => XID lookup table for sequences created in a transaction > + * (also includes altered sequences, which assigns new relfilenode) > + */ > + HTAB *sequences; > + > > Better renamed as seq_rel_locator or some such. Shouldn't this be part of > ReorderBufferTxn which has similar transaction specific hashes. > > I will continue reviewing the remaining patches. > > -- > Best Wishes, > Ashutosh Bapat -- Best Wishes, Ashutosh Bapat
This is review of 0003 patch. Overall the patch looks good and helps understand the decoding logic better. + data +---------------------------------------------------------------------------------------- + BEGIN + sequence public.test_sequence: transactional:1 last_value: 1 log_cnt: 0 is_called:0 + COMMIT Looking at this output, I am wondering how would this patch work with DDL replication. I should have noticed this earlier, sorry. A sequence DDL has two parts, changes to the catalogs and changes to the data file. Support for replicating the data file changes is added by these patches. The catalog changes will need to be supported by DDL replication patch. When applying the DDL changes, there are two ways 1. just apply the catalog changes and let the support added here apply the data changes. 2. Apply both the changes. If the second route is chosen, all the "transactional" decoding and application support added by this patch will need to be ripped out. That will make the "transactional" field in the protocol will become useless. It has potential to be waste bandwidth in future. OTOH, I feel that waiting for the DDL repliation patch set to be commtted will cause this patchset to be delayed for an unknown duration. That's undesirable too. One solution I see is to use Storage RMID WAL again. While decoding it we send a message to the subscriber telling it that a new relfilenode is being allocated to a sequence. The subscriber too then allocates new relfilenode to the sequence. The sequence data changes are decoded without "transactional" flag; but they are decoded as transactional or non-transactional using the same logic as the current patch-set. The subscriber will always apply these changes to the reflilenode associated with the sequence at that point in time. This would have the same effect as the current patch-set. But then there is potential that the DDL replication patchset will render the Storage decoding useless. So not an option. But anyway, I will leave this as a comment as an alternative thought and discarded. Also this might trigger a better idea. What do you think? +-- savepoint test on table with serial column +BEGIN; +CREATE TABLE test_table (a SERIAL, b INT); +INSERT INTO test_table (b) VALUES (100); +INSERT INTO test_table (b) VALUES (200); +SAVEPOINT a; +INSERT INTO test_table (b) VALUES (300); +ROLLBACK TO SAVEPOINT a; The third implicit nextval won't be logged so whether subtransaction is rolled back or committed, it won't have much effect on the decoding. Adding subtransaction around the first INSERT itself might be useful to test that the subtransaction rollback does not rollback the sequence changes. After adding {'include_sequences', false} to the calls to pg_logical_slot_get_changes() in other tests, the SQL statement has grown beyond 80 characters. Need to split it into multiple lines. } + else if (strcmp(elem->defname, "include-sequences") == 0) + { + + if (elem->arg == NULL) + data->include_sequences = false; By default inlclude_sequences = true. Shouldn't then it be set to true here? After looking at the option processing code in pg_logical_slot_get_changes_guts(), it looks like an argument can never be NULL. But I see we have checks for NULL values of other arguments so it's ok to keep a NULL check here. I will look at 0004 next. -- Best Wishes, Ashutosh Bapat
On 6/26/23 15:18, Ashutosh Bapat wrote: > This is review of 0003 patch. Overall the patch looks good and helps > understand the decoding logic better. > > + data > +---------------------------------------------------------------------------------------- > + BEGIN > + sequence public.test_sequence: transactional:1 last_value: 1 > log_cnt: 0 is_called:0 > + COMMIT > > Looking at this output, I am wondering how would this patch work with DDL > replication. I should have noticed this earlier, sorry. A sequence DDL has two > parts, changes to the catalogs and changes to the data file. Support for > replicating the data file changes is added by these patches. The catalog > changes will need to be supported by DDL replication patch. When applying the > DDL changes, there are two ways 1. just apply the catalog changes and let the > support added here apply the data changes. 2. Apply both the changes. If the > second route is chosen, all the "transactional" decoding and application > support added by this patch will need to be ripped out. That will make the > "transactional" field in the protocol will become useless. It has potential to > be waste bandwidth in future. > I don't understand why would it need to be ripped out. Why would it make the transactional behavior useless? Can you explain? IMHO we replicate either changes (and then DDL replication does not interfere with that), or DDL (and then this patch should not interfere). > OTOH, I feel that waiting for the DDL repliation patch set to be commtted will > cause this patchset to be delayed for an unknown duration. That's undesirable > too. > > One solution I see is to use Storage RMID WAL again. While decoding it we send > a message to the subscriber telling it that a new relfilenode is being > allocated to a sequence. The subscriber too then allocates new relfilenode to > the sequence. The sequence data changes are decoded without "transactional" > flag; but they are decoded as transactional or non-transactional using the same > logic as the current patch-set. The subscriber will always apply these changes > to the reflilenode associated with the sequence at that point in time. This > would have the same effect as the current patch-set. But then there is > potential that the DDL replication patchset will render the Storage decoding > useless. So not an option. But anyway, I will leave this as a comment as an > alternative thought and discarded. Also this might trigger a better idea. > > What do you think? > I don't understand what the problem with DDL is, so I can't judge how this is supposed to solve it. > +-- savepoint test on table with serial column > +BEGIN; > +CREATE TABLE test_table (a SERIAL, b INT); > +INSERT INTO test_table (b) VALUES (100); > +INSERT INTO test_table (b) VALUES (200); > +SAVEPOINT a; > +INSERT INTO test_table (b) VALUES (300); > +ROLLBACK TO SAVEPOINT a; > > The third implicit nextval won't be logged so whether subtransaction is rolled > back or committed, it won't have much effect on the decoding. Adding > subtransaction around the first INSERT itself might be useful to test that the > subtransaction rollback does not rollback the sequence changes. > > After adding {'include_sequences', false} to the calls to > pg_logical_slot_get_changes() in other tests, the SQL statement has grown > beyond 80 characters. Need to split it into multiple lines. > > } > + else if (strcmp(elem->defname, "include-sequences") == 0) > + { > + > + if (elem->arg == NULL) > + data->include_sequences = false; > > By default inlclude_sequences = true. Shouldn't then it be set to true here? > I don't follow. Is this still related to the DDL replication, or are you describing some new issue with savepoints? > After looking at the option processing code in > pg_logical_slot_get_changes_guts(), it looks like an argument can never be > NULL. But I see we have checks for NULL values of other arguments so it's ok to > keep a NULL check here. > > I will look at 0004 next. > OK -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 26, 2023 at 8:35 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 6/26/23 15:18, Ashutosh Bapat wrote: > > This is review of 0003 patch. Overall the patch looks good and helps > > understand the decoding logic better. > > > > + data > > +---------------------------------------------------------------------------------------- > > + BEGIN > > + sequence public.test_sequence: transactional:1 last_value: 1 > > log_cnt: 0 is_called:0 > > + COMMIT > > > > Looking at this output, I am wondering how would this patch work with DDL > > replication. I should have noticed this earlier, sorry. A sequence DDL has two > > parts, changes to the catalogs and changes to the data file. Support for > > replicating the data file changes is added by these patches. The catalog > > changes will need to be supported by DDL replication patch. When applying the > > DDL changes, there are two ways 1. just apply the catalog changes and let the > > support added here apply the data changes. 2. Apply both the changes. If the > > second route is chosen, all the "transactional" decoding and application > > support added by this patch will need to be ripped out. That will make the > > "transactional" field in the protocol will become useless. It has potential to > > be waste bandwidth in future. > > > > I don't understand why would it need to be ripped out. Why would it make > the transactional behavior useless? Can you explain? > > IMHO we replicate either changes (and then DDL replication does not > interfere with that), or DDL (and then this patch should not interfere). > > > OTOH, I feel that waiting for the DDL repliation patch set to be commtted will > > cause this patchset to be delayed for an unknown duration. That's undesirable > > too. > > > > One solution I see is to use Storage RMID WAL again. While decoding it we send > > a message to the subscriber telling it that a new relfilenode is being > > allocated to a sequence. The subscriber too then allocates new relfilenode to > > the sequence. The sequence data changes are decoded without "transactional" > > flag; but they are decoded as transactional or non-transactional using the same > > logic as the current patch-set. The subscriber will always apply these changes > > to the reflilenode associated with the sequence at that point in time. This > > would have the same effect as the current patch-set. But then there is > > potential that the DDL replication patchset will render the Storage decoding > > useless. So not an option. But anyway, I will leave this as a comment as an > > alternative thought and discarded. Also this might trigger a better idea. > > > > What do you think? > > > > > I don't understand what the problem with DDL is, so I can't judge how > this is supposed to solve it. I have not looked at the DDL replication patch in detail so I may be missing something. IIUC, that patch replicates the DDL statement in some form: parse tree or statement. But it doesn't replicate the some or all WAL records that the DDL execution generates. Consider DDL "ALTER SEQUENCE test_sequence RESTART WITH 4000;". It updates the catalogs with a new relfilenode and also the START VALUE. It also writes to the new relfilenode. When publisher replicates the DDL and the subscriber applies it, it will do the same - update the catalogs and write to new relfilenode. We don't want the sequence data to be replicated again when it's changed by a DDL. All the transactional changes are associated with a DDL. Other changes to the data sequence are non-transactional. So when replicating the sequence data changes, "transactional" field becomes useless. What I am pointing to is: if we add "transactional" field in the protocol today and in future DDL replication is implemented in a way that "transactional" field becomes redundant, we have introduced a redundant field which will eat a byte on wire. Of course we can remove it by bumping protocol version, but that's some work. Please note we will still need the code to determine whether a change in sequence data is transactional or not IOW whether it's associated with DDL or not. So that code remains. > > > > } > > + else if (strcmp(elem->defname, "include-sequences") == 0) > > + { > > + > > + if (elem->arg == NULL) > > + data->include_sequences = false; > > > > By default inlclude_sequences = true. Shouldn't then it be set to true here? > > > > I don't follow. Is this still related to the DDL replication, or are you > describing some new issue with savepoints? Not related to DDL replication. Not an issue with savepoints either. Just a comment about that particular change. So for not being clear. -- Best Wishes, Ashutosh Bapat
On Mon, Jun 26, 2023 at 8:35 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 6/26/23 15:18, Ashutosh Bapat wrote: > > I will look at 0004 next. > > > > OK 0004- is quite large. I think if we split this into two or even three 1. publication and subscription catalog handling 2. built-in replication protocol changes, it might be easier to review. But anyway, I have given it one read. I have reviewed the parts which deal with the replication-proper in detail. I have *not* thoroughly reviewed the parts which deal with the catalogs, pg_dump, describe and tab completion. Similarly tests. If those parts need a thorough review, please let me know. But before jumping into the comments, a weird scenario I tried. On publisher I created a table t1(a int, b int) and a sequence s and added both to a publication. On subscriber I swapped their names i.e. created a table s(a int, b int) and a sequence t1 and subscribed to the publication. The subscription was created, and during replication it threw error "logical replication target relation "public.t1" is missing replicated columns: "a", "b" and logical replication target relation "public.s" is missing replicated columns: "last_value", "lo g_cnt", "is_called". I think it's good that it at least threw an error. But it would be good if it detected that the reltypes themselves are different and mentioned that in the error. Something like "logical replication target "public.s" is not a sequence like source "public.s". Comments on the patch itself. I didn't find any mention of 'sequence' in the documentation of publish option in CREATE or ALTER PUBLICATION. Something missing in the documentation? But do we really need to record "sequence" as an operation? Just adding the sequences to the publication should be fine right? There's only one operation on sequences, updating the sequence row. +CREATE VIEW pg_publication_sequences AS + SELECT + P.pubname AS pubname, + N.nspname AS schemaname, + C.relname AS sequencename If we report oid or regclass for sequences it might be easier to join the view further. We don't have reg* for publication so we report both oid and name of publication. +/* + * Update the sequence state by modifying the existing sequence data row. + * + * This keeps the same relfilenode, so the behavior is non-transactional. + */ +static void +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called) This function has some code similar to nextval but with the sequence of operations (viz. changes to buffer, WAL insert and cache update) changed. Given the comments in nextval_internal() the difference in sequence of operations should not make a difference in the end result. But I think it will be good to deduplicate the code to avoid confusion and also for ease of maintenance. + +/* + * Update the sequence state by creating a new relfilenode. + * + * This creates a new relfilenode, to allow transactional behavior. + */ +static void +SetSequence_transactional(Oid seq_relid, int64 last_value, int64 log_cnt, bool is_called) Need some deduplication here as well. But the similarities with AlterSequence, ResetSequence or DefineSequence are less. @@ -730,9 +731,9 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, { /* - * Get the table list from publisher and build local table status - * info. + * Get the table and sequence list from publisher and build + * local relation sync status info. */ - tables = fetch_table_list(wrconn, publications); - foreach(lc, tables) + relations = fetch_table_list(wrconn, publications); Is it allowed to connect a newer subscriber to an old publisher? If yes the query to fetch sequences will throw an error since it won't find the catalog. @@ -882,8 +886,10 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data, - /* Get the table list from publisher. */ + /* Get the list of relations from publisher. */ pubrel_names = fetch_table_list(wrconn, sub->publications); + pubrel_names = list_concat(pubrel_names, + fetch_sequence_list(wrconn, sub->publications)); Similarly here. +void +logicalrep_write_sequence(StringInfo out, Relation rel, TransactionId xid, + ... snip ... + pq_sendint8(out, flags); + pq_sendint64(out, lsn); ... snip ... +LogicalRepRelId +logicalrep_read_sequence(StringInfo in, LogicalRepSequence *seqdata) +{ ... snip ... + /* XXX skipping flags and lsn */ + pq_getmsgint(in, 1); + pq_getmsgint64(in); We are ignoring these two fields on the WAL receiver side. I don't see such fields being part of INSERT, UPDATE or DELETE messages. Should we just drop those or do they have some future use? Two lsns are written by OutputPrepareWrite() as prologue to the logical message. If this LSN is one of them, it could be dropped anyway. +static void +fetch_sequence_data(char *nspname, char *relname, ... snip ... + appendStringInfo(&cmd, "SELECT last_value, log_cnt, is_called\n" + " FROM %s", quote_qualified_identifier(nspname, relname)); We are using an undocumented interface here. SELECT ... FROM <sequence> is not documented. This code will break if we change the way a sequence is stored. That is quite unlikely but not impossible. Ideally we should use one of the methods documented at [1]. But none of them provide us what is needed per your comment in copy_sequence() i.e the state of sequence as of last WAL record on that sequence. So I don't have any better ideas that what's done in the patch. May be we can use "nextval() + 32" as an approximation. Some minor comments and nitpicks: @@ -1958,12 +1958,14 @@ get_object_address_publication_schema(List *object, bool missing_ok) Need an update to the function prologue with the description of the third element. Also the error message at the end of the function needs to mention the object type. - appendStringInfo(&buffer, _("publication of schema %s in publication %s"), - nspname, pubname); + appendStringInfo(&buffer, _("publication of schema %s in publication %s type %s"), + nspname, pubname, objtype); s/type/for object type/ ? @@ -5826,18 +5842,24 @@ getObjectIdentityParts(const ObjectAddress *object, break; - appendStringInfo(&buffer, "%s in publication %s", - nspname, pubname); + appendStringInfo(&buffer, "%s in publication %s type %s", + nspname, pubname, objtype); s/type/object type/? ... in some other places as well? +/* + * Check the character is a valid object type for schema publication. + * + * This recognizes either 't' for tables or 's' for sequences. Places that + * need to handle 'u' for unsupported relkinds need to do that explicitlyl s/explicitlyl/explicitly/ +Datum +pg_get_publication_sequences(PG_FUNCTION_ARGS) +{ ... snip ... + /* + * Publications support partitioned tables, although all changes are + * replicated using leaf partition identity and schema, so we only + * need those. + */ Not relevant here. + if (publication->allsequences) + sequences = GetAllSequencesPublicationRelations(); + else + { + List *relids, + *schemarelids; + + relids = GetPublicationRelations(publication->oid, + PUB_OBJTYPE_SEQUENCE, + publication->pubviaroot ? + PUBLICATION_PART_ROOT : + PUBLICATION_PART_LEAF); + schemarelids = GetAllSchemaPublicationRelations(publication->oid, + PUB_OBJTYPE_SEQUENCE, + publication->pubviaroot ? + PUBLICATION_PART_ROOT : + PUBLICATION_PART_LEAF); I think we should just pass PUBLICATION_PART_ALL since that parameter is irrelevant to sequences anyway. Otherwise this code would be confusing. I think we should rename PublicationTable structure to PublicationRelation since it can now contain information about a table or a sequence, both of which are relations. +/* + * Add or remove table to/from publication. s/table/sequence/. Generally this applies to all the code, working for tables, copied and modified for sequences. @@ -18826,6 +18867,30 @@ preprocess_pubobj_list(List *pubobjspec_list, core_yyscan_t yyscanner) errmsg("invalid schema name"), parser_errposition(pubobj->location)); } + else if (pubobj->pubobjtype == PUBLICATIONOBJ_SEQUENCES_IN_SCHEMA || + pubobj->pubobjtype == PUBLICATIONOBJ_SEQUENCES_IN_CUR_SCHEMA) + { + /* WHERE clause is not allowed on a schema object */ + if (pubobj->pubtable && pubobj->pubtable->whereClause) + ereport(ERROR, + errcode(ERRCODE_SYNTAX_ERROR), + errmsg("WHERE clause not allowed for schema"), + parser_errposition(pubobj->location)); Grammar doesn't allow specifying whereClause with ALL TABLES IN SCHEMA specification but we have code to throw error if that happens. We also have similar code for ALL SEQUENCES IN SCHEMA. Should we add for SEQUENCE specification as well? +static void +fetch_sequence_data(char *nspname, char *relname, ... snip ... + /* tablesync sets the sequences in non-transactional way */ + SetSequence(RelationGetRelid(rel), false, last_value, log_cnt, is_called); Why? In case of a regular table, in case the sync fails, the table will retain its state before sync. Similarly it will be expected that the sequence retains its state before sync, No? @@ -1467,10 +1557,21 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos) Now that it syncs sequences as well, should we rename this as LogicalRepSyncRelationStart? +static void +apply_handle_sequence(StringInfo s) ... snip ... + /* + * Commit the per-stream transaction (we only do this when not in + * remote transaction, i.e. for non-transactional sequence updates.) + */ + if (!in_remote_transaction) + CommitTransactionCommand(); I understand the purpose of if block. It commits the transaction that was started when applying a non-transactional sequence change. But didn't understand the term "per-stream transaction". @@ -5683,8 +5686,15 @@ RelationBuildPublicationDesc(Relation relation, PublicationDesc *pubdesc) Thanks for the additional comments. Those are useful. @@ -1716,28 +1716,19 @@ describeOneTableDetails(const char *schemaname, I think these changes make it easy to print the publication description per the code changes later. But May be we should commit the refactoring patch separately. -DECLARE_UNIQUE_INDEX(pg_publication_namespace_pnnspid_pnpubid_index, 6239, PublicationNamespacePnnspidPnpubidIndexId, on pg_publication_namespace using btree(pnnspid oid_ops, pnpubid oid_ops)); +DECLARE_UNIQUE_INDEX(pg_publication_namespace_pnnspid_pnpubid_pntype_index, 8903, PublicationNamespacePnnspidPnpubidPntypeIndexId, on pg_publication_namespace using btree(pnnspid oid_ops, pnpubid oid_ops, pntype char_ops)); Why do we need a new OID? The old index should not be there in a cluster created using this version and hence this OID will not be used. [1] https://www.postgresql.org/docs/current/functions-sequence.html Next I will review 0005. -- Best Wishes, Ashutosh Bapat
0005, 0006 and 0007 are all related to the initial sequence sync. [3] resulted in 0007 and I think we need it. That leaves 0005 and 0006 to be reviewed in this response. I followed the discussion starting [1] till [2]. The second one mentions the interlock mechanism which has been implemented in 0005 and 0006. While I don't have an objection to allowing LOCKing a sequence using the LOCK command, I am not sure whether it will actually work or is even needed. The problem described in [1] seems to be the same as the problem described in [2]. In both cases we see the sequence moving backwards during CATCHUP. At the end of catchup the sequence is in the right state in both the cases. [2] actually deems this behaviour OK. I also agree that the behaviour is ok. I am confused whether we have solved anything using interlocking and it's really needed. I see that the idea of using an LSN to decide whether or not to apply a change to sequence started in [4]. In [5] Tomas proposed to use page LSN. Looking at [6], it actually seems like a good idea. In [7] Tomas agreed that LSN won't be sufficient. But I don't understand why. There are three LSNs in the picture - restart LSN of sync slot, confirmed_flush LSN of sync slot and page LSN of the sequence page from where we read the initial state of the sequence. I think they can be used with the following rules: 1. The publisher will not send any changes with LSN less than confirmed_flush so we are good there. 2. Any non-transactional changes that happened between confirmed_flush and page LSN should be discarded while syncing. They are already visible to SELECT. 3. Any transactional changes with commit LSN between confirmed_flush and page LSN should be discarded while syncing. They are already visible to SELECT. 4. A DDL acquires a lock on sequence. Thus no other change to that sequence can have an LSN between the LSN of the change made by DDL and the commit LSN of that transaction. Only DDL changes to sequence are transactional. Hence any transactional changes with commit LSN beyond page LSN would not have been seen by the SELECT otherwise SELECT would see the page LSN committed by that transaction. so they need to be applied while syncing. 5. Any non-transactional changes beyond page LSN should be applied. They are not seen by SELECT. Am I missing something? I don't have an idea how to get page LSN via a SQL query (while also fetching data on that page). That may or may not be a challenge. [1] https://www.postgresql.org/message-id/c2799362-9098-c7bf-c315-4d7975acafa3%40enterprisedb.com [2] https://www.postgresql.org/message-id/2d4bee7b-31be-8b36-2847-a21a5d56e04f%40enterprisedb.com [3] https://www.postgresql.org/message-id/f5a9d63d-a6fe-59a9-d1ed-38f6a5582c13%40enterprisedb.com [4] https://www.postgresql.org/message-id/CAA4eK1KUYrXFq25xyjBKU1UDh7Dkzw74RXN1d3UAYhd4NzDcsg%40mail.gmail.com [5] https://www.postgresql.org/message-id/CAA4eK1LiA8nV_ZT7gNHShgtFVpoiOvwoxNsmP_fryP%3DPsYPvmA%40mail.gmail.com [6] https://www.postgresql.org/docs/current/storage-page-layout.html -- Best Wishes, Ashutosh Bapat
And the last patch 0008. @@ -1180,6 +1194,13 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt, ... snip ... + if (IsSet(opts.specified_opts, SUBOPT_SEQUENCES)) + { + values[Anum_pg_subscription_subsequences - 1] = + BoolGetDatum(opts.sequences); + replaces[Anum_pg_subscription_subsequences - 1] = true; + } + The list of allowed options set a few lines above this code does not contain "sequences". Is this option missing there or this code is unnecessary? If we intend to add "sequence" at a later time after a subscription is created, will the sequences be synced after ALTER SUBSCRIPTION? + /* + * ignore sequences when not requested + * + * XXX Maybe we should differentiate between "callbacks not defined" or + * "subscriber disabled sequence replication" and "subscriber does not + * know about sequence replication" (e.g. old subscriber version). + * + * For the first two it'd be fine to bail out here, but for the last it It's not clear which two you are talking about. Maybe that's because the paragraph above is ambiguious. It is in the form of A or B and C; so not clear which cases we are differentiating between: (A, B, C), ((A or B) and C) or (A or (B and C)) or something else. + * might be better to continue and error out only when the sequence + * would be replicated (e.g. as part of the publication). We don't know + * that here, unfortunately. Please see comments on changes to pgoutput_startup() below. We may want to change the paragraph accordingly. @@ -298,6 +298,20 @@ StartupDecodingContext(List *output_plugin_options, */ ctx->reorder->update_progress_txn = update_progress_txn_cb_wrapper; + /* + * To support logical decoding of sequences, we require the sequence + * callback. We decide it here, but only check it later in the wrappers. + * + * XXX Isn't it wrong to define only one of those callbacks? Say we + * only define the stream_sequence_cb() - that may get strange results + * depending on what gets streamed. Either none or both? I don't think the current condition is correct; it will consider sequence changes to be streamed even when sequence_cb is not defined and actually not send those. sequence_cb is needed to send sequence changes irrespective of whether transaction streaming is supported. But stream_sequence_cb is required if other stream callbacks are available. Something like if (ctx->callbacks.sequence_cb) { if (ctx->streaming) { if ctx->callbacks.stream_sequence_cb == NULL) ctx->sequences = false; else ctx->sequences = true; } else ctx->sequences = true; } else ctx->sequences = false; + * + * XXX Shouldn't sequence be defined at slot creation time, similar + * to two_phase? Probably not. I don't know why two_phase is defined at the slot creation time, so can't comment on this. But looks like something we need to answer before committing the patches. + /* + * We allow decoding of sequences when the option is given at the streaming + * start, provided the plugin supports all the callbacks for two-phase. s/two-phase/sequences/ + * + * XXX Similar behavior to the two-phase block below. I think we need to describe sequence specific behaviour instead of pointing to the two-phase. two-phase is part of in replication slot's on disk specification but sequence is not. Given that it's XXX, I think you are planning to do that. + * + * XXX Shouldn't this error out if the callbacks are not defined? Isn't this already being done in pgoutput_startup()? Should we remove this XXX. + /* + * Here, we just check whether the sequences decoding option is passed + * by plugin and decide whether to enable it at later point of time. It + * remains enabled if the previous start-up has done so. But we only + * allow the option to be passed in with sufficient version of the + * protocol, and when the output plugin supports it. + */ + if (!data->sequences) + ctx->sequences_opt_given = false; + else if (data->protocol_version < LOGICALREP_PROTO_SEQUENCES_VERSION_NUM) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("requested proto_version=%d does not support sequences, need %d or higher", + data->protocol_version, LOGICALREP_PROTO_SEQUENCES_VERSION_NUM))); + else if (!ctx->sequences) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("sequences requested, but not supported by output plugin"))); If a given output plugin doesn't implement the callbacks but subscription specifies sequences, the code will throw an error whether or not publication is publishing sequences. Instead I think the behaviour should be same as the case when publication doesn't include sequences even if the publisher node has sequences. In either case publisher (the plugin or the publication) doesn't want to publish sequence data. So subscriber's request can be ignored. What might be good is to throw an error if the publication publishes the sequences but there are no callbacks - both output plugin and the publication are part of publisher node, thus it's easy for users to setup them consistently. GetPublicationRelations can be tweaked a bit to return just tables or sequences. That along with publication's all sequences flag should tell us whether publication publishes any sequences or not. That ends my first round of reviews. -- Best Wishes, Ashutosh Bapat
On Tue, Jun 27, 2023 at 11:30 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I have not looked at the DDL replication patch in detail so I may be > missing something. IIUC, that patch replicates the DDL statement in > some form: parse tree or statement. But it doesn't replicate the some > or all WAL records that the DDL execution generates. > Yes, the DDL replication patch uses the parse tree and catalog information to generate a deparsed form of DDL statement which is WAL logged and used to replicate DDLs. -- With Regards, Amit Kapila.
Hi, here's a rebased and significantly reworked version of this patch series, based on the recent reviews and discussion. Let me go through the main differences: 1) reorder the patches to have the "shortening" of test output first 2) merge the various "fix" patches in to the three main patches 0002 - introduce sequence decoding infrastructure 0003 - add sequences to test_decoding 0004 - add sequences to built-in replication I've kept those patches separate to make the evolution easier to follow and discuss, but it was necessary to cleanup the patch series and make it clearer what the current state is. 3) simplify the replicated state As suggested by Ashutosh, it may not be a good idea to replicate the (last_value, log_cnt, is_called) tuple, as that's pretty tightly tied to our internal implementation. Which may not be the right thing for other plugins. So this new patch replicates just "value" which is pretty much (last_value + log_cnt), representing the next value that should be safe to generate on the subscriber (in case of a failover). 4) simplify test_decoding code & tests I realized I can ditch some of the test_decoding changes, because at some point we chose to only include sequences in test_decoding when explicitly requested. So the tests don't need to disable that, it's the other way - one test needs to enable it. This now also prints the single value, instead of the three values. 5) minor tweaks in the built-in replication This adopts the relaxed LOCK code to allow locking sequences during the initial sync, and also adopts the replication of a single value (this affects the "apply" side of that change too). 6) simplified protocol versioning The main open question I had was what to do about protocol versioning for the built-in replication - how to decide whether the subscriber can apply sequences, and what should happen if we decode sequence but the subscriber does not support that. I was not entirely sure we want to handle this by a simple version check, because that maps capabilities to a linear scale, which seems pretty limiting. That is, each protocol version just grows, and new version number means support of a new capability - like replication of two-phase commits, or sequences. Which is nice, but it does not allow supporting just the later feature, for example - you can't skip one. Which is why 2PC decoding has both a version and a subscription flag, which allows exactly that ... When discussing this off-list with Peter Eisentraut, he reminded me of his old message in the thread: https://www.postgresql.org/message-id/8046273f-ea88-5c97-5540-0ccd5d244fd4@enterprisedb.com where he advocates for exactly this simplified behavior. So I took a stab at it and 0005 should be doing that. I keep it as a separate patch for now, to make the changes clearer, but ultimately it should be merged into 0003 and 0004 parts. It's not particularly complex change, it mostly ditches the subscription option (which also means columns in the pg_subscription catalog), and a flag in the decoding context. But the main change is in pgoutput_sequence(), where we protocol_version and error-out if it's not the right version (instead of just ignoring the sequence). AFAICS this behaves as expected - with PG15 subscriber, I get an ERROR on the publisher side from the sequence callback. But it no occurred to me we could do the same thing with the original approach - allow the per-subscription "sequences" flag, but error out when the subscriber did not enable that capability ... Hopefully, I haven't forgotten to address any important point from the reviews ... The one thing I'm not really sure about is how it interferes with the replication of DDL. But in principle, if it decodes DDL for ALTER SEQUENCE, I don't see why it would be a problem that we then decode and replicate the WAL for the sequence state. But if it is a problem, we should be able to skip this WAL record with the initial sequence state (which I think should be possible thanks to the "created" flag this patch adds to the WAL record). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Thanks for the updated patches. I haven't looked at the patches yet but have some responses below. On Thu, Jul 13, 2023 at 12:35 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > 3) simplify the replicated state > > As suggested by Ashutosh, it may not be a good idea to replicate the > (last_value, log_cnt, is_called) tuple, as that's pretty tightly tied to > our internal implementation. Which may not be the right thing for other > plugins. So this new patch replicates just "value" which is pretty much > (last_value + log_cnt), representing the next value that should be safe > to generate on the subscriber (in case of a failover). > Thanks. That will help. > 5) minor tweaks in the built-in replication > > This adopts the relaxed LOCK code to allow locking sequences during the > initial sync, and also adopts the replication of a single value (this > affects the "apply" side of that change too). > I think the problem we are trying to solve with LOCK is not actually getting solved. See [2]. Instead your earlier idea of using page LSN looks better. > > 6) simplified protocol versioning I had tested the cross-version logical replication with older set of patches. Didn't see any unexpected behaviour then. I will test again. > > The one thing I'm not really sure about is how it interferes with the > replication of DDL. But in principle, if it decodes DDL for ALTER > SEQUENCE, I don't see why it would be a problem that we then decode and > replicate the WAL for the sequence state. But if it is a problem, we > should be able to skip this WAL record with the initial sequence state > (which I think should be possible thanks to the "created" flag this > patch adds to the WAL record). I had suggested a solution in [1] to avoid adding a flag to the WAL record. Did you consider it? If you considered it and rejected, I would be interested in knowing reasons behind rejecting it. Let me repeat here again: ``` We can add a decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work as is. Of course we will add non-sequence relfilelocators as well but that should be fine. Creating a new relfilelocator shouldn't be a frequent operation. If at all we are worried about that, we can add only the relfilenodes associated with sequences to the hash table. ``` If the DDL replication takes care of replicating and applying sequence changes, I think we don't need the changes tracking "transactional" sequence changes in this patch-set. That also makes a case for not adding a new field to WAL which may not be used. [1] https://www.postgresql.org/message-id/CAExHW5v_vVqkhF4ehST9EzpX1L3bemD1S%2BkTk_-ZVu_ir-nKDw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAExHW5vHRgjWzi6zZbgCs97eW9U7xMtzXEQK%2BaepuzoGDsDNtg%40mail.gmail.com -- Best Wishes, Ashutosh Bapat
On 6/23/23 15:18, Ashutosh Bapat wrote: > ... > > I reviewed 0001 and related parts of 0004 and 0008 in detail. > > I have only one major change request, about > typedef struct xl_seq_rec > { > RelFileLocator locator; > + bool created; /* creates a new relfilenode (CREATE/ALTER) */ > > I am not sure what are the repercussions of adding a member to an existing WAL > record. I didn't see any code which handles the old WAL format which doesn't > contain the "created" flag. IIUC, the logical decoding may come across > a WAL record written in the old format after upgrade and restart. Is > that not possible? > I don't understand why would adding a new field to xl_seq_rec be an issue, considering it's done in a new major version. Sure, if you generate WAL with old build, and start with a patched version, that would break things. But that's true for many other patches, and it's irrelevant for releases. > But I don't think it's necessary. We can add a > decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator > in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work > as is. Of course we will add non-sequence relfilelocators as well but that > should be fine. Creating a new relfilelocator shouldn't be a frequent > operation. If at all we are worried about that, we can add only the > relfilenodes associated with sequences to the hash table. > Hmmmm, that might work. I feel a bit uneasy about having to keep all relfilenodes, not just sequences ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/5/23 16:51, Ashutosh Bapat wrote: > 0005, 0006 and 0007 are all related to the initial sequence sync. [3] > resulted in 0007 and I think we need it. That leaves 0005 and 0006 to > be reviewed in this response. > > I followed the discussion starting [1] till [2]. The second one > mentions the interlock mechanism which has been implemented in 0005 > and 0006. While I don't have an objection to allowing LOCKing a > sequence using the LOCK command, I am not sure whether it will > actually work or is even needed. > > The problem described in [1] seems to be the same as the problem > described in [2]. In both cases we see the sequence moving backwards > during CATCHUP. At the end of catchup the sequence is in the right > state in both the cases. [2] actually deems this behaviour OK. I also > agree that the behaviour is ok. I am confused whether we have solved > anything using interlocking and it's really needed. > > I see that the idea of using an LSN to decide whether or not to apply > a change to sequence started in [4]. In [5] Tomas proposed to use page > LSN. Looking at [6], it actually seems like a good idea. In [7] Tomas > agreed that LSN won't be sufficient. But I don't understand why. There > are three LSNs in the picture - restart LSN of sync slot, > confirmed_flush LSN of sync slot and page LSN of the sequence page > from where we read the initial state of the sequence. I think they can > be used with the following rules: > 1. The publisher will not send any changes with LSN less than > confirmed_flush so we are good there. > 2. Any non-transactional changes that happened between confirmed_flush > and page LSN should be discarded while syncing. They are already > visible to SELECT. > 3. Any transactional changes with commit LSN between confirmed_flush > and page LSN should be discarded while syncing. They are already > visible to SELECT. > 4. A DDL acquires a lock on sequence. Thus no other change to that > sequence can have an LSN between the LSN of the change made by DDL and > the commit LSN of that transaction. Only DDL changes to sequence are > transactional. Hence any transactional changes with commit LSN beyond > page LSN would not have been seen by the SELECT otherwise SELECT would > see the page LSN committed by that transaction. so they need to be > applied while syncing. > 5. Any non-transactional changes beyond page LSN should be applied. > They are not seen by SELECT. > > Am I missing something? > Hmmm, I think you're onto something and the interlock may not be actually necessary ... IIRC there were two examples of the non-MVCC sequence behavior, leading me to add the interlock. 1) going "backwards" during catchup Sequences are not MVCC, and if there are increments between the slot creation and the SELECT, the sequence will go backwards. But it will ultimately end with the correct value. The LSN checks were an attempt to prevent this. I don't recall why I concluded this would not be sufficient (there's no link for [7] in your message), but maybe it was related to the sequence increments not being WAL-logged and thus not guaranteed to update the page LSN, or something like that. But if we agree we only guarantee consistency at the end of the catchup, this does not matter - it's OK to go backwards as long as the sequence ends with the correct value. 2) missing an increment because of ALTER SEQUENCE My concern here was that we might have a transaction that does ALTER SEQUENCE before the tablesync slot gets created, and the SELECT still sees the old sequence state because we start decoding after the ALTER. But now that I think about it again, this probably can't happen, because the slot won't be created until the ALTER commits. So we shouldn't miss anything. I suspect I got confused by some other bug in the patch at that time, leading me to a faulty conclusion. I'll try removing the interlock, and make sure it actually works OK. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/13/23 16:24, Ashutosh Bapat wrote: > Thanks for the updated patches. I haven't looked at the patches yet > but have some responses below. > > On Thu, Jul 13, 2023 at 12:35 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> >> >> 3) simplify the replicated state >> >> As suggested by Ashutosh, it may not be a good idea to replicate the >> (last_value, log_cnt, is_called) tuple, as that's pretty tightly tied to >> our internal implementation. Which may not be the right thing for other >> plugins. So this new patch replicates just "value" which is pretty much >> (last_value + log_cnt), representing the next value that should be safe >> to generate on the subscriber (in case of a failover). >> > > Thanks. That will help. > > >> 5) minor tweaks in the built-in replication >> >> This adopts the relaxed LOCK code to allow locking sequences during the >> initial sync, and also adopts the replication of a single value (this >> affects the "apply" side of that change too). >> > > I think the problem we are trying to solve with LOCK is not actually > getting solved. See [2]. Instead your earlier idea of using page LSN > looks better. > Thanks. I think you may be right, and the interlock may not be necessary. I've responded to the linked threads, that's probably easier to follow as it keeps the context. >> >> 6) simplified protocol versioning > > I had tested the cross-version logical replication with older set of > patches. Didn't see any unexpected behaviour then. I will test again. >> I think the question is what's the expected behavior. What behavior did you expect/observe? IIRC with the previous version of the patch, if you connected an old subscriber (without sequence replication), it just ignored/skipped the sequence increments and replicated the other changes. The new patch detects that, and triggers ERROR on the publisher. And I think that's the correct thing to do. There was a lengthy discussion about making this more flexible (by not tying this to "linear" protocol version) and/or permissive. I tried doing that by doing similar thing to decoding of 2PC, which allows choosing when creating a subscription. But ultimately that just chooses where to throw an error - whether on the publisher (in the output plugin callback) or on apply side (when trying to apply change to non-existent sequence). I still think it might be useful to have these "capabilities" orthogonal to the protocol version, but it's a matter for a separate patch. It's enough not to fail with "unknown message" on the subscriber. >> The one thing I'm not really sure about is how it interferes with the >> replication of DDL. But in principle, if it decodes DDL for ALTER >> SEQUENCE, I don't see why it would be a problem that we then decode and >> replicate the WAL for the sequence state. But if it is a problem, we >> should be able to skip this WAL record with the initial sequence state >> (which I think should be possible thanks to the "created" flag this >> patch adds to the WAL record). > > I had suggested a solution in [1] to avoid adding a flag to the WAL > record. Did you consider it? If you considered it and rejected, I > would be interested in knowing reasons behind rejecting it. Let me > repeat here again: > > ``` > We can add a > decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator > in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work > as is. Of course we will add non-sequence relfilelocators as well but that > should be fine. Creating a new relfilelocator shouldn't be a frequent > operation. If at all we are worried about that, we can add only the > relfilenodes associated with sequences to the hash table. > ``` > Thanks for reminding me. In principle I'm not against using the proposed approach - tracking all relfilenodes created by a transaction, although I don't think the new flag in xl_seq_rec is a problem, and it's probably cheaper than having to decode all relfilenode creations. > If the DDL replication takes care of replicating and applying sequence > changes, I think we don't need the changes tracking "transactional" > sequence changes in this patch-set. That also makes a case for not > adding a new field to WAL which may not be used. > Maybe, but the DDL replication patch is not there yet, and I'm not sure it's a good idea to make this patch wait for a much larger/complex patch. If the DDL replication patch gets committed, it may ditch this part (assuming it happens in the same development cycle). However, my impression was DDL replication would be optional. In which case we still need to handle the transactional case, to support sequence replication without DDL replication enabled. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 13, 2023 at 8:29 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 6/23/23 15:18, Ashutosh Bapat wrote: > > ... > > > > I reviewed 0001 and related parts of 0004 and 0008 in detail. > > > > I have only one major change request, about > > typedef struct xl_seq_rec > > { > > RelFileLocator locator; > > + bool created; /* creates a new relfilenode (CREATE/ALTER) */ > > > > I am not sure what are the repercussions of adding a member to an existing WAL > > record. I didn't see any code which handles the old WAL format which doesn't > > contain the "created" flag. IIUC, the logical decoding may come across > > a WAL record written in the old format after upgrade and restart. Is > > that not possible? > > > > I don't understand why would adding a new field to xl_seq_rec be an > issue, considering it's done in a new major version. Sure, if you > generate WAL with old build, and start with a patched version, that > would break things. But that's true for many other patches, and it's > irrelevant for releases. There are two issues 1. the name of the field "created" - what does created mean in a "sequence status" WAL record? Consider following sequence of events Begin; Create sequence ('s'); select nextval('s') from generate_series(1, 1000); ... commit This is going to create 1000/32 WAL records with "created" = true. But only the first one created the relfilenode. We might fix this little annoyance by changing the name to "transactional". 2. Consider following scenario v15 running logical decoding has restart_lsn before a "sequence change" WAL record written in old format stop the server upgrade to v16 logical decoding will stat from restart_lsn pointing to a WAL record written by v15. When it tries to read "sequence change" WAL record it won't be able to get "created" flag. Am I missing something here? > > > But I don't think it's necessary. We can add a > > decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator > > in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work > > as is. Of course we will add non-sequence relfilelocators as well but that > > should be fine. Creating a new relfilelocator shouldn't be a frequent > > operation. If at all we are worried about that, we can add only the > > relfilenodes associated with sequences to the hash table. > > > > Hmmmm, that might work. I feel a bit uneasy about having to keep all > relfilenodes, not just sequences ... From relfilenode it should be easy to get to rel and then see if it's a sequence. Only add relfilenodes for the sequence. -- Best Wishes, Ashutosh Bapat
On Thu, Jul 13, 2023 at 9:47 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > >> > >> 6) simplified protocol versioning > > > > I had tested the cross-version logical replication with older set of > > patches. Didn't see any unexpected behaviour then. I will test again. > >> > > I think the question is what's the expected behavior. What behavior did > you expect/observe? Let me run my test again and respond. > > IIRC with the previous version of the patch, if you connected an old > subscriber (without sequence replication), it just ignored/skipped the > sequence increments and replicated the other changes. I liked that. > > The new patch detects that, and triggers ERROR on the publisher. And I > think that's the correct thing to do. With this behaviour users will never be able to setup logical replication between old and new servers considering almost every setup has sequences. > > There was a lengthy discussion about making this more flexible (by not > tying this to "linear" protocol version) and/or permissive. I tried > doing that by doing similar thing to decoding of 2PC, which allows > choosing when creating a subscription. > > But ultimately that just chooses where to throw an error - whether on > the publisher (in the output plugin callback) or on apply side (when > trying to apply change to non-existent sequence). I had some comments on throwing error in [1], esp. towards the end. > > I still think it might be useful to have these "capabilities" orthogonal > to the protocol version, but it's a matter for a separate patch. It's > enough not to fail with "unknown message" on the subscriber. Yes, We should avoid breaking replication with "unknown message". I also agree that improving things in this area can be done in a separate patch, but as far as possible in this release itself. > > If the DDL replication takes care of replicating and applying sequence > > changes, I think we don't need the changes tracking "transactional" > > sequence changes in this patch-set. That also makes a case for not > > adding a new field to WAL which may not be used. > > > > Maybe, but the DDL replication patch is not there yet, and I'm not sure > it's a good idea to make this patch wait for a much larger/complex > patch. If the DDL replication patch gets committed, it may ditch this > part (assuming it happens in the same development cycle). > > However, my impression was DDL replication would be optional. In which > case we still need to handle the transactional case, to support sequence > replication without DDL replication enabled. As I said before, I don't think this patchset needs to wait for DDL replication patch. Let's hope that the later lands in the same release and straightens protocol instead of carrying it forever. [1] https://www.postgresql.org/message-id/CAExHW5vScYKKb0RZoiNEPfbaQ60hihfuWeLuZF4JKrwPJXPcUw%40mail.gmail.com -- Best Wishes, Ashutosh Bapat
On 7/14/23 09:34, Ashutosh Bapat wrote: > On Thu, Jul 13, 2023 at 9:47 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> >>>> >>>> 6) simplified protocol versioning >>> >>> I had tested the cross-version logical replication with older set of >>> patches. Didn't see any unexpected behaviour then. I will test again. >>>> >> >> I think the question is what's the expected behavior. What behavior did >> you expect/observe? > > Let me run my test again and respond. > >> >> IIRC with the previous version of the patch, if you connected an old >> subscriber (without sequence replication), it just ignored/skipped the >> sequence increments and replicated the other changes. > > I liked that. > I liked that too, initially (which is why I did it that way). But I changed my mind, because it's likely to cause more harm than good. >> >> The new patch detects that, and triggers ERROR on the publisher. And I >> think that's the correct thing to do. > > With this behaviour users will never be able to setup logical > replication between old and new servers considering almost every setup > has sequences. > That's not true. Replication to older versions works fine as long as the publication does not include sequences (which need to be added explicitly). If you have a publication with sequences, you clearly want to replicate them, ignoring it is just confusing "magic". If you have a publication with sequences and still want to replicate to an older server, create a new publication without sequences. >> >> There was a lengthy discussion about making this more flexible (by not >> tying this to "linear" protocol version) and/or permissive. I tried >> doing that by doing similar thing to decoding of 2PC, which allows >> choosing when creating a subscription. >> >> But ultimately that just chooses where to throw an error - whether on >> the publisher (in the output plugin callback) or on apply side (when >> trying to apply change to non-existent sequence). > > I had some comments on throwing error in [1], esp. towards the end. > Yes. You said: If a given output plugin doesn't implement the callbacks but subscription specifies sequences, the code will throw an error whether or not publication is publishing sequences. This refers to situation when the subscriber says "sequences" when opening the connection. And this happens *in the plugin* which also defines the callbacks, so I don't see how we could not have the callbacks defined ... Furthermore, the simplified protocol versioning does away with the "sequences" option, so in that case this can't even happen. >> >> I still think it might be useful to have these "capabilities" orthogonal >> to the protocol version, but it's a matter for a separate patch. It's >> enough not to fail with "unknown message" on the subscriber. > > Yes, We should avoid breaking replication with "unknown message". > > I also agree that improving things in this area can be done in a > separate patch, but as far as possible in this release itself. > >>> If the DDL replication takes care of replicating and applying sequence >>> changes, I think we don't need the changes tracking "transactional" >>> sequence changes in this patch-set. That also makes a case for not >>> adding a new field to WAL which may not be used. >>> >> >> Maybe, but the DDL replication patch is not there yet, and I'm not sure >> it's a good idea to make this patch wait for a much larger/complex >> patch. If the DDL replication patch gets committed, it may ditch this >> part (assuming it happens in the same development cycle). >> >> However, my impression was DDL replication would be optional. In which >> case we still need to handle the transactional case, to support sequence >> replication without DDL replication enabled. > > As I said before, I don't think this patchset needs to wait for DDL > replication patch. Let's hope that the later lands in the same release > and straightens protocol instead of carrying it forever. > OK, I agree with that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/14/23 08:51, Ashutosh Bapat wrote: > On Thu, Jul 13, 2023 at 8:29 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 6/23/23 15:18, Ashutosh Bapat wrote: >>> ... >>> >>> I reviewed 0001 and related parts of 0004 and 0008 in detail. >>> >>> I have only one major change request, about >>> typedef struct xl_seq_rec >>> { >>> RelFileLocator locator; >>> + bool created; /* creates a new relfilenode (CREATE/ALTER) */ >>> >>> I am not sure what are the repercussions of adding a member to an existing WAL >>> record. I didn't see any code which handles the old WAL format which doesn't >>> contain the "created" flag. IIUC, the logical decoding may come across >>> a WAL record written in the old format after upgrade and restart. Is >>> that not possible? >>> >> >> I don't understand why would adding a new field to xl_seq_rec be an >> issue, considering it's done in a new major version. Sure, if you >> generate WAL with old build, and start with a patched version, that >> would break things. But that's true for many other patches, and it's >> irrelevant for releases. > > There are two issues > 1. the name of the field "created" - what does created mean in a > "sequence status" WAL record? Consider following sequence of events > Begin; > Create sequence ('s'); > select nextval('s') from generate_series(1, 1000); > > ... > commit > > This is going to create 1000/32 WAL records with "created" = true. But > only the first one created the relfilenode. We might fix this little > annoyance by changing the name to "transactional". > I don't think that's true - this will create 1 record with "created=true" (the one right after the CREATE SEQUENCE) and the rest will have "created=false". I realized I haven't modified seq_desc to show this flag, so I did that in the updated patch version, which makes this easy to see. And all of them need to be handled in a transactional way, because they modify relfilenode visible only to that transaction. So calling the flag "transactional" would be misleading, because the increments can be transactional even with "created=false". > 2. Consider following scenario > v15 running logical decoding has restart_lsn before a "sequence > change" WAL record written in old format > stop the server > upgrade to v16 > logical decoding will stat from restart_lsn pointing to a WAL record > written by v15. When it tries to read "sequence change" WAL record it > won't be able to get "created" flag. > > Am I missing something here? > You're missing the fact that pg_upgrade does not copy replication slots, so the restart_lsn does not matter. (Yes, this is pretty annoying consequence of using pg_upgrade. And maybe we'll improve that in the future - but I'm pretty sure we won't allow decoding old WAL.) >> >>> But I don't think it's necessary. We can add a >>> decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator >>> in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work >>> as is. Of course we will add non-sequence relfilelocators as well but that >>> should be fine. Creating a new relfilelocator shouldn't be a frequent >>> operation. If at all we are worried about that, we can add only the >>> relfilenodes associated with sequences to the hash table. >>> >> >> Hmmmm, that might work. I feel a bit uneasy about having to keep all >> relfilenodes, not just sequences ... > > From relfilenode it should be easy to get to rel and then see if it's > a sequence. Only add relfilenodes for the sequence. > Will try. Attached is an updated version with pg_waldump printing the "created" flag in seq_desc, and removing the unnecessary interlock. I've kept the protocol changes in a separate commit for now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Jul 14, 2023 at 3:59 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > >> > >> The new patch detects that, and triggers ERROR on the publisher. And I > >> think that's the correct thing to do. > > > > With this behaviour users will never be able to setup logical > > replication between old and new servers considering almost every setup > > has sequences. > > > > That's not true. > > Replication to older versions works fine as long as the publication does > not include sequences (which need to be added explicitly). If you have a > publication with sequences, you clearly want to replicate them, ignoring > it is just confusing "magic". I was looking at it from a different angle. Publishers publish what they want, subscribers choose what they want and what gets replicated is intersection of these two sets. Both live happily. But I am fine with that too. It's just that users need to create more publications. > > If you have a publication with sequences and still want to replicate to > an older server, create a new publication without sequences. > I tested the current patches with subscriber at PG 14 and publisher at master + these patches. I created one table and a sequence on both publisher and subscriber. I created two publications, one with sequence and other without it. Both have the table in it. When the subscriber subscribes to the publication with sequence, following ERROR is repeated in the subscriber logs and nothing gets replicated ``` [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOG: 00000: logical replication apply worker for subscription "sub5433" has started [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOCATION: ApplyWorkerMain, worker.c:3169 [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] ERROR: 08P01: could not receive data from WAL stream: ERROR: protocol version does not support sequence replication CONTEXT: slot "sub5433", output plugin "pgoutput", in the sequence callback, associated LSN 0/1513718 [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] LOCATION: libpqrcv_receive, libpqwalreceiver.c:818 [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOG: 00000: background worker "logical replication worker" (PID 916293) exited with exit code 1 [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOCATION: LogChildExit, postmaster.c:3737 ``` When the subscriber subscribes to the publication without sequence, things work normally. The cross-version replication is working as expected then. -- Best Wishes, Ashutosh Bapat
On Fri, Jul 14, 2023 at 4:10 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I don't think that's true - this will create 1 record with > "created=true" (the one right after the CREATE SEQUENCE) and the rest > will have "created=false". I may have misread the code. > > I realized I haven't modified seq_desc to show this flag, so I did that > in the updated patch version, which makes this easy to see. Now I see it. Thanks for the clarification. > > > > Am I missing something here? > > > > You're missing the fact that pg_upgrade does not copy replication slots, > so the restart_lsn does not matter. > > (Yes, this is pretty annoying consequence of using pg_upgrade. And maybe > we'll improve that in the future - but I'm pretty sure we won't allow > decoding old WAL.) Ah, I see. Thanks for correcting me. > >>> > >> > >> Hmmmm, that might work. I feel a bit uneasy about having to keep all > >> relfilenodes, not just sequences ... > > > > From relfilenode it should be easy to get to rel and then see if it's > > a sequence. Only add relfilenodes for the sequence. > > > > Will try. > Actually, adding all relfilenodes to hash may not be that bad. There shouldn't be many of those. So the extra step to lookup reltype may not be necessary. What's your reason for uneasiness? But yeah, there's a way to avoid that as well. Should I wait for this before the second round of review? -- Best Wishes, Ashutosh Bapat
On 7/14/23 15:50, Ashutosh Bapat wrote: > On Fri, Jul 14, 2023 at 3:59 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> >>>> >>>> The new patch detects that, and triggers ERROR on the publisher. And I >>>> think that's the correct thing to do. >>> >>> With this behaviour users will never be able to setup logical >>> replication between old and new servers considering almost every setup >>> has sequences. >>> >> >> That's not true. >> >> Replication to older versions works fine as long as the publication does >> not include sequences (which need to be added explicitly). If you have a >> publication with sequences, you clearly want to replicate them, ignoring >> it is just confusing "magic". > > I was looking at it from a different angle. Publishers publish what > they want, subscribers choose what they want and what gets replicated > is intersection of these two sets. Both live happily. > > But I am fine with that too. It's just that users need to create more > publications. > I think you might make essentially the same argument about replicating just some of the tables in the publication. That is, the publication has tables t1 and t2, but subscriber only has t1. That will fail too, we don't allow the subscriber to ignore changes for t2. I think it'd be rather weird (and confusing) to do this differently for different types of replicated objects. >> >> If you have a publication with sequences and still want to replicate to >> an older server, create a new publication without sequences. >> > > I tested the current patches with subscriber at PG 14 and publisher at > master + these patches. I created one table and a sequence on both > publisher and subscriber. I created two publications, one with > sequence and other without it. Both have the table in it. When the > subscriber subscribes to the publication with sequence, following > ERROR is repeated in the subscriber logs and nothing gets replicated > ``` > [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOG: 00000: > logical replication apply worker for subscription "sub5433" has > started > [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOCATION: > ApplyWorkerMain, worker.c:3169 > [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] ERROR: 08P01: > could not receive data from WAL stream: ERROR: protocol version does > not support sequence replication > CONTEXT: slot "sub5433", output plugin "pgoutput", in the > sequence callback, associated LSN 0/1513718 > [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] LOCATION: > libpqrcv_receive, libpqwalreceiver.c:818 > [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOG: 00000: > background worker "logical replication worker" (PID 916293) exited > with exit code 1 > [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOCATION: > LogChildExit, postmaster.c:3737 > ``` > > When the subscriber subscribes to the publication without sequence, > things work normally. > > The cross-version replication is working as expected then. > Thanks for testing / confirming this! So, do we agree this behavior is reasonable? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/14/23 16:02, Ashutosh Bapat wrote: > ... >>>>> >>>> >>>> Hmmmm, that might work. I feel a bit uneasy about having to keep all >>>> relfilenodes, not just sequences ... >>> >>> From relfilenode it should be easy to get to rel and then see if it's >>> a sequence. Only add relfilenodes for the sequence. >>> >> >> Will try. >> > > Actually, adding all relfilenodes to hash may not be that bad. There > shouldn't be many of those. So the extra step to lookup reltype may > not be necessary. What's your reason for uneasiness? But yeah, there's > a way to avoid that as well. > > Should I wait for this before the second round of review? > I don't think you have to wait - just ignore the part that changes the WAL record, which is a pretty tiny bit of the patch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Here's a slightly improved version of the patch, fixing two minor issues reported by cfbot: - compiler warning about fetch_sequence_data maybe not initializing a variable (not true, but silence the warning) - missing "id" for an element in SGML cocs regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Jul 14, 2023 at 7:33 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Thanks for testing / confirming this! So, do we agree this behavior is > reasonable? > This behaviour doesn't need any on-disk changes or has nothing in it which prohibits us from changing it in future. So I think it's good as a v0. If required we can add the protocol option to provide more flexible behaviour. One thing I am worried about is that the subscriber will get an error only when a sequence change is decoded. All the prior changes will be replicated and applied on the subscriber. Thus by the time the user realises this mistake, they may have replicated data. At this point if they want to subscribe to a publication without sequences they will need to clean the already replicated data. But they may not be in a position to know which is which esp when the subscriber has its own data in those tables. Example, publisher: create publication pub with sequences and tables subscriber: subscribe to pub publisher: modify data in tables and sequences subscriber: replicates some data and errors out publisher: delete some data from tables publisher: create a publication pub_tab without sequences subscriber: subscribe to pub_tab subscriber: replicates the data but rows which were deleted on publisher remain on the subscriber -- Best Wishes, Ashutosh Bapat
On 7/18/23 15:52, Ashutosh Bapat wrote: > On Fri, Jul 14, 2023 at 7:33 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> >> Thanks for testing / confirming this! So, do we agree this behavior is >> reasonable? >> > > This behaviour doesn't need any on-disk changes or has nothing in it > which prohibits us from changing it in future. So I think it's good as > a v0. If required we can add the protocol option to provide more > flexible behaviour. > True, although "no on-disk changes" does not exactly mean we can just change it at will. Essentially, once it gets released, the behavior is somewhat fixed for the next ~5 years, until that release gets EOL. And likely longer, because more features are likely to do the same thing. That's essentially why the patch was reverted from PG16 - I was worried the elaborate protocol versioning/negotiation was not the right thing. > One thing I am worried about is that the subscriber will get an error > only when a sequence change is decoded. All the prior changes will be > replicated and applied on the subscriber. Thus by the time the user > realises this mistake, they may have replicated data. At this point if > they want to subscribe to a publication without sequences they will > need to clean the already replicated data. But they may not be in a > position to know which is which esp when the subscriber has its own > data in those tables. Example, > > publisher: create publication pub with sequences and tables > subscriber: subscribe to pub > publisher: modify data in tables and sequences > subscriber: replicates some data and errors out > publisher: delete some data from tables > publisher: create a publication pub_tab without sequences > subscriber: subscribe to pub_tab > subscriber: replicates the data but rows which were deleted on > publisher remain on the subscriber > Sure, but I'd argue that's correct. If the replication stream has something the subscriber can't apply, what else would you do? We had exactly the same thing with TRUNCATE, for example (except that it failed with "unknown message" on the subscriber). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 19, 2023 at 1:20 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> > > > > This behaviour doesn't need any on-disk changes or has nothing in it > > which prohibits us from changing it in future. So I think it's good as > > a v0. If required we can add the protocol option to provide more > > flexible behaviour. > > > > True, although "no on-disk changes" does not exactly mean we can just > change it at will. Essentially, once it gets released, the behavior is > somewhat fixed for the next ~5 years, until that release gets EOL. And > likely longer, because more features are likely to do the same thing. > > That's essentially why the patch was reverted from PG16 - I was worried > the elaborate protocol versioning/negotiation was not the right thing. I agree that elaborate protocol would pose roadblocks in future. It's better not to add that burden right now, esp. when usage is not clear. Here's behavriour and extension matrix as I understand it and as of the last set of patches. Publisher PG 17, Subscriber PG 17 - changes to sequences are replicated, downstream is capable of applying them Publisher PG 16-, Subscriber PG 17 changes to sequences are never replicated Publisher PG 18+, Subscriber PG 17 - same as 17, 17 case. Any changes in PG 18+ need to make sure that PG 17 subscriber receives sequence changes irrespective of changes in protocol. That may pose some maintenance burden but doesn't seem to be any harder than usual backward compatibility burden. Moreover users can control whether changes to sequences get replicated or not by controlling the objects contained in publication. I don't see any downside to this. Looks all good. Please correct me if wrong. > > > One thing I am worried about is that the subscriber will get an error > > only when a sequence change is decoded. All the prior changes will be > > replicated and applied on the subscriber. Thus by the time the user > > realises this mistake, they may have replicated data. At this point if > > they want to subscribe to a publication without sequences they will > > need to clean the already replicated data. But they may not be in a > > position to know which is which esp when the subscriber has its own > > data in those tables. Example, > > > > publisher: create publication pub with sequences and tables > > subscriber: subscribe to pub > > publisher: modify data in tables and sequences > > subscriber: replicates some data and errors out > > publisher: delete some data from tables > > publisher: create a publication pub_tab without sequences > > subscriber: subscribe to pub_tab > > subscriber: replicates the data but rows which were deleted on > > publisher remain on the subscriber > > > > Sure, but I'd argue that's correct. If the replication stream has > something the subscriber can't apply, what else would you do? We had > exactly the same thing with TRUNCATE, for example (except that it failed > with "unknown message" on the subscriber). When the replication starts, the publisher knows what publication is being used, it also knows what protocol is being used. From publication it knows what objects will be replicated. So we could fail before any changes are replicated when executing START_REPLICATION command. According to [1], if an object is added or removed from publication the subscriber is required to REFRESH SUBSCRIPTION in which case there will be fresh START_REPLICATION command sent. So we should fail the START_REPLICATION command before sending any change rather than when a change is being replicated. That's more deterministic and easy to handle. Of course any changes that were sent before ALTER PUBLICATION can not be reverted, but that's expected. Coming back to TRUNCATE, I don't think it's possible to know whether a publication will send a truncate downstream or not. So we can't throw an error before TRUNCATE change is decoded. Anyway, I think this behaviour should be documented. I didn't see this mentioned in PUBLICATION or SUBSCRIPTION documentation. [1] https://www.postgresql.org/docs/current/sql-alterpublication.html -- Best Wishes, Ashutosh Bapat
On 7/19/23 07:42, Ashutosh Bapat wrote: > On Wed, Jul 19, 2023 at 1:20 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >>>> >>> >>> This behaviour doesn't need any on-disk changes or has nothing in it >>> which prohibits us from changing it in future. So I think it's good as >>> a v0. If required we can add the protocol option to provide more >>> flexible behaviour. >>> >> >> True, although "no on-disk changes" does not exactly mean we can just >> change it at will. Essentially, once it gets released, the behavior is >> somewhat fixed for the next ~5 years, until that release gets EOL. And >> likely longer, because more features are likely to do the same thing. >> >> That's essentially why the patch was reverted from PG16 - I was worried >> the elaborate protocol versioning/negotiation was not the right thing. > > I agree that elaborate protocol would pose roadblocks in future. It's > better not to add that burden right now, esp. when usage is not clear. > > Here's behavriour and extension matrix as I understand it and as of > the last set of patches. > > Publisher PG 17, Subscriber PG 17 - changes to sequences are > replicated, downstream is capable of applying them > > Publisher PG 16-, Subscriber PG 17 changes to sequences are never replicated > > Publisher PG 18+, Subscriber PG 17 - same as 17, 17 case. Any changes > in PG 18+ need to make sure that PG 17 subscriber receives sequence > changes irrespective of changes in protocol. That may pose some > maintenance burden but doesn't seem to be any harder than usual > backward compatibility burden. > > Moreover users can control whether changes to sequences get replicated > or not by controlling the objects contained in publication. > > I don't see any downside to this. Looks all good. Please correct me if wrong. > I think this is an accurate description of what the current patch does. And I think it's a reasonable behavior. My point is that if this gets released in PG17, it'll be difficult to change, even if it does not change on-disk format. >> >>> One thing I am worried about is that the subscriber will get an error >>> only when a sequence change is decoded. All the prior changes will be >>> replicated and applied on the subscriber. Thus by the time the user >>> realises this mistake, they may have replicated data. At this point if >>> they want to subscribe to a publication without sequences they will >>> need to clean the already replicated data. But they may not be in a >>> position to know which is which esp when the subscriber has its own >>> data in those tables. Example, >>> >>> publisher: create publication pub with sequences and tables >>> subscriber: subscribe to pub >>> publisher: modify data in tables and sequences >>> subscriber: replicates some data and errors out >>> publisher: delete some data from tables >>> publisher: create a publication pub_tab without sequences >>> subscriber: subscribe to pub_tab >>> subscriber: replicates the data but rows which were deleted on >>> publisher remain on the subscriber >>> >> >> Sure, but I'd argue that's correct. If the replication stream has >> something the subscriber can't apply, what else would you do? We had >> exactly the same thing with TRUNCATE, for example (except that it failed >> with "unknown message" on the subscriber). > > When the replication starts, the publisher knows what publication is > being used, it also knows what protocol is being used. From > publication it knows what objects will be replicated. So we could fail > before any changes are replicated when executing START_REPLICATION > command. According to [1], if an object is added or removed from > publication the subscriber is required to REFRESH SUBSCRIPTION in > which case there will be fresh START_REPLICATION command sent. So we > should fail the START_REPLICATION command before sending any change > rather than when a change is being replicated. That's more > deterministic and easy to handle. Of course any changes that were sent > before ALTER PUBLICATION can not be reverted, but that's expected. > > Coming back to TRUNCATE, I don't think it's possible to know whether a > publication will send a truncate downstream or not. So we can't throw > an error before TRUNCATE change is decoded. > > Anyway, I think this behaviour should be documented. I didn't see this > mentioned in PUBLICATION or SUBSCRIPTION documentation. > I need to think behavior about this a bit more, and maybe check how difficult would be implementing it. I did however look at the proposed alternative to the "created" flag. The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding. The smgr_decode code needs a review (I'm not sure the skipping/fast-forwarding part is correct), but it seems to be working fine overall, although we need to ensure the WAL record has the correct XID. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Make-test_decoding-ddl.out-shorter-20230719.patch
- 0002-Logical-decoding-of-sequences-20230719.patch
- 0003-Add-decoding-of-sequences-to-test_decoding-20230719.patch
- 0004-Add-decoding-of-sequences-to-built-in-repli-20230719.patch
- 0005-Simplify-protocol-versioning-20230719.patch
- 0006-replace-created-flag-with-XLOG_SMGR_CREATE-20230719.patch
On 7/19/23 12:53, Tomas Vondra wrote: > ... > > I did however look at the proposed alternative to the "created" flag. > The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding. > The smgr_decode code needs a review (I'm not sure the > skipping/fast-forwarding part is correct), but it seems to be working > fine overall, although we need to ensure the WAL record has the correct XID. > cfbot reported two issues in the patch - compilation warning, due to unused variable in sequence_decode, and a failing test in test_decoding. The second thing happens because when creating the relfilenode, it may happen before we know the XID. The patch already does ensure the WAL with the sequence data has XID, but that's later. And when the CREATE record did not have the correct XID, that broke the logic deciding which increments should be "transactional". This forces us to assign XID a bit earlier (it'd happen anyway, when logging the increment). There's a bit of a drawback, because we don't have the relation yet, so we can't do RelationNeedsWAL ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Make-test_decoding-ddl.out-shorter-20230719b.patch
- 0002-Logical-decoding-of-sequences-20230719b.patch
- 0003-Add-decoding-of-sequences-to-test_decoding-20230719b.patch
- 0004-Add-decoding-of-sequences-to-built-in-repl-20230719b.patch
- 0005-Simplify-protocol-versioning-20230719b.patch
- 0006-replace-created-flag-with-XLOG_SMGR_CREATE-20230719b.patch
Thanks Tomas for the updated patches. Here are my comments on 0006 patch as well as 0002 patch. On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I think this is an accurate description of what the current patch does. > And I think it's a reasonable behavior. > > My point is that if this gets released in PG17, it'll be difficult to > change, even if it does not change on-disk format. > Yes. I agree. And I don't see any problem even if we are not able to change it. > > I need to think behavior about this a bit more, and maybe check how > difficult would be implementing it. Ok. In most of the comments and in documentation, there are some phrases which do not look accurate. Change to a sequence is being refered to as "sequence increment". While ascending sequences are common, PostgreSQL supports descending sequences as well. The changes there will be decrements. But that's not the only case. A sequence may be restarted with an older value, in which case the change could increment or a decrement. I think correct usage is 'changes to sequence' or 'sequence changes'. Sequence being assigned a new relfilenode is referred to as sequence being created. This is confusing. When an existing sequence is ALTERed, we will not "create" a new sequence but we will "create" a new relfilenode and "assign" it to that sequence. PFA such edits in 0002 and 0006 patches. Let me know if those look correct. I think we need similar changes to the documentation and comments in other places. > > I did however look at the proposed alternative to the "created" flag. > The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding. > The smgr_decode code needs a review (I'm not sure the > skipping/fast-forwarding part is correct), but it seems to be working > fine overall, although we need to ensure the WAL record has the correct XID. > Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL record, it adds the relfilenode mentioned in it to the sequences hash. When decoding a sequence change record, it checks whether the relfilenode in the WAL record is in hash table. If it is the sequence changes is deemed transactional otherwise non-transactional. The change looks good to me. It simplifies the logic to decide whether a sequence change is transactional or not. In sequence_decode() we skip sequence changes when fast forwarding. Given that smgr_decode() is only to supplement sequence_decode(), I think it's correct to do the same in smgr_decode() as well. Simillarly skipping when we don't have full snapshot. Some minor comments on 0006 patch + /* make sure the relfilenode creation is associated with the XID */ + if (XLogLogicalInfoActive()) + GetCurrentTransactionId(); I think this change is correct and is inline with similar changes in 0002. But I looked at other places from where DefineRelation() is called. For regular tables it is called from ProcessUtilitySlow() which in turn does not call GetCurrentTransactionId(). I am wondering whether we are just discovering a class of bugs caused by not associating an xid with a newly created relfilenode. + /* + * If we don't have snapshot or we are just fast-forwarding, there is no + * point in decoding changes. + */ + if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT || + ctx->fast_forward) + return; This code block is repeated. +void +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid, + RelFileLocator rlocator) +{ ... snip ... + + /* sequence changes require a transaction */ + if (xid == InvalidTransactionId) + return; IIUC, with your changes in DefineSequence() in this patch, this should not happen. So this condition will never be true. But in case it happens, this code will not add the relfilelocation to the hash table and we will deem the sequence change as non-transactional. Isn't it better to just throw an error and stop replication if that (ever) happens? Also some comments on 0002 patch @@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum) /* check the comment above nextval_internal()'s equivalent call. */ if (RelationNeedsWAL(rel)) + { GetTopTransactionId(); + /* + * Make sure the subtransaction has a XID assigned, so that the sequence + * increment WAL record is properly associated with it. This matters for + * increments of sequences created/altered in the transaction, which are + * handled as transactional. + */ + if (XLogLogicalInfoActive()) + GetCurrentTransactionId(); + } + I think we should separately commit the changes which add a call to GetCurrentTransactionId(). That looks like an existing bug/anomaly which can stay irrespective of this patch. + /* + * To support logical decoding of sequences, we require the sequence + * callback. We decide it here, but only check it later in the wrappers. + * + * XXX Isn't it wrong to define only one of those callbacks? Say we + * only define the stream_sequence_cb() - that may get strange results + * depending on what gets streamed. Either none or both? + * + * XXX Shouldn't sequence be defined at slot creation time, similar + * to two_phase? Probably not. + */ Do you intend to keep these XXX's as is? My previous comments on this comment block are in [1]. In fact, given that whether or not sequences are replicated is decided by the protocol version, do we really need LogicalDecodingContext::sequences? Drawing parallel with WAL messages, I don't think it's needed. [1] https://www.postgresql.org/message-id/CAExHW5vScYKKb0RZoiNEPfbaQ60hihfuWeLuZF4JKrwPJXPcUw%40mail.gmail.com -- Best Wishes, Ashutosh Bapat
Attachment
On 7/20/23 09:24, Ashutosh Bapat wrote: > Thanks Tomas for the updated patches. > > Here are my comments on 0006 patch as well as 0002 patch. > > On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I think this is an accurate description of what the current patch does. >> And I think it's a reasonable behavior. >> >> My point is that if this gets released in PG17, it'll be difficult to >> change, even if it does not change on-disk format. >> > > Yes. I agree. And I don't see any problem even if we are not able to change it. > >> >> I need to think behavior about this a bit more, and maybe check how >> difficult would be implementing it. > > Ok. > > In most of the comments and in documentation, there are some phrases > which do not look accurate. > > Change to a sequence is being refered to as "sequence increment". While > ascending sequences are common, PostgreSQL supports descending sequences as > well. The changes there will be decrements. But that's not the only case. A > sequence may be restarted with an older value, in which case the change could > increment or a decrement. I think correct usage is 'changes to sequence' or > 'sequence changes'. > > Sequence being assigned a new relfilenode is referred to as sequence > being created. This is confusing. When an existing sequence is ALTERed, we > will not "create" a new sequence but we will "create" a new relfilenode and > "assign" it to that sequence. > > PFA such edits in 0002 and 0006 patches. Let me know if those look > correct. I think we > need similar changes to the documentation and comments in other places. > OK, I merged the changes into the patches, with some minor changes to the wording etc. >> >> I did however look at the proposed alternative to the "created" flag. >> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding. >> The smgr_decode code needs a review (I'm not sure the >> skipping/fast-forwarding part is correct), but it seems to be working >> fine overall, although we need to ensure the WAL record has the correct XID. >> > > Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL > record, it adds the relfilenode mentioned in it to the sequences hash. > When decoding a sequence change record, it checks whether the > relfilenode in the WAL record is in hash table. If it is the sequence > changes is deemed transactional otherwise non-transactional. The > change looks good to me. It simplifies the logic to decide whether a > sequence change is transactional or not. > Right. > In sequence_decode() we skip sequence changes when fast forwarding. > Given that smgr_decode() is only to supplement sequence_decode(), I > think it's correct to do the same in smgr_decode() as well. Simillarly > skipping when we don't have full snapshot. > I don't follow, smgr_decode already checks ctx->fast_forward. > Some minor comments on 0006 patch > > + /* make sure the relfilenode creation is associated with the XID */ > + if (XLogLogicalInfoActive()) > + GetCurrentTransactionId(); > > I think this change is correct and is inline with similar changes in 0002. But > I looked at other places from where DefineRelation() is called. For regular > tables it is called from ProcessUtilitySlow() which in turn does not call > GetCurrentTransactionId(). I am wondering whether we are just discovering a > class of bugs caused by not associating an xid with a newly created > relfilenode. > Not sure. Why would it be a bug? > + /* > + * If we don't have snapshot or we are just fast-forwarding, there is no > + * point in decoding changes. > + */ > + if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT || > + ctx->fast_forward) > + return; > > This code block is repeated. > Fixed. > +void > +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid, > + RelFileLocator rlocator) > +{ > ... snip ... > + > + /* sequence changes require a transaction */ > + if (xid == InvalidTransactionId) > + return; > > IIUC, with your changes in DefineSequence() in this patch, this should not > happen. So this condition will never be true. But in case it happens, this code > will not add the relfilelocation to the hash table and we will deem the > sequence change as non-transactional. Isn't it better to just throw an error > and stop replication if that (ever) happens? > It can't happen for sequence, but it may happen when creating a non-sequence relfilenode. In a way, it's a way to skip (some) unnecessary relfilenodes. > Also some comments on 0002 patch > > @@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple > tuple, ForkNumber forkNum) > > /* check the comment above nextval_internal()'s equivalent call. */ > if (RelationNeedsWAL(rel)) > + { > GetTopTransactionId(); > > + /* > + * Make sure the subtransaction has a XID assigned, so that > the sequence > + * increment WAL record is properly associated with it. This > matters for > + * increments of sequences created/altered in the > transaction, which are > + * handled as transactional. > + */ > + if (XLogLogicalInfoActive()) > + GetCurrentTransactionId(); > + } > + > > I think we should separately commit the changes which add a call to > GetCurrentTransactionId(). That looks like an existing bug/anomaly > which can stay irrespective of this patch. > Not sure, but I don't see this as a bug. > + /* > + * To support logical decoding of sequences, we require the sequence > + * callback. We decide it here, but only check it later in the wrappers. > + * > + * XXX Isn't it wrong to define only one of those callbacks? Say we > + * only define the stream_sequence_cb() - that may get strange results > + * depending on what gets streamed. Either none or both? > + * > + * XXX Shouldn't sequence be defined at slot creation time, similar > + * to two_phase? Probably not. > + */ > > Do you intend to keep these XXX's as is? My previous comments on this comment > block are in [1]. > > In fact, given that whether or not sequences are replicated is decided by the > protocol version, do we really need LogicalDecodingContext::sequences? Drawing > parallel with WAL messages, I don't think it's needed. > Right. We do that for two_phase because you can override that when creating the subscription - sequences allowed that too initially, but then we ditched that. So I don't think we need this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Make-test_decoding-ddl.out-shorter-20230720.patch
- 0002-Logical-decoding-of-sequences-20230720.patch
- 0003-Add-decoding-of-sequences-to-test_decoding-20230720.patch
- 0004-Add-decoding-of-sequences-to-built-in-repli-20230720.patch
- 0005-Simplify-protocol-versioning-20230720.patch
- 0006-replace-created-flag-with-XLOG_SMGR_CREATE-20230720.patch
FWIW there's two questions related to the switch to XLOG_SMGR_CREATE. 1) Does smgr_decode() need to do the same block as sequence_decode()? /* Skip the change if already processed (per the snapshot). */ if (transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; I don't think it does. Also, we don't have any transactional flag here. Or rather, everything is transactional ... 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. I was thinking maybe we should have it in the transaction (because we need to do cleanup at the end). It seem a bit inconvenient, because then we'd need to either search htabs in all subxacts, or transfer the entries to the top-level xact (otoh, we already do that with snapshots), and cleanup on abort. What do you think? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > OK, I merged the changes into the patches, with some minor changes to > the wording etc. > I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720 even without the rest of the patches. Isn't it a separate improvement? I see that origin filtering (origin=none) doesn't work with this patch. You can see this by using the following statements: Node-1: postgres=# create sequence s; CREATE SEQUENCE postgres=# create publication mypub for all sequences; CREATE PUBLICATION Node-2: postgres=# create sequence s; CREATE SEQUENCE postgres=# create subscription mysub_sub connection '....' publication mypub with (origin=none); NOTICE: created replication slot "mysub_sub" on publisher CREATE SUBSCRIPTION postgres=# create publication mypub_sub for all sequences; CREATE PUBLICATION Node-1: create subscription mysub_pub connection '...' publication mypub_sub with (origin=none); NOTICE: created replication slot "mysub_pub" on publisher CREATE SUBSCRIPTION SELECT nextval('s') FROM generate_series(1,100); After that, you can check on the subscriber that sequences values are overridden with older values: postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 67 | 0 | t (1 row) postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 100 | 0 | t (1 row) postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 133 | 0 | t (1 row) postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 67 | 0 | t (1 row) I haven't verified all the details but I think that is because we don't set XLOG_INCLUDE_ORIGIN while logging sequence values. -- With Regards, Amit Kapila.
On Mon, Jul 24, 2023 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > OK, I merged the changes into the patches, with some minor changes to > > the wording etc. > > > > I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720 > even without the rest of the patches. Isn't it a separate improvement? +1. Yes, it can go separately. It would even be better if the test can be modified to capture the toasted data into a psql variable before insert into the table, and compare it with output of pg_logical_slot_get_changes. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > 0005, 0006 and 0007 are all related to the initial sequence sync. [3] > resulted in 0007 and I think we need it. That leaves 0005 and 0006 to > be reviewed in this response. > > I followed the discussion starting [1] till [2]. The second one > mentions the interlock mechanism which has been implemented in 0005 > and 0006. While I don't have an objection to allowing LOCKing a > sequence using the LOCK command, I am not sure whether it will > actually work or is even needed. > > The problem described in [1] seems to be the same as the problem > described in [2]. In both cases we see the sequence moving backwards > during CATCHUP. At the end of catchup the sequence is in the right > state in both the cases. > I think we could see backward sequence value even after the catchup phase (after the sync worker is exited and or the state of rel is marked as 'ready' in pg_subscription_rel). The point is that there is no guarantee that we will process all the pending WAL before considering the sequence state is 'SYNCDONE' and or 'READY'. For example, after copy_sequence, I see values like: postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 165 | 0 | t (1 row) postgres=# select nextval('s'); nextval --------- 166 (1 row) postgres=# select nextval('s'); nextval --------- 167 (1 row) postgres=# select currval('s'); currval --------- 167 (1 row) Then during the catchup phase: postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 33 | 0 | t (1 row) postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 66 | 0 | t (1 row) postgres=# select * from pg_subscription_rel; srsubid | srrelid | srsubstate | srsublsn ---------+---------+------------+----------- 16394 | 16390 | r | 0/16374E8 16394 | 16393 | s | 0/1637700 (2 rows) postgres=# select * from pg_subscription_rel; srsubid | srrelid | srsubstate | srsublsn ---------+---------+------------+----------- 16394 | 16390 | r | 0/16374E8 16394 | 16393 | r | 0/1637700 (2 rows) Here Sequence relid id 16393. You can see sequence state is marked as ready. postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 66 | 0 | t (1 row) Even after that, see below the value of the sequence is still not caught up. Later, when the apply worker processes all the WAL, the sequence state will be caught up. postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 165 | 0 | t (1 row) So, there will be a window where the sequence won't be caught up for a certain period of time and any usage of it (even after the sync is finished) during that time could result in inconsistent behaviour. The other question is whether it is okay to allow the sequence to go backwards even during the initial sync phase? The reason I am asking this question is that for the time sequence value moves backwards, one is allowed to use it on the subscriber which will result in using out-of-sequence values. For example, immediately, after copy_sequence the values look like this: postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 133 | 32 | t (1 row) postgres=# select nextval('s'); nextval --------- 134 (1 row) postgres=# select currval('s'); currval --------- 134 (1 row) But then during the sync phase, it can go backwards and one is allowed to use it on the subscriber: postgres=# select * from s; last_value | log_cnt | is_called ------------+---------+----------- 66 | 0 | t (1 row) postgres=# select nextval('s'); nextval --------- 67 (1 row) -- With Regards, Amit Kapila.
On 7/24/23 12:40, Amit Kapila wrote: > On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: >> >> 0005, 0006 and 0007 are all related to the initial sequence sync. [3] >> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to >> be reviewed in this response. >> >> I followed the discussion starting [1] till [2]. The second one >> mentions the interlock mechanism which has been implemented in 0005 >> and 0006. While I don't have an objection to allowing LOCKing a >> sequence using the LOCK command, I am not sure whether it will >> actually work or is even needed. >> >> The problem described in [1] seems to be the same as the problem >> described in [2]. In both cases we see the sequence moving backwards >> during CATCHUP. At the end of catchup the sequence is in the right >> state in both the cases. >> > > I think we could see backward sequence value even after the catchup > phase (after the sync worker is exited and or the state of rel is > marked as 'ready' in pg_subscription_rel). The point is that there is > no guarantee that we will process all the pending WAL before > considering the sequence state is 'SYNCDONE' and or 'READY'. For > example, after copy_sequence, I see values like: > > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 165 | 0 | t > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 166 > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 167 > (1 row) > postgres=# select currval('s'); > currval > --------- > 167 > (1 row) > > Then during the catchup phase: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 33 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 66 | 0 | t > (1 row) > > postgres=# select * from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn > ---------+---------+------------+----------- > 16394 | 16390 | r | 0/16374E8 > 16394 | 16393 | s | 0/1637700 > (2 rows) > > postgres=# select * from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn > ---------+---------+------------+----------- > 16394 | 16390 | r | 0/16374E8 > 16394 | 16393 | r | 0/1637700 > (2 rows) > > Here Sequence relid id 16393. You can see sequence state is marked as ready. > > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 66 | 0 | t > (1 row) > > Even after that, see below the value of the sequence is still not > caught up. Later, when the apply worker processes all the WAL, the > sequence state will be caught up. > > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 165 | 0 | t > (1 row) > > So, there will be a window where the sequence won't be caught up for a > certain period of time and any usage of it (even after the sync is > finished) during that time could result in inconsistent behaviour. > I'm rather confused about which node these queries are executed on. Presumably some of it is on publisher, some on subscriber? Can you create a reproducer (TAP test demonstrating this?) I guess it might require adding some sleeps to hit the right timing ... > The other question is whether it is okay to allow the sequence to go > backwards even during the initial sync phase? The reason I am asking > this question is that for the time sequence value moves backwards, one > is allowed to use it on the subscriber which will result in using > out-of-sequence values. For example, immediately, after copy_sequence > the values look like this: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 133 | 32 | t > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 134 > (1 row) > postgres=# select currval('s'); > currval > --------- > 134 > (1 row) > > But then during the sync phase, it can go backwards and one is allowed > to use it on the subscriber: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 66 | 0 | t > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 67 > (1 row) > Well, as for going back during the sync phase, I think the agreement was that's acceptable, as we don't make guarantees about that. The question is what's the state at the end of the sync (which I think leads to the first part of your message). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/24/23 08:31, Amit Kapila wrote: > On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> OK, I merged the changes into the patches, with some minor changes to >> the wording etc. >> > > I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720 > even without the rest of the patches. Isn't it a separate improvement? > True. > I see that origin filtering (origin=none) doesn't work with this > patch. You can see this by using the following statements: > Node-1: > postgres=# create sequence s; > CREATE SEQUENCE > postgres=# create publication mypub for all sequences; > CREATE PUBLICATION > > Node-2: > postgres=# create sequence s; > CREATE SEQUENCE > postgres=# create subscription mysub_sub connection '....' publication > mypub with (origin=none); > NOTICE: created replication slot "mysub_sub" on publisher > CREATE SUBSCRIPTION > postgres=# create publication mypub_sub for all sequences; > CREATE PUBLICATION > > Node-1: > create subscription mysub_pub connection '...' publication mypub_sub > with (origin=none); > NOTICE: created replication slot "mysub_pub" on publisher > CREATE SUBSCRIPTION > > SELECT nextval('s') FROM generate_series(1,100); > > After that, you can check on the subscriber that sequences values are > overridden with older values: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 67 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 100 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 133 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 67 | 0 | t > (1 row) > > I haven't verified all the details but I think that is because we > don't set XLOG_INCLUDE_ORIGIN while logging sequence values. > Hmmm, yeah. I guess we'll need to set XLOG_INCLUDE_ORIGIN with wal_level=logical. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2023-Jul-20, Tomas Vondra wrote: > From 809d60be7e636b8505027ad87bcb9fc65224c47b Mon Sep 17 00:00:00 2001 > From: Tomas Vondra <tomas.vondra@postgresql.org> > Date: Wed, 5 Apr 2023 22:49:41 +0200 > Subject: [PATCH 1/6] Make test_decoding ddl.out shorter > > Some of the test_decoding test output was extremely wide, because it > deals with toasted values, and the aligned mode causes psql to produce > 200kB of dashes. Turn that off temporarily using \pset to avoid it. Do you mind if I get this one pushed later today? Or feel free to push it yourself, if you want. It's an annoying patch to keep seeing posted over and over, with no further value. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche")
On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > PFA such edits in 0002 and 0006 patches. Let me know if those look > > correct. I think we > > need similar changes to the documentation and comments in other places. > > > > OK, I merged the changes into the patches, with some minor changes to > the wording etc. Thanks. > > > In sequence_decode() we skip sequence changes when fast forwarding. > > Given that smgr_decode() is only to supplement sequence_decode(), I > > think it's correct to do the same in smgr_decode() as well. Simillarly > > skipping when we don't have full snapshot. > > > > I don't follow, smgr_decode already checks ctx->fast_forward. In your earlier email you seemed to expressed some doubts about the change skipping code in smgr_decode(). To that, I gave my own perspective of why the change skipping code in smgr_decode() is correct. I think smgr_decode is doing the right thing, IMO. No change required there. > > > Some minor comments on 0006 patch > > > > + /* make sure the relfilenode creation is associated with the XID */ > > + if (XLogLogicalInfoActive()) > > + GetCurrentTransactionId(); > > > > I think this change is correct and is inline with similar changes in 0002. But > > I looked at other places from where DefineRelation() is called. For regular > > tables it is called from ProcessUtilitySlow() which in turn does not call > > GetCurrentTransactionId(). I am wondering whether we are just discovering a > > class of bugs caused by not associating an xid with a newly created > > relfilenode. > > > > Not sure. Why would it be a bug? This discussion is unrelated to sequence decoding but let me add it here. If we don't know the transaction ID that created a relfilenode, we wouldn't know whether to roll back that creation if the transaction gets rolled back during recovery. But maybe that doesn't matter since the relfilenode is not visible in any of the catalogs, so it just lies there unused. > > > +void > > +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid, > > + RelFileLocator rlocator) > > +{ > > ... snip ... > > + > > + /* sequence changes require a transaction */ > > + if (xid == InvalidTransactionId) > > + return; > > > > IIUC, with your changes in DefineSequence() in this patch, this should not > > happen. So this condition will never be true. But in case it happens, this code > > will not add the relfilelocation to the hash table and we will deem the > > sequence change as non-transactional. Isn't it better to just throw an error > > and stop replication if that (ever) happens? > > > > It can't happen for sequence, but it may happen when creating a > non-sequence relfilenode. In a way, it's a way to skip (some) > unnecessary relfilenodes. Ah! The comment is correct but cryptic. I didn't read it to mean this. > > + /* > > + * To support logical decoding of sequences, we require the sequence > > + * callback. We decide it here, but only check it later in the wrappers. > > + * > > + * XXX Isn't it wrong to define only one of those callbacks? Say we > > + * only define the stream_sequence_cb() - that may get strange results > > + * depending on what gets streamed. Either none or both? > > + * > > + * XXX Shouldn't sequence be defined at slot creation time, similar > > + * to two_phase? Probably not. > > + */ > > > > Do you intend to keep these XXX's as is? My previous comments on this comment > > block are in [1]. This comment remains unanswered. > > > > In fact, given that whether or not sequences are replicated is decided by the > > protocol version, do we really need LogicalDecodingContext::sequences? Drawing > > parallel with WAL messages, I don't think it's needed. > > > > Right. We do that for two_phase because you can override that when > creating the subscription - sequences allowed that too initially, but > then we ditched that. So I don't think we need this. Then we should just remove that member and its references. -- Best Wishes, Ashutosh Bapat
On 7/24/23 13:14, Alvaro Herrera wrote: > On 2023-Jul-20, Tomas Vondra wrote: > >> From 809d60be7e636b8505027ad87bcb9fc65224c47b Mon Sep 17 00:00:00 2001 >> From: Tomas Vondra <tomas.vondra@postgresql.org> >> Date: Wed, 5 Apr 2023 22:49:41 +0200 >> Subject: [PATCH 1/6] Make test_decoding ddl.out shorter >> >> Some of the test_decoding test output was extremely wide, because it >> deals with toasted values, and the aligned mode causes psql to produce >> 200kB of dashes. Turn that off temporarily using \pset to avoid it. > > Do you mind if I get this one pushed later today? Or feel free to push > it yourself, if you want. It's an annoying patch to keep seeing posted > over and over, with no further value. > Feel free to push. It's your patch, after all. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 20, 2023 at 10:19 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > FWIW there's two questions related to the switch to XLOG_SMGR_CREATE. > > 1) Does smgr_decode() need to do the same block as sequence_decode()? > > /* Skip the change if already processed (per the snapshot). */ > if (transactional && > !SnapBuildProcessChange(builder, xid, buf->origptr)) > return; > else if (!transactional && > (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || > SnapBuildXactNeedsSkip(builder, buf->origptr))) > return; > > I don't think it does. Also, we don't have any transactional flag here. > Or rather, everything is transactional ... Right. > > > 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. > I was thinking maybe we should have it in the transaction (because we > need to do cleanup at the end). It seem a bit inconvenient, because then > we'd need to either search htabs in all subxacts, or transfer the > entries to the top-level xact (otoh, we already do that with snapshots), > and cleanup on abort. > > What do you think? Hash table per transaction seems saner design. Adding it to the top level transaction should be fine. The entry will contain an XID anyway. If we add it to every subtransaction we will need to search hash table in each of the subtransactions when deciding whether a sequence change is transactional or not. Top transaction is a reasonable trade off. -- Best Wishes, Ashutosh Bapat
On 7/24/23 08:31, Amit Kapila wrote: > On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> OK, I merged the changes into the patches, with some minor changes to >> the wording etc. >> > > I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720 > even without the rest of the patches. Isn't it a separate improvement? > > I see that origin filtering (origin=none) doesn't work with this > patch. You can see this by using the following statements: > Node-1: > postgres=# create sequence s; > CREATE SEQUENCE > postgres=# create publication mypub for all sequences; > CREATE PUBLICATION > > Node-2: > postgres=# create sequence s; > CREATE SEQUENCE > postgres=# create subscription mysub_sub connection '....' publication > mypub with (origin=none); > NOTICE: created replication slot "mysub_sub" on publisher > CREATE SUBSCRIPTION > postgres=# create publication mypub_sub for all sequences; > CREATE PUBLICATION > > Node-1: > create subscription mysub_pub connection '...' publication mypub_sub > with (origin=none); > NOTICE: created replication slot "mysub_pub" on publisher > CREATE SUBSCRIPTION > > SELECT nextval('s') FROM generate_series(1,100); > > After that, you can check on the subscriber that sequences values are > overridden with older values: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 67 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 100 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 133 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 67 | 0 | t > (1 row) > > I haven't verified all the details but I think that is because we > don't set XLOG_INCLUDE_ORIGIN while logging sequence values. > Good point. Attached is a patch that adds XLOG_INCLUDE_ORIGIN to sequence changes. I considered doing that only for wal_level=logical, but we don't do that elsewhere. Also, I didn't do that for smgr_create, because we don't actually replicate that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Make-test_decoding-ddl.out-shorter-20230724.patch
- 0002-Logical-decoding-of-sequences-20230724.patch
- 0003-Add-decoding-of-sequences-to-test_decoding-20230724.patch
- 0004-Add-decoding-of-sequences-to-built-in-repli-20230724.patch
- 0005-Simplify-protocol-versioning-20230724.patch
- 0006-replace-created-flag-with-XLOG_SMGR_CREATE-20230724.patch
- 0007-add-XLOG_INCLUDE_ORIGIN-for-sequences-20230724.patch
On 7/24/23 14:53, Ashutosh Bapat wrote: > On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >>> >>> PFA such edits in 0002 and 0006 patches. Let me know if those look >>> correct. I think we >>> need similar changes to the documentation and comments in other places. >>> >> >> OK, I merged the changes into the patches, with some minor changes to >> the wording etc. > > Thanks. > > >> >>> In sequence_decode() we skip sequence changes when fast forwarding. >>> Given that smgr_decode() is only to supplement sequence_decode(), I >>> think it's correct to do the same in smgr_decode() as well. Simillarly >>> skipping when we don't have full snapshot. >>> >> >> I don't follow, smgr_decode already checks ctx->fast_forward. > > In your earlier email you seemed to expressed some doubts about the > change skipping code in smgr_decode(). To that, I gave my own > perspective of why the change skipping code in smgr_decode() is > correct. I think smgr_decode is doing the right thing, IMO. No change > required there. > I think that was referring to the skipping we do for logical messages: if (message->transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!message->transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; I concluded we don't need to do that here. >> >>> Some minor comments on 0006 patch >>> >>> + /* make sure the relfilenode creation is associated with the XID */ >>> + if (XLogLogicalInfoActive()) >>> + GetCurrentTransactionId(); >>> >>> I think this change is correct and is inline with similar changes in 0002. But >>> I looked at other places from where DefineRelation() is called. For regular >>> tables it is called from ProcessUtilitySlow() which in turn does not call >>> GetCurrentTransactionId(). I am wondering whether we are just discovering a >>> class of bugs caused by not associating an xid with a newly created >>> relfilenode. >>> >> >> Not sure. Why would it be a bug? > > This discussion is unrelated to sequence decoding but let me add it > here. If we don't know the transaction ID that created a relfilenode, > we wouldn't know whether to roll back that creation if the transaction > gets rolled back during recovery. But maybe that doesn't matter since > the relfilenode is not visible in any of the catalogs, so it just lies > there unused. > I think that's unrelated to this patch. > >> >>> +void >>> +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid, >>> + RelFileLocator rlocator) >>> +{ >>> ... snip ... >>> + >>> + /* sequence changes require a transaction */ >>> + if (xid == InvalidTransactionId) >>> + return; >>> >>> IIUC, with your changes in DefineSequence() in this patch, this should not >>> happen. So this condition will never be true. But in case it happens, this code >>> will not add the relfilelocation to the hash table and we will deem the >>> sequence change as non-transactional. Isn't it better to just throw an error >>> and stop replication if that (ever) happens? >>> >> >> It can't happen for sequence, but it may happen when creating a >> non-sequence relfilenode. In a way, it's a way to skip (some) >> unnecessary relfilenodes. > > Ah! The comment is correct but cryptic. I didn't read it to mean this. > OK, I'll improve the comment. >>> + /* >>> + * To support logical decoding of sequences, we require the sequence >>> + * callback. We decide it here, but only check it later in the wrappers. >>> + * >>> + * XXX Isn't it wrong to define only one of those callbacks? Say we >>> + * only define the stream_sequence_cb() - that may get strange results >>> + * depending on what gets streamed. Either none or both? >>> + * >>> + * XXX Shouldn't sequence be defined at slot creation time, similar >>> + * to two_phase? Probably not. >>> + */ >>> >>> Do you intend to keep these XXX's as is? My previous comments on this comment >>> block are in [1]. > > This comment remains unanswered. > I think the conclusion was we don't need to do that. I forgot to remove the comment, though. >>> >>> In fact, given that whether or not sequences are replicated is decided by the >>> protocol version, do we really need LogicalDecodingContext::sequences? Drawing >>> parallel with WAL messages, I don't think it's needed. >>> >> >> Right. We do that for two_phase because you can override that when >> creating the subscription - sequences allowed that too initially, but >> then we ditched that. So I don't think we need this. > > Then we should just remove that member and its references. > The member is still needed - it says whether the plugin has callbacks for sequence decoding or not (just like we have a flag for streaming, for example). I see the XXX comment in sequence_decode() is no longer needed, we rely on protocol versioning. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2023-Jul-24, Tomas Vondra wrote: > On 7/24/23 13:14, Alvaro Herrera wrote: > > Do you mind if I get this one pushed later today? Or feel free to push > > it yourself, if you want. It's an annoying patch to keep seeing posted > > over and over, with no further value. > > Feel free to push. It's your patch, after all. Thanks, done. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Learn about compilers. Then everything looks like either a compiler or a database, and now you have two problems but one of them is fun." https://twitter.com/thingskatedid/status/1456027786158776329
On 7/24/23 12:40, Amit Kapila wrote: > On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: >> >> 0005, 0006 and 0007 are all related to the initial sequence sync. [3] >> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to >> be reviewed in this response. >> >> I followed the discussion starting [1] till [2]. The second one >> mentions the interlock mechanism which has been implemented in 0005 >> and 0006. While I don't have an objection to allowing LOCKing a >> sequence using the LOCK command, I am not sure whether it will >> actually work or is even needed. >> >> The problem described in [1] seems to be the same as the problem >> described in [2]. In both cases we see the sequence moving backwards >> during CATCHUP. At the end of catchup the sequence is in the right >> state in both the cases. >> > > I think we could see backward sequence value even after the catchup > phase (after the sync worker is exited and or the state of rel is > marked as 'ready' in pg_subscription_rel). The point is that there is > no guarantee that we will process all the pending WAL before > considering the sequence state is 'SYNCDONE' and or 'READY'. For > example, after copy_sequence, I see values like: > > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 165 | 0 | t > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 166 > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 167 > (1 row) > postgres=# select currval('s'); > currval > --------- > 167 > (1 row) > > Then during the catchup phase: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 33 | 0 | t > (1 row) > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 66 | 0 | t > (1 row) > > postgres=# select * from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn > ---------+---------+------------+----------- > 16394 | 16390 | r | 0/16374E8 > 16394 | 16393 | s | 0/1637700 > (2 rows) > > postgres=# select * from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn > ---------+---------+------------+----------- > 16394 | 16390 | r | 0/16374E8 > 16394 | 16393 | r | 0/1637700 > (2 rows) > > Here Sequence relid id 16393. You can see sequence state is marked as ready. > Right, but "READY" just means the apply caught up if the LSN where the sync finished ... > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 66 | 0 | t > (1 row) > > Even after that, see below the value of the sequence is still not > caught up. Later, when the apply worker processes all the WAL, the > sequence state will be caught up. > And how is this different from what tablesync does for tables? For that 'r' also does not mean it's fully caught up, IIRC. What matters is whether the sequence since this moment can go back. And I don't think it can, because that would require replaying changes from before we did copy_sequence ... > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 165 | 0 | t > (1 row) > > So, there will be a window where the sequence won't be caught up for a > certain period of time and any usage of it (even after the sync is > finished) during that time could result in inconsistent behaviour. > > The other question is whether it is okay to allow the sequence to go > backwards even during the initial sync phase? The reason I am asking > this question is that for the time sequence value moves backwards, one > is allowed to use it on the subscriber which will result in using > out-of-sequence values. For example, immediately, after copy_sequence > the values look like this: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 133 | 32 | t > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 134 > (1 row) > postgres=# select currval('s'); > currval > --------- > 134 > (1 row) > > But then during the sync phase, it can go backwards and one is allowed > to use it on the subscriber: > postgres=# select * from s; > last_value | log_cnt | is_called > ------------+---------+----------- > 66 | 0 | t > (1 row) > postgres=# select nextval('s'); > nextval > --------- > 67 > (1 row) > As I wrote earlier, I think the agreement was we make no guarantees about what happens during the sync. Also, not sure what you mean by "no one is allowed to use it on subscriber" - that is only allowed after a failover/switchover, after sequence sync completes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/24/23 12:40, Amit Kapila wrote: > > On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > Even after that, see below the value of the sequence is still not > > caught up. Later, when the apply worker processes all the WAL, the > > sequence state will be caught up. > > > > And how is this different from what tablesync does for tables? For that > 'r' also does not mean it's fully caught up, IIRC. What matters is > whether the sequence since this moment can go back. And I don't think it > can, because that would require replaying changes from before we did > copy_sequence ... > For sequences, it is quite possible that we replay WAL from before the copy_sequence whereas the same is not true for tables (w.r.t copy_table()). This is because for tables we have a kind of interlock w.r.t LSN returned via create_slot (say this value of LSN is LSN1), basically, the walsender corresponding to tablesync worker in publisher won't send any WAL before that LSN whereas the same is not true for sequences. Also, even if apply worker can receive WAL before copy_table, it won't apply that as that would be behind the LSN1 and the same is not true for sequences. So, for tables, we will never go back to a state before the copy_table() but for sequences, we can go back to a state before copy_sequence(). -- With Regards, Amit Kapila.
On Mon, Jul 24, 2023 at 4:22 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/24/23 12:40, Amit Kapila wrote: > > On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > >> > >> 0005, 0006 and 0007 are all related to the initial sequence sync. [3] > >> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to > >> be reviewed in this response. > >> > >> I followed the discussion starting [1] till [2]. The second one > >> mentions the interlock mechanism which has been implemented in 0005 > >> and 0006. While I don't have an objection to allowing LOCKing a > >> sequence using the LOCK command, I am not sure whether it will > >> actually work or is even needed. > >> > >> The problem described in [1] seems to be the same as the problem > >> described in [2]. In both cases we see the sequence moving backwards > >> during CATCHUP. At the end of catchup the sequence is in the right > >> state in both the cases. > >> > > > > I think we could see backward sequence value even after the catchup > > phase (after the sync worker is exited and or the state of rel is > > marked as 'ready' in pg_subscription_rel). The point is that there is > > no guarantee that we will process all the pending WAL before > > considering the sequence state is 'SYNCDONE' and or 'READY'. For > > example, after copy_sequence, I see values like: > > > > postgres=# select * from s; > > last_value | log_cnt | is_called > > ------------+---------+----------- > > 165 | 0 | t > > (1 row) > > postgres=# select nextval('s'); > > nextval > > --------- > > 166 > > (1 row) > > postgres=# select nextval('s'); > > nextval > > --------- > > 167 > > (1 row) > > postgres=# select currval('s'); > > currval > > --------- > > 167 > > (1 row) > > > > Then during the catchup phase: > > postgres=# select * from s; > > last_value | log_cnt | is_called > > ------------+---------+----------- > > 33 | 0 | t > > (1 row) > > postgres=# select * from s; > > last_value | log_cnt | is_called > > ------------+---------+----------- > > 66 | 0 | t > > (1 row) > > > > postgres=# select * from pg_subscription_rel; > > srsubid | srrelid | srsubstate | srsublsn > > ---------+---------+------------+----------- > > 16394 | 16390 | r | 0/16374E8 > > 16394 | 16393 | s | 0/1637700 > > (2 rows) > > > > postgres=# select * from pg_subscription_rel; > > srsubid | srrelid | srsubstate | srsublsn > > ---------+---------+------------+----------- > > 16394 | 16390 | r | 0/16374E8 > > 16394 | 16393 | r | 0/1637700 > > (2 rows) > > > > Here Sequence relid id 16393. You can see sequence state is marked as ready. > > > > postgres=# select * from s; > > last_value | log_cnt | is_called > > ------------+---------+----------- > > 66 | 0 | t > > (1 row) > > > > Even after that, see below the value of the sequence is still not > > caught up. Later, when the apply worker processes all the WAL, the > > sequence state will be caught up. > > > > postgres=# select * from s; > > last_value | log_cnt | is_called > > ------------+---------+----------- > > 165 | 0 | t > > (1 row) > > > > So, there will be a window where the sequence won't be caught up for a > > certain period of time and any usage of it (even after the sync is > > finished) during that time could result in inconsistent behaviour. > > > > I'm rather confused about which node these queries are executed on. > Presumably some of it is on publisher, some on subscriber? > These are all on the subscriber. > Can you create a reproducer (TAP test demonstrating this?) I guess it > might require adding some sleeps to hit the right timing ... > I have used the debugger to reproduce this as it needs quite some coordination. I just wanted to see if the sequence can go backward and didn't catch up completely before the sequence state is marked 'ready'. On the publisher side, I created a publication with a table and a sequence. Then did the following steps: SELECT nextval('s') FROM generate_series(1,50); insert into t1 values(1); SELECT nextval('s') FROM generate_series(51,150); Then on the subscriber side with some debugging aid, I could find the values in the sequence shown in the previous email. Sorry, I haven't recorded each and every step but, if you think it helps, I can again try to reproduce it and share the steps. -- With Regards, Amit Kapila.
On 7/25/23 08:28, Amit Kapila wrote: > On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 7/24/23 12:40, Amit Kapila wrote: >>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat >>> <ashutosh.bapat.oss@gmail.com> wrote: >>> >>> Even after that, see below the value of the sequence is still not >>> caught up. Later, when the apply worker processes all the WAL, the >>> sequence state will be caught up. >>> >> >> And how is this different from what tablesync does for tables? For that >> 'r' also does not mean it's fully caught up, IIRC. What matters is >> whether the sequence since this moment can go back. And I don't think it >> can, because that would require replaying changes from before we did >> copy_sequence ... >> > > For sequences, it is quite possible that we replay WAL from before the > copy_sequence whereas the same is not true for tables (w.r.t > copy_table()). This is because for tables we have a kind of interlock > w.r.t LSN returned via create_slot (say this value of LSN is LSN1), > basically, the walsender corresponding to tablesync worker in > publisher won't send any WAL before that LSN whereas the same is not > true for sequences. Also, even if apply worker can receive WAL before > copy_table, it won't apply that as that would be behind the LSN1 and > the same is not true for sequences. So, for tables, we will never go > back to a state before the copy_table() but for sequences, we can go > back to a state before copy_sequence(). > Right. I think the important detail is that during sync we have three important LSNs - LSN1 where the slot is created - LSN2 where the copy happens - LSN3 where we consider the sync completed For tables, LSN1 == LSN2, because the data is completed using the snapshot from the temporary slot. And (LSN1 <= LSN3). But for sequences, the copy happens after the slot creation, possibly with (LSN1 < LSN2). And because LSN3 comes from the main subscription (which may be a bit behind, for whatever reason), it may happen that (LSN1 < LSN3 < LSN2) The the sync ends at LSN3, but that means all sequence changes between LSN3 and LSN2 will be applied "again" making the sequence go away. IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Right. I think the important detail is that during sync we have three > important LSNs > > - LSN1 where the slot is created > - LSN2 where the copy happens > - LSN3 where we consider the sync completed > > For tables, LSN1 == LSN2, because the data is completed using the > snapshot from the temporary slot. And (LSN1 <= LSN3). > > But for sequences, the copy happens after the slot creation, possibly > with (LSN1 < LSN2). And because LSN3 comes from the main subscription > (which may be a bit behind, for whatever reason), it may happen that > > (LSN1 < LSN3 < LSN2) > > The the sync ends at LSN3, but that means all sequence changes between > LSN3 and LSN2 will be applied "again" making the sequence go away. > > IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences). Back in this thread, an approach to use page LSN (LSN2 above) to make sure that no change before LSN2 is applied on subscriber. The approach was discussed in emails around [1] and discarded later for no reason. I think that approach has some merit. [1] https://www.postgresql.org/message-id/flat/21c87ea8-86c9-80d6-bc78-9b95033ca00b%40enterprisedb.com#36bb9c7968b7af577dc080950761290d -- Best Wishes, Ashutosh Bapat
On 7/25/23 15:18, Ashutosh Bapat wrote: > > ... > >> But for sequences, the copy happens after the slot creation, possibly >> with (LSN1 < LSN2). And because LSN3 comes from the main subscription >> (which may be a bit behind, for whatever reason), it may happen that >> >> (LSN1 < LSN3 < LSN2) >> >> The the sync ends at LSN3, but that means all sequence changes between >> LSN3 and LSN2 will be applied "again" making the sequence go away. >> >> IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences). > Do you agree this scheme would be correct? > Back in this thread, an approach to use page LSN (LSN2 above) to make > sure that no change before LSN2 is applied on subscriber. The approach > was discussed in emails around [1] and discarded later for no reason. > I think that approach has some merit. > > [1] https://www.postgresql.org/message-id/flat/21c87ea8-86c9-80d6-bc78-9b95033ca00b%40enterprisedb.com#36bb9c7968b7af577dc080950761290d > That doesn't seem to be the correct link ... IIRC the page LSN was discussed as a way to skip changes up to the point when the COPY was done. I believe it might work with the scheme I described above too. The trouble is we don't have an interface to select both the sequence state and the page LSN. It's probably not hard to add (extend the read_seq_tuple() to also return the LSN, and adding a SQL function), but I don't think it'd add much value, compared to just getting the current insert LSN after the COPY. Yes, the current LSN may be a bit higher, so we may need to apply a couple changes to get into "ready" state. But we read it right after copy_sequence() so how much can happen in between? Also, we can get into similar state anyway - the main subscription can get ahead, at which point the sync has to catchup to it. The attached patch (part 0007) does it this way. Can you try if you can still reproduce the "backwards" movement with this version? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Logical-decoding-of-sequences-20230725.patch
- 0002-Add-decoding-of-sequences-to-test_decoding-20230725.patch
- 0003-Add-decoding-of-sequences-to-built-in-repli-20230725.patch
- 0004-Simplify-protocol-versioning-20230725.patch
- 0005-replace-created-flag-with-XLOG_SMGR_CREATE-20230725.patch
- 0006-add-XLOG_INCLUDE_ORIGIN-for-sequences-20230725.patch
- 0007-Catchup-up-to-a-LSN-after-copy-of-the-seque-20230725.patch
On 7/24/23 14:57, Ashutosh Bapat wrote: > ... > >> >> >> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. >> I was thinking maybe we should have it in the transaction (because we >> need to do cleanup at the end). It seem a bit inconvenient, because then >> we'd need to either search htabs in all subxacts, or transfer the >> entries to the top-level xact (otoh, we already do that with snapshots), >> and cleanup on abort. >> >> What do you think? > > Hash table per transaction seems saner design. Adding it to the top > level transaction should be fine. The entry will contain an XID > anyway. If we add it to every subtransaction we will need to search > hash table in each of the subtransactions when deciding whether a > sequence change is transactional or not. Top transaction is a > reasonable trade off. > It's not clear to me what design you're proposing, exactly. If we track it in top-level transactions, then we'd need copy the data whenever a transaction is assigned as a child, and perhaps also remove it when there's a subxact abort. And we'd need to still search the hashes in all toplevel transactions on every sequence increment - in principle we can't have increment for a sequence created in another in-progress transaction, but maybe it's just not assigned yet. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Here's a somewhat cleaned up version of the patch series, with some of the smaller "rework" patches (protocol versioning, origins, smgr_create, ...) merged into the appropriate part. I've kept the bit adding separate tablesync LSN. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/25/23 08:28, Amit Kapila wrote: > > On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 7/24/23 12:40, Amit Kapila wrote: > >>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat > >>> <ashutosh.bapat.oss@gmail.com> wrote: > >>> > >>> Even after that, see below the value of the sequence is still not > >>> caught up. Later, when the apply worker processes all the WAL, the > >>> sequence state will be caught up. > >>> > >> > >> And how is this different from what tablesync does for tables? For that > >> 'r' also does not mean it's fully caught up, IIRC. What matters is > >> whether the sequence since this moment can go back. And I don't think it > >> can, because that would require replaying changes from before we did > >> copy_sequence ... > >> > > > > For sequences, it is quite possible that we replay WAL from before the > > copy_sequence whereas the same is not true for tables (w.r.t > > copy_table()). This is because for tables we have a kind of interlock > > w.r.t LSN returned via create_slot (say this value of LSN is LSN1), > > basically, the walsender corresponding to tablesync worker in > > publisher won't send any WAL before that LSN whereas the same is not > > true for sequences. Also, even if apply worker can receive WAL before > > copy_table, it won't apply that as that would be behind the LSN1 and > > the same is not true for sequences. So, for tables, we will never go > > back to a state before the copy_table() but for sequences, we can go > > back to a state before copy_sequence(). > > > > Right. I think the important detail is that during sync we have three > important LSNs > > - LSN1 where the slot is created > - LSN2 where the copy happens > - LSN3 where we consider the sync completed > > For tables, LSN1 == LSN2, because the data is completed using the > snapshot from the temporary slot. And (LSN1 <= LSN3). > > But for sequences, the copy happens after the slot creation, possibly > with (LSN1 < LSN2). And because LSN3 comes from the main subscription > (which may be a bit behind, for whatever reason), it may happen that > > (LSN1 < LSN3 < LSN2) > > The the sync ends at LSN3, but that means all sequence changes between > LSN3 and LSN2 will be applied "again" making the sequence go away. > Yeah, the problem is something as you explained but an additional minor point is that for sequences we also do end up applying the WAL between LSN1 and LSN3 which makes it go backwards. The ideal way is that sequences on subscribers never go backward in a way that is visible to users. I will share my thoughts after studying your proposal in a later email. -- With Regards, Amit Kapila.
On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 7/25/23 08:28, Amit Kapila wrote: > > > On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra > > > <tomas.vondra@enterprisedb.com> wrote: > > >> > > >> On 7/24/23 12:40, Amit Kapila wrote: > > >>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat > > >>> <ashutosh.bapat.oss@gmail.com> wrote: > > >>> > > >>> Even after that, see below the value of the sequence is still not > > >>> caught up. Later, when the apply worker processes all the WAL, the > > >>> sequence state will be caught up. > > >>> > > >> > > >> And how is this different from what tablesync does for tables? For that > > >> 'r' also does not mean it's fully caught up, IIRC. What matters is > > >> whether the sequence since this moment can go back. And I don't think it > > >> can, because that would require replaying changes from before we did > > >> copy_sequence ... > > >> > > > > > > For sequences, it is quite possible that we replay WAL from before the > > > copy_sequence whereas the same is not true for tables (w.r.t > > > copy_table()). This is because for tables we have a kind of interlock > > > w.r.t LSN returned via create_slot (say this value of LSN is LSN1), > > > basically, the walsender corresponding to tablesync worker in > > > publisher won't send any WAL before that LSN whereas the same is not > > > true for sequences. Also, even if apply worker can receive WAL before > > > copy_table, it won't apply that as that would be behind the LSN1 and > > > the same is not true for sequences. So, for tables, we will never go > > > back to a state before the copy_table() but for sequences, we can go > > > back to a state before copy_sequence(). > > > > > > > Right. I think the important detail is that during sync we have three > > important LSNs > > > > - LSN1 where the slot is created > > - LSN2 where the copy happens > > - LSN3 where we consider the sync completed > > > > For tables, LSN1 == LSN2, because the data is completed using the > > snapshot from the temporary slot. And (LSN1 <= LSN3). > > > > But for sequences, the copy happens after the slot creation, possibly > > with (LSN1 < LSN2). And because LSN3 comes from the main subscription > > (which may be a bit behind, for whatever reason), it may happen that > > > > (LSN1 < LSN3 < LSN2) > > > > The the sync ends at LSN3, but that means all sequence changes between > > LSN3 and LSN2 will be applied "again" making the sequence go away. > > > > Yeah, the problem is something as you explained but an additional > minor point is that for sequences we also do end up applying the WAL > between LSN1 and LSN3 which makes it go backwards. > I was reading this email thread and found the email by Andres [1] which seems to me to say the same thing: "I assume that part of the initial sync would have to be a new sequence synchronization step that reads all the sequence states on the publisher and ensures that the subscriber sequences are at the same point. There's a bit of trickiness there, but it seems entirely doable. The logical replication replay support for sequences will have to be a bit careful about not decreasing the subscriber's sequence values - the standby initially will be ahead of the increments we'll see in the WAL.". Now, IIUC this means that even before the sequence is marked as SYNCDONE, it shouldn't go backward. [1]: "https://www.postgresql.org/message-id/20221117024357.ljjme6v75mny2j6u%40awork3.anarazel.de With Regards, Amit Kapila.
On 7/26/23 09:27, Amit Kapila wrote: > On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> ... >> > > I was reading this email thread and found the email by Andres [1] > which seems to me to say the same thing: "I assume that part of the > initial sync would have to be a new sequence synchronization step that > reads all the sequence states on the publisher and ensures that the > subscriber sequences are at the same point. There's a bit of > trickiness there, but it seems entirely doable. The logical > replication replay support for sequences will have to be a bit careful > about not decreasing the subscriber's sequence values - the standby > initially will be ahead of the > increments we'll see in the WAL.". Now, IIUC this means that even > before the sequence is marked as SYNCDONE, it shouldn't go backward. > Well, I could argue that's more an opinion, and I'm not sure it really contradicts the idea that the sequence should not go backwards only after the sync completes. Anyway, I was thinking about this a bit more, and it seems it's not as difficult to use the page LSN to ensure sequences don't go backwards. The 0005 change does that, by: 1) adding pg_sequence_state, that returns both the sequence state and the page LSN 2) copy_sequence returns the page LSN 3) tablesync then sets this LSN as origin_startpos (which for tables is just the LSN of the replication slot) AFAICS this makes it work - we start decoding at the page LSN, so that we skip the increments that could lead to the sequence going backwards. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/26/23 09:27, Amit Kapila wrote: > > On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Anyway, I was thinking about this a bit more, and it seems it's not as > difficult to use the page LSN to ensure sequences don't go backwards. > While studying the changes for this proposal and related areas, I have a few comments: 1. I think you need to advance the origin if it is changed due to copy_sequence(), otherwise, if the sync worker restarts after SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN value. 2. Between the time of SYNCDONE and READY state, the patch can skip applying non-transactional sequence changes even if it should apply it. The reason is that during that state change should_apply_changes_for_rel() decides whether to apply change based on the value of remote_final_lsn which won't be set for non-transactional change. I think we need to send the start LSN of a non-transactional record and then use that as remote_final_lsn for such a change. 3. For non-transactional sequence change apply, we don't set replorigin_session_origin_lsn/replorigin_session_origin_timestamp as we are doing in apply_handle_commit_internal() before calling CommitTransactionCommand(). So, that can lead to the origin moving backwards after restart which will lead to requesting and applying the same changes again and for that period of time sequence can go backwards. This needs some more thought as to what is the correct behaviour/solution for this. 4. BTW, while checking this behaviour, I noticed that the initial sync worker for sequence mentions the table in the LOG message: "LOG: logical replication table synchronization worker for subscription "mysub", table "s" has finished". Won't it be better here to refer to it as a sequence? -- With Regards, Amit Kapila.
On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/24/23 14:57, Ashutosh Bapat wrote: > > ... > > > >> > >> > >> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. > >> I was thinking maybe we should have it in the transaction (because we > >> need to do cleanup at the end). It seem a bit inconvenient, because then > >> we'd need to either search htabs in all subxacts, or transfer the > >> entries to the top-level xact (otoh, we already do that with snapshots), > >> and cleanup on abort. > >> > >> What do you think? > > > > Hash table per transaction seems saner design. Adding it to the top > > level transaction should be fine. The entry will contain an XID > > anyway. If we add it to every subtransaction we will need to search > > hash table in each of the subtransactions when deciding whether a > > sequence change is transactional or not. Top transaction is a > > reasonable trade off. > > > > It's not clear to me what design you're proposing, exactly. > > If we track it in top-level transactions, then we'd need copy the data > whenever a transaction is assigned as a child, and perhaps also remove > it when there's a subxact abort. I thought, esp. with your changes to assign xid, we will always know the top level transaction when a sequence is assigned a relfilenode. So the refilenodes will always get added to the correct hash directly. I didn't imagine a case where we will need to copy the hash table from sub-transaction to top transaction. If that's true, yes it's inconvenient. As to the abort, don't we already remove entries on subtxn abort? Having per transaction hash table doesn't seem to change anything much. > > And we'd need to still search the hashes in all toplevel transactions on > every sequence increment - in principle we can't have increment for a > sequence created in another in-progress transaction, but maybe it's just > not assigned yet. We hold a strong lock on sequence when changing its relfilenode. The sequence whose relfilenode is being changed can not be accessed by any concurrent transaction. So I am not able to understand what you are trying to say. I think per (top level) transaction hash table is cleaner design. It puts the hash table where it should be. But if that makes code difficult, current design works too. -- Best Wishes, Ashutosh Bapat
On 7/28/23 11:42, Amit Kapila wrote: > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 7/26/23 09:27, Amit Kapila wrote: >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> Anyway, I was thinking about this a bit more, and it seems it's not as >> difficult to use the page LSN to ensure sequences don't go backwards. >> > > While studying the changes for this proposal and related areas, I have > a few comments: > 1. I think you need to advance the origin if it is changed due to > copy_sequence(), otherwise, if the sync worker restarts after > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN > value. > True, we want to restart at the new origin_startpos. > 2. Between the time of SYNCDONE and READY state, the patch can skip > applying non-transactional sequence changes even if it should apply > it. The reason is that during that state change > should_apply_changes_for_rel() decides whether to apply change based > on the value of remote_final_lsn which won't be set for > non-transactional change. I think we need to send the start LSN of a > non-transactional record and then use that as remote_final_lsn for > such a change. Good catch. remote_final_lsn is set in apply_handle_begin, but that won't happen for sequences. We're already sending the LSN, but logicalrep_read_sequence ignores it - it should be enough to add it to LogicalRepSequence and then set it in apply_handle_sequence(). > > 3. For non-transactional sequence change apply, we don't set > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as > we are doing in apply_handle_commit_internal() before calling > CommitTransactionCommand(). So, that can lead to the origin moving > backwards after restart which will lead to requesting and applying the > same changes again and for that period of time sequence can go > backwards. This needs some more thought as to what is the correct > behaviour/solution for this. > I think saying "origin moves backwards" is a bit misleading. AFAICS the origin position is not actually moving backwards, it's more that we don't (and can't) move it forwards for each non-transactional change. So yeah, we may re-apply those, and IMHO that's expected - the sequence is allowed to be "ahead" on the subscriber. I don't see a way to improve this, except maybe having a separate LSN for non-transactional changes (for each origin). > 4. BTW, while checking this behaviour, I noticed that the initial sync > worker for sequence mentions the table in the LOG message: "LOG: > logical replication table synchronization worker for subscription > "mysub", table "s" has finished". Won't it be better here to refer to > it as a sequence? > Thanks, I'll fix that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Anyway, I was thinking about this a bit more, and it seems it's not as > difficult to use the page LSN to ensure sequences don't go backwards. > The 0005 change does that, by: > > 1) adding pg_sequence_state, that returns both the sequence state and > the page LSN > > 2) copy_sequence returns the page LSN > > 3) tablesync then sets this LSN as origin_startpos (which for tables is > just the LSN of the replication slot) > > AFAICS this makes it work - we start decoding at the page LSN, so that > we skip the increments that could lead to the sequence going backwards. > I like this design very much. It makes things simpler than complex. Thanks for doing this. I am wondering whether we could reuse pg_sequence_last_value() instead of adding a new function. But the name of the function doesn't leave much space for expanding its functionality. So we are good with a new one. Probably some code deduplication. -- Best Wishes, Ashutosh Bapat
On 7/28/23 14:35, Ashutosh Bapat wrote: > On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 7/24/23 14:57, Ashutosh Bapat wrote: >>> ... >>> >>>> >>>> >>>> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. >>>> I was thinking maybe we should have it in the transaction (because we >>>> need to do cleanup at the end). It seem a bit inconvenient, because then >>>> we'd need to either search htabs in all subxacts, or transfer the >>>> entries to the top-level xact (otoh, we already do that with snapshots), >>>> and cleanup on abort. >>>> >>>> What do you think? >>> >>> Hash table per transaction seems saner design. Adding it to the top >>> level transaction should be fine. The entry will contain an XID >>> anyway. If we add it to every subtransaction we will need to search >>> hash table in each of the subtransactions when deciding whether a >>> sequence change is transactional or not. Top transaction is a >>> reasonable trade off. >>> >> >> It's not clear to me what design you're proposing, exactly. >> >> If we track it in top-level transactions, then we'd need copy the data >> whenever a transaction is assigned as a child, and perhaps also remove >> it when there's a subxact abort. > > I thought, esp. with your changes to assign xid, we will always know > the top level transaction when a sequence is assigned a relfilenode. > So the refilenodes will always get added to the correct hash directly. > I didn't imagine a case where we will need to copy the hash table from > sub-transaction to top transaction. If that's true, yes it's > inconvenient. > Well, it's a matter of efficiency. To check if a sequence change is transactional, we need to check if it's for a relfilenode created in the current transaction (it can't be for relfilenode created in a concurrent top-level transaction, due to MVCC). If you don't copy the entries into the top-level xact, you have to walk all subxacts and search all of those, for each sequence change. And there may be quite a few of both subxacts and sequence changes ... I wonder if we need to search the other top-level xacts, but we probably need to do that. Because it might be a subxact without an assignment, or something like that. > As to the abort, don't we already remove entries on subtxn abort? > Having per transaction hash table doesn't seem to change anything > much. > What entries are we removing? My point is that if we copy the entries to the top-level xact, we probably need to remove them on abort. Or we could leave them in the top-level xact hash. >> >> And we'd need to still search the hashes in all toplevel transactions on >> every sequence increment - in principle we can't have increment for a >> sequence created in another in-progress transaction, but maybe it's just >> not assigned yet. > > We hold a strong lock on sequence when changing its relfilenode. The > sequence whose relfilenode is being changed can not be accessed by any > concurrent transaction. So I am not able to understand what you are > trying to say. > How do you know the subxact has already been recognized as such? It may be treated as top-level transaction for a while, until the assignment. > I think per (top level) transaction hash table is cleaner design. It > puts the hash table where it should be. But if that makes code > difficult, current design works too. > regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/28/23 11:42, Amit Kapila wrote: > > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 7/26/23 09:27, Amit Kapila wrote: > >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> Anyway, I was thinking about this a bit more, and it seems it's not as > >> difficult to use the page LSN to ensure sequences don't go backwards. > >> > > > > While studying the changes for this proposal and related areas, I have > > a few comments: > > 1. I think you need to advance the origin if it is changed due to > > copy_sequence(), otherwise, if the sync worker restarts after > > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN > > value. > > > > True, we want to restart at the new origin_startpos. > > > 2. Between the time of SYNCDONE and READY state, the patch can skip > > applying non-transactional sequence changes even if it should apply > > it. The reason is that during that state change > > should_apply_changes_for_rel() decides whether to apply change based > > on the value of remote_final_lsn which won't be set for > > non-transactional change. I think we need to send the start LSN of a > > non-transactional record and then use that as remote_final_lsn for > > such a change. > > Good catch. remote_final_lsn is set in apply_handle_begin, but that > won't happen for sequences. We're already sending the LSN, but > logicalrep_read_sequence ignores it - it should be enough to add it to > LogicalRepSequence and then set it in apply_handle_sequence(). > As per my understanding, the LSN sent is EndRecPtr of record which is the beginning of the next record (means current_record_end + 1). For comparing the current record, we use the start_position of the record as we do when we use the remote_final_lsn via apply_handle_begin(). > > > > 3. For non-transactional sequence change apply, we don't set > > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as > > we are doing in apply_handle_commit_internal() before calling > > CommitTransactionCommand(). So, that can lead to the origin moving > > backwards after restart which will lead to requesting and applying the > > same changes again and for that period of time sequence can go > > backwards. This needs some more thought as to what is the correct > > behaviour/solution for this. > > > > I think saying "origin moves backwards" is a bit misleading. AFAICS the > origin position is not actually moving backwards, it's more that we > don't (and can't) move it forwards for each non-transactional change. So > yeah, we may re-apply those, and IMHO that's expected - the sequence is > allowed to be "ahead" on the subscriber. > But, if this happens then for a period of time the sequence will go backwards relative to what one would have observed before restart. -- With Regards, Amit Kapila.
On 7/28/23 14:44, Ashutosh Bapat wrote: > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Anyway, I was thinking about this a bit more, and it seems it's not as >> difficult to use the page LSN to ensure sequences don't go backwards. >> The 0005 change does that, by: >> >> 1) adding pg_sequence_state, that returns both the sequence state and >> the page LSN >> >> 2) copy_sequence returns the page LSN >> >> 3) tablesync then sets this LSN as origin_startpos (which for tables is >> just the LSN of the replication slot) >> >> AFAICS this makes it work - we start decoding at the page LSN, so that >> we skip the increments that could lead to the sequence going backwards. >> > > I like this design very much. It makes things simpler than complex. > Thanks for doing this. > I agree it seems simpler. It'd be good to try testing / reviewing it a bit more, so that it doesn't misbehave in some way. > I am wondering whether we could reuse pg_sequence_last_value() instead > of adding a new function. But the name of the function doesn't leave > much space for expanding its functionality. So we are good with a new > one. Probably some code deduplication. > I don't think we should do that, the pg_sequence_last_value() function is meant to do something different. I don't think it'd be any simpler to also make it do what pg_sequence_state() does would make it any simpler. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/29/23 06:54, Amit Kapila wrote: > On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 7/28/23 11:42, Amit Kapila wrote: >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> On 7/26/23 09:27, Amit Kapila wrote: >>>>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> Anyway, I was thinking about this a bit more, and it seems it's not as >>>> difficult to use the page LSN to ensure sequences don't go backwards. >>>> >>> >>> While studying the changes for this proposal and related areas, I have >>> a few comments: >>> 1. I think you need to advance the origin if it is changed due to >>> copy_sequence(), otherwise, if the sync worker restarts after >>> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN >>> value. >>> >> >> True, we want to restart at the new origin_startpos. >> >>> 2. Between the time of SYNCDONE and READY state, the patch can skip >>> applying non-transactional sequence changes even if it should apply >>> it. The reason is that during that state change >>> should_apply_changes_for_rel() decides whether to apply change based >>> on the value of remote_final_lsn which won't be set for >>> non-transactional change. I think we need to send the start LSN of a >>> non-transactional record and then use that as remote_final_lsn for >>> such a change. >> >> Good catch. remote_final_lsn is set in apply_handle_begin, but that >> won't happen for sequences. We're already sending the LSN, but >> logicalrep_read_sequence ignores it - it should be enough to add it to >> LogicalRepSequence and then set it in apply_handle_sequence(). >> > > As per my understanding, the LSN sent is EndRecPtr of record which is > the beginning of the next record (means current_record_end + 1). For > comparing the current record, we use the start_position of the record > as we do when we use the remote_final_lsn via apply_handle_begin(). > >>> >>> 3. For non-transactional sequence change apply, we don't set >>> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as >>> we are doing in apply_handle_commit_internal() before calling >>> CommitTransactionCommand(). So, that can lead to the origin moving >>> backwards after restart which will lead to requesting and applying the >>> same changes again and for that period of time sequence can go >>> backwards. This needs some more thought as to what is the correct >>> behaviour/solution for this. >>> >> >> I think saying "origin moves backwards" is a bit misleading. AFAICS the >> origin position is not actually moving backwards, it's more that we >> don't (and can't) move it forwards for each non-transactional change. So >> yeah, we may re-apply those, and IMHO that's expected - the sequence is >> allowed to be "ahead" on the subscriber. >> > > But, if this happens then for a period of time the sequence will go > backwards relative to what one would have observed before restart. > That is true, but is it really a problem? This whole sequence decoding thing was meant to allow logical failover - make sure that after switch to the subscriber, the sequences don't generate duplicate values. From this POV, the sequence going backwards (back to the confirmed origin position) is not an issue - it's still far enough (ahead of publisher). Is that great / ideal? No, I agree with that. But it was considered acceptable and good enough for the failover use case ... The only idea how to improve that is we could keep the non-transactional changes (instead of applying them immediately), and then apply them on the nearest "commit". That'd mean it's subject to the position tracking, and the sequence would not go backwards, I think. So every time we decode a commit, we'd check if we decoded any sequence changes since the last commit, and merge them (a bit like a subxact). This would however also mean sequence changes from rolled-back xacts may not be replictated. I think that'd be fine, but IIRC Andres suggested it's a valid use case. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/28/23 14:35, Ashutosh Bapat wrote: > > ... > > We hold a strong lock on sequence when changing its relfilenode. The > sequence whose relfilenode is being changed can not be accessed by any > concurrent transaction. So I am not able to understand what you are > trying to say. > > I think per (top level) transaction hash table is cleaner design. It > puts the hash table where it should be. But if that makes code > difficult, current design works too. > I was thinking about switching to the per-txn hash, so here's a patch adopting that approach (in part 0006). I can't say it's much simpler, but maybe it can be simplified a bit. Most of the complexity comes from assignments maybe happening with a delay, so it's hard to say what's a top-level xact. The patch essentially does this: 1) the HTAB is moved to ReorderBufferTXN 2) after decoding SGMR_CREATE, we add an entry to the current TXN and (for subtransactions) to the parent TXN (even the copy references the subxact) 3) when processing an assignment, we copy the HTAB entries from the subxact to the parent 4) after a subxact abort, we remove the HTAB entries from the parent 5) while searching for the relfilenode, we only scan the HTAB in the top-level xacts (this is possible due to the copying) This could work without the copy in parent HTAB, but then we'd have to scan all the transactions for every increment. And there may be many lookups and many (sub)transactions, but only a small number of new relfilenodes. So it seems like a good tradeoff. If we could convince ourselves the subxact has to be already assigned while decoding the sequence change, then we could simply search only the current transaction (and the parent). But I've been unable to convince myself that's guaranteed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Logical-decoding-of-sequences-20230729.patch
- 0002-Add-decoding-of-sequences-to-test_decoding-20230729.patch
- 0003-Add-decoding-of-sequences-to-built-in-repli-20230729.patch
- 0004-Catchup-up-to-a-LSN-after-copy-of-the-seque-20230729.patch
- 0005-use-page-LSN-for-sequences-20230729.patch
- 0006-per-transaction-hash-of-sequences-20230729.patch
On 7/29/23 14:38, Tomas Vondra wrote: > > ... > > The only idea how to improve that is we could keep the non-transactional > changes (instead of applying them immediately), and then apply them on > the nearest "commit". That'd mean it's subject to the position tracking, > and the sequence would not go backwards, I think. > > So every time we decode a commit, we'd check if we decoded any sequence > changes since the last commit, and merge them (a bit like a subxact). > > This would however also mean sequence changes from rolled-back xacts may > not be replictated. I think that'd be fine, but IIRC Andres suggested > it's a valid use case. > I wasn't sure how difficult would this approach be, so I experimented with this today, and it's waaaay more complicated than I thought. In fact, I'm not even sure how to do that ... The part 0008 is an WIP patch where ReorderBufferQueueSequence does not apply the non-transactional changes immediately, and instead adds the changes to a top-level list. And then ReorderBufferCommit adds a fake subxact with all sequence changes up to the commit LSN. The challenging part is snapshot management - when applying the changes immediately, we can simply build and use the current snapshot. But with 0008 it's not that simple - we don't even know into which transaction will the sequence change get "injected". In fact, we don't even know if the parent transaction will have a snapshot (if it only does nextval() it may seem empty). I was thinking maybe we could "keep" the snapshots for non-transactional changes, but I suspect it might confuse the main transaction in some way. I'm still not convinced this behavior would actually be desirable ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Logical-decoding-of-sequences-20230730.patch
- 0002-Add-decoding-of-sequences-to-test_decoding-20230730.patch
- 0003-Add-decoding-of-sequences-to-built-in-repli-20230730.patch
- 0004-Catchup-up-to-a-LSN-after-copy-of-the-seque-20230730.patch
- 0005-use-page-LSN-for-sequences-20230730.patch
- 0006-per-transaction-hash-of-sequences-20230730.patch
- 0007-assert-checking-sequence-hash-20230730.patch
- 0008-try-adding-fake-transaction-with-sequence-c-20230730.patch
On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/28/23 14:44, Ashutosh Bapat wrote: > > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> Anyway, I was thinking about this a bit more, and it seems it's not as > >> difficult to use the page LSN to ensure sequences don't go backwards. > >> The 0005 change does that, by: > >> > >> 1) adding pg_sequence_state, that returns both the sequence state and > >> the page LSN > >> > >> 2) copy_sequence returns the page LSN > >> > >> 3) tablesync then sets this LSN as origin_startpos (which for tables is > >> just the LSN of the replication slot) > >> > >> AFAICS this makes it work - we start decoding at the page LSN, so that > >> we skip the increments that could lead to the sequence going backwards. > >> > > > > I like this design very much. It makes things simpler than complex. > > Thanks for doing this. > > > > I agree it seems simpler. It'd be good to try testing / reviewing it a > bit more, so that it doesn't misbehave in some way. > Yeah, I also think this needs a review. This is a sort of new concept where we don't use the LSN of the slot (for cases where copy returned a larger value of LSN) or a full_snapshot created corresponding to the sync slot by Walsender. For the case of the table, we build a full snapshot because we use that for copying the table but why do we need to build that for copying the sequence especially when we directly copy it from the sequence relation without caring for any snapshot? -- With Regards, Amit Kapila.
On 7/31/23 11:25, Amit Kapila wrote: > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 7/28/23 14:44, Ashutosh Bapat wrote: >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> Anyway, I was thinking about this a bit more, and it seems it's not as >>>> difficult to use the page LSN to ensure sequences don't go backwards. >>>> The 0005 change does that, by: >>>> >>>> 1) adding pg_sequence_state, that returns both the sequence state and >>>> the page LSN >>>> >>>> 2) copy_sequence returns the page LSN >>>> >>>> 3) tablesync then sets this LSN as origin_startpos (which for tables is >>>> just the LSN of the replication slot) >>>> >>>> AFAICS this makes it work - we start decoding at the page LSN, so that >>>> we skip the increments that could lead to the sequence going backwards. >>>> >>> >>> I like this design very much. It makes things simpler than complex. >>> Thanks for doing this. >>> >> >> I agree it seems simpler. It'd be good to try testing / reviewing it a >> bit more, so that it doesn't misbehave in some way. >> > > Yeah, I also think this needs a review. This is a sort of new concept > where we don't use the LSN of the slot (for cases where copy returned > a larger value of LSN) or a full_snapshot created corresponding to the > sync slot by Walsender. For the case of the table, we build a full > snapshot because we use that for copying the table but why do we need > to build that for copying the sequence especially when we directly > copy it from the sequence relation without caring for any snapshot? > We need the slot to decode/apply changes during catchup. The main subscription may get ahead, and we need to ensure the WAL is not discarded or something like that. This applies even if the initial sync step does not use the slot/snapshot directly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/31/23 11:25, Amit Kapila wrote: > > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 7/28/23 14:44, Ashutosh Bapat wrote: > >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > >>> <tomas.vondra@enterprisedb.com> wrote: > >>>> > >>>> Anyway, I was thinking about this a bit more, and it seems it's not as > >>>> difficult to use the page LSN to ensure sequences don't go backwards. > >>>> The 0005 change does that, by: > >>>> > >>>> 1) adding pg_sequence_state, that returns both the sequence state and > >>>> the page LSN > >>>> > >>>> 2) copy_sequence returns the page LSN > >>>> > >>>> 3) tablesync then sets this LSN as origin_startpos (which for tables is > >>>> just the LSN of the replication slot) > >>>> > >>>> AFAICS this makes it work - we start decoding at the page LSN, so that > >>>> we skip the increments that could lead to the sequence going backwards. > >>>> > >>> > >>> I like this design very much. It makes things simpler than complex. > >>> Thanks for doing this. > >>> > >> > >> I agree it seems simpler. It'd be good to try testing / reviewing it a > >> bit more, so that it doesn't misbehave in some way. > >> > > > > Yeah, I also think this needs a review. This is a sort of new concept > > where we don't use the LSN of the slot (for cases where copy returned > > a larger value of LSN) or a full_snapshot created corresponding to the > > sync slot by Walsender. For the case of the table, we build a full > > snapshot because we use that for copying the table but why do we need > > to build that for copying the sequence especially when we directly > > copy it from the sequence relation without caring for any snapshot? > > > > We need the slot to decode/apply changes during catchup. The main > subscription may get ahead, and we need to ensure the WAL is not > discarded or something like that. This applies even if the initial sync > step does not use the slot/snapshot directly. > AFAIK, none of these needs a full_snapshot (see usage of SnapBuild->building_full_snapshot). The full_snapshot tracks both catalog and non-catalog xacts in the snapshot where we require to track non-catalog ones because we want to copy the table using that snapshot. It is relatively expensive to build a full snapshot and we don't do that unless it is required. For the current usage of this patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient. -- With Regards, Amit Kapila.
On 8/1/23 04:59, Amit Kapila wrote: > On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 7/31/23 11:25, Amit Kapila wrote: >>> ... >>> >>> Yeah, I also think this needs a review. This is a sort of new concept >>> where we don't use the LSN of the slot (for cases where copy returned >>> a larger value of LSN) or a full_snapshot created corresponding to the >>> sync slot by Walsender. For the case of the table, we build a full >>> snapshot because we use that for copying the table but why do we need >>> to build that for copying the sequence especially when we directly >>> copy it from the sequence relation without caring for any snapshot? >>> >> >> We need the slot to decode/apply changes during catchup. The main >> subscription may get ahead, and we need to ensure the WAL is not >> discarded or something like that. This applies even if the initial sync >> step does not use the slot/snapshot directly. >> > > AFAIK, none of these needs a full_snapshot (see usage of > SnapBuild->building_full_snapshot). The full_snapshot tracks both > catalog and non-catalog xacts in the snapshot where we require to > track non-catalog ones because we want to copy the table using that > snapshot. It is relatively expensive to build a full snapshot and we > don't do that unless it is required. For the current usage of this > patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient. > Yeah, you may be right we don't need a full snapshot, because we don't need to export it. We however still need a snapshot, and it wasn't clear to me whether you suggest we don't need the slot / snapshot at all. Anyway, I think this is "just" a matter of efficiency, not correctness. IMHO there are bigger questions regarding the "going back" behavior after apply restart. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 1, 2023 at 8:46 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Anyway, I think this is "just" a matter of efficiency, not correctness. > IMHO there are bigger questions regarding the "going back" behavior > after apply restart. sequence_decode() has the following code /* Skip the change if already processed (per the snapshot). */ if (transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; This means that if the subscription restarts, the upstream will *not* send any non-transactional sequence changes with LSN prior to the LSN specified by START_REPLICATION command. That should avoid replicating all the non-transactional sequence changes since ReplicationSlot::restart_lsn if the subscription restarts. But in apply_handle_sequence(), we do not update the replorigin_session_origin_lsn with LSN of the non-transactional sequence change when it's applied. This means that if a subscription restarts while it is half way through applying a transaction, those changes will be replicated again. This will move the sequence backward. If the subscription keeps restarting again and again while applying that transaction, we will see the sequence "rubber banding" [1] on subscription. So untill the transaction is completely applied, the other users of the sequence may see duplicate values during this time. I think this is undesirable. But I am not able to find a case where this can lead to conflicting values after failover. If there's only one transaction which is repeatedly being applied, the rows which use sequence values were never committed so there's no conflicting value present on the subscription. The same reasoning can be extended to multiple in-flight transactions. If another transaction (T2) uses the sequence values changed by in-flight transaction T1 and if T2 commits before T1, the sequence changes used by T2 must have LSNs before commit of T2 and thus they will never be replicated. (See example below). T1 insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q1 T2 insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q2 COMMIT; T1 insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q13 COMMIT; So I am not able to imagine a case when a sequence going backward can cause conflicting values. But whether or not that's the case, downstream should not request (and hence receive) any changes that have been already applied (and committed) downstream as a principle. I think a way to achieve this is to update the replorigin_session_origin_lsn so that a sequence change applied once is not requested (and hence sent) again. [1] https://en.wikipedia.org/wiki/Rubber_banding -- Best Wishes, Ashutosh Bapat
On 8/11/23 08:32, Ashutosh Bapat wrote: > On Tue, Aug 1, 2023 at 8:46 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Anyway, I think this is "just" a matter of efficiency, not correctness. >> IMHO there are bigger questions regarding the "going back" behavior >> after apply restart. > > > sequence_decode() has the following code > /* Skip the change if already processed (per the snapshot). */ > if (transactional && > !SnapBuildProcessChange(builder, xid, buf->origptr)) > return; > else if (!transactional && > (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || > SnapBuildXactNeedsSkip(builder, buf->origptr))) > return; > > This means that if the subscription restarts, the upstream will *not* > send any non-transactional sequence changes with LSN prior to the LSN > specified by START_REPLICATION command. That should avoid replicating > all the non-transactional sequence changes since > ReplicationSlot::restart_lsn if the subscription restarts. > Ah, right, I got confused and mixed restart_lsn and the LSN passed in the START_REPLICATION COMMAND. Thanks for the details, I think this works fine. > But in apply_handle_sequence(), we do not update the > replorigin_session_origin_lsn with LSN of the non-transactional > sequence change when it's applied. This means that if a subscription > restarts while it is half way through applying a transaction, those > changes will be replicated again. This will move the sequence > backward. If the subscription keeps restarting again and again while > applying that transaction, we will see the sequence "rubber banding" > [1] on subscription. So untill the transaction is completely applied, > the other users of the sequence may see duplicate values during this > time. I think this is undesirable. > Well, but as I said earlier, this is not expected to support using the sequence on the subscriber until after the failover, so there's not real risk of "duplicate values". Yes, you might select the data from the sequence directly, but that would have all sorts of issues even without replication - users are required to use nextval/currval and so on. > But I am not able to find a case where this can lead to conflicting > values after failover. If there's only one transaction which is > repeatedly being applied, the rows which use sequence values were > never committed so there's no conflicting value present on the > subscription. The same reasoning can be extended to multiple in-flight > transactions. If another transaction (T2) uses the sequence values > changed by in-flight transaction T1 and if T2 commits before T1, the > sequence changes used by T2 must have LSNs before commit of T2 and > thus they will never be replicated. (See example below). > > T1 > insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q1 > T2 > insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q2 > COMMIT; > T1 > insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q13 > COMMIT; > > So I am not able to imagine a case when a sequence going backward can > cause conflicting values. Right, I agree this "rubber banding" can happen. But as long as we don't go back too far (before the last applied commit) I think that'd fine. We only need to make guarantees about committed transactions, and I don't think we need to worry about this too much ... > > But whether or not that's the case, downstream should not request (and > hence receive) any changes that have been already applied (and > committed) downstream as a principle. I think a way to achieve this is > to update the replorigin_session_origin_lsn so that a sequence change > applied once is not requested (and hence sent) again. > I guess we could update the origin, per attached 0004. We don't have timestamp to set replorigin_session_origin_timestamp, but it seems we don't need that. The attached patch merges the earlier improvements, except for the part that experimented with adding a "fake" transaction (which turned out to have a number of difficult issues). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > > But whether or not that's the case, downstream should not request (and > > hence receive) any changes that have been already applied (and > > committed) downstream as a principle. I think a way to achieve this is > > to update the replorigin_session_origin_lsn so that a sequence change > > applied once is not requested (and hence sent) again. > > > > I guess we could update the origin, per attached 0004. We don't have > timestamp to set replorigin_session_origin_timestamp, but it seems we > don't need that. > > The attached patch merges the earlier improvements, except for the part > that experimented with adding a "fake" transaction (which turned out to > have a number of difficult issues). 0004 looks good to me. But I need to review the impact of not setting replorigin_session_origin_timestamp. What fake transaction experiment are you talking about? -- Best Wishes, Ashutosh Bapat
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > > > > > But whether or not that's the case, downstream should not request (and > > > hence receive) any changes that have been already applied (and > > > committed) downstream as a principle. I think a way to achieve this is > > > to update the replorigin_session_origin_lsn so that a sequence change > > > applied once is not requested (and hence sent) again. > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > don't need that. > > > > The attached patch merges the earlier improvements, except for the part > > that experimented with adding a "fake" transaction (which turned out to > > have a number of difficult issues). > > 0004 looks good to me. + { CommitTransactionCommand(); + + /* + * Update origin state so we don't try applying this sequence + * change in case of crash. + * + * XXX We don't have replorigin_session_origin_timestamp, but we + * can just leave that set to 0. + */ + replorigin_session_origin_lsn = seq.lsn; IIUC, your proposal is to update the replorigin_session_origin_lsn, so that after restart, it doesn't use some prior origin LSN to start with which can in turn lead the sequence to go backward. If so, it should be updated before calling CommitTransactionCommand() as we are doing in apply_handle_commit_internal(). If that is not the intention then it is not clear to me how updating replorigin_session_origin_lsn after commit is helpful. > But I need to review the impact of not setting > replorigin_session_origin_timestamp. > This may not have a direct impact on built-in replication as I think we don't rely on it yet but we need to think of out-of-core solutions. I am not sure if I understood your proposal as per my previous comment but once you clarify the same, I'll also try to think on the same. -- With Regards, Amit Kapila.
On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > > > > > > > > > But whether or not that's the case, downstream should not request (and > > > > hence receive) any changes that have been already applied (and > > > > committed) downstream as a principle. I think a way to achieve this is > > > > to update the replorigin_session_origin_lsn so that a sequence change > > > > applied once is not requested (and hence sent) again. > > > > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > > don't need that. > > > > > > The attached patch merges the earlier improvements, except for the part > > > that experimented with adding a "fake" transaction (which turned out to > > > have a number of difficult issues). > > > > 0004 looks good to me. > > > + { > CommitTransactionCommand(); > + > + /* > + * Update origin state so we don't try applying this sequence > + * change in case of crash. > + * > + * XXX We don't have replorigin_session_origin_timestamp, but we > + * can just leave that set to 0. > + */ > + replorigin_session_origin_lsn = seq.lsn; > > IIUC, your proposal is to update the replorigin_session_origin_lsn, so > that after restart, it doesn't use some prior origin LSN to start with > which can in turn lead the sequence to go backward. If so, it should > be updated before calling CommitTransactionCommand() as we are doing > in apply_handle_commit_internal(). If that is not the intention then > it is not clear to me how updating replorigin_session_origin_lsn after > commit is helpful. > typedef struct ReplicationState { ... /* * Location of the latest commit from the remote side. */ XLogRecPtr remote_lsn; This is the variable that will be updated with the value of replorigin_session_origin_lsn. This means we will now track some arbitrary LSN location of the remote side in this variable. The above comment makes me wonder if there is anything we are missing or if it is just a matter of updating this comment because before the patch we always adhere to what is written in the comment. -- With Regards, Amit Kapila.
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > > > The attached patch merges the earlier improvements, except for the part > > that experimented with adding a "fake" transaction (which turned out to > > have a number of difficult issues). > > 0004 looks good to me. But I need to review the impact of not setting > replorigin_session_origin_timestamp. I think it will be good to set replorigin_session_origin_timestamp = 0 explicitly so as not to pick up a garbage value. The timestamp is written to the commit record. Beyond that I don't see any use of it. It is further passed downstream if there is cascaded logical replication setup. But I don't see it being used. So it should be fine to leave it 0. I don't think we can use logically replicated sequences in a mult-master environment where the timestamp may be used to resolve conflict. Such a setup will require a distributed sequence management which can not be achieved by logical replication alone. In short, I didn't find any hazard in leaving the replorigin_session_origin_timestamp as 0. -- Best Wishes, Ashutosh Bapat
On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra > > > <tomas.vondra@enterprisedb.com> wrote: > > > > > > > > > > > > > > But whether or not that's the case, downstream should not request (and > > > > > hence receive) any changes that have been already applied (and > > > > > committed) downstream as a principle. I think a way to achieve this is > > > > > to update the replorigin_session_origin_lsn so that a sequence change > > > > > applied once is not requested (and hence sent) again. > > > > > > > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > > > don't need that. > > > > > > > > The attached patch merges the earlier improvements, except for the part > > > > that experimented with adding a "fake" transaction (which turned out to > > > > have a number of difficult issues). > > > > > > 0004 looks good to me. > > > > > > + { > > CommitTransactionCommand(); > > + > > + /* > > + * Update origin state so we don't try applying this sequence > > + * change in case of crash. > > + * > > + * XXX We don't have replorigin_session_origin_timestamp, but we > > + * can just leave that set to 0. > > + */ > > + replorigin_session_origin_lsn = seq.lsn; > > > > IIUC, your proposal is to update the replorigin_session_origin_lsn, so > > that after restart, it doesn't use some prior origin LSN to start with > > which can in turn lead the sequence to go backward. If so, it should > > be updated before calling CommitTransactionCommand() as we are doing > > in apply_handle_commit_internal(). If that is not the intention then > > it is not clear to me how updating replorigin_session_origin_lsn after > > commit is helpful. > > > > typedef struct ReplicationState > { > ... > /* > * Location of the latest commit from the remote side. > */ > XLogRecPtr remote_lsn; > > This is the variable that will be updated with the value of > replorigin_session_origin_lsn. This means we will now track some > arbitrary LSN location of the remote side in this variable. The above > comment makes me wonder if there is anything we are missing or if it > is just a matter of updating this comment because before the patch we > always adhere to what is written in the comment. I don't think we are missing anything. This value is used to track the remote LSN upto which all the commits from upstream have been applied locally. Since a non-transactional sequence change is like a single WAL record transaction, it's LSN acts as the LSN of the mini-commit. So it should be fine to update remote_lsn with sequence WAL record's end LSN. That's what the patches do. I don't see any hazard. But you are right, we need to update comments. Here and also at other places like replorigin_session_advance() which uses remote_commit as name of the argument which gets assigned to ReplicationState::remote_lsn. -- Best Wishes, Ashutosh Bapat
On Wednesday, August 16, 2023 10:27 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: Hi, > > > I guess we could update the origin, per attached 0004. We don't have > timestamp to set replorigin_session_origin_timestamp, but it seems we don't > need that. > > The attached patch merges the earlier improvements, except for the part that > experimented with adding a "fake" transaction (which turned out to have a > number of difficult issues). I tried to test the patch and found a crash when calling pg_logical_slot_get_changes() to consume sequence changes. Steps: ---- create table t1_seq(a int); create sequence seq1; SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true); INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'include-xids', 'false', 'skip-empty-xacts', '1'); ---- Attach the backtrace in bt.txt. Best Regards, Hou zj
Attachment
On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I was reading through 0001, I noticed this comment in ReorderBufferSequenceIsTransactional() function + * To decide if a sequence change should be handled as transactional or applied + * immediately, we track (sequence) relfilenodes created by each transaction. + * We don't know if the current sub-transaction was already assigned to the + * top-level transaction, so we need to check all transactions. It says "We don't know if the current sub-transaction was already assigned to the top-level transaction, so we need to check all transactions". But IIRC as part of the steaming of in-progress transactions we have ensured that whenever we are logging the first change by any subtransaction we include the top transaction ID in it. Refer this code LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record) { ... /* * If the top-level xid is valid, we need to assign the subxact to the * top-level xact. We need to do this for all records, hence we do it * before the switch. */ if (TransactionIdIsValid(txid)) { ReorderBufferAssignChild(ctx->reorder, txid, XLogRecGetXid(record), buf.origptr); } } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > I was reading through 0001, I noticed this comment in > ReorderBufferSequenceIsTransactional() function > > + * To decide if a sequence change should be handled as transactional or applied > + * immediately, we track (sequence) relfilenodes created by each transaction. > + * We don't know if the current sub-transaction was already assigned to the > + * top-level transaction, so we need to check all transactions. > > It says "We don't know if the current sub-transaction was already > assigned to the top-level transaction, so we need to check all > transactions". But IIRC as part of the steaming of in-progress > transactions we have ensured that whenever we are logging the first > change by any subtransaction we include the top transaction ID in it. > > Refer this code > > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, > XLogReaderState *record) > { > ... > /* > * If the top-level xid is valid, we need to assign the subxact to the > * top-level xact. We need to do this for all records, hence we do it > * before the switch. > */ > if (TransactionIdIsValid(txid)) > { > ReorderBufferAssignChild(ctx->reorder, > txid, > XLogRecGetXid(record), > buf.origptr); > } > } Some more comments 1. ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid are duplicated except the first one is just confirming whether relfilelocator was created in the transaction or not and the other is returning the XID as well so I think these two could be easily merged so that we can avoid duplicate codes. 2. /* + * ReorderBufferTransferSequencesToParent + * Copy the relfilenode entries to the parent after assignment. + */ +static void +ReorderBufferTransferSequencesToParent(ReorderBuffer *rb, + ReorderBufferTXN *txn, + ReorderBufferTXN *subtxn) If we agree with my comment in the previous email (i.e. the first WAL by a subxid will always include topxid) then we do not need this function at all and always add relfilelocator directly to the top transaction and we never need to transfer. That is all I have for now while first pass of 0001, later I will do a more detailed review and will look into other patches also. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Friday, September 15, 2023 11:11 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, August 16, 2023 10:27 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > don't need that. > > > > The attached patch merges the earlier improvements, except for the > > part that experimented with adding a "fake" transaction (which turned > > out to have a number of difficult issues). > > I tried to test the patch and found a crash when calling > pg_logical_slot_get_changes() to consume sequence changes. Oh, after confirming again, I realize it's my fault that my build environment was not clean. This case passed after rebuilding. Sorry for the noise. Best Regards, Hou zj
On 9/22/23 13:24, Dilip Kumar wrote: > On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >> >> I was reading through 0001, I noticed this comment in >> ReorderBufferSequenceIsTransactional() function >> >> + * To decide if a sequence change should be handled as transactional or applied >> + * immediately, we track (sequence) relfilenodes created by each transaction. >> + * We don't know if the current sub-transaction was already assigned to the >> + * top-level transaction, so we need to check all transactions. >> >> It says "We don't know if the current sub-transaction was already >> assigned to the top-level transaction, so we need to check all >> transactions". But IIRC as part of the steaming of in-progress >> transactions we have ensured that whenever we are logging the first >> change by any subtransaction we include the top transaction ID in it. >> >> Refer this code >> >> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, >> XLogReaderState *record) >> { >> ... >> /* >> * If the top-level xid is valid, we need to assign the subxact to the >> * top-level xact. We need to do this for all records, hence we do it >> * before the switch. >> */ >> if (TransactionIdIsValid(txid)) >> { >> ReorderBufferAssignChild(ctx->reorder, >> txid, >> XLogRecGetXid(record), >> buf.origptr); >> } >> } > > Some more comments > > 1. > ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid > are duplicated except the first one is just confirming whether > relfilelocator was created in the transaction or not and the other is > returning the XID as well so I think these two could be easily merged > so that we can avoid duplicate codes. > Right. The attached patch modifies the IsTransactional function to also return the XID, and removes the GetXid one. It feels a bit weird because now the IsTransactional function is called even in places where we know the change is transactional. It's true two separate functions duplicated a bit of code, ofc. > 2. > /* > + * ReorderBufferTransferSequencesToParent > + * Copy the relfilenode entries to the parent after assignment. > + */ > +static void > +ReorderBufferTransferSequencesToParent(ReorderBuffer *rb, > + ReorderBufferTXN *txn, > + ReorderBufferTXN *subtxn) > > If we agree with my comment in the previous email (i.e. the first WAL > by a subxid will always include topxid) then we do not need this > function at all and always add relfilelocator directly to the top > transaction and we never need to transfer. > Good point! I don't recall why I thought this was necessary. I suspect it was before I added the GetCurrentTransactionId() calls to ensure the subxact has a XID. I replaced the ReorderBufferTransferSequencesToParent call with an assert that the relfilenode hash table is empty, and I've been unable to trigger any failures. > That is all I have for now while first pass of 0001, later I will do a > more detailed review and will look into other patches also. > Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 9/20/23 11:53, Dilip Kumar wrote: > On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> > > I was reading through 0001, I noticed this comment in > ReorderBufferSequenceIsTransactional() function > > + * To decide if a sequence change should be handled as transactional or applied > + * immediately, we track (sequence) relfilenodes created by each transaction. > + * We don't know if the current sub-transaction was already assigned to the > + * top-level transaction, so we need to check all transactions. > > It says "We don't know if the current sub-transaction was already > assigned to the top-level transaction, so we need to check all > transactions". But IIRC as part of the steaming of in-progress > transactions we have ensured that whenever we are logging the first > change by any subtransaction we include the top transaction ID in it. > Yeah, that's a stale comment - the actual code only searched through the top-level ones (and thus relying on the immediate assignment). As I wrote in the earlier response, I suspect this code originates from before I added the GetCurrentTransactionId() calls. That being said, I do wonder why with the immediate assignments we still need the bit in ReorderBufferAssignChild that says: /* * We already saw this transaction, but initially added it to the * list of top-level txns. Now that we know it's not top-level, * remove it from there. */ dlist_delete(&subtxn->node); I don't think that affects this patch, but it's a bit confusing. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/13/23 15:18, Ashutosh Bapat wrote: > On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat >>> <ashutosh.bapat.oss@gmail.com> wrote: >>>> >>>> On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra >>>> <tomas.vondra@enterprisedb.com> wrote: >>>>> >>>>>> >>>>>> But whether or not that's the case, downstream should not request (and >>>>>> hence receive) any changes that have been already applied (and >>>>>> committed) downstream as a principle. I think a way to achieve this is >>>>>> to update the replorigin_session_origin_lsn so that a sequence change >>>>>> applied once is not requested (and hence sent) again. >>>>>> >>>>> >>>>> I guess we could update the origin, per attached 0004. We don't have >>>>> timestamp to set replorigin_session_origin_timestamp, but it seems we >>>>> don't need that. >>>>> >>>>> The attached patch merges the earlier improvements, except for the part >>>>> that experimented with adding a "fake" transaction (which turned out to >>>>> have a number of difficult issues). >>>> >>>> 0004 looks good to me. >>> >>> >>> + { >>> CommitTransactionCommand(); >>> + >>> + /* >>> + * Update origin state so we don't try applying this sequence >>> + * change in case of crash. >>> + * >>> + * XXX We don't have replorigin_session_origin_timestamp, but we >>> + * can just leave that set to 0. >>> + */ >>> + replorigin_session_origin_lsn = seq.lsn; >>> >>> IIUC, your proposal is to update the replorigin_session_origin_lsn, so >>> that after restart, it doesn't use some prior origin LSN to start with >>> which can in turn lead the sequence to go backward. If so, it should >>> be updated before calling CommitTransactionCommand() as we are doing >>> in apply_handle_commit_internal(). If that is not the intention then >>> it is not clear to me how updating replorigin_session_origin_lsn after >>> commit is helpful. >>> >> >> typedef struct ReplicationState >> { >> ... >> /* >> * Location of the latest commit from the remote side. >> */ >> XLogRecPtr remote_lsn; >> >> This is the variable that will be updated with the value of >> replorigin_session_origin_lsn. This means we will now track some >> arbitrary LSN location of the remote side in this variable. The above >> comment makes me wonder if there is anything we are missing or if it >> is just a matter of updating this comment because before the patch we >> always adhere to what is written in the comment. > > I don't think we are missing anything. This value is used to track the > remote LSN upto which all the commits from upstream have been applied > locally. Since a non-transactional sequence change is like a single > WAL record transaction, it's LSN acts as the LSN of the mini-commit. > So it should be fine to update remote_lsn with sequence WAL record's > end LSN. That's what the patches do. I don't see any hazard. But you > are right, we need to update comments. Here and also at other places > like > replorigin_session_advance() which uses remote_commit as name of the > argument which gets assigned to ReplicationState::remote_lsn. > I agree - updating the replorigin_session_origin_lsn shouldn't break anything. As you write, it's essentially a "mini-commit" and the commit order remains the same. I'm not sure about resetting replorigin_session_origin_timestamp to 0 though. It's not something we rely on very much (it may not correlated with the commit order etc.). But why should we set it to 0? We don't do that for regular commits, right? And IMO it makes sense to just use the timestamp of the last commit before the sequence change. FWIW I've left this in a separate commit, but I'll merge that into 0002 in the next patch version. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/25/23 12:20, Amit Kapila wrote: > ... > > I have used the debugger to reproduce this as it needs quite some > coordination. I just wanted to see if the sequence can go backward and > didn't catch up completely before the sequence state is marked > 'ready'. On the publisher side, I created a publication with a table > and a sequence. Then did the following steps: > SELECT nextval('s') FROM generate_series(1,50); > insert into t1 values(1); > SELECT nextval('s') FROM generate_series(51,150); > > Then on the subscriber side with some debugging aid, I could find the > values in the sequence shown in the previous email. Sorry, I haven't > recorded each and every step but, if you think it helps, I can again > try to reproduce it and share the steps. > Amit, can you try to reproduce this backwards movement with the latest version of the patch? I have tried triggering that (mis)behavior, but I haven't been successful so far. I'm hesitant to declare it resolved, as it's dependent on timing etc. and you mentioned it required quite some coordination. Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 12, 2023 at 9:03 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/25/23 12:20, Amit Kapila wrote: > > ... > > > > I have used the debugger to reproduce this as it needs quite some > > coordination. I just wanted to see if the sequence can go backward and > > didn't catch up completely before the sequence state is marked > > 'ready'. On the publisher side, I created a publication with a table > > and a sequence. Then did the following steps: > > SELECT nextval('s') FROM generate_series(1,50); > > insert into t1 values(1); > > SELECT nextval('s') FROM generate_series(51,150); > > > > Then on the subscriber side with some debugging aid, I could find the > > values in the sequence shown in the previous email. Sorry, I haven't > > recorded each and every step but, if you think it helps, I can again > > try to reproduce it and share the steps. > > > > Amit, can you try to reproduce this backwards movement with the latest > version of the patch? > I lost touch with this patch but IIRC the quoted problem per se shouldn't occur after the idea to use page LSN instead of slot's LSN for synchronization between sync and apply worker. -- With Regards, Amit Kapila.
On Thursday, October 12, 2023 11:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Hi, I have been reviewing the patch set, and here are some initial comments. 1. I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional sequence change in ReorderBufferQueueChange(). 2. ReorderBufferSequenceIsTransactional It seems we call the above function once in sequence_decode() and call it again in ReorderBufferQueueSequence(), would it better to avoid the second call as the hashtable search looks not cheap. 3. The patch cleans up the sequence hash table when COMMIT or ABORT a transaction (via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem destory the hash table when PREPARE the transaction. It's not a big porblem but would it be better to release the memory earlier by destory the table for prepare ? 4. +pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, ... + /* output BEGIN if we haven't yet, but only for the transactional case */ + if (transactional) + { + if (data->skip_empty_xacts && !txndata->xact_wrote_changes) + { + pg_output_begin(ctx, data, txn, false); + } + txndata->xact_wrote_changes = true; + } I think we should call pg_output_stream_start() instead of pg_output_begin() for streaming sequence changes. 5. + /* + * Schema should be sent using the original relation because it + * also sends the ancestor's relation. + */ + maybe_send_schema(ctx, txn, relation, relentry); The comment seems a bit misleading here, I think it was used for the partition logic in pgoutput_change(). Best Regards, Hou zj
Hi! On 10/24/23 13:31, Zhijie Hou (Fujitsu) wrote: > On Thursday, October 12, 2023 11:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> > > Hi, > > I have been reviewing the patch set, and here are some initial comments. > > 1. > > I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional > sequence change in ReorderBufferQueueChange(). > True. It's unlikely for a transaction to only have sequence increments and be large enough to get streamed, and other changes would make it to have this flag. But it's certainly more correct to set the flag even for sequence changes. The updated patch modifies ReorderBufferQueueChange to do this. > 2. > > ReorderBufferSequenceIsTransactional > > It seems we call the above function once in sequence_decode() and call it again > in ReorderBufferQueueSequence(), would it better to avoid the second call as > the hashtable search looks not cheap. > In principle yes, but I don't think it's worth it - I doubt the overhead is going to be measurable. Based on earlier reviews I tried to reduce the code duplication (there used to be two separate functions doing the lookup), and I did consider doing just one call in sequence_decode() and passing the XID to ReorderBufferQueueSequence() - determining the XID is the only purpose of the call there. But it didn't seem nice/worth it. > 3. > > The patch cleans up the sequence hash table when COMMIT or ABORT a transaction > (via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem > destory the hash table when PREPARE the transaction. It's not a big porblem but > would it be better to release the memory earlier by destory the table for > prepare ? > I think you're right. I added the sequence cleanup to a couple places, right before cleanup of the transaction. I wonder if we should simply call ReorderBufferSequenceCleanup() from ReorderBufferCleanupTXN(). > 4. > > +pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > ... > + /* output BEGIN if we haven't yet, but only for the transactional case */ > + if (transactional) > + { > + if (data->skip_empty_xacts && !txndata->xact_wrote_changes) > + { > + pg_output_begin(ctx, data, txn, false); > + } > + txndata->xact_wrote_changes = true; > + } > > I think we should call pg_output_stream_start() instead of pg_output_begin() > for streaming sequence changes. > Good catch! Fixed. > 5. > + /* > + * Schema should be sent using the original relation because it > + * also sends the ancestor's relation. > + */ > + maybe_send_schema(ctx, txn, relation, relentry); > > The comment seems a bit misleading here, I think it was used for the partition > logic in pgoutput_change(). True. I've removed the comment. Attached is an updated patch, with all those tweaks/fixes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, I've been cleaning up the first two patches to get them committed soon (adding the decoding infrastructure + test_decoding), cleaning up stale comments, updating commit messages etc. And I think it's ready to go, but it's too late over, so I plan going over once more tomorrow and then likely push. But if someone wants to take a look, I'd welcome that. The one issue I found during this cleanup is that the patch was missing the changes introduced by 29d0a77fa660 for decoding of other stuff. commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d Author: Amit Kapila <akapila@postgresql.org> Date: Thu Oct 26 06:54:16 2023 +0530 Migrate logical slots to the new node during an upgrade. ... I fixed that, but perhaps someone might want to double check ... 0003 is here just for completeness - that's the part adding sequences to built-in replication. I haven't done much with it, it needs some cleanup too to get it committable. I don't intend to push that right after 0001+0002, though. While going over 0001, I realized there might be an optimization for ReorderBufferSequenceIsTransactional. As coded in 0001, it always searches through all top-level transactions, and if there's many of them that might be expensive, even if very few of them have any relfilenodes in the hash table. It's still linear search, and it needs to happen for each sequence change. But can the relfilenode even be in some other top-level transaction? How could it be - our transaction would not see it, and wouldn't be able to generate the sequence change. So we should be able to simply check *our* transaction (or if it's a subxact, the top-level transaction). Either it's there (and it's transactional change), or not (and then it's non-transactional change). The 0004 does this. This of course hinges on when exactly the transactions get created, and assignments processed. For example if this would fire before the txn gets assigned to the top-level one, this would break. I don't think this can happen thanks to the immediate logging of assignments, but I'm too tired to think about it now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I've been cleaning up the first two patches to get them committed soon > (adding the decoding infrastructure + test_decoding), cleaning up stale > comments, updating commit messages etc. And I think it's ready to go, > but it's too late over, so I plan going over once more tomorrow and then > likely push. But if someone wants to take a look, I'd welcome that. > > The one issue I found during this cleanup is that the patch was missing > the changes introduced by 29d0a77fa660 for decoding of other stuff. > > commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d > Author: Amit Kapila <akapila@postgresql.org> > Date: Thu Oct 26 06:54:16 2023 +0530 > > Migrate logical slots to the new node during an upgrade. > ... > > I fixed that, but perhaps someone might want to double check ... > > > 0003 is here just for completeness - that's the part adding sequences to > built-in replication. I haven't done much with it, it needs some cleanup > too to get it committable. I don't intend to push that right after > 0001+0002, though. > > > While going over 0001, I realized there might be an optimization for > ReorderBufferSequenceIsTransactional. As coded in 0001, it always > searches through all top-level transactions, and if there's many of them > that might be expensive, even if very few of them have any relfilenodes > in the hash table. It's still linear search, and it needs to happen for > each sequence change. > > But can the relfilenode even be in some other top-level transaction? How > could it be - our transaction would not see it, and wouldn't be able to > generate the sequence change. So we should be able to simply check *our* > transaction (or if it's a subxact, the top-level transaction). Either > it's there (and it's transactional change), or not (and then it's > non-transactional change). > I also think the relfilenode should be part of either the current top-level xact or one of its subxact, so looking at all the top-level transactions for each change doesn't seem advisable. > The 0004 does this. > > This of course hinges on when exactly the transactions get created, and > assignments processed. For example if this would fire before the txn > gets assigned to the top-level one, this would break. I don't think this > can happen thanks to the immediate logging of assignments, but I'm too > tired to think about it now. > This needs some thought because I think we can't guarantee the association till we reach the point where we can actually decode the xact. See comments in AssertTXNLsnOrder() [1]. I noticed few minor comments while reading the patch: 1. + * turned on here because the non-transactional logical message is + * decoded without waiting for these records. Instead of '.. logical message', shouldn't we say sequence change message? 2. + /* + * If we found an entry with matchine relfilenode, typo (matchine) 3. + Note that this may not the value obtained by the process updating the + process, but the future sequence value written to WAL (typically about + 32 values ahead). /may not the value/may not be the value [1] - /* * Skip the verification if we don't reach the LSN at which we start * decoding the contents of transactions yet because until we reach the * LSN, we could have transactions that don't have the association between * the top-level transaction and subtransaction yet and consequently have * the same LSN. We don't guarantee this association until we try to * decode the actual contents of transaction. -- With Regards, Amit Kapila.
On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > While going over 0001, I realized there might be an optimization for > > ReorderBufferSequenceIsTransactional. As coded in 0001, it always > > searches through all top-level transactions, and if there's many of them > > that might be expensive, even if very few of them have any relfilenodes > > in the hash table. It's still linear search, and it needs to happen for > > each sequence change. > > > > But can the relfilenode even be in some other top-level transaction? How > > could it be - our transaction would not see it, and wouldn't be able to > > generate the sequence change. So we should be able to simply check *our* > > transaction (or if it's a subxact, the top-level transaction). Either > > it's there (and it's transactional change), or not (and then it's > > non-transactional change). > > > > I also think the relfilenode should be part of either the current > top-level xact or one of its subxact, so looking at all the top-level > transactions for each change doesn't seem advisable. > > > The 0004 does this. > > > > This of course hinges on when exactly the transactions get created, and > > assignments processed. For example if this would fire before the txn > > gets assigned to the top-level one, this would break. I don't think this > > can happen thanks to the immediate logging of assignments, but I'm too > > tired to think about it now. > > > > This needs some thought because I think we can't guarantee the > association till we reach the point where we can actually decode the > xact. See comments in AssertTXNLsnOrder() [1]. > I am wondering that instead of building the infrastructure to know whether a particular change is transactional on the decoding side, can't we have some flag in the WAL record to note whether the change is transactional or not? I have discussed this point with my colleague Kuroda-San and we thought that it may be worth exploring whether we can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to determine if the sequence is created/changed in the current subtransaction and then record that in WAL record. By this, we need to have additional information in the WAL record like XLOG_SEQ_LOG but we can probably do it only with wal_level as logical. One minor point: It'd also + * trigger assert in DecodeSequence. I don't see DecodeSequence() in the patch. Which exact assert/function are you referring to here? -- With Regards, Amit Kapila.
On 11/27/23 11:13, Amit Kapila wrote: > On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> While going over 0001, I realized there might be an optimization for >>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always >>> searches through all top-level transactions, and if there's many of them >>> that might be expensive, even if very few of them have any relfilenodes >>> in the hash table. It's still linear search, and it needs to happen for >>> each sequence change. >>> >>> But can the relfilenode even be in some other top-level transaction? How >>> could it be - our transaction would not see it, and wouldn't be able to >>> generate the sequence change. So we should be able to simply check *our* >>> transaction (or if it's a subxact, the top-level transaction). Either >>> it's there (and it's transactional change), or not (and then it's >>> non-transactional change). >>> >> >> I also think the relfilenode should be part of either the current >> top-level xact or one of its subxact, so looking at all the top-level >> transactions for each change doesn't seem advisable. >> >>> The 0004 does this. >>> >>> This of course hinges on when exactly the transactions get created, and >>> assignments processed. For example if this would fire before the txn >>> gets assigned to the top-level one, this would break. I don't think this >>> can happen thanks to the immediate logging of assignments, but I'm too >>> tired to think about it now. >>> >> >> This needs some thought because I think we can't guarantee the >> association till we reach the point where we can actually decode the >> xact. See comments in AssertTXNLsnOrder() [1]. >> I suppose you mean the comment before the SnapBuildXactNeedsSkip call, which says: /* * Skip the verification if we don't reach the LSN at which we start * decoding the contents of transactions yet because until we reach * the LSN, we could have transactions that don't have the association * between the top-level transaction and subtransaction yet and * consequently have the same LSN. We don't guarantee this * association until we try to decode the actual contents of * transaction. The ordering of the records prior to the * start_decoding_at LSN should have been checked before the restart. */ But doesn't this say that after we actually start decoding / stop skipping, we should have seen the assignment? We're already decoding transaction contents (because sequence change *is* part of xact, even if we decide to replay it in the non-transactional way). > > I am wondering that instead of building the infrastructure to know > whether a particular change is transactional on the decoding side, > can't we have some flag in the WAL record to note whether the change > is transactional or not? I have discussed this point with my colleague > Kuroda-San and we thought that it may be worth exploring whether we > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to > determine if the sequence is created/changed in the current > subtransaction and then record that in WAL record. By this, we need to > have additional information in the WAL record like XLOG_SEQ_LOG but we > can probably do it only with wal_level as logical. > I may not understand the proposal exactly, but it's not enough to know if it was created in the same subxact. It might have been created in some earlier subxact in the same top-level xact. FWIW I think one of the earlier patch versions did something like this, by adding a "created" flag in the xlog record. And we concluded doing this on the decoding side is a better solution. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 11/27/23 11:13, Amit Kapila wrote: > > On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra > >> <tomas.vondra@enterprisedb.com> wrote: > >>> > >>> While going over 0001, I realized there might be an optimization for > >>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always > >>> searches through all top-level transactions, and if there's many of them > >>> that might be expensive, even if very few of them have any relfilenodes > >>> in the hash table. It's still linear search, and it needs to happen for > >>> each sequence change. > >>> > >>> But can the relfilenode even be in some other top-level transaction? How > >>> could it be - our transaction would not see it, and wouldn't be able to > >>> generate the sequence change. So we should be able to simply check *our* > >>> transaction (or if it's a subxact, the top-level transaction). Either > >>> it's there (and it's transactional change), or not (and then it's > >>> non-transactional change). > >>> > >> > >> I also think the relfilenode should be part of either the current > >> top-level xact or one of its subxact, so looking at all the top-level > >> transactions for each change doesn't seem advisable. > >> > >>> The 0004 does this. > >>> > >>> This of course hinges on when exactly the transactions get created, and > >>> assignments processed. For example if this would fire before the txn > >>> gets assigned to the top-level one, this would break. I don't think this > >>> can happen thanks to the immediate logging of assignments, but I'm too > >>> tired to think about it now. > >>> > >> > >> This needs some thought because I think we can't guarantee the > >> association till we reach the point where we can actually decode the > >> xact. See comments in AssertTXNLsnOrder() [1]. > >> > > I suppose you mean the comment before the SnapBuildXactNeedsSkip call, > which says: > > /* > * Skip the verification if we don't reach the LSN at which we start > * decoding the contents of transactions yet because until we reach > * the LSN, we could have transactions that don't have the association > * between the top-level transaction and subtransaction yet and > * consequently have the same LSN. We don't guarantee this > * association until we try to decode the actual contents of > * transaction. The ordering of the records prior to the > * start_decoding_at LSN should have been checked before the restart. > */ > > But doesn't this say that after we actually start decoding / stop > skipping, we should have seen the assignment? We're already decoding > transaction contents (because sequence change *is* part of xact, even if > we decide to replay it in the non-transactional way). > It means to say that the assignment is decided after start_decoding_at point. We haven't decided that we are past start_decoding_at by the time the patch is computing the transactional flag. > > > > I am wondering that instead of building the infrastructure to know > > whether a particular change is transactional on the decoding side, > > can't we have some flag in the WAL record to note whether the change > > is transactional or not? I have discussed this point with my colleague > > Kuroda-San and we thought that it may be worth exploring whether we > > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to > > determine if the sequence is created/changed in the current > > subtransaction and then record that in WAL record. By this, we need to > > have additional information in the WAL record like XLOG_SEQ_LOG but we > > can probably do it only with wal_level as logical. > > > > I may not understand the proposal exactly, but it's not enough to know > if it was created in the same subxact. It might have been created in > some earlier subxact in the same top-level xact. > We should be able to detect even some earlier subxact or top-level xact based on rd_createSubid/rd_newRelfilelocatorSubid. > FWIW I think one of the earlier patch versions did something like this, > by adding a "created" flag in the xlog record. And we concluded doing > this on the decoding side is a better solution. > oh, I thought it would be much simpler than what we are doing on the decoding-side. Can you please point me to the email discussion where this is concluded or share the reason? -- With Regards, Amit Kapila.
On Mon, Nov 27, 2023 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > > FWIW I think one of the earlier patch versions did something like this, > > by adding a "created" flag in the xlog record. And we concluded doing > > this on the decoding side is a better solution. > > > > oh, I thought it would be much simpler than what we are doing on the > decoding-side. Can you please point me to the email discussion where > this is concluded or share the reason? > I'll check the thread about this point by myself as well but if by chance you remember it then kindly share it. -- With Regards, Amit Kapila.
Dear Amit, Tomas, > > > > > > I am wondering that instead of building the infrastructure to know > > > whether a particular change is transactional on the decoding side, > > > can't we have some flag in the WAL record to note whether the change > > > is transactional or not? I have discussed this point with my colleague > > > Kuroda-San and we thought that it may be worth exploring whether we > > > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to > > > determine if the sequence is created/changed in the current > > > subtransaction and then record that in WAL record. By this, we need to > > > have additional information in the WAL record like XLOG_SEQ_LOG but we > > > can probably do it only with wal_level as logical. > > > > > > > I may not understand the proposal exactly, but it's not enough to know > > if it was created in the same subxact. It might have been created in > > some earlier subxact in the same top-level xact. > > > > We should be able to detect even some earlier subxact or top-level > xact based on rd_createSubid/rd_newRelfilelocatorSubid. Here is a small PoC patchset to help your understanding. Please see attached files. 0001, 0002 were not changed, and 0004 was reassigned to 0003. (For now, I focused only on test_decoding, because it is only for evaluation purpose.) 0004 is what we really wanted to say. is_transactional is added in WAL record, and it stores whether the operations is transactional. In order to distinguish the status, rd_createSubid and rd_newRelfilelocatorSubid are used. According to the comment, they would be a valid value only when the relation was changed within the transaction Also, sequences_hash was not needed anymore, so it and related functions were removed. How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On 11/27/23 12:11, Amit Kapila wrote: > On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 11/27/23 11:13, Amit Kapila wrote: >>> On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra >>>> <tomas.vondra@enterprisedb.com> wrote: >>>>> >>>>> While going over 0001, I realized there might be an optimization for >>>>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always >>>>> searches through all top-level transactions, and if there's many of them >>>>> that might be expensive, even if very few of them have any relfilenodes >>>>> in the hash table. It's still linear search, and it needs to happen for >>>>> each sequence change. >>>>> >>>>> But can the relfilenode even be in some other top-level transaction? How >>>>> could it be - our transaction would not see it, and wouldn't be able to >>>>> generate the sequence change. So we should be able to simply check *our* >>>>> transaction (or if it's a subxact, the top-level transaction). Either >>>>> it's there (and it's transactional change), or not (and then it's >>>>> non-transactional change). >>>>> >>>> >>>> I also think the relfilenode should be part of either the current >>>> top-level xact or one of its subxact, so looking at all the top-level >>>> transactions for each change doesn't seem advisable. >>>> >>>>> The 0004 does this. >>>>> >>>>> This of course hinges on when exactly the transactions get created, and >>>>> assignments processed. For example if this would fire before the txn >>>>> gets assigned to the top-level one, this would break. I don't think this >>>>> can happen thanks to the immediate logging of assignments, but I'm too >>>>> tired to think about it now. >>>>> >>>> >>>> This needs some thought because I think we can't guarantee the >>>> association till we reach the point where we can actually decode the >>>> xact. See comments in AssertTXNLsnOrder() [1]. >>>> >> >> I suppose you mean the comment before the SnapBuildXactNeedsSkip call, >> which says: >> >> /* >> * Skip the verification if we don't reach the LSN at which we start >> * decoding the contents of transactions yet because until we reach >> * the LSN, we could have transactions that don't have the association >> * between the top-level transaction and subtransaction yet and >> * consequently have the same LSN. We don't guarantee this >> * association until we try to decode the actual contents of >> * transaction. The ordering of the records prior to the >> * start_decoding_at LSN should have been checked before the restart. >> */ >> >> But doesn't this say that after we actually start decoding / stop >> skipping, we should have seen the assignment? We're already decoding >> transaction contents (because sequence change *is* part of xact, even if >> we decide to replay it in the non-transactional way). >> > > It means to say that the assignment is decided after start_decoding_at > point. We haven't decided that we are past start_decoding_at by the > time the patch is computing the transactional flag. > Ah, I see. We're deciding if the change is transactional before calling SnapBuildXactNeedsSkip. That's a bit unfortunate. >>> >>> I am wondering that instead of building the infrastructure to know >>> whether a particular change is transactional on the decoding side, >>> can't we have some flag in the WAL record to note whether the change >>> is transactional or not? I have discussed this point with my colleague >>> Kuroda-San and we thought that it may be worth exploring whether we >>> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to >>> determine if the sequence is created/changed in the current >>> subtransaction and then record that in WAL record. By this, we need to >>> have additional information in the WAL record like XLOG_SEQ_LOG but we >>> can probably do it only with wal_level as logical. >>> >> >> I may not understand the proposal exactly, but it's not enough to know >> if it was created in the same subxact. It might have been created in >> some earlier subxact in the same top-level xact. >> > > We should be able to detect even some earlier subxact or top-level > xact based on rd_createSubid/rd_newRelfilelocatorSubid. > Interesting. I admit I haven't considered using these fields before, so I need to familiarize with it a bit, and try if it'd work. >> FWIW I think one of the earlier patch versions did something like this, >> by adding a "created" flag in the xlog record. And we concluded doing >> this on the decoding side is a better solution. >> > > oh, I thought it would be much simpler than what we are doing on the > decoding-side. Can you please point me to the email discussion where > this is concluded or share the reason? > I think the discussion started around [1], and then in a bunch of following messages (search for "relfilenode"). regards [1] https://www.postgresql.org/message-id/CAExHW5v_vVqkhF4ehST9EzpX1L3bemD1S%2BkTk_-ZVu_ir-nKDw%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/27/23 13:08, Hayato Kuroda (Fujitsu) wrote: > Dear Amit, Tomas, > >>>> >>>> I am wondering that instead of building the infrastructure to know >>>> whether a particular change is transactional on the decoding side, >>>> can't we have some flag in the WAL record to note whether the change >>>> is transactional or not? I have discussed this point with my colleague >>>> Kuroda-San and we thought that it may be worth exploring whether we >>>> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to >>>> determine if the sequence is created/changed in the current >>>> subtransaction and then record that in WAL record. By this, we need to >>>> have additional information in the WAL record like XLOG_SEQ_LOG but we >>>> can probably do it only with wal_level as logical. >>>> >>> >>> I may not understand the proposal exactly, but it's not enough to know >>> if it was created in the same subxact. It might have been created in >>> some earlier subxact in the same top-level xact. >>> >> >> We should be able to detect even some earlier subxact or top-level >> xact based on rd_createSubid/rd_newRelfilelocatorSubid. > > Here is a small PoC patchset to help your understanding. Please see attached > files. > > 0001, 0002 were not changed, and 0004 was reassigned to 0003. > (For now, I focused only on test_decoding, because it is only for evaluation purpose.) > > 0004 is what we really wanted to say. is_transactional is added in WAL record, and it stores > whether the operations is transactional. In order to distinguish the status, rd_createSubid and > rd_newRelfilelocatorSubid are used. According to the comment, they would be a valid value > only when the relation was changed within the transaction > Also, sequences_hash was not needed anymore, so it and related functions were removed. > > How do you think? > I think it's an a very nice idea, assuming it maintains the current behavior. It makes a lot of code unnecessary, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I spent a bit of time looking at the proposed change, and unfortunately logging just the boolean flag does not work. A good example is this bit from a TAP test added by the patch for built-in replication (which was not included with the WIP patch): BEGIN; ALTER SEQUENCE s RESTART WITH 1000; SAVEPOINT sp1; INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100); ROLLBACK TO sp1; COMMIT; This is expected to produce: 1131|0|t but produces 1000|0|f instead. The reason is very simple - as implemented, the patch simply checks if the relfilenode is from the same top-level transaction, which it is, and sets the flag to "true". So we know the sequence changes need to be queued and replayed as part of this transaction. But then during decoding, we still queue the changes into the subxact, which then aborts, and the changes are discarded. That is not how it's supposed to work, because the new relfilenode is still valid, someone might do nextval() and commit. And the nextval() may not get WAL-logged, so we'd lose this. What I guess we might do is log not just a boolean flag, but the XID of the subtransaction that created the relfilenode. And then during decoding we'd queue the changes into this subtransaction ... 0006 in the attached patch series does this, and it seems to fix the TAP test failure. I left it at the end, to make it easier to run tests without the patch applied. There's a couple open questions, though. - I'm not sure it's a good idea to log XIDs of subxacts into WAL like this. I think it'd be OK, and there are other records that do that (like RunningXacts or commit record), but maybe I'm missing something. - We need the actual XID, not just the SubTransactionId. I wrote SubTransactionGetXid() to to this, but I did not work with subxacts this, so it'd be better if someone checked it's dealing with XID and FullTransactionId correctly. - I'm a bit concerned how this will perform with deeply nested subtransactions. SubTransactionGetXid() does pretty much a linear search, which might be somewhat expensive. And it's a cost put on everyone who writes WAL, not just the decoding process. Maybe we should at least limit this to wal_level=logical? - seq_decode() then uses this XID (for transactional changes) instead of the XID logged in the record itself. I think that's fine - it's the TXN where we want to queue the change, after all, right? - (unrelated) I also noticed that maybe ReorderBufferQueueSequence() should always expect a valid XID. The code seems to suggest people can pass InvalidTransactionId in the non-transactional case, but that's not true because the rb->sequence() then fails. The attached patches should also fix all the typos reported by Amit earlier today. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v20231127-3-0001-Logical-decoding-of-sequences.patch
- v20231127-3-0002-tweak-ReorderBufferSequenceIsTransaction.patch
- v20231127-3-0003-WIP-add-is_transactional-attribute-in-xl.patch
- v20231127-3-0004-Add-decoding-of-sequences-to-test_decodi.patch
- v20231127-3-0005-Add-decoding-of-sequences-to-built-in-re.patch
- v20231127-3-0006-log-XID-instead-of-a-boolean-flag.patch
FWIW, here are some more minor review comments for v20231127-3-0001 ====== doc/src/sgml/logicaldecoding.sgml 1. + The <parameter>txn</parameter> parameter contains meta information about + the transaction the sequence change is part of. Note however that for + non-transactional updates, the transaction may be NULL, depending on + if the transaction already has an XID assigned. + The <parameter>sequence_lsn</parameter> has the WAL location of the + sequence update. <parameter>transactional</parameter> says if the + sequence has to be replayed as part of the transaction or directly. /says if/specifies whether/ ====== src/backend/commands/sequence.c 2. DecodeSeqTuple + memcpy(((char *) tuple->tuple.t_data), + data + sizeof(xl_seq_rec), + SizeofHeapTupleHeader); + + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader, + data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader, + datalen); Maybe I am misreading but isn't this just copying 2 contiguous pieces of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen) achieve the same? ====== .../replication/logical/reorderbuffer.c 3. + * To decide if a sequence change is transactional, we maintain a hash + * table of relfilenodes created in each (sub)transactions, along with + * the XID of the (sub)transaction that created the relfilenode. The + * entries from substransactions are copied to the top-level transaction + * to make checks cheaper. The hash table gets cleaned up when the + * transaction completes (commit/abort). /substransactions/subtransactions/ ~~~ 4. + * A naive approach would be to just loop through all transactions and check + * each of them, but there may be (easily thousands) of subtransactions, and + * the check happens for each sequence change. So this could be very costly. /may be (easily thousands) of/may be (easily thousands of)/ ~~~ 5. ReorderBufferSequenceCleanup + while ((ent = (ReorderBufferSequenceEnt *) hash_seq_search(&scan_status)) != NULL) + { + (void) hash_search(txn->toptxn->sequences_hash, + (void *) &ent->rlocator, + HASH_REMOVE, NULL); + } Typically, other HASH_REMOVE code I saw would check result for NULL to give elog(ERROR, "hash table corrupted"); ~~~ 6. ReorderBufferQueueSequence + if (xid != InvalidTransactionId) + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); How about using the macro: TransactionIdIsValid ~~~ 7. ReorderBufferQueueSequence + if (reloid == InvalidOid) + elog(ERROR, "could not map filenode \"%s\" to relation OID", + relpathperm(rlocator, + MAIN_FORKNUM)); How about using the macro: OidIsValid ~~~ 8. + /* + * Calculate the first value of the next batch (at which point we + * generate and decode another WAL record. + */ Missing ')' ~~~ 9. ReorderBufferAddRelFileLocator + /* + * We only care about sequence relfilenodes for now, and those always have + * a XID. So if there's no XID, don't bother adding them to the hash. + */ + if (xid == InvalidTransactionId) + return; How about using the macro: TransactionIdIsValid ~~~ 10. ReorderBufferProcessTXN + if (reloid == InvalidOid) + elog(ERROR, "could not map filenode \"%s\" to relation OID", + relpathperm(change->data.sequence.locator, + MAIN_FORKNUM)); How about using the macro: OidIsValid ~~~ 11. ReorderBufferChangeSize + if (tup) + { + sz += sizeof(HeapTupleData); + len = tup->tuple.t_len; + sz += len; + } Why is the 'sz' increment split into 2 parts? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I spent a bit of time looking at the proposed change, and unfortunately > logging just the boolean flag does not work. A good example is this bit > from a TAP test added by the patch for built-in replication (which was > not included with the WIP patch): > > BEGIN; > ALTER SEQUENCE s RESTART WITH 1000; > SAVEPOINT sp1; > INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100); > ROLLBACK TO sp1; > COMMIT; > > This is expected to produce: > > 1131|0|t > > but produces > > 1000|0|f > > instead. The reason is very simple - as implemented, the patch simply > checks if the relfilenode is from the same top-level transaction, which > it is, and sets the flag to "true". So we know the sequence changes need > to be queued and replayed as part of this transaction. > > But then during decoding, we still queue the changes into the subxact, > which then aborts, and the changes are discarded. That is not how it's > supposed to work, because the new relfilenode is still valid, someone > might do nextval() and commit. And the nextval() may not get WAL-logged, > so we'd lose this. > > What I guess we might do is log not just a boolean flag, but the XID of > the subtransaction that created the relfilenode. And then during > decoding we'd queue the changes into this subtransaction ... > > 0006 in the attached patch series does this, and it seems to fix the TAP > test failure. I left it at the end, to make it easier to run tests > without the patch applied. > Offhand, I don't have any better idea than what you have suggested for the problem but this needs some thoughts including the questions asked by you. I'll spend some time on it and respond back. -- With Regards, Amit Kapila.
On 11/28/23 12:32, Amit Kapila wrote: > On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I spent a bit of time looking at the proposed change, and unfortunately >> logging just the boolean flag does not work. A good example is this bit >> from a TAP test added by the patch for built-in replication (which was >> not included with the WIP patch): >> >> BEGIN; >> ALTER SEQUENCE s RESTART WITH 1000; >> SAVEPOINT sp1; >> INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100); >> ROLLBACK TO sp1; >> COMMIT; >> >> This is expected to produce: >> >> 1131|0|t >> >> but produces >> >> 1000|0|f >> >> instead. The reason is very simple - as implemented, the patch simply >> checks if the relfilenode is from the same top-level transaction, which >> it is, and sets the flag to "true". So we know the sequence changes need >> to be queued and replayed as part of this transaction. >> >> But then during decoding, we still queue the changes into the subxact, >> which then aborts, and the changes are discarded. That is not how it's >> supposed to work, because the new relfilenode is still valid, someone >> might do nextval() and commit. And the nextval() may not get WAL-logged, >> so we'd lose this. >> >> What I guess we might do is log not just a boolean flag, but the XID of >> the subtransaction that created the relfilenode. And then during >> decoding we'd queue the changes into this subtransaction ... >> >> 0006 in the attached patch series does this, and it seems to fix the TAP >> test failure. I left it at the end, to make it easier to run tests >> without the patch applied. >> > > Offhand, I don't have any better idea than what you have suggested for > the problem but this needs some thoughts including the questions asked > by you. I'll spend some time on it and respond back. > I've been experimenting with the idea to log the XID, and for a moment I was worried it actually can't work, because subtransactions may not actually be just nested in simple way, but form a tree. And what if the sequence was altered in a different branch (sibling subxact), not in the immediate parent. In which case the new SubTransactionGetXid() would fail, because it just walks the current chain of subtransactions. I've been thinking about cases like this: BEGIN; CREATE SEQUENCE s; # XID 1000 SELECT alter_sequence(); # XID 1001 SAVEPOINT s1; SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000 ROLLBACK TO s1; SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000 COMMIT; The XID values are what the sequence wal record will reference, assuming that the main transaction XID is 1000. Initially, I thought it's wrong that the nextval() calls reference XID of the main transaction, because the last relfilenode comes from 1001, which is the subxact created by alter_sequence() thanks to the exception handling block. And that's where the approach in reorderbuffer would queue the changes. But I think this is actually correct too. When a subtransaction commits (e.g. when alter_sequence() completes), it essentially becomes part of the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid accordingly, setting it to parentSubid. This also means that SubTransactionGetXid() can't actually fail, because the ID has to reference an active subtransaction in the current stack. I'm still concerned about the cost of the lookup, because the list may be long and the subxact we're looking for may be quite high, but I guess we might have another field, caching the XID. It'd need to be updated only in AtEOSubXact_cleanup, and at that point we know it's the immediate parent, so it'd be pretty cheap I think. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I have been hacking on improving the improvements outlined in my preceding e-mail, but I have some bad news - I ran into an issue that I don't know how to solve :-( Consider this transaction: BEGIN; ALTER SEQUENCE s RESTART 1000; SAVEPOINT s1; ALTER SEQUENCE s RESTART 2000; ROLLBACK TO s1; INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40); COMMIT; If you try this with the approach relying on rd_newRelfilelocatorSubid and rd_createSubid, it fails like this on the subscriber: ERROR: could not map filenode "base/5/16394" to relation OID This happens because ReorderBufferQueueSequence tries to do this in the non-transactional branch: reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber); and the relfilenode is the one created by the first ALTER. But this is obviously wrong - the changes should have been treated as transactional, because they are tied to the first ALTER. So how did we get there? Well, the whole problem is that in case of abort, AtEOSubXact_cleanup resets the two fields to InvalidSubTransactionId. Which means the rollback in the above transaction also forgets about the first ALTER. Now that I look at the RelationData comments, it actually describes exactly this situation: * * rd_newRelfilelocatorSubid is the ID of the highest subtransaction * the most-recent relfilenumber change has survived into or zero if * not changed in the current transaction (or we have forgotten * changing it). This field is accurate when non-zero, but it can be * zero when a relation has multiple new relfilenumbers within a * single transaction, with one of them occurring in a subsequently * aborted subtransaction, e.g. * BEGIN; * TRUNCATE t; * SAVEPOINT save; * TRUNCATE t; * ROLLBACK TO save; * -- rd_newRelfilelocatorSubid is now forgotten * The root of this problem is that we'd need some sort of "history" for the field, so that when a subxact aborts, we can restore the previous value. But we obviously don't have that, and I doubt we want to add that to relcache - for example, it'd either need to impose some limit on the history (and thus a failure when we reach the limit), or it'd need to handle histories of arbitrary length. At this point I don't see a solution for this, which means the best way forward with the sequence decoding patch seems to be the original approach, on the decoding side. I'm attaching the patch with 0005 and 0006, adding two simple tests (no other changes compared to yesterday's version). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v20231128-0001-Logical-decoding-of-sequences.patch
- v20231128-0002-tweak-ReorderBufferSequenceIsTransactional.patch
- v20231128-0003-Add-decoding-of-sequences-to-test_decoding.patch
- v20231128-0004-Add-decoding-of-sequences-to-built-in-repl.patch
- v20231128-0005-subxact-alter-rollback-test.patch
- v20231128-0006-subxact-test.patch
- v20231128-0007-WIP-add-is_transactional-attribute-in-xl_s.patch
- v20231128-0008-log-XID-instead-of-a-boolean-flag.patch
On 11/27/23 23:06, Peter Smith wrote: > FWIW, here are some more minor review comments for v20231127-3-0001 > > ====== > doc/src/sgml/logicaldecoding.sgml > > 1. > + The <parameter>txn</parameter> parameter contains meta information about > + the transaction the sequence change is part of. Note however that for > + non-transactional updates, the transaction may be NULL, depending on > + if the transaction already has an XID assigned. > + The <parameter>sequence_lsn</parameter> has the WAL location of the > + sequence update. <parameter>transactional</parameter> says if the > + sequence has to be replayed as part of the transaction or directly. > > /says if/specifies whether/ > Will fix. > ====== > src/backend/commands/sequence.c > > 2. DecodeSeqTuple > > + memcpy(((char *) tuple->tuple.t_data), > + data + sizeof(xl_seq_rec), > + SizeofHeapTupleHeader); > + > + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader, > + data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader, > + datalen); > > Maybe I am misreading but isn't this just copying 2 contiguous pieces > of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen) > achieve the same? > You're right, will fix. I think the code looked differently before, got simplified and I haven't noticed this can be a single memcpy(). > ====== > .../replication/logical/reorderbuffer.c > > 3. > + * To decide if a sequence change is transactional, we maintain a hash > + * table of relfilenodes created in each (sub)transactions, along with > + * the XID of the (sub)transaction that created the relfilenode. The > + * entries from substransactions are copied to the top-level transaction > + * to make checks cheaper. The hash table gets cleaned up when the > + * transaction completes (commit/abort). > > /substransactions/subtransactions/ > Will fix. > ~~~ > > 4. > + * A naive approach would be to just loop through all transactions and check > + * each of them, but there may be (easily thousands) of subtransactions, and > + * the check happens for each sequence change. So this could be very costly. > > /may be (easily thousands) of/may be (easily thousands of)/ > > ~~~ Thanks. I've reworded this to ... may be many (easily thousands of) subtransactions ... > > 5. ReorderBufferSequenceCleanup > > + while ((ent = (ReorderBufferSequenceEnt *) > hash_seq_search(&scan_status)) != NULL) > + { > + (void) hash_search(txn->toptxn->sequences_hash, > + (void *) &ent->rlocator, > + HASH_REMOVE, NULL); > + } > > Typically, other HASH_REMOVE code I saw would check result for NULL to > give elog(ERROR, "hash table corrupted"); > Good point, I'll add the error check > ~~~ > > 6. ReorderBufferQueueSequence > > + if (xid != InvalidTransactionId) > + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > How about using the macro: TransactionIdIsValid > Actually, I wrote in some other message, I think the check is not necessary. Or rather it should be an assert that XID is valid. And yeah, the macro is a good idea. > ~~~ > > 7. ReorderBufferQueueSequence > > + if (reloid == InvalidOid) > + elog(ERROR, "could not map filenode \"%s\" to relation OID", > + relpathperm(rlocator, > + MAIN_FORKNUM)); > > How about using the macro: OidIsValid > I chose to keep this consistent with other places in reorderbuffer, and all of them use the equality check. > ~~~ > > 8. > + /* > + * Calculate the first value of the next batch (at which point we > + * generate and decode another WAL record. > + */ > > Missing ')' > Will fix. > ~~~ > > 9. ReorderBufferAddRelFileLocator > > + /* > + * We only care about sequence relfilenodes for now, and those always have > + * a XID. So if there's no XID, don't bother adding them to the hash. > + */ > + if (xid == InvalidTransactionId) > + return; > > How about using the macro: TransactionIdIsValid > Will change. > ~~~ > > 10. ReorderBufferProcessTXN > > + if (reloid == InvalidOid) > + elog(ERROR, "could not map filenode \"%s\" to relation OID", > + relpathperm(change->data.sequence.locator, > + MAIN_FORKNUM)); > > How about using the macro: OidIsValid > Same as the other Oid check - consistency. > ~~~ > > 11. ReorderBufferChangeSize > > + if (tup) > + { > + sz += sizeof(HeapTupleData); > + len = tup->tuple.t_len; > + sz += len; > + } > > Why is the 'sz' increment split into 2 parts? > Because the other branches in ReorderBufferChangeSize do it that way. You're right it might be coded on a single line. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi! Considering my findings about issues with the rd_newRelfilelocatorSubid field and how it makes that approach impossible, I decided to rip out those patches, and go back to the approach where reorderbuffer tracks new relfilenodes. This means the open questions I listed two days ago disappear, because all of that was about the alternative approach. I've also added a couple more tests into 034_sequences.pl, testing the basic cases with substransactions that rollback (or not), etc. The attached patch also addresses the review comments by Peter Smith. The one remaining open question is ReorderBufferSequenceIsTransactional and whether it can do better than searching through all top-level transactions. The idea of 0002 was to only search the current top-level xact, but Amit pointed out we can't rely on seeing the assignment until we know we're in a consistent snapshot. I'm yet to try doing some tests to measure how expensive this lookup can be in practice. But let's assume it's measurable and significant enough to matter. I wonder if we could salvage this optimization somehow. I'm thinking about three options: 1) Could ReorderBufferSequenceIsTransactional check the snapshot is already consistent etc. and use the optimized variant (looking only at the same top-level xact) in that case? And if not, fallback to the search of all top-level xacts. In practice, the full search would be used only for a short initial period. 2) We could also make ReorderBufferSequenceIsTransactional to always check the same top-level transaction first and then fallback, no matter whether the snapshot is consistent or not. The problem is this doesn't really optimize the common case where there are no new relfilenodes, so we won't find a match in the top-level xact, and will always search everything anyway. 3) Alternatively, we could maintain a global hash table, instead of in the top-level transaction. So there'd always be two copies, one in the xact itself and then in the global hash. Now there's either one (in current top-level xact), or two (subxact + top-level xact). I kinda like (3), because it just works and doesn't require the snapshot being consistent etc. Opinions? -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I have been hacking on improving the improvements outlined in my > preceding e-mail, but I have some bad news - I ran into an issue that I > don't know how to solve :-( > > Consider this transaction: > > BEGIN; > ALTER SEQUENCE s RESTART 1000; > > SAVEPOINT s1; > ALTER SEQUENCE s RESTART 2000; > ROLLBACK TO s1; > > INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40); > COMMIT; > > If you try this with the approach relying on rd_newRelfilelocatorSubid > and rd_createSubid, it fails like this on the subscriber: > > ERROR: could not map filenode "base/5/16394" to relation OID > > This happens because ReorderBufferQueueSequence tries to do this in the > non-transactional branch: > > reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber); > > and the relfilenode is the one created by the first ALTER. But this is > obviously wrong - the changes should have been treated as transactional, > because they are tied to the first ALTER. So how did we get there? > > Well, the whole problem is that in case of abort, AtEOSubXact_cleanup > resets the two fields to InvalidSubTransactionId. Which means the > rollback in the above transaction also forgets about the first ALTER. > Now that I look at the RelationData comments, it actually describes > exactly this situation: > > * > * rd_newRelfilelocatorSubid is the ID of the highest subtransaction > * the most-recent relfilenumber change has survived into or zero if > * not changed in the current transaction (or we have forgotten > * changing it). This field is accurate when non-zero, but it can be > * zero when a relation has multiple new relfilenumbers within a > * single transaction, with one of them occurring in a subsequently > * aborted subtransaction, e.g. > * BEGIN; > * TRUNCATE t; > * SAVEPOINT save; > * TRUNCATE t; > * ROLLBACK TO save; > * -- rd_newRelfilelocatorSubid is now forgotten > * > > The root of this problem is that we'd need some sort of "history" for > the field, so that when a subxact aborts, we can restore the previous > value. But we obviously don't have that, and I doubt we want to add that > to relcache - for example, it'd either need to impose some limit on the > history (and thus a failure when we reach the limit), or it'd need to > handle histories of arbitrary length. > Yeah, I think that would be really tricky and we may not want to go there. > At this point I don't see a solution for this, which means the best way > forward with the sequence decoding patch seems to be the original > approach, on the decoding side. > One thing that worries me about that approach is that it can suck with the workload that has a lot of DDLs that create XLOG_SMGR_CREATE records. We have previously fixed some such workloads in logical decoding where decoding a transaction containing truncation of a table with a lot of partitions (1000 or more) used to take a very long time. Don't we face performance issues in such scenarios? How do we see this work w.r.t to some sort of global sequences? There is some recent discussion where I have raised a similar point [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com -- With Regards, Amit Kapila.
On 11/29/23 14:42, Amit Kapila wrote: > On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I have been hacking on improving the improvements outlined in my >> preceding e-mail, but I have some bad news - I ran into an issue that I >> don't know how to solve :-( >> >> Consider this transaction: >> >> BEGIN; >> ALTER SEQUENCE s RESTART 1000; >> >> SAVEPOINT s1; >> ALTER SEQUENCE s RESTART 2000; >> ROLLBACK TO s1; >> >> INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40); >> COMMIT; >> >> If you try this with the approach relying on rd_newRelfilelocatorSubid >> and rd_createSubid, it fails like this on the subscriber: >> >> ERROR: could not map filenode "base/5/16394" to relation OID >> >> This happens because ReorderBufferQueueSequence tries to do this in the >> non-transactional branch: >> >> reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber); >> >> and the relfilenode is the one created by the first ALTER. But this is >> obviously wrong - the changes should have been treated as transactional, >> because they are tied to the first ALTER. So how did we get there? >> >> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup >> resets the two fields to InvalidSubTransactionId. Which means the >> rollback in the above transaction also forgets about the first ALTER. >> Now that I look at the RelationData comments, it actually describes >> exactly this situation: >> >> * >> * rd_newRelfilelocatorSubid is the ID of the highest subtransaction >> * the most-recent relfilenumber change has survived into or zero if >> * not changed in the current transaction (or we have forgotten >> * changing it). This field is accurate when non-zero, but it can be >> * zero when a relation has multiple new relfilenumbers within a >> * single transaction, with one of them occurring in a subsequently >> * aborted subtransaction, e.g. >> * BEGIN; >> * TRUNCATE t; >> * SAVEPOINT save; >> * TRUNCATE t; >> * ROLLBACK TO save; >> * -- rd_newRelfilelocatorSubid is now forgotten >> * >> >> The root of this problem is that we'd need some sort of "history" for >> the field, so that when a subxact aborts, we can restore the previous >> value. But we obviously don't have that, and I doubt we want to add that >> to relcache - for example, it'd either need to impose some limit on the >> history (and thus a failure when we reach the limit), or it'd need to >> handle histories of arbitrary length. >> > > Yeah, I think that would be really tricky and we may not want to go there. > >> At this point I don't see a solution for this, which means the best way >> forward with the sequence decoding patch seems to be the original >> approach, on the decoding side. >> > > One thing that worries me about that approach is that it can suck with > the workload that has a lot of DDLs that create XLOG_SMGR_CREATE > records. We have previously fixed some such workloads in logical > decoding where decoding a transaction containing truncation of a table > with a lot of partitions (1000 or more) used to take a very long time. > Don't we face performance issues in such scenarios? > I don't think we do, really. We will have to decode the SMGR records and add the relfilenodes to the hash table(s), but I think that affects the lookup performance too much. What I think might be a problem is if we have many top-level transactions, especially if those transactions do something that creates a relfilenode. Because then we'll have to do a hash_search for each of them, and that might be measurable even if each lookup is O(1). And we do the lookup for every sequence change ... > How do we see this work w.r.t to some sort of global sequences? There > is some recent discussion where I have raised a similar point [1]. > > [1] - https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com > I think those are very different things, even though called "sequences". AFAIK solutions like snowflakeID or UUIDs don't require replication of any shared state (that's kinda the whole point), so I don't see why would it need some special support in logical decoding. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/29/23 15:41, Tomas Vondra wrote: > ... >> >> One thing that worries me about that approach is that it can suck with >> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE >> records. We have previously fixed some such workloads in logical >> decoding where decoding a transaction containing truncation of a table >> with a lot of partitions (1000 or more) used to take a very long time. >> Don't we face performance issues in such scenarios? >> > > I don't think we do, really. We will have to decode the SMGR records and > add the relfilenodes to the hash table(s), but I think that affects the > lookup performance too much. What I think might be a problem is if we > have many top-level transactions, especially if those transactions do > something that creates a relfilenode. Because then we'll have to do a > hash_search for each of them, and that might be measurable even if each > lookup is O(1). And we do the lookup for every sequence change ... > I did some micro-benchmarking today, trying to identify cases where this would cause unexpected problems, either due to having to maintain all the relfilenodes, or due to having to do hash lookups for every sequence change. But I think it's fine, mostly ... I did all the following tests with 64 clients. I may try more, but even with this there should be fair number of concurrent transactions, which determines the number of top-level transactions in reorderbuffer. I'll try with more clients tomorrow, but I don't think it'll change stuff. The test is fairly simple - run a particular number of transactions (might be 1000 * 64, or more). And then measure how long it takes to decode the changes using test_decoding. Now, the various workloads I tried: 1) "good case" - small OLTP transactions, a couple nextval('s') calls begin; insert into t (1); select nextval('s'); insert into t (1); commit; This is pretty fine, the sequence part of reorderbuffer is really not measurable, it's like 1% of the total CPU time. Which is expected, because we only wal-log every 32-nd increment or so. 2) "good case" - same as (1) but more nextval calls to always do wal begin; insert into t (1); select nextval('s') from generate_series(1,40); insert into t (1); commit; Here sequences are more measurable, it's like 15% of CPU time, but most of that comes to AbortCurrentTransaction() in the non-transactional branch of ReorderBufferQueueSequence. I don't think there's a way around that, and it's entirely unrelated to relfilenodes. The function checking if the change is transactional (ReorderBufferSequenceIsTransactional) is less than 1% of the profile - and this is the version that always walks all top-level transactions. 3) "bad case" - small transactions that generate a lot of relfilenodes select alter_sequence(); where the function is defined like this (I did create 1000 sequences before the test): CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$ DECLARE v INT; BEGIN v := 1 + (random() * 999)::int; execute format('alter sequence s%s restart with 1000', v); perform nextval('s'); END; $$ LANGUAGE plpgsql; This performs terribly, but it's entirely unrelated to sequences. Current master has exactly the same problem, if transactions do DDL. Like this, for example: CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$ DECLARE v INT; BEGIN v := 1 + (random() * 999)::int; execute format('create table t%s (a int)', v); execute format('drop table t%s', v); insert into t values (1); END; $$ LANGUAGE plpgsql; This has the same impact on master. The perf report shows this: --98.06%--pg_logical_slot_get_changes_guts | --97.88%--LogicalDecodingProcessRecord | --97.56%--xact_decode | --97.51%--DecodeCommit | |--91.92%--SnapBuildCommitTxn | | | --91.65%--SnapBuildBuildSnapshot | | | --91.14%--pg_qsort The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot takes so long is because: ----------------- Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8) at snapbuild.c:498 498 + sizeof(TransactionId) * builder->committed.xcnt (gdb) p builder->committed.xcnt $4 = 11532 ----------------- And with each iteration it grows by 1. That looks quite weird, possibly a bug worth fixing, but unrelated to this patch. I can't investigate this more at the moment, not sure when/if I'll get to that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 29, 2023 at 11:45 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 11/27/23 23:06, Peter Smith wrote: > > FWIW, here are some more minor review comments for v20231127-3-0001 > > > > ====== > > .../replication/logical/reorderbuffer.c > > > > 3. > > + * To decide if a sequence change is transactional, we maintain a hash > > + * table of relfilenodes created in each (sub)transactions, along with > > + * the XID of the (sub)transaction that created the relfilenode. The > > + * entries from substransactions are copied to the top-level transaction > > + * to make checks cheaper. The hash table gets cleaned up when the > > + * transaction completes (commit/abort). > > > > /substransactions/subtransactions/ > > > > Will fix. FYI - I think this typo still exists in the patch v20231128-0001. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > 3) "bad case" - small transactions that generate a lot of relfilenodes > > select alter_sequence(); > > where the function is defined like this (I did create 1000 sequences > before the test): > > CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$ > DECLARE > v INT; > BEGIN > v := 1 + (random() * 999)::int; > execute format('alter sequence s%s restart with 1000', v); > perform nextval('s'); > END; > $$ LANGUAGE plpgsql; > > This performs terribly, but it's entirely unrelated to sequences. > Current master has exactly the same problem, if transactions do DDL. > Like this, for example: > > CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$ > DECLARE > v INT; > BEGIN > v := 1 + (random() * 999)::int; > execute format('create table t%s (a int)', v); > execute format('drop table t%s', v); > insert into t values (1); > END; > $$ LANGUAGE plpgsql; > > This has the same impact on master. The perf report shows this: > > --98.06%--pg_logical_slot_get_changes_guts > | > --97.88%--LogicalDecodingProcessRecord > | > --97.56%--xact_decode > | > --97.51%--DecodeCommit > | > |--91.92%--SnapBuildCommitTxn > | | > | --91.65%--SnapBuildBuildSnapshot > | | > | --91.14%--pg_qsort > > The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot > takes so long is because: > > ----------------- > Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8) > at snapbuild.c:498 > 498 + sizeof(TransactionId) * builder->committed.xcnt > (gdb) p builder->committed.xcnt > $4 = 11532 > ----------------- > > And with each iteration it grows by 1. > Can we somehow avoid this either by keeping DDL-related xacts open or aborting them? Also, will it make any difference to use setval as do_setval() seems to be logging each time? If possible, can you share the scripts? Kuroda-San has access to the performance machine, he may be able to try it as well. -- With Regards, Amit Kapila.
Dear Tomas, > I did some micro-benchmarking today, trying to identify cases where this > would cause unexpected problems, either due to having to maintain all > the relfilenodes, or due to having to do hash lookups for every sequence > change. But I think it's fine, mostly ... > I did also performance tests (especially case 3). First of all, there are some variants from yours. 1. patch 0002 was reverted because it has an issue. So this test checks whether refactoring around ReorderBufferSequenceIsTransactional seems really needed. 2. per comments from Amit, I also measured the abort case. In this case, the alter_sequence() is called but the transaction is aborted. 3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any cases, clients executed 1000 transactions. The performance machine has 128 core so that result for 128 clients might be saturated. 4. a short sleep (0.1s) was added in alter_sequence(), especially between "alter sequence" and nextval(). Because while testing, I found that the transaction is too short to execute in parallel. I think it is reasonable because ReorderBufferSequenceIsTransactional() might be worse when the parallelism is increased. I attached one backend process via perf and executed pg_slot_logical_get_changes(). Attached txt file shows which function occupied CPU time, especially from pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional(). Here are my observations about them. * In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64 clients case. * For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste many times. I think this is because changes exceed logical_decoding_work_mem, so we do not have to analyze anymore. * In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is linearly longer. This means that we need to think some solution to avoid the overhead by ReorderBufferSequenceIsTransactional(). ``` 8 clients 3.73% occupied time 16 7.26% 32 15.82% 64 29.14% 128 46.27% ``` * In case of abort, I also checked CPU time used by ReorderBufferAddRelFileLocator(), but it seems not so depends on the number of clients. ``` 8 clients 3.66% occupied time 16 6.94% 32 4.65% 64 5.39% 128 3.06% ``` As next step, I've planned to run the case which uses setval() function, because it generates more WALs than normal nextval(); How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On 11/30/23 12:56, Amit Kapila wrote: > On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> 3) "bad case" - small transactions that generate a lot of relfilenodes >> >> select alter_sequence(); >> >> where the function is defined like this (I did create 1000 sequences >> before the test): >> >> CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$ >> DECLARE >> v INT; >> BEGIN >> v := 1 + (random() * 999)::int; >> execute format('alter sequence s%s restart with 1000', v); >> perform nextval('s'); >> END; >> $$ LANGUAGE plpgsql; >> >> This performs terribly, but it's entirely unrelated to sequences. >> Current master has exactly the same problem, if transactions do DDL. >> Like this, for example: >> >> CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$ >> DECLARE >> v INT; >> BEGIN >> v := 1 + (random() * 999)::int; >> execute format('create table t%s (a int)', v); >> execute format('drop table t%s', v); >> insert into t values (1); >> END; >> $$ LANGUAGE plpgsql; >> >> This has the same impact on master. The perf report shows this: >> >> --98.06%--pg_logical_slot_get_changes_guts >> | >> --97.88%--LogicalDecodingProcessRecord >> | >> --97.56%--xact_decode >> | >> --97.51%--DecodeCommit >> | >> |--91.92%--SnapBuildCommitTxn >> | | >> | --91.65%--SnapBuildBuildSnapshot >> | | >> | --91.14%--pg_qsort >> >> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot >> takes so long is because: >> >> ----------------- >> Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8) >> at snapbuild.c:498 >> 498 + sizeof(TransactionId) * builder->committed.xcnt >> (gdb) p builder->committed.xcnt >> $4 = 11532 >> ----------------- >> >> And with each iteration it grows by 1. >> > > Can we somehow avoid this either by keeping DDL-related xacts open or > aborting them? I I'm not sure why the snapshot builder does this, i.e. why we end up accumulating that many xids, and I didn't have time to look closer. So I don't know if this would be a solution or not. > Also, will it make any difference to use setval as > do_setval() seems to be logging each time? > I think that's pretty much what case (2) does, as it calls nextval() enough time for each transaction do generate WAL. But I don't think this is a very sensible benchmark - it's an extreme case, but practical cases are far closer to case (1) because sequences are intermixed with other activity. No one really does just nextval() calls. > If possible, can you share the scripts? Kuroda-San has access to the > performance machine, he may be able to try it as well. > Sure, attached. But it's a very primitive script, nothing fancy. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 12/1/23 12:08, Hayato Kuroda (Fujitsu) wrote: > Dear Tomas, > >> I did some micro-benchmarking today, trying to identify cases where this >> would cause unexpected problems, either due to having to maintain all >> the relfilenodes, or due to having to do hash lookups for every sequence >> change. But I think it's fine, mostly ... >> > > I did also performance tests (especially case 3). First of all, there are some > variants from yours. > > 1. patch 0002 was reverted because it has an issue. So this test checks whether > refactoring around ReorderBufferSequenceIsTransactional seems really needed. FWIW I also did the benchmarks without the 0002 patch, for the same reason. I forgot to mention that. > 2. per comments from Amit, I also measured the abort case. In this case, the > alter_sequence() is called but the transaction is aborted. > 3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any cases, > clients executed 1000 transactions. The performance machine has 128 core so that > result for 128 clients might be saturated. > 4. a short sleep (0.1s) was added in alter_sequence(), especially between > "alter sequence" and nextval(). Because while testing, I found that the > transaction is too short to execute in parallel. I think it is reasonable > because ReorderBufferSequenceIsTransactional() might be worse when the parallelism > is increased. > > I attached one backend process via perf and executed pg_slot_logical_get_changes(). > Attached txt file shows which function occupied CPU time, especially from > pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional(). > Here are my observations about them. > > * In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64 > clients case. > * For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste > many times. I think this is because changes exceed logical_decoding_work_mem, > so we do not have to analyze anymore. > * In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is linearly > longer. This means that we need to think some solution to avoid the overhead by > ReorderBufferSequenceIsTransactional(). > > ``` > 8 clients 3.73% occupied time > 16 7.26% > 32 15.82% > 64 29.14% > 128 46.27% > ``` Interesting, so what exactly does the transaction do? Anyway, I don't think this is very surprising - I believe it behaves like this because of having to search in many hash tables (one in each toplevel xact). And I think the solution I explained before (maintaining a single toplevel hash, instead of many per-top-level hashes). FWIW I find this case interesting, but not very practical, because no practical workload has that many aborts. > > * In case of abort, I also checked CPU time used by ReorderBufferAddRelFileLocator(), but > it seems not so depends on the number of clients. > > ``` > 8 clients 3.66% occupied time > 16 6.94% > 32 4.65% > 64 5.39% > 128 3.06% > ``` > > As next step, I've planned to run the case which uses setval() function, because it > generates more WALs than normal nextval(); > How do you think? > Sure, although I don't think it's much different from the test selecting 40 values from the sequence (in each transaction). That generates about the same amount of WAL. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Dear Tomas, > > I did also performance tests (especially case 3). First of all, there are some > > variants from yours. > > > > 1. patch 0002 was reverted because it has an issue. So this test checks whether > > refactoring around ReorderBufferSequenceIsTransactional seems really > needed. > > FWIW I also did the benchmarks without the 0002 patch, for the same > reason. I forgot to mention that. Oh, good news. So your bench markings are quite meaningful. > > Interesting, so what exactly does the transaction do? It is quite simple - PSA the script file. It was executed with 64 multiplicity. The definition of alter_sequence() is same as you said. (I did use normal bash script for running them, but your approach may be smarter) > Anyway, I don't > think this is very surprising - I believe it behaves like this because > of having to search in many hash tables (one in each toplevel xact). And > I think the solution I explained before (maintaining a single toplevel > hash, instead of many per-top-level hashes). Agreed. And I can benchmark again for new ones, maybe when we decide new approach. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On 12/3/23 13:55, Hayato Kuroda (Fujitsu) wrote: > Dear Tomas, > >>> I did also performance tests (especially case 3). First of all, there are some >>> variants from yours. >>> >>> 1. patch 0002 was reverted because it has an issue. So this test checks whether >>> refactoring around ReorderBufferSequenceIsTransactional seems really >> needed. >> >> FWIW I also did the benchmarks without the 0002 patch, for the same >> reason. I forgot to mention that. > > Oh, good news. So your bench markings are quite meaningful. > >> >> Interesting, so what exactly does the transaction do? > > It is quite simple - PSA the script file. It was executed with 64 multiplicity. > The definition of alter_sequence() is same as you said. > (I did use normal bash script for running them, but your approach may be smarter) > >> Anyway, I don't >> think this is very surprising - I believe it behaves like this because >> of having to search in many hash tables (one in each toplevel xact). And >> I think the solution I explained before (maintaining a single toplevel >> hash, instead of many per-top-level hashes). > > Agreed. And I can benchmark again for new ones, maybe when we decide new > approach. > Thanks for the script. Are you also measuring the time it takes to decode this using test_decoding? FWIW I did more comprehensive suite of tests over the weekend, with a couple more variations. I'm attaching the updated scripts, running it should be as simple as ./run.sh BRANCH TRANSACTIONS RUNS so perhaps ./run.sh master 1000 3 to do 3 runs with 1000 transactions per client. And it'll run a bunch of combinations hard-coded in the script, and write the timings into a CSV file (with "master" in each row). I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I did this with current master, the basic patch (without the 0002 part), and then with the optimized approach (single global hash table, see the 0004 part). That's what master / patched / optimized in the results is. Interestingly enough, the i5 handled this much faster, it seems to be better in single-core tasks. The xeon is still running, so the results for "optimized" only have one run (out of 3), but shouldn't change much. Attached is also a table summarizing this, and visualizing the timing change (vs. master) in the last couple columns. Green is "faster" than master (but we don't really expect that), and "red" means slower than master (the more red, the slower). There results are grouped by script (see the attached .tgz), with either 32 or 96 clients (which does affect the timing, but not between master and patch). Some executions have no pg_sleep() calls, some have 0.001 wait (but that doesn't seem to make much difference). Overall, I'd group the results into about three groups: 1) good cases [nextval, nextval-40, nextval-abort] These are cases that slow down a bit, but the slowdown is mostly within reasonable bounds (we're making the decoding to do more stuff, so it'd be a bit silly to require that extra work to make no impact). And I do think this is reasonable, because this is pretty much an extreme / worst case behavior. People don't really do just nextval() calls, without doing anything else. Not to mention doing aborts for 100% transactions. So in practice this is going to be within noise (and in those cases the results even show speedup, which seems a bit surprising). It's somewhat dependent on CPU too - on xeon there's hardly any regression. 2) nextval-40-abort Here the slowdown is clear, but I'd argue it generally falls in the same group as (1). Yes, I'd be happier if it didn't behave like this, but if someone can show me a practical workload affected by this ... 3) irrelevant cases [all the alters taking insane amounts of time] I absolutely refuse to care about these extreme cases where decoding 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on xeon). If this was a problem for some practical workload, we'd have already heard about it I guess. And even if there was such workload, it wouldn't be up to this patch to fix that. There's clearly something misbehaving in the snapshot builder. I was hopeful the global hash table would be an improvement, but that doesn't seem to be the case. I haven't done much profiling yet, but I'd guess most of the overhead is due to ReorderBufferQueueSequence() starting and aborting a transaction in the non-transactinal case. Which is unfortunate, but I don't know if there's a way to optimize that. Some time ago I floated the idea of maybe "queuing" the sequence changes and only replay them on the next commit, somehow. But we did ran into problems with which snapshot to use, that I didn't know how to solve. Maybe we should try again. The idea is we'd queue the non-transactional changes somewhere (can't be in the transaction, because we must keep them even if it aborts), and then "inject" them into the next commit. That'd mean we wouldn't do the separate start/abort for each change. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 12/3/23 18:52, Tomas Vondra wrote: > ... > > Some time ago I floated the idea of maybe "queuing" the sequence changes > and only replay them on the next commit, somehow. But we did ran into > problems with which snapshot to use, that I didn't know how to solve. > Maybe we should try again. The idea is we'd queue the non-transactional > changes somewhere (can't be in the transaction, because we must keep > them even if it aborts), and then "inject" them into the next commit. > That'd mean we wouldn't do the separate start/abort for each change. > Another idea is that maybe we could somehow inform ReorderBuffer whether the output plugin even is interested in sequences. That'd help with cases where we don't even want/need to replicate sequences, e.g. because the publication does not specify (publish=sequence). What happens now in that case is we call ReorderBufferQueueSequence(), it does the whole dance with starting/aborting the transaction, calls rb->sequence() which just does "meh" and doesn't do anything. Maybe we could just short-circuit this by asking the output plugin somehow. In an extreme case the plugin may not even specify the sequence callbacks, and we're still doing all of this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Thanks for the script. Are you also measuring the time it takes to > decode this using test_decoding? > > FWIW I did more comprehensive suite of tests over the weekend, with a > couple more variations. I'm attaching the updated scripts, running it > should be as simple as > > ./run.sh BRANCH TRANSACTIONS RUNS > > so perhaps > > ./run.sh master 1000 3 > > to do 3 runs with 1000 transactions per client. And it'll run a bunch of > combinations hard-coded in the script, and write the timings into a CSV > file (with "master" in each row). > > I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I > did this with current master, the basic patch (without the 0002 part), > and then with the optimized approach (single global hash table, see the > 0004 part). That's what master / patched / optimized in the results is. > > Interestingly enough, the i5 handled this much faster, it seems to be > better in single-core tasks. The xeon is still running, so the results > for "optimized" only have one run (out of 3), but shouldn't change much. > > Attached is also a table summarizing this, and visualizing the timing > change (vs. master) in the last couple columns. Green is "faster" than > master (but we don't really expect that), and "red" means slower than > master (the more red, the slower). > > There results are grouped by script (see the attached .tgz), with either > 32 or 96 clients (which does affect the timing, but not between master > and patch). Some executions have no pg_sleep() calls, some have 0.001 > wait (but that doesn't seem to make much difference). > > Overall, I'd group the results into about three groups: > > 1) good cases [nextval, nextval-40, nextval-abort] > > These are cases that slow down a bit, but the slowdown is mostly within > reasonable bounds (we're making the decoding to do more stuff, so it'd > be a bit silly to require that extra work to make no impact). And I do > think this is reasonable, because this is pretty much an extreme / worst > case behavior. People don't really do just nextval() calls, without > doing anything else. Not to mention doing aborts for 100% transactions. > > So in practice this is going to be within noise (and in those cases the > results even show speedup, which seems a bit surprising). It's somewhat > dependent on CPU too - on xeon there's hardly any regression. > > > 2) nextval-40-abort > > Here the slowdown is clear, but I'd argue it generally falls in the same > group as (1). Yes, I'd be happier if it didn't behave like this, but if > someone can show me a practical workload affected by this ... > > > 3) irrelevant cases [all the alters taking insane amounts of time] > > I absolutely refuse to care about these extreme cases where decoding > 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on > xeon). If this was a problem for some practical workload, we'd have > already heard about it I guess. And even if there was such workload, it > wouldn't be up to this patch to fix that. There's clearly something > misbehaving in the snapshot builder. > > > I was hopeful the global hash table would be an improvement, but that > doesn't seem to be the case. I haven't done much profiling yet, but I'd > guess most of the overhead is due to ReorderBufferQueueSequence() > starting and aborting a transaction in the non-transactinal case. Which > is unfortunate, but I don't know if there's a way to optimize that. > Before discussing the alternative ideas you shared, let me try to clarify my understanding so that we are on the same page. I see two observations based on the testing and discussion we had (a) for non-transactional cases, the overhead observed is mainly due to starting/aborting a transaction for each change; (b) for transactional cases, we see overhead due to traversing all the top-level txns and check the hash table for each one to find whether change is transactional. Am, I missing something? -- With Regards, Amit Kapila.
On 12/5/23 13:17, Amit Kapila wrote: > ... >> I was hopeful the global hash table would be an improvement, but that >> doesn't seem to be the case. I haven't done much profiling yet, but I'd >> guess most of the overhead is due to ReorderBufferQueueSequence() >> starting and aborting a transaction in the non-transactinal case. Which >> is unfortunate, but I don't know if there's a way to optimize that. >> > > Before discussing the alternative ideas you shared, let me try to > clarify my understanding so that we are on the same page. I see two > observations based on the testing and discussion we had (a) for > non-transactional cases, the overhead observed is mainly due to > starting/aborting a transaction for each change; Yes, I believe that's true. See the attached profiles for nextval.sql and nextval-40.sql from master and optimized build (with the global hash), and also a perf-diff. I only include the top 1000 lines for each profile, that should be enough. master - current master without patches applied optimized - master + sequence decoding with global hash table For nextval, there's almost no difference in the profile. Decoding the other changes (inserts) is the dominant part, as we only log sequences every 32 increments. For nextval-40, the main increase is likely due to this part |--11.09%--seq_decode | | | |--9.25%--ReorderBufferQueueSequence | | | | | |--3.56%--AbortCurrentTransaction | | | | | | | --3.53%--AbortSubTransaction | | | | | | | |--0.95%--AtSubAbort_Portals | | | | | | | | | --0.83%--hash_seq_search | | | | | | | --0.83%--ResourceOwnerReleaseInternal | | | | | |--2.06%--BeginInternalSubTransaction | | | | | | | --1.10%--CommitTransactionCommand | | | | | | | --1.07%--StartSubTransaction | | | | | |--1.28%--CleanupSubTransaction | | | | | | | --0.64%--AtSubCleanup_Portals | | | | | | | --0.55%--hash_seq_search | | | | | --0.67%--RelidByRelfilenumber So yeah, that's the transaction stuff in ReorderBufferQueueSequence. There's also per-diff, comparing individual functions. > (b) for transactional > cases, we see overhead due to traversing all the top-level txns and > check the hash table for each one to find whether change is > transactional. > Not really, no. As I explained in my preceding e-mail, this check makes almost no difference - I did expect it to matter, but it doesn't. And I was a bit disappointed the global hash table didn't move the needle. Most of the time is spent in 78.81% 0.00% postgres postgres [.] DecodeCommit (inlined) | ---DecodeCommit (inlined) | |--72.65%--SnapBuildCommitTxn | | | --72.61%--SnapBuildBuildSnapshot | | | --72.09%--pg_qsort | | | |--66.24%--pg_qsort | | | And there's almost no difference between master and build with sequence decoding - see the attached diff-alter-sequence.perf, comparing the two branches (perf diff -c delta-abs). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Some time ago I floated the idea of maybe "queuing" the sequence changes > and only replay them on the next commit, somehow. But we did ran into > problems with which snapshot to use, that I didn't know how to solve. > Maybe we should try again. The idea is we'd queue the non-transactional > changes somewhere (can't be in the transaction, because we must keep > them even if it aborts), and then "inject" them into the next commit. > That'd mean we wouldn't do the separate start/abort for each change. Why can't we use the same concept of SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the non-transactional changes (have some base snapshot before the first change), and whenever there is any catalog change, queue new snapshot change also in the queue of the non-transactional sequence change so that while sending it to downstream whenever it is necessary we will change the historic snapshot? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/5/23 13:17, Amit Kapila wrote: > > > (b) for transactional > > cases, we see overhead due to traversing all the top-level txns and > > check the hash table for each one to find whether change is > > transactional. > > > > Not really, no. As I explained in my preceding e-mail, this check makes > almost no difference - I did expect it to matter, but it doesn't. And I > was a bit disappointed the global hash table didn't move the needle. > > Most of the time is spent in > > 78.81% 0.00% postgres postgres [.] DecodeCommit (inlined) > | > ---DecodeCommit (inlined) > | > |--72.65%--SnapBuildCommitTxn > | | > | --72.61%--SnapBuildBuildSnapshot > | | > | --72.09%--pg_qsort > | | > | |--66.24%--pg_qsort > | | | > > And there's almost no difference between master and build with sequence > decoding - see the attached diff-alter-sequence.perf, comparing the two > branches (perf diff -c delta-abs). > I think in this the commit time predominates which hides the overhead. We didn't investigate in detail if that can be improved but if we see a similar case of abort [1], it shows the overhead of ReorderBufferSequenceIsTransactional(). I understand that aborts won't be frequent and it is sort of unrealistic test but still helps to show that there is overhead in ReorderBufferSequenceIsTransactional(). Now, I am not sure if we can ignore that case because theoretically, the overhead can increase based on the number of top-level transactions. [1]: https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > I was also wondering what happens if the sequence changes are transactional but somehow the snap builder state changes to SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and the seq_decode() which means RelFileLocator will not be added to the hash table and during the seq_decode() we will consider the change as non-transactional. I haven't fully analyzed that what is the real problem in this case but have we considered this case? what happens if the transaction having both ALTER SEQUENCE and nextval() gets aborted but the nextva() has been considered as non-transactional because smgr_decode() changes were not processed because snap builder state was not yet SNAPBUILD_FULL_SNAPSHOT. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > > Some time ago I floated the idea of maybe "queuing" the sequence changes > > and only replay them on the next commit, somehow. But we did ran into > > problems with which snapshot to use, that I didn't know how to solve. > > Maybe we should try again. The idea is we'd queue the non-transactional > > changes somewhere (can't be in the transaction, because we must keep > > them even if it aborts), and then "inject" them into the next commit. > > That'd mean we wouldn't do the separate start/abort for each change. > > Why can't we use the same concept of > SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > non-transactional changes (have some base snapshot before the first > change), and whenever there is any catalog change, queue new snapshot > change also in the queue of the non-transactional sequence change so > that while sending it to downstream whenever it is necessary we will > change the historic snapshot? > Oh, do you mean maintain different historic snapshots and then switch based on the change we are processing? I guess the other thing we need to consider is the order of processing the changes if we maintain separate queues that need to be processed. -- With Regards, Amit Kapila.
On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/3/23 18:52, Tomas Vondra wrote: > > ... > > > > Another idea is that maybe we could somehow inform ReorderBuffer whether > the output plugin even is interested in sequences. That'd help with > cases where we don't even want/need to replicate sequences, e.g. because > the publication does not specify (publish=sequence). > > What happens now in that case is we call ReorderBufferQueueSequence(), > it does the whole dance with starting/aborting the transaction, calls > rb->sequence() which just does "meh" and doesn't do anything. Maybe we > could just short-circuit this by asking the output plugin somehow. > > In an extreme case the plugin may not even specify the sequence > callbacks, and we're still doing all of this. > We could explore this but I guess it won't solve the problem we are facing in cases where all sequences are published and plugin has specified the sequence callbacks. I think it would add some overhead of this check in positive cases where we decide to anyway do send the changes. -- With Regards, Amit Kapila.
On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Why can't we use the same concept of > > SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > > non-transactional changes (have some base snapshot before the first > > change), and whenever there is any catalog change, queue new snapshot > > change also in the queue of the non-transactional sequence change so > > that while sending it to downstream whenever it is necessary we will > > change the historic snapshot? > > > > Oh, do you mean maintain different historic snapshots and then switch > based on the change we are processing? I guess the other thing we need > to consider is the order of processing the changes if we maintain > separate queues that need to be processed. I mean we will not specifically maintain the historic changes, but if there is any catalog change where we are pushing the snapshot to all the transaction's change queue, at the same time we will push this snapshot in the non-transactional sequence queue as well. I am not sure what is the problem with the ordering? because we will be queueing all non-transactional sequence changes in a separate queue in the order they arrive and as soon as we process the next commit we will process all the non-transactional changes at that time. Do you see issue with that? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 12/6/23 10:05, Dilip Kumar wrote: > On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> > > I was also wondering what happens if the sequence changes are > transactional but somehow the snap builder state changes to > SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and > the seq_decode() which means RelFileLocator will not be added to the > hash table and during the seq_decode() we will consider the change as > non-transactional. I haven't fully analyzed that what is the real > problem in this case but have we considered this case? what happens if > the transaction having both ALTER SEQUENCE and nextval() gets aborted > but the nextva() has been considered as non-transactional because > smgr_decode() changes were not processed because snap builder state > was not yet SNAPBUILD_FULL_SNAPSHOT. > Yes, if something like this happens, that'd be a problem: 1) decoding starts, with SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT 2) transaction that creates a new refilenode gets decoded, but we skip it because we don't have the correct snapshot 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT 4) we decode sequence change from nextval() for the sequence This would lead to us attempting to apply sequence change for a relfilenode that's not visible yet (and may even get aborted). But can this even happen? Can we start decoding in the middle of a transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical messages, where we also call the output plugin in non-transactional cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/6/23 12:05, Dilip Kumar wrote: > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >>> Why can't we use the same concept of >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the >>> non-transactional changes (have some base snapshot before the first >>> change), and whenever there is any catalog change, queue new snapshot >>> change also in the queue of the non-transactional sequence change so >>> that while sending it to downstream whenever it is necessary we will >>> change the historic snapshot? >>> >> >> Oh, do you mean maintain different historic snapshots and then switch >> based on the change we are processing? I guess the other thing we need >> to consider is the order of processing the changes if we maintain >> separate queues that need to be processed. > > I mean we will not specifically maintain the historic changes, but if > there is any catalog change where we are pushing the snapshot to all > the transaction's change queue, at the same time we will push this > snapshot in the non-transactional sequence queue as well. I am not > sure what is the problem with the ordering? because we will be > queueing all non-transactional sequence changes in a separate queue in > the order they arrive and as soon as we process the next commit we > will process all the non-transactional changes at that time. Do you > see issue with that? > Isn't this (in principle) the idea of queuing the non-transactional changes and then applying them on the next commit? Yes, I didn't get very far with that, but I got stuck exactly on tracking which snapshot to use, so if there's a way to do that, that'd fix my issue. Also, would this mean we don't need to track the relfilenodes, if we're able to query the catalog? Would we be able to check if the relfilenode was created by the current xact? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/6/23 11:19, Amit Kapila wrote: > On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 12/3/23 18:52, Tomas Vondra wrote: >>> ... >>> >> >> Another idea is that maybe we could somehow inform ReorderBuffer whether >> the output plugin even is interested in sequences. That'd help with >> cases where we don't even want/need to replicate sequences, e.g. because >> the publication does not specify (publish=sequence). >> >> What happens now in that case is we call ReorderBufferQueueSequence(), >> it does the whole dance with starting/aborting the transaction, calls >> rb->sequence() which just does "meh" and doesn't do anything. Maybe we >> could just short-circuit this by asking the output plugin somehow. >> >> In an extreme case the plugin may not even specify the sequence >> callbacks, and we're still doing all of this. >> > > We could explore this but I guess it won't solve the problem we are > facing in cases where all sequences are published and plugin has > specified the sequence callbacks. I think it would add some overhead > of this check in positive cases where we decide to anyway do send the > changes. Well, the idea is the check would be very simple (essentially just a boolean flag somewhere), so not really measurable. And if the plugin requests decoding sequences, I guess it's natural it may have a bit of overhead. It needs to do more things, after all. It needs to be acceptable, ofc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/6/23 09:56, Amit Kapila wrote: > On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 12/5/23 13:17, Amit Kapila wrote: >> >>> (b) for transactional >>> cases, we see overhead due to traversing all the top-level txns and >>> check the hash table for each one to find whether change is >>> transactional. >>> >> >> Not really, no. As I explained in my preceding e-mail, this check makes >> almost no difference - I did expect it to matter, but it doesn't. And I >> was a bit disappointed the global hash table didn't move the needle. >> >> Most of the time is spent in >> >> 78.81% 0.00% postgres postgres [.] DecodeCommit (inlined) >> | >> ---DecodeCommit (inlined) >> | >> |--72.65%--SnapBuildCommitTxn >> | | >> | --72.61%--SnapBuildBuildSnapshot >> | | >> | --72.09%--pg_qsort >> | | >> | |--66.24%--pg_qsort >> | | | >> >> And there's almost no difference between master and build with sequence >> decoding - see the attached diff-alter-sequence.perf, comparing the two >> branches (perf diff -c delta-abs). >> > > I think in this the commit time predominates which hides the overhead. > We didn't investigate in detail if that can be improved but if we see > a similar case of abort [1], it shows the overhead of > ReorderBufferSequenceIsTransactional(). I understand that aborts won't > be frequent and it is sort of unrealistic test but still helps to show > that there is overhead in ReorderBufferSequenceIsTransactional(). Now, > I am not sure if we can ignore that case because theoretically, the > overhead can increase based on the number of top-level transactions. > > [1]: https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com > But those profiles were with the "old" patch, with one hash table per top-level transaction. I see nothing like that with the patch [1] that replaces that with a single global hash table. With that patch, the ReorderBufferSequenceIsTransactional() took ~0.5% in any tests I did. What did have bigger impact is this: 46.12% 1.47% postgres [.] pg_logical_slot_get_changes_guts | |--45.12%--pg_logical_slot_get_changes_guts | | | |--42.34%--LogicalDecodingProcessRecord | | | | | |--12.82%--xact_decode | | | | | | | |--9.46%--DecodeAbort (inlined) | | | | | | | | | |--8.44%--ReorderBufferCleanupTXN | | | | | | | | | | | |--3.25%--ReorderBufferSequenceCleanup (in) | | | | | | | | | | | | | |--1.59%--hash_seq_search | | | | | | | | | | | | | |--0.80%--hash_search_with_hash_value | | | | | | | | | | | | | --0.59%--hash_search | | | | | | hash_bytes I guess that could be optimized, but it's also a direct consequence of the huge number of aborts for transactions that create relfilenode. For any other workload this will be negligible. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 6, 2023 at 7:20 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/6/23 11:19, Amit Kapila wrote: > > On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 12/3/23 18:52, Tomas Vondra wrote: > >>> ... > >>> > >> > >> Another idea is that maybe we could somehow inform ReorderBuffer whether > >> the output plugin even is interested in sequences. That'd help with > >> cases where we don't even want/need to replicate sequences, e.g. because > >> the publication does not specify (publish=sequence). > >> > >> What happens now in that case is we call ReorderBufferQueueSequence(), > >> it does the whole dance with starting/aborting the transaction, calls > >> rb->sequence() which just does "meh" and doesn't do anything. Maybe we > >> could just short-circuit this by asking the output plugin somehow. > >> > >> In an extreme case the plugin may not even specify the sequence > >> callbacks, and we're still doing all of this. > >> > > > > We could explore this but I guess it won't solve the problem we are > > facing in cases where all sequences are published and plugin has > > specified the sequence callbacks. I think it would add some overhead > > of this check in positive cases where we decide to anyway do send the > > changes. > > Well, the idea is the check would be very simple (essentially just a > boolean flag somewhere), so not really measurable. > > And if the plugin requests decoding sequences, I guess it's natural it > may have a bit of overhead. It needs to do more things, after all. It > needs to be acceptable, ofc. > I agree with you that if it can be done cheaply or without a measurable overhead then it would be a good idea and can serve other purposes as well. For example, see discussion [1]. I had more of what the patch in email [1] is doing where it needs to start/stop xact and do so relcache access etc. which seems can add some overhead if done for each change, though I haven't measured so can't be sure. [1] - https://www.postgresql.org/message-id/CAGfChW5Qo2SrjJ7rU9YYtZbRaWv6v-Z8MJn%3DdQNx4uCSqDEOHA%40mail.gmail.com -- With Regards, Amit Kapila.
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/6/23 12:05, Dilip Kumar wrote: > > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >>> Why can't we use the same concept of > >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > >>> non-transactional changes (have some base snapshot before the first > >>> change), and whenever there is any catalog change, queue new snapshot > >>> change also in the queue of the non-transactional sequence change so > >>> that while sending it to downstream whenever it is necessary we will > >>> change the historic snapshot? > >>> > >> > >> Oh, do you mean maintain different historic snapshots and then switch > >> based on the change we are processing? I guess the other thing we need > >> to consider is the order of processing the changes if we maintain > >> separate queues that need to be processed. > > > > I mean we will not specifically maintain the historic changes, but if > > there is any catalog change where we are pushing the snapshot to all > > the transaction's change queue, at the same time we will push this > > snapshot in the non-transactional sequence queue as well. I am not > > sure what is the problem with the ordering? because we will be > > queueing all non-transactional sequence changes in a separate queue in > > the order they arrive and as soon as we process the next commit we > > will process all the non-transactional changes at that time. Do you > > see issue with that? > > > > Isn't this (in principle) the idea of queuing the non-transactional > changes and then applying them on the next commit? Yes, it is. Yes, I didn't get > very far with that, but I got stuck exactly on tracking which snapshot > to use, so if there's a way to do that, that'd fix my issue. Thinking more about the snapshot issue do we need to even bother about changing the snapshot at all while streaming the non-transactional sequence changes or we can send all the non-transactional changes with a single snapshot? So mainly snapshot logically gets changed due to these 2 events case1: When any transaction gets committed which has done catalog operation (this changes the global snapshot) and case2: When within a transaction, there is some catalog change (this just updates the 'curcid' in the base snapshot of the transaction). Now, if we are thinking that we are streaming all the non-transactional sequence changes right before the next commit then we are not bothered about the (case1) at all because all changes we have queues so far are before this commit. And if we come to a (case2), if we are performing any catalog change on the sequence then the following changes on the same sequence will be considered transactional and if the changes are just on some other catalog (not relevant to our sequence operation) then also we should not be worried about command_id change because visibility of catalog lookup for our sequence will be unaffected by this. In short, I am trying to say that we can safely queue the non-transactional sequence changes and stream them based on the snapshot we got when we decode the first change, and as long as we are planning to stream just before the next commit (or next in-progress stream), we don't ever need to update the snapshot. > Also, would this mean we don't need to track the relfilenodes, if we're > able to query the catalog? Would we be able to check if the relfilenode > was created by the current xact? I think by querying the catalog and checking the xmin we should be able to figure that out, but isn't that costlier than looking up the relfilenode in hash? Because just for identifying whether the changes are transactional or non-transactional you would have to query the catalog, that means for each change before we decide whether we add to the transaction's change queue or non-transactional change queue we will have to query the catalog i.e. you will have to start/stop the transaction? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/6/23 12:05, Dilip Kumar wrote: > > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >>> Why can't we use the same concept of > >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > >>> non-transactional changes (have some base snapshot before the first > >>> change), and whenever there is any catalog change, queue new snapshot > >>> change also in the queue of the non-transactional sequence change so > >>> that while sending it to downstream whenever it is necessary we will > >>> change the historic snapshot? > >>> > >> > >> Oh, do you mean maintain different historic snapshots and then switch > >> based on the change we are processing? I guess the other thing we need > >> to consider is the order of processing the changes if we maintain > >> separate queues that need to be processed. > > > > I mean we will not specifically maintain the historic changes, but if > > there is any catalog change where we are pushing the snapshot to all > > the transaction's change queue, at the same time we will push this > > snapshot in the non-transactional sequence queue as well. I am not > > sure what is the problem with the ordering? > > Currently, we set up the historic snapshot before starting a transaction to process the change and then adapt the updates to it while processing the changes for the transaction. Now, while processing this new queue of non-transactional sequence messages, we probably need a separate snapshot and updates to it. So, either we need some sort of switching between snapshots or do it in different transactions. > > because we will be > > queueing all non-transactional sequence changes in a separate queue in > > the order they arrive and as soon as we process the next commit we > > will process all the non-transactional changes at that time. Do you > > see issue with that? > > > > Isn't this (in principle) the idea of queuing the non-transactional > changes and then applying them on the next commit? Yes, I didn't get > very far with that, but I got stuck exactly on tracking which snapshot > to use, so if there's a way to do that, that'd fix my issue. > > Also, would this mean we don't need to track the relfilenodes, if we're > able to query the catalog? Would we be able to check if the relfilenode > was created by the current xact? > I thought this new mechanism was for processing a queue of non-transactional sequence changes. The tracking of relfilenodes is to distinguish between transactional and non-transactional messages, so I think we probably still need that. -- With Regards, Amit Kapila.
On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Yes, if something like this happens, that'd be a problem: > > 1) decoding starts, with > > SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT > > 2) transaction that creates a new refilenode gets decoded, but we skip > it because we don't have the correct snapshot > > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT > > 4) we decode sequence change from nextval() for the sequence > > This would lead to us attempting to apply sequence change for a > relfilenode that's not visible yet (and may even get aborted). > > But can this even happen? Can we start decoding in the middle of a > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > messages, where we also call the output plugin in non-transactional cases. It's not a problem for logical messages because whether the message is transaction or non-transactional is decided while WAL logs the message itself. But here our problem starts with deciding whether the change is transactional vs non-transactional, because if we insert the 'relfilenode' in hash then the subsequent sequence change in the same transaction would be considered transactional otherwise non-transactional. And XLOG_HEAP2_NEW_CID is just for changing the snapshot->curcid which will only affect the catalog visibility of the upcoming operation in the same transaction, but that's not an issue because if some of the changes of this transaction are seen when snapbuild state < SNAPBUILD_FULL_SNAPSHOT then this transaction has to get committed before the state change to SNAPBUILD_CONSISTENT_SNAPSHOT i.e. the commit LSN of this transaction is going to be < start_decoding_at. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, There's been a lot discussed over the past month or so, and it's become difficult to get a good idea what's the current state - what issues remain to be solved, what's unrelated to this patch, and how to move if forward. Long-running threads tend to be confusing, so I had a short call with Amit to discuss the current state yesterday, and to make sure we're on the same page. I believe it was very helpful, and I've promised to post a short summary of the call - issues, what we agreed seems like a path forward, etc. Obviously, I might have misunderstood something, in which case Amit can correct me. And I'd certainly welcome opinions from others. In general, we discussed three areas - desirability of the feature, correctness and performance. I believe a brief summary of the agreement would be this: - desirability of the feature: Random IDs (UUIDs etc.) are likely a much better solution for distributed (esp. active-active) systems. But there are important use cases that are likely to keep using regular sequences (online upgrades of single-node instances, existing systems, ...). - correctness: There's one possible correctness issue, when the snapshot changes to FULL between record creating a sequence relfilenode and that sequence advancing. This needs to be verified/reproduced, and fixed. - performance issues: We've agreed the case with a lot of aborts (when DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've discussed whether the overhead with many sequence changes (nextval-40) is acceptable, and/or how to improve it. Next, I'll go over these points in more details, with my understanding of what the challenges are, possible solutions etc. Most of this was discussed/agreed on the call, but some are ideas I had only after the call when writing this summary. 1) desirability of the feature Firstly, do we actually want/need this feature? I believe that's very much a question of what use cases we're targeting. If we only focus on distributed databases (particularly those with multiple active nodes), then we probably agree that the right solution is to not use sequences (~generators of incrementing values) but UUIDs or similar random identifiers (better not call them sequences, there's not much sequential about it). The huge advantage is this does not require replicating any state between the nodes, so logical decoding can simply ignore them and replicate just the generated values. I don't think there's any argument about that. If I as building such distributed system, I'd certainly use such random IDs. The question is what to do about the other use cases - online upgrades relying on logical decoding, failovers to logical replicas, and so on. Or what to do about existing systems that can't be easily changed to use different/random identifiers. Those are not really distributed systems and therefore don't quite need random IDs. Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can easily lead to massive write amplification, for example. There are variants like UUIDv7 reducing the impact, but there's other stuff. My takeaway from this is there's still value in having this feature. 2) correctness The only correctness issue I'm aware of is the question what happens when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding the relfilenode creation and the sequence increment, pointed out by Dilip in [1]. If this happens (and while I don't have a reproducer, I also don't have a very clear idea why it couldn't happen), it breaks how the patch decides between transactional and non-transactional sequence changes. So this seems like a fatal flaw - it definitely needs to be solved. I don't have a good idea how to do that, unfortunately. The problem is the dependency on an earlier record, and that this needs to be evaluated immediately (in the decode phase). Logical messages don't have the same issue because the "transactional" flag does not depend on earlier stuff, and other records are not interpreted until apply/commit, when we know everything relevant was decoded. I don't know what the solution is. Either we find a way to make sure not to lose/skip the smgr record, or we need to rethink how we determine the transactional flag (perhaps even try again adding it to the WAL record, but we didn't find a way to do that earlier). 3) performance issues We have discussed two cases - "ddl-abort" and "nextval-40". The "ddl-abort" is when the workload does a lot of DDL and then aborts them, leading to profiles dominated by DecodeCommit. The agreement here is that while this is a valid issue and we should try fixing it, it's unrelated to this patch. The issue exists even on master. So in the context of this patch we can ignore this issue. The "nextval-40" applies to workloads doing a lot of regular sequence changes. We only decode/apply changes written to WAL, and that happens only for every 32 increments or so. The test was with a very simple transaction (just sequence advanced to write WAL + 1-row insert), which means it's pretty much a worst case impact. For larger transactions, it's going to be hardly measurable. Also, this only measured decoding, not apply (which also will make this less significant). Most of the overhead comes from ReorderBufferQueueSequence() starting and then aborting a transaction, per the profile in [2]. This only happens in the non-transactional case, but we expect that in regular Anyway, let's say we want to mitigate this overhead. I think there are three ways to do that: a) find a way to not have to apply sequence changes immediately, but queue them until the next commit This would give a chance to combine multiple sequence changes into a single "replay change", reducing the overhead. There's a couple problems with this, though. Firstly, it can't help OLTP workloads because the transactions are short so sequence changes are unlikely to combine. It's also, not clear how expensive this be - could it be expensive enough to outweigh the benefits? All of this is assuming it can be implemented, we don't have such patch yet. I was speculating about something like this earlier, but I haven't managed to make that work. Doesn't mean it's impossible, ofc. b) provide a way for the output plugin to skip sequence decoding early The way the decoding is coded now, ReorderBufferQueueSequence does all the expensive dance even if the output plugin does not implement the sequence callbacks. Maybe we should have a way to allow skipping all of this early, right at the beginning of ReorderBufferQueueSequence (and thus before we even try to start/abort the transaction). Ofc, this is not a perfect solution either - it won't help workloads that actually need/want sequence decoding but the workload is such that the decoding has significant overhead, or with plugins that choose to support decoding sequences in genera. For example the built-in output plugin would certainly support sequences - and the overhead would still be there (even if no sequences are added to the publication). b) instruct people to increase the sequence cache from 32 to 1024 This would reduce the number of WAL messages that need to be decoded and replayed, reducing the overhead proportionally. Of course, this also means the sequence will "jump forward" more in case of crash or failover to the logical replica, but I think that's acceptable tradeoff. People should not expect sequences to be gap-less anyway. Considering nextval-40 is pretty much a worst-case behavior, I think this might actually be an acceptable solution/workaround. regards [1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com [2] https://www.postgresql.org/message-id/0bc34f71-7745-dc16-d765-5ba1f0776a3f%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 7, 2023 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > Yes, if something like this happens, that'd be a problem: > > > > 1) decoding starts, with > > > > SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT > > > > 2) transaction that creates a new refilenode gets decoded, but we skip > > it because we don't have the correct snapshot > > > > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT > > > > 4) we decode sequence change from nextval() for the sequence > > > > This would lead to us attempting to apply sequence change for a > > relfilenode that's not visible yet (and may even get aborted). > > > > But can this even happen? Can we start decoding in the middle of a > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > > messages, where we also call the output plugin in non-transactional cases. > > It's not a problem for logical messages because whether the message is > transaction or non-transactional is decided while WAL logs the message > itself. But here our problem starts with deciding whether the change > is transactional vs non-transactional, because if we insert the > 'relfilenode' in hash then the subsequent sequence change in the same > transaction would be considered transactional otherwise > non-transactional. > It is correct that we can make a wrong decision about whether a change is transactional or non-transactional when sequence DDL happens before the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens after that state. However, one thing to note here is that we won't try to stream such a change because for non-transactional cases we don't proceed unless the snapshot is in a consistent state. Now, if the decision had been correct then we would probably have queued the sequence change and discarded at commit. One thing that we deviate here is that for non-sequence transactional cases (including logical messages), we immediately start queuing the changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided SnapBuildProcessChange() returns true which is quite possible) and take final decision at commit/prepare/abort time. However, that won't be the case for sequences because of the dependency of determining transactional cases on one of the prior records. Now, I am not completely sure at this stage if such a deviation can cause any problem and or whether we are okay to have such a deviation for sequences. -- With Regards, Amit Kapila.
Dear hackers, > It is correct that we can make a wrong decision about whether a change > is transactional or non-transactional when sequence DDL happens before > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > after that state. I found a workload which decoder distinguish wrongly. # Prerequisite Apply an attached patch for inspecting the sequence status. It can be applied atop v20231203 patch set. Also, a table and a sequence must be defined: ``` CREATE TABLE foo (var int); CREATE SEQUENCE s; ``` # Workload Then, you can execute concurrent transactions from three clients like below: Client-1 BEGIN; INSERT INTO foo VALUES (1); Client-2 SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); Client-3 BEGIN; ALTER SEQUENCE s MAXVALUE 5000; COMMIT; SAVEPOINT s1; SELECT setval('s', 2000); ROLLBACK; SELECT pg_logical_slot_get_changes('slot', 'test_decoding'); # Result and analysis At first, below lines would be output on the log. This meant that WAL records for ALTER SEQUENCE were decoded but skipped because the snapshot had been building. ``` ... LOG: logical decoding found initial starting point at 0/154D238 DETAIL: Waiting for transactions (approximately 1) older than 741 to end. STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: skipped STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: skipped ... ``` Note that above `seq_decode...` line was not output via `setval()`, it was done by ALTER SEQUENCE statement. Below is a call stack for inserting WAL. ``` XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); fill_seq_fork_with_data fill_seq_with_data AlterSequence ``` Then, subsequent lines would say like them. This means that the snapshot becomes FULL and `setval()` is regarded non-transactional wrongly. ``` LOG: logical decoding found initial consistent point at 0/154D658 DETAIL: Waiting for transactions (approximately 1) older than 742 to end. STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: the sequence is non-transactional STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: not consistent: skipped ``` The change would be discarded because the snapshot has not been CONSISTENT yet by the below part. If it has been transactional, we would have queued this change though the transaction will be skipped at commit. ``` else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; ``` But anyway, we could find a case which we can make a wrong decision. This example is lucky - does not output wrongly, but I'm not sure all the case like that. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > But can this even happen? Can we start decoding in the middle of a > > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > > > messages, where we also call the output plugin in non-transactional cases. > > > > It's not a problem for logical messages because whether the message is > > transaction or non-transactional is decided while WAL logs the message > > itself. But here our problem starts with deciding whether the change > > is transactional vs non-transactional, because if we insert the > > 'relfilenode' in hash then the subsequent sequence change in the same > > transaction would be considered transactional otherwise > > non-transactional. > > > > It is correct that we can make a wrong decision about whether a change > is transactional or non-transactional when sequence DDL happens before > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > after that state. However, one thing to note here is that we won't try > to stream such a change because for non-transactional cases we don't > proceed unless the snapshot is in a consistent state. Now, if the > decision had been correct then we would probably have queued the > sequence change and discarded at commit. > > One thing that we deviate here is that for non-sequence transactional > cases (including logical messages), we immediately start queuing the > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > SnapBuildProcessChange() returns true which is quite possible) and > take final decision at commit/prepare/abort time. However, that won't > be the case for sequences because of the dependency of determining > transactional cases on one of the prior records. Now, I am not > completely sure at this stage if such a deviation can cause any > problem and or whether we are okay to have such a deviation for > sequences. Okay, so this particular scenario that I raised is somehow saved, I mean although we are considering transactional sequence operation as non-transactional we also know that if some of the changes for a transaction are skipped because the snapshot was not FULL that means that transaction can not be streamed because that transaction has to be committed before snapshot become CONSISTENT (based on the snapshot state change machinery). Ideally based on the same logic that the snapshot is not consistent the non-transactional sequence changes are also skipped. But the only thing that makes me a bit uncomfortable is that even though the result is not wrong we have made some wrong intermediate decisions i.e. considered transactional change as non-transactions. One solution to this issue is that, even if the snapshot state does not reach FULL just add the sequence relids to the hash, I mean that hash is only maintained for deciding whether the sequence is changed in that transaction or not. So no adding such relids to hash seems like a root cause of the issue. Honestly, I haven't analyzed this idea in detail about how easy it would be to add only these changes to the hash and what are the other dependencies, but this seems like a worthwhile direction IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > It is correct that we can make a wrong decision about whether a change > > is transactional or non-transactional when sequence DDL happens before > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > after that state. However, one thing to note here is that we won't try > > to stream such a change because for non-transactional cases we don't > > proceed unless the snapshot is in a consistent state. Now, if the > > decision had been correct then we would probably have queued the > > sequence change and discarded at commit. > > > > One thing that we deviate here is that for non-sequence transactional > > cases (including logical messages), we immediately start queuing the > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > SnapBuildProcessChange() returns true which is quite possible) and > > take final decision at commit/prepare/abort time. However, that won't > > be the case for sequences because of the dependency of determining > > transactional cases on one of the prior records. Now, I am not > > completely sure at this stage if such a deviation can cause any > > problem and or whether we are okay to have such a deviation for > > sequences. > > Okay, so this particular scenario that I raised is somehow saved, I > mean although we are considering transactional sequence operation as > non-transactional we also know that if some of the changes for a > transaction are skipped because the snapshot was not FULL that means > that transaction can not be streamed because that transaction has to > be committed before snapshot become CONSISTENT (based on the snapshot > state change machinery). Ideally based on the same logic that the > snapshot is not consistent the non-transactional sequence changes are > also skipped. But the only thing that makes me a bit uncomfortable is > that even though the result is not wrong we have made some wrong > intermediate decisions i.e. considered transactional change as > non-transactions. > > One solution to this issue is that, even if the snapshot state does > not reach FULL just add the sequence relids to the hash, I mean that > hash is only maintained for deciding whether the sequence is changed > in that transaction or not. So no adding such relids to hash seems > like a root cause of the issue. Honestly, I haven't analyzed this > idea in detail about how easy it would be to add only these changes to > the hash and what are the other dependencies, but this seems like a > worthwhile direction IMHO. I also thought about the same solution. I tried this solution as the attached patch on top of Hayato's diagnostic changes. Following log messages are seen in server error log. Those indicate that the sequence change was correctly deemed as a transactional change (line 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional). 2023-12-14 12:12:50.550 IST [321229] ERROR: relation "pg_replication_slot" does not exist at character 15 2023-12-14 12:12:50.550 IST [321229] STATEMENT: select * from pg_replication_slot; 2023-12-14 12:12:57.289 IST [321229] LOG: logical decoding found initial starting point at 0/1598D50 2023-12-14 12:12:57.289 IST [321229] DETAIL: Waiting for transactions (approximately 1) older than 759 to end. 2023-12-14 12:12:57.289 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: skipped 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.552 IST [321229] LOG: logical decoding found initial consistent point at 0/1599170 2023-12-14 12:13:49.552 IST [321229] DETAIL: Waiting for transactions (approximately 1) older than 760 to end. 2023-12-14 12:13:49.552 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT 2023-12-14 12:14:55.591 IST [321230] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional 2023-12-14 12:14:55.591 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.813 IST [321229] LOG: logical decoding found consistent point at 0/15992E8 2023-12-14 12:14:55.813 IST [321229] DETAIL: There are no running transactions. 2023-12-14 12:14:55.813 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); It looks like the solution works. But this is the only place where we process a change before SNAPSHOT reaches FULL. But this is also the only record which affects a decision to queue/not a following change. So it should be ok. The sequence_hash'es as separate for each transaction and they are cleaned when processing COMMIT record. So I think we don't have any side effects of adding relfilenode to sequence hash even though snapshot is not FULL. As a side note 1. the prologue of ReorderBufferSequenceCleanup() mentions only abort, but this function will be called for COMMIT as well. Prologue needs to be fixed. 2. Now that sequence hashes are per transaction, do we need ReoderBufferTXN in ReorderBufferSequenceEnt? -- Best Wishes, Ashutosh Bapat
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > It is correct that we can make a wrong decision about whether a change > > > is transactional or non-transactional when sequence DDL happens before > > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > > after that state. However, one thing to note here is that we won't try > > > to stream such a change because for non-transactional cases we don't > > > proceed unless the snapshot is in a consistent state. Now, if the > > > decision had been correct then we would probably have queued the > > > sequence change and discarded at commit. > > > > > > One thing that we deviate here is that for non-sequence transactional > > > cases (including logical messages), we immediately start queuing the > > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > > SnapBuildProcessChange() returns true which is quite possible) and > > > take final decision at commit/prepare/abort time. However, that won't > > > be the case for sequences because of the dependency of determining > > > transactional cases on one of the prior records. Now, I am not > > > completely sure at this stage if such a deviation can cause any > > > problem and or whether we are okay to have such a deviation for > > > sequences. > > > > Okay, so this particular scenario that I raised is somehow saved, I > > mean although we are considering transactional sequence operation as > > non-transactional we also know that if some of the changes for a > > transaction are skipped because the snapshot was not FULL that means > > that transaction can not be streamed because that transaction has to > > be committed before snapshot become CONSISTENT (based on the snapshot > > state change machinery). Ideally based on the same logic that the > > snapshot is not consistent the non-transactional sequence changes are > > also skipped. But the only thing that makes me a bit uncomfortable is > > that even though the result is not wrong we have made some wrong > > intermediate decisions i.e. considered transactional change as > > non-transactions. > > > > One solution to this issue is that, even if the snapshot state does > > not reach FULL just add the sequence relids to the hash, I mean that > > hash is only maintained for deciding whether the sequence is changed > > in that transaction or not. So no adding such relids to hash seems > > like a root cause of the issue. Honestly, I haven't analyzed this > > idea in detail about how easy it would be to add only these changes to > > the hash and what are the other dependencies, but this seems like a > > worthwhile direction IMHO. > > I also thought about the same solution. I tried this solution as the > attached patch on top of Hayato's diagnostic changes. I think you forgot to attach the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > It is correct that we can make a wrong decision about whether a change > > > is transactional or non-transactional when sequence DDL happens before > > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > > after that state. However, one thing to note here is that we won't try > > > to stream such a change because for non-transactional cases we don't > > > proceed unless the snapshot is in a consistent state. Now, if the > > > decision had been correct then we would probably have queued the > > > sequence change and discarded at commit. > > > > > > One thing that we deviate here is that for non-sequence transactional > > > cases (including logical messages), we immediately start queuing the > > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > > SnapBuildProcessChange() returns true which is quite possible) and > > > take final decision at commit/prepare/abort time. However, that won't > > > be the case for sequences because of the dependency of determining > > > transactional cases on one of the prior records. Now, I am not > > > completely sure at this stage if such a deviation can cause any > > > problem and or whether we are okay to have such a deviation for > > > sequences. > > > > Okay, so this particular scenario that I raised is somehow saved, I > > mean although we are considering transactional sequence operation as > > non-transactional we also know that if some of the changes for a > > transaction are skipped because the snapshot was not FULL that means > > that transaction can not be streamed because that transaction has to > > be committed before snapshot become CONSISTENT (based on the snapshot > > state change machinery). Ideally based on the same logic that the > > snapshot is not consistent the non-transactional sequence changes are > > also skipped. But the only thing that makes me a bit uncomfortable is > > that even though the result is not wrong we have made some wrong > > intermediate decisions i.e. considered transactional change as > > non-transactions. > > > > One solution to this issue is that, even if the snapshot state does > > not reach FULL just add the sequence relids to the hash, I mean that > > hash is only maintained for deciding whether the sequence is changed > > in that transaction or not. So no adding such relids to hash seems > > like a root cause of the issue. Honestly, I haven't analyzed this > > idea in detail about how easy it would be to add only these changes to > > the hash and what are the other dependencies, but this seems like a > > worthwhile direction IMHO. > > ... > It looks like the solution works. But this is the only place where we > process a change before SNAPSHOT reaches FULL. But this is also the > only record which affects a decision to queue/not a following change. > So it should be ok. The sequence_hash'es as separate for each > transaction and they are cleaned when processing COMMIT record. > > It looks like the solution works. But this is the only place where we process a change before SNAPSHOT reaches FULL. But this is also the only record which affects a decision to queue/not a following change. So it should be ok. The sequence_hash'es as separate for each transaction and they are cleaned when processing COMMIT record. > But it is possible that even commit or abort also happens before the snapshot reaches full state in which case the hash table will have stale or invalid (for aborts) entries. That will probably be cleaned at a later point by running_xact records. Now, I think in theory, it is possible that the same RelFileLocator can again be allocated before we clean up the existing entry which can probably confuse the system. It might or might not be a problem in practice but I think the more assumptions we add for sequences, the more difficult it will become to ensure its correctness. -- With Regards, Amit Kapila.
On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I think you forgot to attach the patch. Sorry. Here it is. On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > It looks like the solution works. But this is the only place where we > process a change before SNAPSHOT reaches FULL. But this is also the > only record which affects a decision to queue/not a following change. > So it should be ok. The sequence_hash'es as separate for each > transaction and they are cleaned when processing COMMIT record. > > > > But it is possible that even commit or abort also happens before the > snapshot reaches full state in which case the hash table will have > stale or invalid (for aborts) entries. That will probably be cleaned > at a later point by running_xact records. Why would cleaning wait till running_xact records? Won't txn entry itself be removed when processing commit/abort record? At the same the sequence hash will be cleaned as well. > Now, I think in theory, it > is possible that the same RelFileLocator can again be allocated before > we clean up the existing entry which can probably confuse the system. How? The transaction allocating the first time would be cleaned before it happens the second time. So shouldn't matter. -- Best Wishes, Ashutosh Bapat
Attachment
On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I think you forgot to attach the patch. > > Sorry. Here it is. > > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > It looks like the solution works. But this is the only place where we > > process a change before SNAPSHOT reaches FULL. But this is also the > > only record which affects a decision to queue/not a following change. > > So it should be ok. The sequence_hash'es as separate for each > > transaction and they are cleaned when processing COMMIT record. > > > > > > > But it is possible that even commit or abort also happens before the > > snapshot reaches full state in which case the hash table will have > > stale or invalid (for aborts) entries. That will probably be cleaned > > at a later point by running_xact records. > > Why would cleaning wait till running_xact records? Won't txn entry > itself be removed when processing commit/abort record? At the same the > sequence hash will be cleaned as well. > > > Now, I think in theory, it > > is possible that the same RelFileLocator can again be allocated before > > we clean up the existing entry which can probably confuse the system. > > How? The transaction allocating the first time would be cleaned before > it happens the second time. So shouldn't matter. > It can only be cleaned if we process it but xact_decode won't allow us to process it and I don't think it would be a good idea to add another hack for sequences here. See below code: xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) { SnapBuild *builder = ctx->snapshot_builder; ReorderBuffer *reorder = ctx->reorder; XLogReaderState *r = buf->record; uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; /* * If the snapshot isn't yet fully built, we cannot decode anything, so * bail out. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) return; -- With Regards, Amit Kapila.
On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I think you forgot to attach the patch. > > > > Sorry. Here it is. > > > > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > It looks like the solution works. But this is the only place where we > > > process a change before SNAPSHOT reaches FULL. But this is also the > > > only record which affects a decision to queue/not a following change. > > > So it should be ok. The sequence_hash'es as separate for each > > > transaction and they are cleaned when processing COMMIT record. > > > > > > > > > > But it is possible that even commit or abort also happens before the > > > snapshot reaches full state in which case the hash table will have > > > stale or invalid (for aborts) entries. That will probably be cleaned > > > at a later point by running_xact records. > > > > Why would cleaning wait till running_xact records? Won't txn entry > > itself be removed when processing commit/abort record? At the same the > > sequence hash will be cleaned as well. > > > > > Now, I think in theory, it > > > is possible that the same RelFileLocator can again be allocated before > > > we clean up the existing entry which can probably confuse the system. > > > > How? The transaction allocating the first time would be cleaned before > > it happens the second time. So shouldn't matter. > > > > It can only be cleaned if we process it but xact_decode won't allow us > to process it and I don't think it would be a good idea to add another > hack for sequences here. See below code: > > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > { > SnapBuild *builder = ctx->snapshot_builder; > ReorderBuffer *reorder = ctx->reorder; > XLogReaderState *r = buf->record; > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; > > /* > * If the snapshot isn't yet fully built, we cannot decode anything, so > * bail out. > */ > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) > return; That may be true for a transaction which is decoded, but I think all the transactions which are added to ReorderBuffer should be cleaned up once they have been processed irrespective of whether they are decoded/sent downstream or not. In this case I see the sequence hash being cleaned up for the sequence related transaction in Hayato's reproducer. See attached patch with a diagnostic change and the output below (notice sequence cleanup called on transaction 767). 2023-12-14 21:06:36.756 IST [386957] LOG: logical decoding found initial starting point at 0/15B2F68 2023-12-14 21:06:36.756 IST [386957] DETAIL: Waiting for transactions (approximately 1) older than 767 to end. 2023-12-14 21:06:36.756 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: skipped 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.710 IST [386957] LOG: logical decoding found initial consistent point at 0/15B3388 2023-12-14 21:07:05.710 IST [386957] DETAIL: Waiting for transactions (approximately 1) older than 768 to end. 2023-12-14 21:07:05.710 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:39.292 IST [386298] LOG: checkpoint starting: time 2023-12-14 21:07:40.919 IST [386957] LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: XXX: the sequence is transactional 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: sequence cleanup called on transaction 767 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: logical decoding found consistent point at 0/15B3518 2023-12-14 21:07:40.919 IST [386957] DETAIL: There are no running transactions. 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); We see similar output when pg_logical_slot_get_changes() is called. I haven't found the code path from where the sequence cleanup gets called. But it's being called. Am I missing something? -- Best Wishes, Ashutosh Bapat
Attachment
On Thu, Dec 14, 2023, at 12:44 PM, Ashutosh Bapat wrote:
I haven't found the code path from where the sequence cleanup getscalled. But it's being called. Am I missing something?
ReorderBufferCleanupTXN.
On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > It can only be cleaned if we process it but xact_decode won't allow us > > to process it and I don't think it would be a good idea to add another > > hack for sequences here. See below code: > > > > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > > { > > SnapBuild *builder = ctx->snapshot_builder; > > ReorderBuffer *reorder = ctx->reorder; > > XLogReaderState *r = buf->record; > > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; > > > > /* > > * If the snapshot isn't yet fully built, we cannot decode anything, so > > * bail out. > > */ > > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) > > return; > > That may be true for a transaction which is decoded, but I think all > the transactions which are added to ReorderBuffer should be cleaned up > once they have been processed irrespective of whether they are > decoded/sent downstream or not. In this case I see the sequence hash > being cleaned up for the sequence related transaction in Hayato's > reproducer. > It was because the test you are using was not designed to show the problem I mentioned. In this case, the rollback was after a full snapshot state was reached. -- With Regards, Amit Kapila.
Hi, I wanted to hop in here on one particular issue: > On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > better solution for distributed (esp. active-active) systems. But there > are important use cases that are likely to keep using regular sequences > (online upgrades of single-node instances, existing systems, ...). +1. Right now, the lack of sequence replication is a rather large foot-gun on logical replication upgrades. Copying the sequencesover during the cutover period is doable, of course, but: (a) There's no out-of-the-box tooling that does it, so everyone has to write some scripts just for that one function. (b) It's one more thing that extends the cutover window. I don't think it is a good idea to make it mandatory: for example, there's a strong use case for replicating a table butnot a sequence associated with it. But it's definitely a missing feature in logical replication.
On 12/19/23 13:54, Christophe Pettus wrote: > Hi, > > I wanted to hop in here on one particular issue: > >> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much >> better solution for distributed (esp. active-active) systems. But there >> are important use cases that are likely to keep using regular sequences >> (online upgrades of single-node instances, existing systems, ...). > > +1. > > Right now, the lack of sequence replication is a rather large > foot-gun on logical replication upgrades. Copying the sequences > over during the cutover period is doable, of course, but: > > (a) There's no out-of-the-box tooling that does it, so everyone has > to write some scripts just for that one function. > > (b) It's one more thing that extends the cutover window. > I agree it's an annoying gap for this use case. But if this is the only use cases, maybe a better solution would be to provide such tooling instead of adding it to the logical decoding? It might seem a bit strange if most data is copied by replication directly, while sequences need special handling, ofc. > I don't think it is a good idea to make it mandatory: for example, > there's a strong use case for replicating a table but not a sequence > associated with it. But it's definitely a missing feature in > logical replication. I don't think the plan was to make replication of sequences mandatory, certainly not with the built-in replication. If you don't add sequences to the publication, the sequence changes will be skipped. But it still needs to be part of the decoding, which adds overhead for all logical decoding uses, even if the sequence changes end up being discarded. That's somewhat annoying, especially considering sequences are fairly common part of the WAL stream. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/15/23 03:33, Amit Kapila wrote: > On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: >> >> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> It can only be cleaned if we process it but xact_decode won't allow us >>> to process it and I don't think it would be a good idea to add another >>> hack for sequences here. See below code: >>> >>> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) >>> { >>> SnapBuild *builder = ctx->snapshot_builder; >>> ReorderBuffer *reorder = ctx->reorder; >>> XLogReaderState *r = buf->record; >>> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; >>> >>> /* >>> * If the snapshot isn't yet fully built, we cannot decode anything, so >>> * bail out. >>> */ >>> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) >>> return; >> >> That may be true for a transaction which is decoded, but I think all >> the transactions which are added to ReorderBuffer should be cleaned up >> once they have been processed irrespective of whether they are >> decoded/sent downstream or not. In this case I see the sequence hash >> being cleaned up for the sequence related transaction in Hayato's >> reproducer. >> > > It was because the test you are using was not designed to show the > problem I mentioned. In this case, the rollback was after a full > snapshot state was reached. > Right, I haven't tried to reproduce this, but it very much looks like we the entry would not be removed if the xact aborts/commits before the snapshot reaches FULL state. I suppose one way to deal with this would be to first check if an entry for the same relfilenode exists. If it does, the original transaction must have terminated, but we haven't cleaned it up yet - in which case we can just "move" the relfilenode to the new one. However, can't that happen even with full snapshots? I mean, let's say a transaction creates a relfilenode and terminates without writing an abort record (surely that's possible, right?). And then another xact comes and generates the same relfilenode (presumably that's unlikely, but perhaps possible?). Aren't we in pretty much the same situation, until the next RUNNING_XACTS cleans up the hash table? I think tracking all relfilenodes would fix the original issue (with treating some changes as transactional), and the tweak that "moves" the relfilenode to the new xact would fix this other issue too. That being said, I feel a bit uneasy about it, for similar reasons as Amit. If we start processing records before full snapshot, that seems like moving the assumptions a bit. For example it means we'd create ReorderBufferTXN entries for cases that'd have skipped before. OTOH this is (or should be) only a very temporary period while starting the replication, I believe. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Here's new version of this patch series. It rebases the 2023/12/03 version, and there's a couple improvements to address the performance and correctness questions. Since the 2023/12/03 version was posted, there were a couple off-list discussions with several people - with Amit, as mentioned in [1], and then also internally and at pgconf.eu. My personal (very brief) takeaway from these discussions is this: 1) desirability: We want a built-in way to handle sequences in logical replication. I think everyone agrees this is not a way to do distributed sequences in an active-active setups, but that there are other use cases that need this feature - typically upgrades / logical failover. Multiple approaches were discussed (support in logical replication or a separate tool to be executed on the logical replica). Both might work, people usually end up with some sort of custom tool anyway. But it's cumbersome, and the consensus seems the logical rep feature is better. 2) performance: There was concern about the performance impact, and that it affects everyone, including those who don't replicate sequences (as the overhead is mostly incurred before calls to output plugin etc.). I do agree with this, but I don't think sequences can be decoded in a much cheaper way. There was a proposal [2] that maybe we could batch the non-transactional sequences changes in the "next" transaction, and distribute them similarly to SnapBuildDistributeNewCatalogSnapshot() distributes catalog snapshots. But I doubt that'd actually work. Or more precisely - if we can make the code work, I think it would not solve the issue for some common cases. Consider for example a case with many concurrent top-level transactions, making this quite expensive. And I'd bet sequence changes are far more common than catalog changes. However, I think we ultimately agreed that the overhead is acceptable if it only applies to use cases that actually need to decode sequences. So if there was a way to skip sequence decoding when not necessary, that would work. Unfortunately, that can't be based on simply checking which callbacks are defined by the output plugin, because e.g. pgoutput needs to handle both cases (so the callbacks need to be defined). Nor it can be determined based on what's included in the publication (as that's not available that early). The agreement was that the best way is to have a CREATE SUBSCRIPTION option that would instruct the upstream to decode sequences. By default this option is 'off' (because that's the no-overhead case), but it can be enabled for each subscription. This is what 0005 implements, and interestingly enough, this is what an earlier version [3] from 2023/04/02 did. This means that if you add a sequence to the publication, but leave "sequences=off" in CREATE SUBSCRIPTION, the sequence won't be replicated after all. That may seems a bit surprising, and I don't like it, but I don't think there's a better way to do this. 3) correctness: The last point is about making "transactional" flag correct when the snapshot state changes mid-transaction, originally pointed out by Dilip [4]. Per [5] this however happens to work correctly, because while we identify the change as 'non-transactional' (which is incorrect), we immediately throw it again (so we don't try to apply it, which would error-out). One option would be to document/describe this in the comments, per 0006. This means that when ReorderBufferSequenceIsTransactional() returns true, it's correct. But if it returns 'false', it means 'maybe'. I agree it seems a bit strange, but with the extra comments I think it's OK. It simply means that if we get transactional=false incorrectly, we're guaranteed to not process it. Maybe we could rename the function to make this clear from the name. The other solution proposed in the thread [6] was to always decode the relfilenode, and add it to the hash table. 0007 does this, and it works. But I agree this seems possibly worse than 0006 - it means we may be adding entries to the hash table, and it's not clear when exactly we'll clean them up etc. It'd be the only place processing stuff before the snapshots reaches FULL. I personally would go with 0006, i.e. just explaining why doing it this way is correct. regards [1] https://www.postgresql.org/message-id/12822961-b7de-9d59-dd27-2e3dc3980c7e%40enterprisedb.com [2] https://www.postgresql.org/message-id/CAFiTN-vm3-bGfm-uJdzRLERMHozW8xjZHu4rdmtWR-rP-SJYMQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/1f96b282-cb90-8302-cee8-7b3f5576a31c%40enterprisedb.com [4] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com [5] https://www.postgresql.org/message-id/CAA4eK1LFise9iN%2BNN%3Dagrk4prR1qD%2BebvzNjKAWUog2%2Bhy3HxQ%40mail.gmail.com [6] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v20240111-0001-Logical-decoding-of-sequences.patch
- v20240111-0002-Add-decoding-of-sequences-to-test_decoding.patch
- v20240111-0003-Add-decoding-of-sequences-to-built-in-repl.patch
- v20240111-0004-global-hash-table-of-sequence-relfilenodes.patch
- v20240111-0005-CREATE-SUBSCRIPTION-flag-to-enable-sequenc.patch
- v20240111-0006-add-comment-about-SNAPBUILD_FULL.patch
- v20240111-0007-decode-all-sequence-relfilenodes.patch
On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > 1) desirability: We want a built-in way to handle sequences in logical > replication. I think everyone agrees this is not a way to do distributed > sequences in an active-active setups, but that there are other use cases > that need this feature - typically upgrades / logical failover. Yeah. I find it extremely hard to take seriously the idea that this isn't a valuable feature. How else are you supposed to do a logical failover without having your entire application break? > 2) performance: There was concern about the performance impact, and that > it affects everyone, including those who don't replicate sequences (as > the overhead is mostly incurred before calls to output plugin etc.). > > The agreement was that the best way is to have a CREATE SUBSCRIPTION > option that would instruct the upstream to decode sequences. By default > this option is 'off' (because that's the no-overhead case), but it can > be enabled for each subscription. Seems reasonable, at least unless and until we come up with something better. > 3) correctness: The last point is about making "transactional" flag > correct when the snapshot state changes mid-transaction, originally > pointed out by Dilip [4]. Per [5] this however happens to work > correctly, because while we identify the change as 'non-transactional' > (which is incorrect), we immediately throw it again (so we don't try to > apply it, which would error-out). I've said this before, but I still find this really scary. It's unclear to me that we can simply classify updates as transactional or non-transactional and expect things to work. If it's possible, I hope we have a really good explanation somewhere of how and why it's possible. If we do, can somebody point me to it so I can read it? To be possibly slightly more clear about my concern, I think the scary case is where we have transactional and non-transactional things happening to the same sequence in close temporal proximity, either within the same session or across two or more sessions. If a non-transactional change can get reordered ahead of some transactional change upon which it logically depends, or behind some transactional change that logically depends on it, then we have trouble. I also wonder if there are any cases where the same operation is partly transactional and partly non-transactional. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/23/24 21:47, Robert Haas wrote: > On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> 1) desirability: We want a built-in way to handle sequences in logical >> replication. I think everyone agrees this is not a way to do distributed >> sequences in an active-active setups, but that there are other use cases >> that need this feature - typically upgrades / logical failover. > > Yeah. I find it extremely hard to take seriously the idea that this > isn't a valuable feature. How else are you supposed to do a logical > failover without having your entire application break? > >> 2) performance: There was concern about the performance impact, and that >> it affects everyone, including those who don't replicate sequences (as >> the overhead is mostly incurred before calls to output plugin etc.). >> >> The agreement was that the best way is to have a CREATE SUBSCRIPTION >> option that would instruct the upstream to decode sequences. By default >> this option is 'off' (because that's the no-overhead case), but it can >> be enabled for each subscription. > > Seems reasonable, at least unless and until we come up with something better. > >> 3) correctness: The last point is about making "transactional" flag >> correct when the snapshot state changes mid-transaction, originally >> pointed out by Dilip [4]. Per [5] this however happens to work >> correctly, because while we identify the change as 'non-transactional' >> (which is incorrect), we immediately throw it again (so we don't try to >> apply it, which would error-out). > > I've said this before, but I still find this really scary. It's > unclear to me that we can simply classify updates as transactional or > non-transactional and expect things to work. If it's possible, I hope > we have a really good explanation somewhere of how and why it's > possible. If we do, can somebody point me to it so I can read it? > I did try to explain how this works (and why) in a couple places: 1) the commit message 2) reorderbuffer header comment 3) ReorderBufferSequenceIsTransactional comment (and nearby) It's possible this does not meet your expectations, ofc. Maybe there should be a separate README for this - I haven't found anything like that for logical decoding in general, which is why I did (1)-(3). > To be possibly slightly more clear about my concern, I think the scary > case is where we have transactional and non-transactional things > happening to the same sequence in close temporal proximity, either > within the same session or across two or more sessions. If a > non-transactional change can get reordered ahead of some transactional > change upon which it logically depends, or behind some transactional > change that logically depends on it, then we have trouble. I also > wonder if there are any cases where the same operation is partly > transactional and partly non-transactional. > I certainly understand this concern, and to some extent I even share it. Having to differentiate between transactional and non-transactional changes certainly confused me more than once. It's especially confusing, because the decoding implicitly changes the perceived ordering/atomicity of the events. That being said, I don't think it get reordered the way you're concerned about. The "transactionality" is determined by relfilenode change, so how could the reordering happen? We'd have to misidentify change in either direction - and for nontransactional->transactional change that's clearly not possible. There has to be a new relfilenode in that xact. In the other direction (transactional->nontransactional), it can happen if we fail to decode the relfilenode record. Which is what we discussed earlier, but came to the conclusion that it actually works OK. Of course, there might be bugs. I spent quite a bit of effort reviewing and testing this, but there still might be something wrong. But I think that applies to any feature. What would be worse is some sort of thinko in the approach in general. I don't have a good answer to that, unfortunately - I think it works, but how would I know for sure? We explored multiple alternative approaches and all of them crashed and burned ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I did try to explain how this works (and why) in a couple places: > > 1) the commit message > 2) reorderbuffer header comment > 3) ReorderBufferSequenceIsTransactional comment (and nearby) > > It's possible this does not meet your expectations, ofc. Maybe there > should be a separate README for this - I haven't found anything like > that for logical decoding in general, which is why I did (1)-(3). I read over these and I do think they answer a bunch of questions, but I don't think they answer all of the questions. Suppose T1 creates a sequence and commits. Then T2 calls nextval(). Then T3 drops the sequence. According to the commit message, T2's change will be "replayed immediately after decoding". But it's essential to replay T2's change after we replay T1 and before we replay T3, and the comments don't explain why that's guaranteed. The answer might be "locks". If we always replay a transaction immediately when we see it's commit record then in the example above we're fine, because the commit record for the transaction that creates the sequence must precede the nextval() call, since the sequence won't be visible until the transaction commits, and also because T1 holds a lock on it at that point sufficient to hedge out nextval. And the nextval record must precede the point where T3 takes an exclusive lock on the sequence. Note, however, that this change of reasoning critically depends on us never delaying application of a transaction. If we might reach T1's commit record and say "hey, let's hold on to this for a bit and replay it after we've decoded some more," everything immediately breaks, unless we also delay application of T2's non-transactional update in such a way that it's still guaranteed to happen after T1. I wonder if this kind of situation would be a problem for a future parallel-apply feature. It wouldn't work, for example, to hand T1 and T3 off (in that order) to a separate apply process but handle T2's "non-transactional" message directly, because it might handle that message before the application of T1 got completed. This also seems to depend on every transactional operation that might affect a future non-transactional operation holding a lock that would conflict with that non-transactional operation. For example, if ALTER SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence, then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what the semantics of that would be -- would T1's second nextval() see the sequence restart, or what? But if the effect of T1's second nextval does depend in some way on the ALTER SEQUENCE operation which precedes it in the WAL stream, then we might have some trouble here, because both nextvals precede the commit of T2. Fortunately, this sequence of events is foreclosed by locking. But I did find one somewhat-similar case in which that's not so. S1: create table withseq (a bigint generated always as identity); S1: begin; S2: select nextval('withseq_a_seq'); S1: alter table withseq set unlogged; S2: select nextval('withseq_a_seq'); I think this is a bug in the code that supports owned sequences rather than a problem that this patch should have to do something about. When a sequence is flipped between logged and unlogged directly, we take a stronger lock than we do here when it's done in this indirect way. Also, I'm not quite sure if it would pose a problem for sequence decoding anyway: it changes the relfilenode, but not the value. But this is the *kind* of problem that could make the approach unsafe: supposedly transactional changes being interleaved with supposedly non-transctional changes, in such a way that the non-transactional changes might get applied at the wrong time relative to the transactional changes. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/26/24 15:39, Robert Haas wrote: > On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> I did try to explain how this works (and why) in a couple places: >> >> 1) the commit message >> 2) reorderbuffer header comment >> 3) ReorderBufferSequenceIsTransactional comment (and nearby) >> >> It's possible this does not meet your expectations, ofc. Maybe there >> should be a separate README for this - I haven't found anything like >> that for logical decoding in general, which is why I did (1)-(3). > > I read over these and I do think they answer a bunch of questions, but > I don't think they answer all of the questions. > > Suppose T1 creates a sequence and commits. Then T2 calls nextval(). > Then T3 drops the sequence. According to the commit message, T2's > change will be "replayed immediately after decoding". But it's > essential to replay T2's change after we replay T1 and before we > replay T3, and the comments don't explain why that's guaranteed. > > The answer might be "locks". If we always replay a transaction > immediately when we see it's commit record then in the example above > we're fine, because the commit record for the transaction that creates > the sequence must precede the nextval() call, since the sequence won't > be visible until the transaction commits, and also because T1 holds a > lock on it at that point sufficient to hedge out nextval. And the > nextval record must precede the point where T3 takes an exclusive lock > on the sequence. > Right, locks + apply in commit order gives us this guarantee (I can't think of a case where it wouldn't be the case). > Note, however, that this change of reasoning critically depends on us > never delaying application of a transaction. If we might reach T1's > commit record and say "hey, let's hold on to this for a bit and replay > it after we've decoded some more," everything immediately breaks, > unless we also delay application of T2's non-transactional update in > such a way that it's still guaranteed to happen after T1. I wonder if > this kind of situation would be a problem for a future parallel-apply > feature. It wouldn't work, for example, to hand T1 and T3 off (in that > order) to a separate apply process but handle T2's "non-transactional" > message directly, because it might handle that message before the > application of T1 got completed. > Doesn't the whole logical replication critically depend on the commit order? If you decide to arbitrarily reorder/delay the transactions, all kinds of really bad things can happen. That's a generic problem, it applies to all kinds of objects, not just sequences - a parallel apply would need to detect this sort of dependencies (e.g. INSERT + DELETE of the same key), and do something about it. Similar for sequences, where the important event is allocation of a new relfilenode. If anything, it's easier for sequences, because the relfilenode tracking gives us an explicit (and easy) way to detect these dependencies between transactions. > This also seems to depend on every transactional operation that might > affect a future non-transactional operation holding a lock that would > conflict with that non-transactional operation. For example, if ALTER > SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence, > then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART > WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what > the semantics of that would be -- would T1's second nextval() see the > sequence restart, or what? But if the effect of T1's second nextval > does depend in some way on the ALTER SEQUENCE operation which precedes > it in the WAL stream, then we might have some trouble here, because > both nextvals precede the commit of T2. Fortunately, this sequence of > events is foreclosed by locking. > I don't quite follow :-( AFAIK this theory hinges on not having the right lock, but I believe ALTER SEQUENCE does obtain the lock (at least in cases that assign a new relfilenode). Which means such reordering should not be possible, because nextval() in other transactions will then wait until commit. And all nextval() calls in the same transaction will be treated as transactional. So I think this works OK. If something does not lock the sequence in a way that would prevent other xacts to do nextval() on it, it's not a change that would change the relfilenode - and so it does not switch the sequence into a transactional mode. > But I did find one somewhat-similar case in which that's not so. > > S1: create table withseq (a bigint generated always as identity); > S1: begin; > S2: select nextval('withseq_a_seq'); > S1: alter table withseq set unlogged; > S2: select nextval('withseq_a_seq'); > > I think this is a bug in the code that supports owned sequences rather > than a problem that this patch should have to do something about. When > a sequence is flipped between logged and unlogged directly, we take a > stronger lock than we do here when it's done in this indirect way. Yes, I think this is a bug in handling of owned sequences - from the moment the "ALTER TABLE ... SET UNLOGGED" is executed, the two sessions generate duplicate values (until the S1 is committed, at which point the values generated in S2 get "forgotten"). It seems we end up updating both relfilenodes, which is clearly wrong. Seems like a bug independent of the decoding, IMO. > Also, I'm not quite sure if it would pose a problem for sequence > decoding anyway: it changes the relfilenode, but not the value. But > this is the *kind* of problem that could make the approach unsafe: > supposedly transactional changes being interleaved with supposedly > non-transctional changes, in such a way that the non-transactional > changes might get applied at the wrong time relative to the > transactional changes. > I'm not sure what you mean by "changes relfilenode, not value" but I suspect it might break the sequence decoding - or at least confuse it. I haven't thecked what exactly happens when we change logged/unlogged for a sequence, but I assume it does change the relfilenode, which already is a change of a value - we WAL-log the new sequence state, at least. But it should be treated as "transactional" in the transaction that did the ALTER TABLE, because it created the relfilenode. However, I'm not sure this is a valid argument against the sequence decoding patch. If something does not acquire the correct lock, it's not surprising something else breaks, if it relies on the lock. Of course, I understand you're trying to make a broader point - that if something like this could happen in "correct" case, it'd be a problem. But I don't think that's possible. The whole "transactional" thing is determined by having a new relfilenode for the sequence, and I can't imagine a case where we could assign a new relfilenode without a lock. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Right, locks + apply in commit order gives us this guarantee (I can't > think of a case where it wouldn't be the case). I couldn't find any cases of inadequate locking other than the one I mentioned. > Doesn't the whole logical replication critically depend on the commit > order? If you decide to arbitrarily reorder/delay the transactions, all > kinds of really bad things can happen. That's a generic problem, it > applies to all kinds of objects, not just sequences - a parallel apply > would need to detect this sort of dependencies (e.g. INSERT + DELETE of > the same key), and do something about it. Yes, but here I'm not just talking about the commit order. I'm talking about the order of applying non-transactional operations relative to commits. Consider: T1: CREATE SEQUENCE s; T2: BEGIN; T2: SELECT nextval('s'); T3: SELECT nextval('s'); T2: ALTER SEQUENCE s INCREMENT 2; T2: SELECT nextval('s'); T2: COMMIT; The commit order is T1 < T3 < T2, but T3 makes no transactional changes, so the commit order is really just T1 < T2. But it's completely wrong to say that all we need to do is apply T1 before we apply T2. The correct order of application is: 1. T1. 2. T2's first nextval 3. T3's nextval 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and the subsequent nextval) In other words, the fact that some sequence changes are non-transactional creates ordering hazards that don't exist if there are no non-transactional changes. So in that way, sequences are different from table modifications, where applying the transactions in order of commit is all we need to do. Here we need to apply the transactions in order of commit and also apply the non-transactional changes at the right point in the sequence. Consider the following alternative apply sequence: 1. T1. 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and the subsequent nextval) 3. T3's nextval 4. T2's first nextval That's still in commit order. It's also wrong. Imagine that you commit this patch and someone later wants to do parallel logical apply. So every time they finish decoding a transaction, they stick it in a queue to be applied by the next available worker. But, non-transactional changes are very simple, so we just directly apply those in the main process. Well, kaboom! But now this can happen with the above example. 1. Decode T1. Add to queue for apply. 2. Before the (idle) apply worker has a chance to pull T1 out of the queue, decode the first nextval and try to apply it. Oops. We're trying to apply a modification to a sequence that hasn't been created yet. I'm not saying that this kind of hypothetical is a reason not to commit the patch. But it seems like we're not on the same page about what the ordering requirements are here. I'm just making the argument that those non-transactional operations actually act like mini-transactions. They need to happen at the right time relative to the real transactions. A non-transactional operation needs to be applied after any transactions that commit before it is logged, and before any transactions that commit after it's logged. > Yes, I think this is a bug in handling of owned sequences - from the > moment the "ALTER TABLE ... SET UNLOGGED" is executed, the two sessions > generate duplicate values (until the S1 is committed, at which point the > values generated in S2 get "forgotten"). > > It seems we end up updating both relfilenodes, which is clearly wrong. > > Seems like a bug independent of the decoding, IMO. Yeah. -- Robert Haas EDB: http://www.enterprisedb.com
On 2/13/24 17:37, Robert Haas wrote: > On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> Right, locks + apply in commit order gives us this guarantee (I can't >> think of a case where it wouldn't be the case). > > I couldn't find any cases of inadequate locking other than the one I mentioned. > >> Doesn't the whole logical replication critically depend on the commit >> order? If you decide to arbitrarily reorder/delay the transactions, all >> kinds of really bad things can happen. That's a generic problem, it >> applies to all kinds of objects, not just sequences - a parallel apply >> would need to detect this sort of dependencies (e.g. INSERT + DELETE of >> the same key), and do something about it. > > Yes, but here I'm not just talking about the commit order. I'm talking > about the order of applying non-transactional operations relative to > commits. > > Consider: > > T1: CREATE SEQUENCE s; > T2: BEGIN; > T2: SELECT nextval('s'); > T3: SELECT nextval('s'); > T2: ALTER SEQUENCE s INCREMENT 2; > T2: SELECT nextval('s'); > T2: COMMIT; > It's not clear to me if you're talking about nextval() that happens to generate WAL, or nextval() covered by WAL generated by a previous call. I'm going to assume it's the former, i.e. nextval() that generated WAL describing the *next* sequence chunk, because without WAL there's nothing to apply and therefore no issue with T3 ordering. The way I think about non-transactional sequence changes is as if they were tiny transactions that happen "fully" (including commit) at the LSN where the LSN change is logged. > The commit order is T1 < T3 < T2, but T3 makes no transactional > changes, so the commit order is really just T1 < T2. But it's > completely wrong to say that all we need to do is apply T1 before we > apply T2. The correct order of application is: > > 1. T1. > 2. T2's first nextval > 3. T3's nextval > 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > the subsequent nextval) > Is that quite true? If T3 generated WAL (for the nextval call), it will be applied at that particular LSN. AFAIK that guarantees it happens after the first T2 change (which is also non-transactional) and before the transactional T2 change (because that creates a new relfilenode). > In other words, the fact that some sequence changes are > non-transactional creates ordering hazards that don't exist if there > are no non-transactional changes. So in that way, sequences are > different from table modifications, where applying the transactions in > order of commit is all we need to do. Here we need to apply the > transactions in order of commit and also apply the non-transactional > changes at the right point in the sequence. Consider the following > alternative apply sequence: > > 1. T1. > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > the subsequent nextval) > 3. T3's nextval > 4. T2's first nextval > > That's still in commit order. It's also wrong. > Yes, this would be wrong. Thankfully the apply is not allowed to reorder the changes like this, because that's not what "non-transactional" means in this context. It does not mean we can arbitrarily reorder the changes, it only means the changes are applied as if they were independent transactions (but in the same order as they were executed originally). Both with respect to the other non-transactional changes, and to "commits" of other stuff. (for serial apply, at least) > Imagine that you commit this patch and someone later wants to do > parallel logical apply. So every time they finish decoding a > transaction, they stick it in a queue to be applied by the next > available worker. But, non-transactional changes are very simple, so > we just directly apply those in the main process. Well, kaboom! But > now this can happen with the above example. > > 1. Decode T1. Add to queue for apply. > 2. Before the (idle) apply worker has a chance to pull T1 out of the > queue, decode the first nextval and try to apply it. > > Oops. We're trying to apply a modification to a sequence that hasn't > been created yet. I'm not saying that this kind of hypothetical is a > reason not to commit the patch. But it seems like we're not on the > same page about what the ordering requirements are here. I'm just > making the argument that those non-transactional operations actually > act like mini-transactions. They need to happen at the right time > relative to the real transactions. A non-transactional operation needs > to be applied after any transactions that commit before it is logged, > and before any transactions that commit after it's logged. > How is this issue specific to sequences? AFAIK this is a general problem with transactions that depend on each other. Consider for example this: T1: INSERT INTO t (id) VALUES (1); T2: DELETE FROM t WHERE id = 1; If you parallelize this in a naive way, maybe T2 gets applied before T1. In which case the DELETE won't find the row yet. There's different ways to address this. You can detect this type of conflicts (e.g. when a DELETE that doesn't find a match), drain the apply queue and retry the transaction. Or you may compare keysets of the transactions and make sure the apply waits until the conflicting one gets fully applied first. AFAIK for sequences it's not any different, except the key we'd have to compare is the sequence itself. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > The way I think about non-transactional sequence changes is as if they > were tiny transactions that happen "fully" (including commit) at the LSN > where the LSN change is logged. 100% this. > It does not mean we can arbitrarily reorder the changes, it only means > the changes are applied as if they were independent transactions (but in > the same order as they were executed originally). Both with respect to > the other non-transactional changes, and to "commits" of other stuff. Right, this is very important and I agree completely. I'm feeling more confident about this now that I heard you say that stuff -- this is really the key issue I've been worried about since I first looked at this, and I wasn't sure that you were in agreement, but it sounds like you are. I think we should (a) fix the locking bug I found (but that can be independent of this patch) and (b) make sure that this patch documents the points from the quoted material above so that everyone who reads the code (and maybe tries to enhance it) is clear on what the assumptions are. (I haven't checked whether it documents that stuff or not. I'm just saying it should, because I think it's a subtlety that someone might miss.) -- Robert Haas EDB: http://www.enterprisedb.com
On 2/15/24 05:16, Robert Haas wrote: > On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> The way I think about non-transactional sequence changes is as if they >> were tiny transactions that happen "fully" (including commit) at the LSN >> where the LSN change is logged. > > 100% this. > >> It does not mean we can arbitrarily reorder the changes, it only means >> the changes are applied as if they were independent transactions (but in >> the same order as they were executed originally). Both with respect to >> the other non-transactional changes, and to "commits" of other stuff. > > Right, this is very important and I agree completely. > > I'm feeling more confident about this now that I heard you say that > stuff -- this is really the key issue I've been worried about since I > first looked at this, and I wasn't sure that you were in agreement, > but it sounds like you are. I think we should (a) fix the locking bug > I found (but that can be independent of this patch) and (b) make sure > that this patch documents the points from the quoted material above so > that everyone who reads the code (and maybe tries to enhance it) is > clear on what the assumptions are. > > (I haven't checked whether it documents that stuff or not. I'm just > saying it should, because I think it's a subtlety that someone might > miss.) > Thanks for thinking about these issues with reordering events. Good we seem to be in agreement and that you feel more confident about this. I'll check if there's a good place to document this. For me, the part that I feel most uneasy about is the decoding while the snapshot is still being built (and can flip to consistent snapshot between the relfilenode creation and sequence change, confusing the logic that decides which changes are transactional). It seems "a bit weird" that we either keep the "simple" logic that may end up with incorrect "non-transactional" result, but happens to then work fine because we immediately discard the change. But it still feels better than the alternative, which requires us to start decoding stuff (relfilenode creation) before building a proper snapshot is consistent, which we didn't do before - or at least not in this particular way. While I don't have a practical example where it would cause trouble now, I have a nagging feeling it might easily cause trouble in the future by making some new features harder to implement. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 16, 2024 at 1:57 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > For me, the part that I feel most uneasy about is the decoding while the > snapshot is still being built (and can flip to consistent snapshot > between the relfilenode creation and sequence change, confusing the > logic that decides which changes are transactional). > > It seems "a bit weird" that we either keep the "simple" logic that may > end up with incorrect "non-transactional" result, but happens to then > work fine because we immediately discard the change. > > But it still feels better than the alternative, which requires us to > start decoding stuff (relfilenode creation) before building a proper > snapshot is consistent, which we didn't do before - or at least not in > this particular way. While I don't have a practical example where it > would cause trouble now, I have a nagging feeling it might easily cause > trouble in the future by making some new features harder to implement. I don't understand the issues here well enough to comment. Is there a good write-up someplace I can read to understand the design here? Is the rule that changes are transactional if and only if the current transaction has assigned a new relfilenode to the sequence? Why does the logic get confused if the state of the snapshot changes? My naive reaction is that it kinda sounds like you're relying on two different mistakes cancelling each other out, and that might be a bad idea, because maybe there's some situation where they don't. But I don't understand the issue well enough to have an educated opinion at this point. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/19/23 13:54, Christophe Pettus wrote: > > Hi, > > > > I wanted to hop in here on one particular issue: > > > >> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > >> better solution for distributed (esp. active-active) systems. But there > >> are important use cases that are likely to keep using regular sequences > >> (online upgrades of single-node instances, existing systems, ...). > > > > +1. > > > > Right now, the lack of sequence replication is a rather large > > foot-gun on logical replication upgrades. Copying the sequences > > over during the cutover period is doable, of course, but: > > > > (a) There's no out-of-the-box tooling that does it, so everyone has > > to write some scripts just for that one function. > > > > (b) It's one more thing that extends the cutover window. > > > > I agree it's an annoying gap for this use case. But if this is the only > use cases, maybe a better solution would be to provide such tooling > instead of adding it to the logical decoding? > > It might seem a bit strange if most data is copied by replication > directly, while sequences need special handling, ofc. > One difference between the logical replication of tables and sequences is that we can guarantee with synchronous_commit (and synchronous_standby_names) that after failover transactions data is replicated or not whereas for sequences we can't guarantee that because of their non-transactional nature. Say, there are two transactions T1 and T2, it is possible that T1's entire table data and sequence data are committed and replicated but T2's sequence data is replicated. So, after failover to logical subscriber in such a case if one routes T2 again to the new node as it was not successful previously then it would needlessly perform the sequence changes again. I don't how much that matters but that would probably be the difference between the replication of tables and sequences. I agree with your point above that for upgrades some tool like pg_copysequence where we can provide a way to copy sequence data to subscribers from the publisher would suffice the need. -- With Regards, Amit Kapila.
On Tue, Feb 20, 2024 at 10:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Is the rule that changes are transactional if and only if the current > transaction has assigned a new relfilenode to the sequence? Yes, thats the rule. > Why does the logic get confused if the state of the snapshot changes? The rule doesn't get changed, but the way this identification is implemented at the decoding gets confused and assumes transactional as non-transactional. The identification of whether the sequence is transactional or not is implemented based on what WAL we have decoded from the particular transaction and whether we decode a particular WAL or not depends upon the snapshot state (it's about what we decode not necessarily what we sent). So if the snapshot state changed the mid-transaction that means we haven't decoded the WAL which created a new relfilenode but we will decode the WAL which is operating on the sequence. So here we will assume the change is non-transaction whereas it was transactional because we did not decode some of the changes of transaction which we rely on for identifying whether it is transactional or not. > My naive reaction is that it kinda sounds like you're relying on two > different mistakes cancelling each other out, and that might be a bad > idea, because maybe there's some situation where they don't. But I > don't understand the issue well enough to have an educated opinion at > this point. I would say the first one is a mistake in identifying the transactional as non-transactional during the decoding and that mistake happens only when we decode the transaction partially. But we never stream the partially decoded transactions downstream which means even though we have made a mistake in decoding it, we are not streaming it so our mistake is not getting converted into a real problem. But again I agree there is a temporary wrong decision and if we try to do something else based on this decision then it could be an issue. You might be interested in more detail [1] where I first reported this problem and also [2] where we concluded why this is not creating a real problem. [1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 20, 2024 at 1:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > You might be interested in more detail [1] where I first reported this > problem and also [2] where we concluded why this is not creating a > real problem. > > [1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com > [2] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com Thanks. Dilip and I just spent a lot of time talking this through on a call. One of the key bits of logic is here: + /* Skip the change if already processed (per the snapshot). */ + if (transactional && + !SnapBuildProcessChange(builder, xid, buf->origptr)) + return; + else if (!transactional && + (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || + SnapBuildXactNeedsSkip(builder, buf->origptr))) + return; As a stylistic note, I think this would be mode clear if it were written if (transactional) { if (!SnapBuildProcessChange()) return; } else { if (something else) return; }. Now, on to correctness. It's possible for us to identify a transactional change as non-transactional if smgr_decode() was called for the relfilenode before SNAPBUILD_FULL_SNAPSHOT was reached. In that case, if !SnapBuildProcessChange() would have been true, then we need SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr) to also be true. Otherwise, we'll process this change when we wouldn't have otherwise. But Dilip made an argument to me about this which seems correct to me. snapbuild.h says that SNAPBUILD_CONSISTENT is reached only when we find a point where any transaction that was running at the time we reached SNAPBUILD_FULL_SNAPSHOT have finished. So if this transaction is one for which we incorrectly identified the sequence change as non-transactional, then we cannot be in the SNAPBUILD_CONSISTENT state yet, so SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT will be true and hence whole "or" condition we'll be true and we'll return. So far, so good. I think, anyway. I haven't comprehensively verified that the comment in snapbuild.h accurately reflects what the code actually does. But if it does, then presumably we shouldn't see a record for which we might have mistakenly identified a change as non-transactional after reaching SNAPBUILD_CONSISTENT, which seems to be good enough to guarantee that the mistake won't matter. However, the logic in smgr_decode() doesn't only care about the snapshot state. It also cares about the fast-forward flag: + if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT || + ctx->fast_forward) + return; Let's say fast_forward is true. Then smgr_decode() is going to skip recording anything about the relfilenode, so we'll identify all sequence changes as non-transactional. But look at how this case is handled in seq_decode(): + if (ctx->fast_forward) + { + /* + * We need to set processing_required flag to notify the sequence + * change existence to the caller. Usually, the flag is set when + * either the COMMIT or ABORT records are decoded, but this must be + * turned on here because the non-transactional logical message is + * decoded without waiting for these records. + */ + if (!transactional) + ctx->processing_required = true; + + return; + } This seems suspicious. Why are we testing the transactional flag here if it's guaranteed to be false? My guess is that the person who wrote this code thought that the flag would be accurate even in this case, but that doesn't seem to be true. So this case probably needs some more thought. It's definitely not great that this logic is so complicated; it's really hard to verify that all the tests match up well enough to keep us out of trouble. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 20, 2024 at 3:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > Let's say fast_forward is true. Then smgr_decode() is going to skip > recording anything about the relfilenode, so we'll identify all > sequence changes as non-transactional. But look at how this case is > handled in seq_decode(): > > + if (ctx->fast_forward) > + { > + /* > + * We need to set processing_required flag to notify the sequence > + * change existence to the caller. Usually, the flag is set when > + * either the COMMIT or ABORT records are decoded, but this must be > + * turned on here because the non-transactional logical message is > + * decoded without waiting for these records. > + */ > + if (!transactional) > + ctx->processing_required = true; > + > + return; > + } It appears that the 'processing_required' flag was introduced as part of supporting upgrades for logical replication slots. Its purpose is to determine whether a slot is fully caught up, meaning that there are no pending decodable changes left before it can be upgraded. So now if some change was transactional but we have identified it as non-transaction then we will mark this flag 'ctx->processing_required = true;' so we temporarily set this flag incorrectly, but even if the flag would have been correctly identified initially, it would have been set again to true in the DecodeTXNNeedSkip() function regardless of whether the transaction is committed or aborted. As a result, the flag would eventually be set to 'true', and the behavior would align with the intended logic. But I am wondering why this flag is always set to true in DecodeTXNNeedSkip() irrespective of the commit or abort. Because the aborted transactions are not supposed to be replayed? So if my observation is correct that for the aborted transaction, this shouldn't be set to true then we have a problem with sequence where we are identifying the transactional changes as non-transaction changes because now for transactional changes this should depend upon commit status. On another thought, can there be a situation where we have identified this flag wrongly as non-transaction and set this flag, and the commit/abort record never appeared in the WAL so never decoded? That can also lead to an incorrect decision during the upgrade. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 2/20/24 06:54, Amit Kapila wrote: > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 12/19/23 13:54, Christophe Pettus wrote: >>> Hi, >>> >>> I wanted to hop in here on one particular issue: >>> >>>> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >>>> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much >>>> better solution for distributed (esp. active-active) systems. But there >>>> are important use cases that are likely to keep using regular sequences >>>> (online upgrades of single-node instances, existing systems, ...). >>> >>> +1. >>> >>> Right now, the lack of sequence replication is a rather large >>> foot-gun on logical replication upgrades. Copying the sequences >>> over during the cutover period is doable, of course, but: >>> >>> (a) There's no out-of-the-box tooling that does it, so everyone has >>> to write some scripts just for that one function. >>> >>> (b) It's one more thing that extends the cutover window. >>> >> >> I agree it's an annoying gap for this use case. But if this is the only >> use cases, maybe a better solution would be to provide such tooling >> instead of adding it to the logical decoding? >> >> It might seem a bit strange if most data is copied by replication >> directly, while sequences need special handling, ofc. >> > > One difference between the logical replication of tables and sequences > is that we can guarantee with synchronous_commit (and > synchronous_standby_names) that after failover transactions data is > replicated or not whereas for sequences we can't guarantee that > because of their non-transactional nature. Say, there are two > transactions T1 and T2, it is possible that T1's entire table data and > sequence data are committed and replicated but T2's sequence data is > replicated. So, after failover to logical subscriber in such a case if > one routes T2 again to the new node as it was not successful > previously then it would needlessly perform the sequence changes > again. I don't how much that matters but that would probably be the > difference between the replication of tables and sequences. > I don't quite follow what the problem with synchronous_commit is :-( For sequences, we log the changes ahead, i.e. even if nextval() did not write anything into WAL, it's still safe because these changes are covered by the WAL generated some time ago (up to ~32 values back). And that's certainly subject to synchronous_commit, right? There certainly are issues with sequences and syncrep: https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com but that's unrelated to logical replication. FWIW I don't think we'd re-apply sequence changes needlessly, because the worker does update the origin after applying non-transactional changes. So after the replication gets restarted, we'd skip what we already applied, no? But maybe there is an issue and I'm just not getting it. Could you maybe share an example of T1/T2, with a replication restart and what you think would happen? > I agree with your point above that for upgrades some tool like > pg_copysequence where we can provide a way to copy sequence data to > subscribers from the publisher would suffice the need. > Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet another tool users would need to use. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 20, 2024 at 5:39 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 2/20/24 06:54, Amit Kapila wrote: > > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 12/19/23 13:54, Christophe Pettus wrote: > >>> Hi, > >>> > >>> I wanted to hop in here on one particular issue: > >>> > >>>> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >>>> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > >>>> better solution for distributed (esp. active-active) systems. But there > >>>> are important use cases that are likely to keep using regular sequences > >>>> (online upgrades of single-node instances, existing systems, ...). > >>> > >>> +1. > >>> > >>> Right now, the lack of sequence replication is a rather large > >>> foot-gun on logical replication upgrades. Copying the sequences > >>> over during the cutover period is doable, of course, but: > >>> > >>> (a) There's no out-of-the-box tooling that does it, so everyone has > >>> to write some scripts just for that one function. > >>> > >>> (b) It's one more thing that extends the cutover window. > >>> > >> > >> I agree it's an annoying gap for this use case. But if this is the only > >> use cases, maybe a better solution would be to provide such tooling > >> instead of adding it to the logical decoding? > >> > >> It might seem a bit strange if most data is copied by replication > >> directly, while sequences need special handling, ofc. > >> > > > > One difference between the logical replication of tables and sequences > > is that we can guarantee with synchronous_commit (and > > synchronous_standby_names) that after failover transactions data is > > replicated or not whereas for sequences we can't guarantee that > > because of their non-transactional nature. Say, there are two > > transactions T1 and T2, it is possible that T1's entire table data and > > sequence data are committed and replicated but T2's sequence data is > > replicated. So, after failover to logical subscriber in such a case if > > one routes T2 again to the new node as it was not successful > > previously then it would needlessly perform the sequence changes > > again. I don't how much that matters but that would probably be the > > difference between the replication of tables and sequences. > > > > I don't quite follow what the problem with synchronous_commit is :-( > > For sequences, we log the changes ahead, i.e. even if nextval() did not > write anything into WAL, it's still safe because these changes are > covered by the WAL generated some time ago (up to ~32 values back). And > that's certainly subject to synchronous_commit, right? > > There certainly are issues with sequences and syncrep: > > https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com > > but that's unrelated to logical replication. > > FWIW I don't think we'd re-apply sequence changes needlessly, because > the worker does update the origin after applying non-transactional > changes. So after the replication gets restarted, we'd skip what we > already applied, no? > It will work for restarts but I was trying to discuss what happens in the scenario after the publisher node goes down and we failover to the subscriber node and make it a primary node (or a failover case). After that, all unfinished transactions will be re-routed to the new primary. Consider a theoretical case where we send sequence changes of the yet uncommitted transactions directly from wal buffers (something like 91f2cae7a4 does for physical replication) and then immediately the primary or publisher node crashes. After failover to the subscriber node, the application will re-route unfinished transactions to the new primary. In such a situation, I think there is a chance that we will update the sequence value when it would have already received/applied that update via replication. This is what I was saying that there is probably a difference between tables and sequences, for tables such a replicated change would be rolled back. Having said that, this is probably no different from what would happen in the case of physical replication. > But maybe there is an issue and I'm just not getting it. Could you maybe > share an example of T1/T2, with a replication restart and what you think > would happen? > > > I agree with your point above that for upgrades some tool like > > pg_copysequence where we can provide a way to copy sequence data to > > subscribers from the publisher would suffice the need. > > > > Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet > another tool users would need to use. > But can logical replica be used for failover? We don't have any way to replicate/sync the slots on subscribers and neither we have a mechanism to replicate existing publications. I think if we want to achieve failover to a logical subscriber we need to replicate/sync the required logical and physical slots to the subscribers. I haven't thought through it completely so there would probably be more things to consider for allowing logical subscribers to be used as failover candidates. -- With Regards, Amit Kapila.
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 2/13/24 17:37, Robert Haas wrote: > > > In other words, the fact that some sequence changes are > > non-transactional creates ordering hazards that don't exist if there > > are no non-transactional changes. So in that way, sequences are > > different from table modifications, where applying the transactions in > > order of commit is all we need to do. Here we need to apply the > > transactions in order of commit and also apply the non-transactional > > changes at the right point in the sequence. Consider the following > > alternative apply sequence: > > > > 1. T1. > > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > > the subsequent nextval) > > 3. T3's nextval > > 4. T2's first nextval > > > > That's still in commit order. It's also wrong. > > > > Yes, this would be wrong. Thankfully the apply is not allowed to reorder > the changes like this, because that's not what "non-transactional" means > in this context. > > It does not mean we can arbitrarily reorder the changes, it only means > the changes are applied as if they were independent transactions (but in > the same order as they were executed originally). > In this regard, I have another scenario in mind where the apply order could be different for the changes in the same transactions. For example, Transaction T1 Begin; Insert .. Insert .. nextval .. --consider this generates WAL .. Insert .. nextval .. --consider this generates WAL In this case, if the nextval operations will be applied in a different order (aka before Inserts) then there could be some inconsistency. Say, if, it doesn't follow the above order during apply then a trigger fired on both pub and sub for each row insert that refers to the current sequence value to make some decision could have different behavior on publisher and subscriber. If this is not how the patch will behave then fine but otherwise, isn't this something that we should be worried about? -- With Regards, Amit Kapila.
On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Feb 20, 2024 at 3:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > Let's say fast_forward is true. Then smgr_decode() is going to skip > > recording anything about the relfilenode, so we'll identify all > > sequence changes as non-transactional. But look at how this case is > > handled in seq_decode(): > > > > + if (ctx->fast_forward) > > + { > > + /* > > + * We need to set processing_required flag to notify the sequence > > + * change existence to the caller. Usually, the flag is set when > > + * either the COMMIT or ABORT records are decoded, but this must be > > + * turned on here because the non-transactional logical message is > > + * decoded without waiting for these records. > > + */ > > + if (!transactional) > > + ctx->processing_required = true; > > + > > + return; > > + } > > It appears that the 'processing_required' flag was introduced as part > of supporting upgrades for logical replication slots. Its purpose is > to determine whether a slot is fully caught up, meaning that there are > no pending decodable changes left before it can be upgraded. > > So now if some change was transactional but we have identified it as > non-transaction then we will mark this flag 'ctx->processing_required > = true;' so we temporarily set this flag incorrectly, but even if the > flag would have been correctly identified initially, it would have > been set again to true in the DecodeTXNNeedSkip() function regardless > of whether the transaction is committed or aborted. As a result, the > flag would eventually be set to 'true', and the behavior would align > with the intended logic. > > But I am wondering why this flag is always set to true in > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > aborted transactions are not supposed to be replayed? So if my > observation is correct that for the aborted transaction, this > shouldn't be set to true then we have a problem with sequence where we > are identifying the transactional changes as non-transaction changes > because now for transactional changes this should depend upon commit > status. I have checked this case with Amit Kapila. So it seems in the cases where we have sent the prepared transaction or streamed in-progress transaction we would need to send the abort also, and for that reason, we are setting 'ctx->processing_required' as true so that if these WALs are not streamed we do not allow upgrade of such slots. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > But I am wondering why this flag is always set to true in > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > > aborted transactions are not supposed to be replayed? So if my > > observation is correct that for the aborted transaction, this > > shouldn't be set to true then we have a problem with sequence where we > > are identifying the transactional changes as non-transaction changes > > because now for transactional changes this should depend upon commit > > status. > > I have checked this case with Amit Kapila. So it seems in the cases > where we have sent the prepared transaction or streamed in-progress > transaction we would need to send the abort also, and for that reason, > we are setting 'ctx->processing_required' as true so that if these > WALs are not streamed we do not allow upgrade of such slots. I don't find this explanation clear enough for me to understand. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 21, 2024 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > But I am wondering why this flag is always set to true in > > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > > > aborted transactions are not supposed to be replayed? So if my > > > observation is correct that for the aborted transaction, this > > > shouldn't be set to true then we have a problem with sequence where we > > > are identifying the transactional changes as non-transaction changes > > > because now for transactional changes this should depend upon commit > > > status. > > > > I have checked this case with Amit Kapila. So it seems in the cases > > where we have sent the prepared transaction or streamed in-progress > > transaction we would need to send the abort also, and for that reason, > > we are setting 'ctx->processing_required' as true so that if these > > WALs are not streamed we do not allow upgrade of such slots. > > I don't find this explanation clear enough for me to understand. Explanation about why we set 'ctx->processing_required' to true from DecodeCommit as well as DecodeAbort: -------------------------------------------------------------------------------------------------------------------------------------------------- For upgrading logical replication slots, it's essential to ensure these slots are completely synchronized with the subscriber. To identify that we process all the pending WAL in 'fast_forward' mode to find whether there is any decodable WAL or not. So in short any WAL type that we stream to standby in normal mode (no fast_forward mode) is considered decodable and so is the abort WAL. That's the reason why at the end of the transaction commit/abort we need to set this 'ctx->processing_required' to true i.e. there are some decodable WAL exists so we can not upgrade this slot. Why the below check is safe? > + if (ctx->fast_forward) > + { > + /* > + * We need to set processing_required flag to notify the sequence > + * change existence to the caller. Usually, the flag is set when > + * either the COMMIT or ABORT records are decoded, but this must be > + * turned on here because the non-transactional logical message is > + * decoded without waiting for these records. > + */ > + if (!transactional) > + ctx->processing_required = true; > + > + return; > + } So the problem is that we might consider the transaction change as non-transaction and mark this flag as true. But what would have happened if we would have identified it correctly as transactional? In such cases, we wouldn't have set this flag here but then we would have set this while processing the DecodeAbort/DecodeCommit, so the net effect would be the same no? You may question what if the Abort/Commit WAL never appears in the WAL, but this flag is specifically for the upgrade case, and in that case we have to do a clean shutdown so may not be an issue. But in the future, if we try to use 'ctx->processing_required' for something else where the clean shutdown is not guaranteed then this flag can be set incorrectly. I am not arguing that this is a perfect design but I am just making a point about why it would work. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > So the problem is that we might consider the transaction change as > non-transaction and mark this flag as true. But it's not "might" right? It's absolutely 100% certain that we will consider that transaction's changes as non-transactional ... because when we're in fast-forward mode, the table of new relfilenodes is not built, and so whenever we check whether any transaction made a new relfilenode for this sequence, the answer will be no. > But what would have > happened if we would have identified it correctly as transactional? > In such cases, we wouldn't have set this flag here but then we would > have set this while processing the DecodeAbort/DecodeCommit, so the > net effect would be the same no? You may question what if the > Abort/Commit WAL never appears in the WAL, but this flag is > specifically for the upgrade case, and in that case we have to do a > clean shutdown so may not be an issue. But in the future, if we try > to use 'ctx->processing_required' for something else where the clean > shutdown is not guaranteed then this flag can be set incorrectly. > > I am not arguing that this is a perfect design but I am just making a > point about why it would work. Even if this argument is correct (and I don't know if it is), the code and comments need some updating. We should not be testing a flag that is guaranteed false with comments that make it sound like the value of the flag is trustworthy when it isn't. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 21, 2024 at 2:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > So the problem is that we might consider the transaction change as > > non-transaction and mark this flag as true. > > But it's not "might" right? It's absolutely 100% certain that we will > consider that transaction's changes as non-transactional ... because > when we're in fast-forward mode, the table of new relfilenodes is not > built, and so whenever we check whether any transaction made a new > relfilenode for this sequence, the answer will be no. > > > But what would have > > happened if we would have identified it correctly as transactional? > > In such cases, we wouldn't have set this flag here but then we would > > have set this while processing the DecodeAbort/DecodeCommit, so the > > net effect would be the same no? You may question what if the > > Abort/Commit WAL never appears in the WAL, but this flag is > > specifically for the upgrade case, and in that case we have to do a > > clean shutdown so may not be an issue. But in the future, if we try > > to use 'ctx->processing_required' for something else where the clean > > shutdown is not guaranteed then this flag can be set incorrectly. > > > > I am not arguing that this is a perfect design but I am just making a > > point about why it would work. > > Even if this argument is correct (and I don't know if it is), the code > and comments need some updating. We should not be testing a flag that > is guaranteed false with comments that make it sound like the value of > the flag is trustworthy when it isn't. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, Let me share a bit of an update regarding this patch and PG17. I have discussed this patch and how to move it forward with a couple hackers (both within EDB and outside), and my takeaway is that the patch is not quite baked yet, not enough to make it into PG17 :-( There are two main reasons / concerns leading to this conclusion: * correctness of the decoding part There are (were) doubts about decoding during startup, before the snapshot gets consistent, when we can get "temporarily incorrect" decisions whether a change is transactional. While the behavior is ultimately correct (we treat all changes as non-transactional and discard it), it seems "dirty" and it’s unclear to me if it might cause more serious issues down the line (not necessarily bugs, but perhaps making it harder to implement future changes). * handling of sequences in built-in replication Per the patch, sequences need to be added to the publication explicitly. But there were suggestions we might (should) add certain sequences automatically - e.g. sequences backing SERIAL/BIGSERIAL columns, etc. I’m not sure we really want to do that, and so far I assumed we would start with the manual approach and move to automatic addition in the future. But the agreement seems to be it would be a pretty significant "breaking change", and something we probably don’t want to do. If someone feels has an opinion on either of the two issues (in either way), I'd like to hear it. Obviously, I'm not particularly happy about this outcome. And I'm also somewhat cautious because this patch was already committed+reverted in PG16 cycle, and doing the same thing in PG17 is not on my wish list. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company