Thread: logical decoding and replication of sequences, take 2

logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

18 August 2022, 21:10:39

Hi,

Here's a rebased version of the patch adding logical decoding of
sequences. The previous attempt [1] ended up getting reverted, due to
running into issues with non-transactional nature of sequences when
decoding the existing WAL records. See [2] for details.

This patch uses a different approach, proposed by Hannu Krosing [3],
based on tracking sequences actually modified in each transaction, and
then WAL-logging the state at the end.

This does work, but I'm not very happy about WAL-logging all sequences
at the end. The "problem" is we have to re-read the current state of the
sequence from disk, because it might be concurrently updated by another
transaction.

Imagine two transactions, T1 and T2:

T1: BEGIN

T1: SELECT nextval('s') FROM generate_series(1,1000)

T2: BEGIN

T2: SELECT nextval('s') FROM generate_series(1,1000)

T2: COMMIT

T1: COMMIT

The expected outcome is that the sequence value is ~2000. We must not
blindly apply the changes from T2 by the increments in T1. So the patch
simply reads "current" state of the transaction at commit time. Which is
annoying, because it involves I/O, increases the commit duration, etc.

On the other hand, this is likely cheaper than the other approach based
on WAL-logging every sequence increment (that would have to be careful
about obsoleted increments too, when applying them transactionally).


I wonder if we might deal with this by simply WAL-logging LSN of the
last change for each sequence (in the given xact), which would allow
discarding the "obsolete" changes quite easily I think. nextval() would
simply look at LSN in the page header.

And maybe we could then use the LSN to read the increment from the WAL
during decoding, instead of having to read it and WAL-log it during
commit. Essentially, we'd run a local XLogReader. Of course, we'd have
to be careful about checkpoints, not sure what to do about that.

Another idea that just occurred to me is that if we end up having to
read the sequence state during commit, maybe we could at least optimize
it somehow. For example we might track LSN of the last logged state for
each sequence (in shared memory or something), and the other sessions
could just skip the WAL-log if their "local" LSN is <= than this LSN.


regards


[1]
https://www.postgresql.org/message-id/flat/d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com

[2]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com

[3]
https://www.postgresql.org/message-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

decoding-sequences-tracking-20220818.patch

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

19 August 2022, 11:11:35

I've been thinking about the two optimizations mentioned at the end a
bit more, so let me share my thoughts before I forget that:

On 8/18/22 23:10, Tomas Vondra wrote:
>
> ...
>
> And maybe we could then use the LSN to read the increment from the WAL
> during decoding, instead of having to read it and WAL-log it during
> commit. Essentially, we'd run a local XLogReader. Of course, we'd have
> to be careful about checkpoints, not sure what to do about that.
> 

I think logging just the LSN is workable.

I was worried about dealing with checkpoints, because imagine you do
nextval() on sequence that was last WAL-logged a couple checkpoints
back. Then you wouldn't be able to read the LSN (when decoding), because
the WAL might have been recycled. But that can't happen, because we
always force WAL-logging the first time nextval() is called after a
checkpoint. So we know the LSN is guaranteed to be available.

Of course, this would not reduce the amount of WAL messages, because
we'd still log all sequences touched by the transaction. We wouldn't
need to read the state from disk, though, and we could ignore "old"
stuff in decoding (with LSN lower than the last LSN we decoded).

For frequently used sequences that seems like a win.

> Another idea that just occurred to me is that if we end up having to
> read the sequence state during commit, maybe we could at least optimize
> it somehow. For example we might track LSN of the last logged state for
> each sequence (in shared memory or something), and the other sessions
> could just skip the WAL-log if their "local" LSN is <= than this LSN.
> 

Tracking the last LSN for each sequence (in a SLRU or something) should
work too, I guess. In principle this just moves the skipping of "old"
increments from decoding to writing, so that we don't even have to write
those into WAL.

We don't even need persistence, nor to keep all the records, I think. If
you don't find a record for a given sequence, assume it wasn't logged
yet and just log it. Of course, it requires a bit of shared memory for
each sequence, say ~32B. Not sure about the overhead, but I'd bet if you
have many (~thousands) frequently used sequences, there'll be a lot of
other overhead making this irrelevant.

Of course, if we're doing the skipping when writing the WAL, maybe we
should just read the sequence state - we'd do the I/O, but only in
fraction of the transactions, and we wouldn't need to read old WAL in
logical decoding.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

11 November 2022, 22:49:07

Hi,

I noticed on cfbot the patch no longer applies, so here's a rebased
version. Most of the breakage was due to the column filtering reworks,
grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

I haven't looked into the optimizations / improvements I discussed in my
previous post (logging only LSN of the last WAL-logged increment),
because while fixing "make check-world" I ran into a more serious issue
that I think needs to be discussed first. And I suspect it might also
affect the feasibility of the LSN optimization.

So, what's the issue - the current solution is based on WAL-logging
state of all sequences incremented by the transaction at COMMIT. To do
that, we read the state from disk, and write that into WAL. However,
these WAL messages are not necessarily correlated to COMMIT records, so
stuff like this might happen:

1. transaction T1 increments sequence S
2. transaction T2 increments sequence S
3. both T1 and T2 start to COMMIT
4. T1 reads state of S from disk, writes it into WAL
5. transaction T3 increments sequence S
6. T2 reads state of S from disk, writes it into WAL
7. T2 write COMMIT into WAL
8. T1 write COMMIT into WAL

Because the apply order is determined by ordering of COMMIT records,
this means we'd apply the increments logged by T2, and then by T1. But
that undoes the increment by T3, and the sequence would go backwards.

The previous patch version addressed that by acquiring lock on the
sequence, holding it until transaction end. This effectively ensures the
order of sequence messages and COMMIT matches. But that's problematic
for a number of reasons:

1) throughput reduction, because the COMMIT records need to serialize

2) deadlock risk, if we happen to lock sequences in different order
   (in different transactions)

3) problem for prepared transactions - the sequences are locked and
   logged in PrepareTransaction, because we may not have seqhashtab
   beyond that point. This is a much worse variant of (1).

Note: I also wonder what happens if someone does DISCARD SEQUENCES. I
guess we'll forget the sequences, which is bad - so we'd have to invent
a separate cache that does not have this issue.


I realized (3) because one of the test_decoding TAP tests got stuck
exactly because of a sequence locked by a prepared transaction.

This patch simply releases the lock after writing the WAL message, but
that just makes it vulnerable to the reordering. And this would have
been true even with the LSN optimization.

However, I was thinking that maybe we could use the LSN of the WAL
message (XLOG_LOGICAL_SEQUENCE) to deal with the ordering issue, because
*this* is the sensible sequence increment ordering.

In the example above, we'd first apply the WAL message from T2 (because
that commits first). And then we'd get to apply T1, but the WAL message
has an older LSN, so we'd skip it.

But this requires us remembering LSN of the already applied WAL sequence
messages, which could be tricky - we'd need to persist it in some way
because of restarts, etc. We can't do this while decoding but on the
apply side, I think, because of streaming, aborts.

The other option might be to make these messages non-transactional, in
which case we'd separate the ordering from COMMIT ordering, evading the
reordering problem.

That'd mean we'd ignore rollbacks (which seems fine), we could probably
optimize this by checking if the state actually changed, etc. But we'd
also need to deal with transactions created in the (still uncommitted)
transaction. But I'm also worried it might lead to the same issue with
non-transactional behaviors that forced revert in v15.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

decoding-sequences-tracking-20221112.patch

Re: logical decoding and replication of sequences, take 2

From

Ian Lawrence Barwick

Date:

16 November 2022, 04:43:51

2022年11月12日(土) 7:49 Tomas Vondra <tomas.vondra@enterprisedb.com>:
>
> Hi,
>
> I noticed on cfbot the patch no longer applies, so here's a rebased
> version. Most of the breakage was due to the column filtering reworks,
> grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

(...)

Hi

Thanks for the update patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
    'tests': [
      't/001_basic.pl',
    ],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick

Re: logical decoding and replication of sequences, take 2

From

Robert Haas

Date:

16 November 2022, 21:05:04

On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> The other option might be to make these messages non-transactional, in
> which case we'd separate the ordering from COMMIT ordering, evading the
> reordering problem.
>
> That'd mean we'd ignore rollbacks (which seems fine), we could probably
> optimize this by checking if the state actually changed, etc. But we'd
> also need to deal with transactions created in the (still uncommitted)
> transaction. But I'm also worried it might lead to the same issue with
> non-transactional behaviors that forced revert in v15.

I think it might be a good idea to step back slightly from
implementation details and try to agree on a theoretical model of
what's happening here. Let's start by banishing the words
transactional and non-transactional from the conversation and talk
about what logical replication is trying to do.

We can imagine that the replicated objects on the primary pass through
a series of states S1, S2, ..., Sn, where n keeps going up as new
state changes occur. The state, for our purposes here, is the contents
of the database as they could be observed by a user running SELECT
queries at some moment in time chosen by the user. For instance, if
the initial state of the database is S1, and then the user executes
BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
state that differs from S1 in that both of those rows are now part of
the database contents. There is no state where one of those rows is
visible and the other is not. That was never observable by the user,
except from within the transaction as it was executing, which we can
and should discount. I believe that the goal of logical replication is
to bring about a state of affairs where the set of states observable
on the standby is a subset of the states observable on the primary.
That is, if the primary goes from S1 to S2 to S3, the standby can do
the same thing, or it can go straight from S1 to S3 without ever
making it possible for the user to observe S2. Either is correct
behavior. But the standby cannot invent any new states that didn't
occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
S3, or something like that. It can only consolidate changes that
occurred separately on the primary, never split them up. Neither can
it reorder them.

Now, if you accept this as a reasonable definition of correctness,
then the next question is what consequences it has for transactional
and non-transactional behavior. If all behavior is transactional, then
we've basically got to replay each primary transaction in a single
standby transaction, and commit those transactions in the same order
that the corresponding primary transactions committed. We could
legally choose to merge a group of transactions that committed one
after the other on the primary into a single transaction on the
standby, and it might even be a good idea if they're all very tiny,
but it's not required. But if there are non-transactional things
happening, then there are changes that become visible at some time
other than at a transaction commit. For example, consider this
sequence of events, in which each "thing" that happens is
transactional except where the contrary is noted:

T1: BEGIN;
T2: BEGIN;
T1: Do thing 1;
T2: Do thing 2;
T1: Do a non-transactional thing;
T1: Do thing 3;
T2: Do thing 4;
T2: COMMIT;
T1: COMMIT;

From the point of the user here, there are 4 observable states here:

S1: Initiate state.
S2: State after the non-transactional thing happens.
S3: State after T2 commits (reflects the non-transactional thing plus
things 2 and 4).
S4: State after T1 commits.

Basically, the non-transactional thing behaves a whole lot like a
separate transaction. That non-transactional operation ought to be
replicated before T2, which ought to be replicated before T1. Maybe
logical replication ought to treat it in exactly that way: as a
separate operation that needs to be replicated after any earlier
transactions that completed prior to the history shown here, but
before T2 or T1. Alternatively, you can merge the non-transactional
change into T2, i.e. the first transaction that committed after it
happened. But you can't merge it into T1, even though it happened in
T1. If you do that, then you're creating states on the standby that
never existed on the primary, which is wrong. You could argue that
this is just nitpicking: who cares if the change in the sequence value
doesn't get replicated at exactly the right moment? But I don't think
it's a technicality at all: I think if we don't make the operation
appear to happen at the same point in the sequence as it became
visible on the master, then there will be endless artifacts and corner
cases to the bottom of which we will never get. Just like if we
replicated the actual transactions out of order, chaos would ensue,
because there can be logical dependencies between them, so too can
there be logical dependencies between non-transactional operations, or
between a non-transactional operation and a transactional operation.

To make it more concrete, consider two sessions concurrently running this SQL:

insert into t1 select nextval('s1') from generate_series(1,1000000) g;

There are, in effect, 2000002 transaction-like things here. The
sequence gets incremented 2 million times, and then there are 2
commits that each insert a million rows. Perhaps the actual order of
events looks something like this:

1. nextval the sequence N times, where N >= 1 million
2. commit the first transaction, adding a million rows to t1
3. nextval the sequence 2 million - N times
4. commit the second transaction, adding another million rows to t1

Unless we replicate all of the nextval operations that occur in step 1
at the same time or prior to replicating the first transaction in step
2, we might end up making visible a state where the next value of the
sequence is less than the highest value present in the table, which
would be bad.

With that perhaps overly-long set of preliminaries, I'm going to move
on to talking about the implementation ideas which you mention. You
write that "the current solution is based on WAL-logging state of all
sequences incremented by the transaction at COMMIT" and then, it seems
to me, go on to demonstrate that it's simply incorrect. In my opinion,
the fundamental problem is that it doesn't look at the order that
things happened on the primary and do them in the same order on the
standby. Instead, it accepts that the non-transactional operations are
going to be replicated at the wrong time, and then tries to patch
around the issue by attempting to scrounge up the correct values at
some convenient point and use that data to compensate for our failure
to do the right thing at an earlier point. That doesn't seem like a
satisfying solution, and I think it will be hard to make it fully
correct.

Your alternative proposal says "The other option might be to make
these messages non-transactional, in which case we'd separate the
ordering from COMMIT ordering, evading the reordering problem." But I
don't think that avoids the reordering problem at all. Nor do I think
it's correct. I don't think you *can* separate the ordering of these
operations from the COMMIT ordering. They are, as I argue here,
essentially mini-commits that only bump the sequence value, and they
need to be replicated after the transactions that commit prior to the
sequence value bump and before those that commit afterward. If they
aren't handled that way, I don't think you're going to get fully
correct behavior.

I'm going to confess that I have no really specific idea how to
implement that. I'm just not sufficiently familiar with this code.
However, I suspect that the solution lies in changing things on the
decoding side rather than in the WAL format. I feel like the
information that we need in order to do the right thing must already
be present in the WAL. If it weren't, then how could crash recovery
work correctly, or physical replication? At any given moment, you can
choose to promote a physical standby, and at that point the state you
observe on the new primary had better be some state that existed on
the primary at some point in its history. At any moment, you can
unplug the primary, restart it, and run crash recovery, and if you do,
you had better end up with some state that existed on the primary at
some point shortly before the crash. I think that there are actually a
few subtle inaccuracies in the last two sentences, because actually
the order in which transactions become visible on a physical standby
can differ from the order in which it happens on the primary, but I
don't think that actually changes the picture much. The point is that
the WAL is the definitive source of information about what happened
and in what order it happened, and we use it in that way already in
the context of physical replication, and of standbys. If logical
decoding has a problem with some case that those systems handle
correctly, the problem is with logical decoding, not the WAL format.

In particular, I think it's likely that the "non-transactional
messages" that you mention earlier don't get applied at the point in
the commit sequence where they were found in the WAL. Not sure why
exactly, but perhaps the point at which we're reading WAL runs ahead
of the decoding per se, or something like that, and thus those
non-transactional messages arrive too early relative to the commit
ordering. Possibly that could be changed, and they could be buffered
until earlier commits are replicated. Or else, when we see a WAL
record for a non-transactional sequence operation, we could arrange to
bundle that operation into an "adjacent" replicated transaction i.e.
the transaction whose commit record occurs most nearly prior to, or
most nearly after, the WAL record for the operation itself. Or else,
we could create "virtual" transactions for such operations and make
sure those get replayed at the right point in the commit sequence. Or
else, I don't know, maybe something else. But I think the overall
picture is that we need to approach the problem by replicating changes
in WAL order, as a physical standby would do. Saying that a change is
"nontransactional" doesn't mean that it's exempt from ordering
requirements; rather, it means that that change has its own place in
that ordering, distinct from the transaction in which it occurred.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

17 November 2022, 01:41:14


On 11/16/22 22:05, Robert Haas wrote:
> On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> The other option might be to make these messages non-transactional, in
>> which case we'd separate the ordering from COMMIT ordering, evading the
>> reordering problem.
>>
>> That'd mean we'd ignore rollbacks (which seems fine), we could probably
>> optimize this by checking if the state actually changed, etc. But we'd
>> also need to deal with transactions created in the (still uncommitted)
>> transaction. But I'm also worried it might lead to the same issue with
>> non-transactional behaviors that forced revert in v15.
> 
> I think it might be a good idea to step back slightly from
> implementation details and try to agree on a theoretical model of
> what's happening here. Let's start by banishing the words
> transactional and non-transactional from the conversation and talk
> about what logical replication is trying to do.
> 

OK, let's try.

> We can imagine that the replicated objects on the primary pass through
> a series of states S1, S2, ..., Sn, where n keeps going up as new
> state changes occur. The state, for our purposes here, is the contents
> of the database as they could be observed by a user running SELECT
> queries at some moment in time chosen by the user. For instance, if
> the initial state of the database is S1, and then the user executes
> BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
> state that differs from S1 in that both of those rows are now part of
> the database contents. There is no state where one of those rows is
> visible and the other is not. That was never observable by the user,
> except from within the transaction as it was executing, which we can
> and should discount. I believe that the goal of logical replication is
> to bring about a state of affairs where the set of states observable
> on the standby is a subset of the states observable on the primary.
> That is, if the primary goes from S1 to S2 to S3, the standby can do
> the same thing, or it can go straight from S1 to S3 without ever
> making it possible for the user to observe S2. Either is correct
> behavior. But the standby cannot invent any new states that didn't
> occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
> S3, or something like that. It can only consolidate changes that
> occurred separately on the primary, never split them up. Neither can
> it reorder them.
> 

I mostly agree, and in a way the last patch aims to do roughly this,
i.e. make sure that the state after each transaction matches the state a
user might observe on the primary (modulo implementation challenges).

There's a couple of caveats, though:

1) Maybe we should focus more on "actually observed" state instead of
"observable". Who cares if the sequence moved forward in a transaction
that was ultimately rolled back? No committed transaction should have
observer those values - in a way, the last "valid" state of the sequence
is the last value generated in a transaction that ultimately committed.

2) I think what matters more is that we never generate duplicate value.
That is, if you generate a value from a sequence, commit a transaction
and replicate it, then the logical standby should not generate the same
value from the sequence. This guarantee seems necessary for "failover"
to logical standby.

> Now, if you accept this as a reasonable definition of correctness,
> then the next question is what consequences it has for transactional
> and non-transactional behavior. If all behavior is transactional, then
> we've basically got to replay each primary transaction in a single
> standby transaction, and commit those transactions in the same order
> that the corresponding primary transactions committed. We could
> legally choose to merge a group of transactions that committed one
> after the other on the primary into a single transaction on the
> standby, and it might even be a good idea if they're all very tiny,
> but it's not required. But if there are non-transactional things
> happening, then there are changes that become visible at some time
> other than at a transaction commit. For example, consider this
> sequence of events, in which each "thing" that happens is
> transactional except where the contrary is noted:
> 
> T1: BEGIN;
> T2: BEGIN;
> T1: Do thing 1;
> T2: Do thing 2;
> T1: Do a non-transactional thing;
> T1: Do thing 3;
> T2: Do thing 4;
> T2: COMMIT;
> T1: COMMIT;
> 
> From the point of the user here, there are 4 observable states here:
> 
> S1: Initiate state.
> S2: State after the non-transactional thing happens.
> S3: State after T2 commits (reflects the non-transactional thing plus
> things 2 and 4).
> S4: State after T1 commits.
> 
> Basically, the non-transactional thing behaves a whole lot like a
> separate transaction. That non-transactional operation ought to be
> replicated before T2, which ought to be replicated before T1. Maybe
> logical replication ought to treat it in exactly that way: as a
> separate operation that needs to be replicated after any earlier
> transactions that completed prior to the history shown here, but
> before T2 or T1. Alternatively, you can merge the non-transactional
> change into T2, i.e. the first transaction that committed after it
> happened. But you can't merge it into T1, even though it happened in
> T1. If you do that, then you're creating states on the standby that
> never existed on the primary, which is wrong. You could argue that
> this is just nitpicking: who cares if the change in the sequence value
> doesn't get replicated at exactly the right moment? But I don't think
> it's a technicality at all: I think if we don't make the operation
> appear to happen at the same point in the sequence as it became
> visible on the master, then there will be endless artifacts and corner
> cases to the bottom of which we will never get. Just like if we
> replicated the actual transactions out of order, chaos would ensue,
> because there can be logical dependencies between them, so too can
> there be logical dependencies between non-transactional operations, or
> between a non-transactional operation and a transactional operation.
> 

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

  (a) What would you do with increments of sequences created/reset in a
      transaction? Can't apply those outside the transaction, because it
      might be rolled back (and that state is not visible on primary).

  (b) What about increments created before we have a proper snapshot?
      There may be transactions dependent on the increment. This is what
      ultimately led to revert of the patch.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

> To make it more concrete, consider two sessions concurrently running this SQL:
> 
> insert into t1 select nextval('s1') from generate_series(1,1000000) g;
> 
> There are, in effect, 2000002 transaction-like things here. The
> sequence gets incremented 2 million times, and then there are 2
> commits that each insert a million rows. Perhaps the actual order of
> events looks something like this:
> 
> 1. nextval the sequence N times, where N >= 1 million
> 2. commit the first transaction, adding a million rows to t1
> 3. nextval the sequence 2 million - N times
> 4. commit the second transaction, adding another million rows to t1
> 
> Unless we replicate all of the nextval operations that occur in step 1
> at the same time or prior to replicating the first transaction in step
> 2, we might end up making visible a state where the next value of the
> sequence is less than the highest value present in the table, which
> would be bad.
> 

Right, that's the "guarantee" I've mentioned above, more or less.

> With that perhaps overly-long set of preliminaries, I'm going to move
> on to talking about the implementation ideas which you mention. You
> write that "the current solution is based on WAL-logging state of all
> sequences incremented by the transaction at COMMIT" and then, it seems
> to me, go on to demonstrate that it's simply incorrect. In my opinion,
> the fundamental problem is that it doesn't look at the order that
> things happened on the primary and do them in the same order on the
> standby. Instead, it accepts that the non-transactional operations are
> going to be replicated at the wrong time, and then tries to patch
> around the issue by attempting to scrounge up the correct values at
> some convenient point and use that data to compensate for our failure
> to do the right thing at an earlier point. That doesn't seem like a
> satisfying solution, and I think it will be hard to make it fully
> correct.
> 

I understand what you're saying, but I'm not sure I agree with you.

Yes, this would mean we accept we may end up with something like this:

1: T1 logs sequence state S1
2: someone increments sequence
3: T2 logs sequence stats S2
4: T2 commits
5: T1 commits

which "inverts" the apply order of S1 vs. S2, because we first apply S2
and then the "old" S1. But as long as we're smart enough to "discard"
applying S1, I think that's acceptable - because it guarantees we'll not
generate duplicate values (with values in the committed transaction).

I'd also argue it does not actually generate invalid state, because once
we commit either transaction, S2 is what's visible.

Yes, if you so "SELECT * FROM sequence" you'll see some intermediate
state, but that's not how sequences are accessed. And you can't do
currval('s') from a transaction that never accessed the sequence.

And if it did, we'd write S2 (or whatever it saw) as part of it's commits.

So I think the main issue of this approach is how to decide which
sequence states are obsolete and should be skipped.

> Your alternative proposal says "The other option might be to make
> these messages non-transactional, in which case we'd separate the
> ordering from COMMIT ordering, evading the reordering problem." But I
> don't think that avoids the reordering problem at all.

I don't understand why. Why would it not address the reordering issue?

> Nor do I think it's correct.

Nor do I understand this. I mean, isn't it essentially the option you
mentioned earlier - treating the non-transactional actions as
independent transactions? Yes, we'd be batching them so that we'd not
see "intermediate" states, but those are not observed by abyone.

> I don't think you *can* separate the ordering of these
> operations from the COMMIT ordering. They are, as I argue here,
> essentially mini-commits that only bump the sequence value, and they
> need to be replicated after the transactions that commit prior to the
> sequence value bump and before those that commit afterward. If they
> aren't handled that way, I don't think you're going to get fully
> correct behavior.

I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
you have something like this:

1: T1 does something and also increments a sequence
2: T1 logs state of the sequence (right before commit)
3: T1 writes COMMIT

Now when we decode/apply this, we end up doing this:

1: decode all T1 changes, stash them
2: decode the sequence state and apply it separately
3: decode COMMIT, apply all T1 changes

There might be other transactions interleaving with this, but I think
it'd behave correctly. What example would not work?

> 
> I'm going to confess that I have no really specific idea how to
> implement that. I'm just not sufficiently familiar with this code.
> However, I suspect that the solution lies in changing things on the
> decoding side rather than in the WAL format. I feel like the
> information that we need in order to do the right thing must already
> be present in the WAL. If it weren't, then how could crash recovery
> work correctly, or physical replication? At any given moment, you can
> choose to promote a physical standby, and at that point the state you
> observe on the new primary had better be some state that existed on
> the primary at some point in its history. At any moment, you can
> unplug the primary, restart it, and run crash recovery, and if you do,
> you had better end up with some state that existed on the primary at
> some point shortly before the crash. I think that there are actually a
> few subtle inaccuracies in the last two sentences, because actually
> the order in which transactions become visible on a physical standby
> can differ from the order in which it happens on the primary, but I
> don't think that actually changes the picture much. The point is that
> the WAL is the definitive source of information about what happened
> and in what order it happened, and we use it in that way already in
> the context of physical replication, and of standbys. If logical
> decoding has a problem with some case that those systems handle
> correctly, the problem is with logical decoding, not the WAL format.
> 

The problem lies in how we log sequences. If we wrote each individual
increment to WAL, it might work the way you propose (except for cases
with sequences created in a transaction, etc.). But that's not what we
do - we log sequence increments in batches of 32 values, and then only
modify the sequence relfilenode.

This works for physical replication, because the WAL describes the
"next" state of the sequence (so if you do "SELECT * FROM sequence"
you'll not see the same state, and the sequence value may "jump ahead"
after a failover).

But for logical replication this does not work, because the transaction
might depend on a state created (WAL-logged) by some other transaction.
And perhaps that transaction actually happened *before* we even built
the first snapshot for decoding :-/

There's also the issue with what snapshot to use when decoding these
transactional changes in logical decoding (see


> In particular, I think it's likely that the "non-transactional
> messages" that you mention earlier don't get applied at the point in
> the commit sequence where they were found in the WAL. Not sure why
> exactly, but perhaps the point at which we're reading WAL runs ahead
> of the decoding per se, or something like that, and thus those
> non-transactional messages arrive too early relative to the commit
> ordering. Possibly that could be changed, and they could be buffered

I'm not sure which case of "non-transactional messages" this refers to,
so I can't quite respond to these comments. Perhaps you mean the
problems that killed the previous patch [1]?

[1]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com


> until earlier commits are replicated. Or else, when we see a WAL
> record for a non-transactional sequence operation, we could arrange to
> bundle that operation into an "adjacent" replicated transaction i.e.

IIRC moving stuff between transactions during decoding is problematic,
because of snapshots.

> the transaction whose commit record occurs most nearly prior to, or
> most nearly after, the WAL record for the operation itself. Or else,
> we could create "virtual" transactions for such operations and make
> sure those get replayed at the right point in the commit sequence. Or
> else, I don't know, maybe something else. But I think the overall
> picture is that we need to approach the problem by replicating changes
> in WAL order, as a physical standby would do. Saying that a change is
> "nontransactional" doesn't mean that it's exempt from ordering
> requirements; rather, it means that that change has its own place in
> that ordering, distinct from the transaction in which it occurred.
> 

But doesn't the approach with WAL-logging sequence state before COMMIT,
and then applying it independently in WAL-order, do pretty much this?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Andres Freund

Date:

17 November 2022, 02:43:57

Hi,

On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
>   (a) What would you do with increments of sequences created/reset in a
>       transaction? Can't apply those outside the transaction, because it
>       might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.

>   (b) What about increments created before we have a proper snapshot?
>       There may be transactions dependent on the increment. This is what
>       ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot?  Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.

> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

That approach seems quite wrong to me.

> > I'm going to confess that I have no really specific idea how to
> > implement that. I'm just not sufficiently familiar with this code.
> > However, I suspect that the solution lies in changing things on the
> > decoding side rather than in the WAL format. I feel like the
> > information that we need in order to do the right thing must already
> > be present in the WAL. If it weren't, then how could crash recovery
> > work correctly, or physical replication? At any given moment, you can
> > choose to promote a physical standby, and at that point the state you
> > observe on the new primary had better be some state that existed on
> > the primary at some point in its history. At any moment, you can
> > unplug the primary, restart it, and run crash recovery, and if you do,
> > you had better end up with some state that existed on the primary at
> > some point shortly before the crash.

One minor exception here is that there's no real time bound to see the last
few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
flush.

> > I think that there are actually a
> > few subtle inaccuracies in the last two sentences, because actually
> > the order in which transactions become visible on a physical standby
> > can differ from the order in which it happens on the primary, but I
> > don't think that actually changes the picture much. The point is that
> > the WAL is the definitive source of information about what happened
> > and in what order it happened, and we use it in that way already in
> > the context of physical replication, and of standbys. If logical
> > decoding has a problem with some case that those systems handle
> > correctly, the problem is with logical decoding, not the WAL format.
> >
>
> The problem lies in how we log sequences. If we wrote each individual
> increment to WAL, it might work the way you propose (except for cases
> with sequences created in a transaction, etc.). But that's not what we
> do - we log sequence increments in batches of 32 values, and then only
> modify the sequence relfilenode.

> This works for physical replication, because the WAL describes the
> "next" state of the sequence (so if you do "SELECT * FROM sequence"
> you'll not see the same state, and the sequence value may "jump ahead"
> after a failover).
>
> But for logical replication this does not work, because the transaction
> might depend on a state created (WAL-logged) by some other transaction.
> And perhaps that transaction actually happened *before* we even built
> the first snapshot for decoding :-/

I really can't follow the "depend on state ... by some other transaction"
aspect.

Even the case of a sequence that is renamed inside a transaction that did
*not* create / reset the sequence and then also triggers increment of the
sequence seems to be dealt with reasonably by processing sequence increments
outside a transaction - the old name will be used for the increments, replay
of the renaming transaction would then implement the rename in a hypothetical
DDL-replay future.

> There's also the issue with what snapshot to use when decoding these
> transactional changes in logical decoding (see

Incomplete parenthetical? Or were you referencing the next paragraph?

What are the transactional changes you're referring to here?

I did some skimming of the referenced thread about the reversal of the last
approach, but I couldn't really understand what the fundamental issues were
with the reverted implementation - it's a very long thread and references
other threads.

Greetings,

Andres Freund

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

17 November 2022, 11:39:49


On 11/17/22 03:43, Andres Freund wrote:
> Hi,
> 
> 
> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
>> Well, yeah - we can either try to perform the stuff independently of the
>> transactions that triggered it, or we can try making it part of some of
>> the transactions. Each of those options has problems, though :-(
>>
>> The first version of the patch tried the first approach, i.e. decode the
>> increments and apply that independently. But:
>>
>>   (a) What would you do with increments of sequences created/reset in a
>>       transaction? Can't apply those outside the transaction, because it
>>       might be rolled back (and that state is not visible on primary).
> 
> I think a reasonable approach could be to actually perform different WAL
> logging for that case. It'll require a bit of machinery, but could actually
> result in *less* WAL logging overall, because we don't need to emit a WAL
> record for each SEQ_LOG_VALS sequence values.
> 

Could you elaborate? Hard to comment without knowing more ...

My point was that stuff like this (creating a new sequence or at least a
new relfilenode) means we can't apply that independently of the
transaction (unlike regular increments). I'm not sure how a change to
WAL logging would make that go away.

> 
> 
>>   (b) What about increments created before we have a proper snapshot?
>>       There may be transactions dependent on the increment. This is what
>>       ultimately led to revert of the patch.
> 
> I don't understand this - why would we ever need to process those increments
> from before we have a snapshot?  Wouldn't they, by definition, be before the
> slot was active?
> 
> To me this is the rough equivalent of logical decoding not giving the initial
> state of all tables. You need some process outside of logical decoding to get
> that (obviously we have some support for that via the exported data snapshot
> during slot creation).
> 

Which is what already happens during tablesync, no? We more or less copy
sequences as if they were tables.

> I assume that part of the initial sync would have to be a new sequence
> synchronization step that reads all the sequence states on the publisher and
> ensures that the subscriber sequences are at the same point. There's a bit of
> trickiness there, but it seems entirely doable. The logical replication replay
> support for sequences will have to be a bit careful about not decreasing the
> subscriber's sequence values - the standby initially will be ahead of the
> increments we'll see in the WAL. But that seems inevitable given the
> non-transactional nature of sequences.
> 

See fetch_sequence_data / copy_sequence in the patch. The bit about
ensuring the sequence does not go away (say, using page LSN and/or LSN
of the increment) is not there, however isn't that pretty much what I
proposed doing for "reconciling" the sequence state logged at COMMIT?

> 
>> This version of the patch tries to do the opposite thing - make sure
>> that the state after each commit matches what the transaction might have
>> seen (for sequences it accessed). It's imperfect, because it might log a
>> state generated "after" the sequence got accessed - it focuses on the
>> guarantee not to generate duplicate values.
> 
> That approach seems quite wrong to me.
> 

Why? Because it might log a state for sequence as of COMMIT, when the
transaction accessed the sequence much earlier? That is, this may happen:

T1: nextval('s') -> 1
T2: call nextval('s') 1000000x
T1: commit

and T1 will log sequence state ~1000001, give or take. I don't think
there's way around that, given the non-transactional nature of
sequences. And I'm not convinced this is an issue, as it ensures
uniqueness of values generated on the subscriber. And I think it's
reasonable to replicate the sequence state as of the commit (because
that's what you'd see on the primary).

> 
>>> I'm going to confess that I have no really specific idea how to
>>> implement that. I'm just not sufficiently familiar with this code.
>>> However, I suspect that the solution lies in changing things on the
>>> decoding side rather than in the WAL format. I feel like the
>>> information that we need in order to do the right thing must already
>>> be present in the WAL. If it weren't, then how could crash recovery
>>> work correctly, or physical replication? At any given moment, you can
>>> choose to promote a physical standby, and at that point the state you
>>> observe on the new primary had better be some state that existed on
>>> the primary at some point in its history. At any moment, you can
>>> unplug the primary, restart it, and run crash recovery, and if you do,
>>> you had better end up with some state that existed on the primary at
>>> some point shortly before the crash.
> 
> One minor exception here is that there's no real time bound to see the last
> few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
> flush.
> 

Right. Another issue is we ignore stuff that happened in aborted
transactions, so then nextval('s') in another transaction may not wait
for syncrep to confirm receiving that WAL. Which is a data loss case,
see [1]:

[1]
https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9%40enterprisedb.com

> 
>>> I think that there are actually a
>>> few subtle inaccuracies in the last two sentences, because actually
>>> the order in which transactions become visible on a physical standby
>>> can differ from the order in which it happens on the primary, but I
>>> don't think that actually changes the picture much. The point is that
>>> the WAL is the definitive source of information about what happened
>>> and in what order it happened, and we use it in that way already in
>>> the context of physical replication, and of standbys. If logical
>>> decoding has a problem with some case that those systems handle
>>> correctly, the problem is with logical decoding, not the WAL format.
>>>
>>
>> The problem lies in how we log sequences. If we wrote each individual
>> increment to WAL, it might work the way you propose (except for cases
>> with sequences created in a transaction, etc.). But that's not what we
>> do - we log sequence increments in batches of 32 values, and then only
>> modify the sequence relfilenode.
> 
>> This works for physical replication, because the WAL describes the
>> "next" state of the sequence (so if you do "SELECT * FROM sequence"
>> you'll not see the same state, and the sequence value may "jump ahead"
>> after a failover).
>>
>> But for logical replication this does not work, because the transaction
>> might depend on a state created (WAL-logged) by some other transaction.
>> And perhaps that transaction actually happened *before* we even built
>> the first snapshot for decoding :-/
> 
> I really can't follow the "depend on state ... by some other transaction"
> aspect.
> 

T1: nextval('s') -> writes WAL, covering by the next 32 increments
T2: nextval('s') -> no WAL generated, covered by T1 WAL

This is what I mean by "dependency" on state logged by another
transaction. It already causes problems with streaming replication (see
the reference to syncrep above), logical replication has the same issue.

> 
> Even the case of a sequence that is renamed inside a transaction that did
> *not* create / reset the sequence and then also triggers increment of the
> sequence seems to be dealt with reasonably by processing sequence increments
> outside a transaction - the old name will be used for the increments, replay
> of the renaming transaction would then implement the rename in a hypothetical
> DDL-replay future.
> 
> 
>> There's also the issue with what snapshot to use when decoding these
>> transactional changes in logical decoding (see
> 
> Incomplete parenthetical? Or were you referencing the next paragraph?
> 
> What are the transactional changes you're referring to here?
> 

Sorry, IIRC I merely wanted to mention/reference the snapshot issue in
the thread [2] that I ended up referencing in the next paragraph.


[2]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com

> 
> I did some skimming of the referenced thread about the reversal of the last
> approach, but I couldn't really understand what the fundamental issues were
> with the reverted implementation - it's a very long thread and references
> other threads.
> 

Yes, it's long/complex, but I intentionally linked to a specific message
which describes the issue ...

It's entirely possible there is a simple fix for the issue, and I just
got confused / unable to see the solution. The whole issue was due to
having a mix of transactional and non-transactional cases, similarly to
logical messages - and logicalmsg_decode() has the same issue, so maybe
let's talk about that for a moment.

See [3] and imagine you're dealing with a transactional message, but
you're still building a consistent snapshot. So the first branch applies:

    if (transactional &&
        !SnapBuildProcessChange(builder, xid, buf->origptr))
        return;

but because we don't have a snapshot, SnapBuildProcessChange does this:

    if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
        return false;

which however means logicalmsg_decode() does

    snapshot = SnapBuildGetOrBuildSnapshot(builder);

which crashes, because it hits this assert:

    Assert(builder->state == SNAPBUILD_CONSISTENT);

The sequence decoding did almost the same thing, with the same issue.
Maybe the correct thing to do is to just ignore the change in this case?
Presumably it'd be replicated by tablesync. But we've been unable to
convince ourselves that's correct, or what snapshot to pass to
ReorderBufferQueueMessage/ReorderBufferQueueSequence.


[3]
https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/decode.c#L585


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Robert Haas

Date:

17 November 2022, 16:04:52

On Wed, Nov 16, 2022 at 8:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> There's a couple of caveats, though:
>
> 1) Maybe we should focus more on "actually observed" state instead of
> "observable". Who cares if the sequence moved forward in a transaction
> that was ultimately rolled back? No committed transaction should have
> observer those values - in a way, the last "valid" state of the sequence
> is the last value generated in a transaction that ultimately committed.

When I say "observable" I mean from a separate transaction, not one
that is making changes to things.

I said "observable" rather than "actually observed" because we neither
know nor care whether someone actually ran a SELECT statement at any
given moment in time, just what they would have seen if they did.

> 2) I think what matters more is that we never generate duplicate value.
> That is, if you generate a value from a sequence, commit a transaction
> and replicate it, then the logical standby should not generate the same
> value from the sequence. This guarantee seems necessary for "failover"
> to logical standby.

I think that matters, but I don't think it's sufficient. We need to
preserve the order in which things appear to happen, and which changes
are and are not atomic, not just the final result.

> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
>   (a) What would you do with increments of sequences created/reset in a
>       transaction? Can't apply those outside the transaction, because it
>       might be rolled back (and that state is not visible on primary).

If the state isn't going to be visible until the transaction commits,
it has to be replicated as part of the transaction. If I create a
sequence and then nextval it a bunch of times, I can't replicate that
by first creating the sequence, and then later, as a separate
operation, replicating the nextvals. If I do that, then there's an
intermediate state visible on the replica that was never visible on
the origin server. That's broken.

>   (b) What about increments created before we have a proper snapshot?
>       There may be transactions dependent on the increment. This is what
>       ultimately led to revert of the patch.

Whatever problem exists here is with the implementation, not the
concept. If you copy the initial state as it exists at some moment in
time to a replica, and then replicate all the changes that happen
afterward to that replica without messing up the order, the replica
WILL be in sync with the origin server. The things that happen before
you copy the initial state do not and cannot matter.

But what you're describing sounds like the changes aren't really
replicated in visibility order, and then it is easy to see how a
problem like this can happen. Because now, an operation that actually
became visible just before or just after the initial copy was taken
might be thought to belong on the other side of that boundary, and
then everything will break. And it sounds like that is what you are
describing.

> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

Like Andres, I just can't imagine this being correct. It feels like
it's trying to paper over the failure to do the replication properly
during the transaction by overwriting state at the end.

> Yes, this would mean we accept we may end up with something like this:
>
> 1: T1 logs sequence state S1
> 2: someone increments sequence
> 3: T2 logs sequence stats S2
> 4: T2 commits
> 5: T1 commits
>
> which "inverts" the apply order of S1 vs. S2, because we first apply S2
> and then the "old" S1. But as long as we're smart enough to "discard"
> applying S1, I think that's acceptable - because it guarantees we'll not
> generate duplicate values (with values in the committed transaction).
>
> I'd also argue it does not actually generate invalid state, because once
> we commit either transaction, S2 is what's visible.

I agree that it's OK if the sequence increment gets merged into the
commit that immediately follows. However, I disagree with the idea of
discarding the second update on the grounds that it would make the
sequence go backward and we know that can't be right. That algorithm
works in the really specific case where the only operations are
increments. As soon as anyone does anything else to the sequence, such
an algorithm can no longer work. Nor can it work for objects that are
not sequences. The alternative strategy of replicating each change
exactly once and in the correct order works for all current and future
object types in all cases.

> > Your alternative proposal says "The other option might be to make
> > these messages non-transactional, in which case we'd separate the
> > ordering from COMMIT ordering, evading the reordering problem." But I
> > don't think that avoids the reordering problem at all.
>
> I don't understand why. Why would it not address the reordering issue?
>
> > Nor do I think it's correct.
>
> Nor do I understand this. I mean, isn't it essentially the option you
> mentioned earlier - treating the non-transactional actions as
> independent transactions? Yes, we'd be batching them so that we'd not
> see "intermediate" states, but those are not observed by abyone.

I don't think that batching them is a bad idea, in fact I think it's
necessary. But those batches still have to be applied at the right
time relative to the sequence of commits.

> I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
> you have something like this:
>
> 1: T1 does something and also increments a sequence
> 2: T1 logs state of the sequence (right before commit)
> 3: T1 writes COMMIT
>
> Now when we decode/apply this, we end up doing this:
>
> 1: decode all T1 changes, stash them
> 2: decode the sequence state and apply it separately
> 3: decode COMMIT, apply all T1 changes
>
> There might be other transactions interleaving with this, but I think
> it'd behave correctly. What example would not work?

What if one of the other transactions renames the sequence, or changes
the current value, or does basically anything to it other than
nextval?

> The problem lies in how we log sequences. If we wrote each individual
> increment to WAL, it might work the way you propose (except for cases
> with sequences created in a transaction, etc.). But that's not what we
> do - we log sequence increments in batches of 32 values, and then only
> modify the sequence relfilenode.
>
> This works for physical replication, because the WAL describes the
> "next" state of the sequence (so if you do "SELECT * FROM sequence"
> you'll not see the same state, and the sequence value may "jump ahead"
> after a failover).
>
> But for logical replication this does not work, because the transaction
> might depend on a state created (WAL-logged) by some other transaction.
> And perhaps that transaction actually happened *before* we even built
> the first snapshot for decoding :-/

I agree that there's a problem here but I don't think that it's a huge
problem. I think that it's not QUITE right to think about what state
is visible on the primary. It's better to think about what state would
be visible on the primary if it crashed and restarted after writing
any given amount of WAL, or what would be visible on a physical
standby after replaying any given amount of WAL. If logical
replication mimics that, I think it's as correct as it needs to be. If
not, those other systems are broken, too.

So I think what should happen is that when we write a WAL record
saying that the sequence has been incremented by 32, that should be
logically replicated after all commits whose commit record precedes
that WAL record and before commits whose commit record follows that
WAL record. It is OK to merge the replication of that record into one
of either the immediately preceding or the immediately following
commit, but you can't do it as part of any other commit because then
you're changing the order of operations.

For instance, consider:

T1: BEGIN; INSERT; COMMIT;
T2: BEGIN; nextval('a_seq') causing a logged advancement to the sequence;
T3: BEGIN; nextval('b_seq') causing a logged advancement to the sequence;
T4: BEGIN; INSERT; COMMIT;
T2: COMMIT;
T3: COMMIT;

The sequence increments can be replicated as part of T1 or part of T4
or in between applying T1 and T4. They cannot be applied as part of T2
or T3. Otherwise, suppose T4 read the current value of one of those
sequences and included that value in the inserted row, and the target
table happened to be the sequence_value_at_end_of_period table. Then
imagine that after receiving the data for T4 and replicating it, the
primary server is hit by a meteor and the replica is promoted. Well,
it's now possible for some new transaction to get a value from that
sequence than what has already been written to the
sequence_value_at_end_of_period table, which will presumably break the
application.

> > In particular, I think it's likely that the "non-transactional
> > messages" that you mention earlier don't get applied at the point in
> > the commit sequence where they were found in the WAL. Not sure why
> > exactly, but perhaps the point at which we're reading WAL runs ahead
> > of the decoding per se, or something like that, and thus those
> > non-transactional messages arrive too early relative to the commit
> > ordering. Possibly that could be changed, and they could be buffered
>
> I'm not sure which case of "non-transactional messages" this refers to,
> so I can't quite respond to these comments. Perhaps you mean the
> problems that killed the previous patch [1]?

In http://postgr.es/m/8bf1c518-b886-fe1b-5c42-09f9c663146d@enterprisedb.com
you said "The other option might be to make these messages
non-transactional". I was referring to that.

> > the transaction whose commit record occurs most nearly prior to, or
> > most nearly after, the WAL record for the operation itself. Or else,
> > we could create "virtual" transactions for such operations and make
> > sure those get replayed at the right point in the commit sequence. Or
> > else, I don't know, maybe something else. But I think the overall
> > picture is that we need to approach the problem by replicating changes
> > in WAL order, as a physical standby would do. Saying that a change is
> > "nontransactional" doesn't mean that it's exempt from ordering
> > requirements; rather, it means that that change has its own place in
> > that ordering, distinct from the transaction in which it occurred.
>
> But doesn't the approach with WAL-logging sequence state before COMMIT,
> and then applying it independently in WAL-order, do pretty much this?

I'm sort of repeating myself here, but: only if the only operations
that ever get performed on sequences are increments. Which is just not
true.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

From

Andres Freund

Date:

17 November 2022, 17:07:16

Hi,

On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
> On 11/17/22 03:43, Andres Freund wrote:
> > On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> >> Well, yeah - we can either try to perform the stuff independently of the
> >> transactions that triggered it, or we can try making it part of some of
> >> the transactions. Each of those options has problems, though :-(
> >>
> >> The first version of the patch tried the first approach, i.e. decode the
> >> increments and apply that independently. But:
> >>
> >>   (a) What would you do with increments of sequences created/reset in a
> >>       transaction? Can't apply those outside the transaction, because it
> >>       might be rolled back (and that state is not visible on primary).
> >
> > I think a reasonable approach could be to actually perform different WAL
> > logging for that case. It'll require a bit of machinery, but could actually
> > result in *less* WAL logging overall, because we don't need to emit a WAL
> > record for each SEQ_LOG_VALS sequence values.
> >
>
> Could you elaborate? Hard to comment without knowing more ...
>
> My point was that stuff like this (creating a new sequence or at least a
> new relfilenode) means we can't apply that independently of the
> transaction (unlike regular increments). I'm not sure how a change to
> WAL logging would make that go away.

Different WAL logging would make it easy to handle that on the logical
decoding level. We don't need to emit WAL records each time a
created-in-this-toplevel-xact sequences gets incremented as they're not
persisting anyway if the surrounding xact aborts. We already need to remember
the filenode so it can be dropped at the end of the transaction, so we could
emit a single record for each sequence at that point.

> >>   (b) What about increments created before we have a proper snapshot?
> >>       There may be transactions dependent on the increment. This is what
> >>       ultimately led to revert of the patch.
> >
> > I don't understand this - why would we ever need to process those increments
> > from before we have a snapshot?  Wouldn't they, by definition, be before the
> > slot was active?
> >
> > To me this is the rough equivalent of logical decoding not giving the initial
> > state of all tables. You need some process outside of logical decoding to get
> > that (obviously we have some support for that via the exported data snapshot
> > during slot creation).
> >
>
> Which is what already happens during tablesync, no? We more or less copy
> sequences as if they were tables.

I think you might have to copy sequences after tables, but I'm not sure. But
otherwise, yea.

> > I assume that part of the initial sync would have to be a new sequence
> > synchronization step that reads all the sequence states on the publisher and
> > ensures that the subscriber sequences are at the same point. There's a bit of
> > trickiness there, but it seems entirely doable. The logical replication replay
> > support for sequences will have to be a bit careful about not decreasing the
> > subscriber's sequence values - the standby initially will be ahead of the
> > increments we'll see in the WAL. But that seems inevitable given the
> > non-transactional nature of sequences.
> >
>
> See fetch_sequence_data / copy_sequence in the patch. The bit about
> ensuring the sequence does not go away (say, using page LSN and/or LSN
> of the increment) is not there, however isn't that pretty much what I
> proposed doing for "reconciling" the sequence state logged at COMMIT?

Well, I think the approach of logging all sequence increments at commit is the
wrong idea...

Creating a new relfilenode whenever a sequence is incremented seems like a
complete no-go to me. That increases sequence overhead by several orders of
magnitude and will lead to *awful* catalog bloat on the subscriber.

> >
> >> This version of the patch tries to do the opposite thing - make sure
> >> that the state after each commit matches what the transaction might have
> >> seen (for sequences it accessed). It's imperfect, because it might log a
> >> state generated "after" the sequence got accessed - it focuses on the
> >> guarantee not to generate duplicate values.
> >
> > That approach seems quite wrong to me.
> >
>
> Why? Because it might log a state for sequence as of COMMIT, when the
> transaction accessed the sequence much earlier?

Mainly because sequences aren't transactional and trying to make them will
require awful contortions.

While there are cases where we don't flush the WAL / wait for syncrep for
sequences, we do replicate their state correctly on physical replication. If
an LSN has been acknowledged as having been replicated, we won't just loose a
prior sequence increment after promotion, even if the transaction didn't [yet]
commit.

It's completely valid for an application to call nextval() in one transaction,
potentially even abort it, and then only use that sequence value in another
transaction.

> > I did some skimming of the referenced thread about the reversal of the last
> > approach, but I couldn't really understand what the fundamental issues were
> > with the reverted implementation - it's a very long thread and references
> > other threads.
> >
>
> Yes, it's long/complex, but I intentionally linked to a specific message
> which describes the issue ...
>
> It's entirely possible there is a simple fix for the issue, and I just
> got confused / unable to see the solution. The whole issue was due to
> having a mix of transactional and non-transactional cases, similarly to
> logical messages - and logicalmsg_decode() has the same issue, so maybe
> let's talk about that for a moment.
>
> See [3] and imagine you're dealing with a transactional message, but
> you're still building a consistent snapshot. So the first branch applies:
>
>     if (transactional &&
>         !SnapBuildProcessChange(builder, xid, buf->origptr))
>         return;
>
> but because we don't have a snapshot, SnapBuildProcessChange does this:
>
>     if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
>         return false;

In this case we'd just return without further work in logicalmsg_decode(). The
problematic case presumably is is when we have a full snapshot but aren't yet
consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
true. And we reach:

> which however means logicalmsg_decode() does
>
>     snapshot = SnapBuildGetOrBuildSnapshot(builder);
>
> which crashes, because it hits this assert:
>
>     Assert(builder->state == SNAPBUILD_CONSISTENT);

I think the problem here is just that we shouldn't even try to get a snapshot
in the transactional case - note that it's not even used in
ReorderBufferQueueMessage() for transactional message. The transactional case
needs to behave like a "normal" change - we might never decode the message if
the transaction ends up committing before we've reached a consistent point.

> The sequence decoding did almost the same thing, with the same issue.
> Maybe the correct thing to do is to just ignore the change in this case?

No, I don't think that'd be correct, the message | sequence needs to be queued
for the transaction. If the transaction ends up committing after we've reached
consistency, we'll get the correct snapshot from the base snapshot set in
SnapBuildProcessChange().

Greetings,

Andres Freund

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

17 November 2022, 21:13:23


On 11/17/22 18:07, Andres Freund wrote:
> Hi,
> 
> On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
>> On 11/17/22 03:43, Andres Freund wrote:
>>> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
>>>> Well, yeah - we can either try to perform the stuff independently of the
>>>> transactions that triggered it, or we can try making it part of some of
>>>> the transactions. Each of those options has problems, though :-(
>>>>
>>>> The first version of the patch tried the first approach, i.e. decode the
>>>> increments and apply that independently. But:
>>>>
>>>>   (a) What would you do with increments of sequences created/reset in a
>>>>       transaction? Can't apply those outside the transaction, because it
>>>>       might be rolled back (and that state is not visible on primary).
>>>
>>> I think a reasonable approach could be to actually perform different WAL
>>> logging for that case. It'll require a bit of machinery, but could actually
>>> result in *less* WAL logging overall, because we don't need to emit a WAL
>>> record for each SEQ_LOG_VALS sequence values.
>>>
>>
>> Could you elaborate? Hard to comment without knowing more ...
>>
>> My point was that stuff like this (creating a new sequence or at least a
>> new relfilenode) means we can't apply that independently of the
>> transaction (unlike regular increments). I'm not sure how a change to
>> WAL logging would make that go away.
> 
> Different WAL logging would make it easy to handle that on the logical
> decoding level. We don't need to emit WAL records each time a
> created-in-this-toplevel-xact sequences gets incremented as they're not
> persisting anyway if the surrounding xact aborts. We already need to remember
> the filenode so it can be dropped at the end of the transaction, so we could
> emit a single record for each sequence at that point.
> 
> 
>>>>   (b) What about increments created before we have a proper snapshot?
>>>>       There may be transactions dependent on the increment. This is what
>>>>       ultimately led to revert of the patch.
>>>
>>> I don't understand this - why would we ever need to process those increments
>>> from before we have a snapshot?  Wouldn't they, by definition, be before the
>>> slot was active?
>>>
>>> To me this is the rough equivalent of logical decoding not giving the initial
>>> state of all tables. You need some process outside of logical decoding to get
>>> that (obviously we have some support for that via the exported data snapshot
>>> during slot creation).
>>>
>>
>> Which is what already happens during tablesync, no? We more or less copy
>> sequences as if they were tables.
> 
> I think you might have to copy sequences after tables, but I'm not sure. But
> otherwise, yea.
> 
> 
>>> I assume that part of the initial sync would have to be a new sequence
>>> synchronization step that reads all the sequence states on the publisher and
>>> ensures that the subscriber sequences are at the same point. There's a bit of
>>> trickiness there, but it seems entirely doable. The logical replication replay
>>> support for sequences will have to be a bit careful about not decreasing the
>>> subscriber's sequence values - the standby initially will be ahead of the
>>> increments we'll see in the WAL. But that seems inevitable given the
>>> non-transactional nature of sequences.
>>>
>>
>> See fetch_sequence_data / copy_sequence in the patch. The bit about
>> ensuring the sequence does not go away (say, using page LSN and/or LSN
>> of the increment) is not there, however isn't that pretty much what I
>> proposed doing for "reconciling" the sequence state logged at COMMIT?
> 
> Well, I think the approach of logging all sequence increments at commit is the
> wrong idea...
> 

But we're not logging all sequence increments, no?

We're logging the state for each sequence touched by the transaction,
but only once - if the transaction incremented the sequence 1000000x
times, we'll still log it just once (at least for this particular purpose).

Yes, if transactions touch each sequence just once, then we're logging
individual increments.

The only more efficient solution would be to decode the existing WAL
(every ~32 increments), and perhaps also tracking which sequences were
accessed by a transaction. And then simply stashing the increments in a
global reorderbuffer hash table, and then applying only the last one at
commit time. This would require the transactional / non-transactional
behavior (I think), but perhaps we can make that work.

Or are you thinking about some other scheme?

> Creating a new relfilenode whenever a sequence is incremented seems like a
> complete no-go to me. That increases sequence overhead by several orders of
> magnitude and will lead to *awful* catalog bloat on the subscriber.
> 

You mean on the the apply side? Yes, I agree this needs a better
approach, I've focused on the decoding side so far.

> 
>>>
>>>> This version of the patch tries to do the opposite thing - make sure
>>>> that the state after each commit matches what the transaction might have
>>>> seen (for sequences it accessed). It's imperfect, because it might log a
>>>> state generated "after" the sequence got accessed - it focuses on the
>>>> guarantee not to generate duplicate values.
>>>
>>> That approach seems quite wrong to me.
>>>
>>
>> Why? Because it might log a state for sequence as of COMMIT, when the
>> transaction accessed the sequence much earlier?
> 
> Mainly because sequences aren't transactional and trying to make them will
> require awful contortions.
> 
> While there are cases where we don't flush the WAL / wait for syncrep for
> sequences, we do replicate their state correctly on physical replication. If
> an LSN has been acknowledged as having been replicated, we won't just loose a
> prior sequence increment after promotion, even if the transaction didn't [yet]
> commit.
> 

True, I agree we should aim to achieve that.

> It's completely valid for an application to call nextval() in one transaction,
> potentially even abort it, and then only use that sequence value in another
> transaction.
> 

I don't quite agree with that - we make no promises about what happens
to sequence changes in aborted transactions. I don't think I've ever
seen an application using such pattern either.

And I'd argue we already fail to uphold such guarantee, because we don't
wait for syncrep if the sequence WAL happened in aborted transaction. So
if you use the value elsewhere (outside PG), you may lose it.

Anyway, I think the scheme I outlined above (with stashing decoded
increments, logged once every 32 values) and applying the latest
increment for all sequences at commit, would work.

> 
> 
>>> I did some skimming of the referenced thread about the reversal of the last
>>> approach, but I couldn't really understand what the fundamental issues were
>>> with the reverted implementation - it's a very long thread and references
>>> other threads.
>>>
>>
>> Yes, it's long/complex, but I intentionally linked to a specific message
>> which describes the issue ...
>>
>> It's entirely possible there is a simple fix for the issue, and I just
>> got confused / unable to see the solution. The whole issue was due to
>> having a mix of transactional and non-transactional cases, similarly to
>> logical messages - and logicalmsg_decode() has the same issue, so maybe
>> let's talk about that for a moment.
>>
>> See [3] and imagine you're dealing with a transactional message, but
>> you're still building a consistent snapshot. So the first branch applies:
>>
>>     if (transactional &&
>>         !SnapBuildProcessChange(builder, xid, buf->origptr))
>>         return;
>>
>> but because we don't have a snapshot, SnapBuildProcessChange does this:
>>
>>     if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
>>         return false;
> 
> In this case we'd just return without further work in logicalmsg_decode(). The
> problematic case presumably is is when we have a full snapshot but aren't yet
> consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
> true. And we reach:
> 
>> which however means logicalmsg_decode() does
>>
>>     snapshot = SnapBuildGetOrBuildSnapshot(builder);
>>
>> which crashes, because it hits this assert:
>>
>>     Assert(builder->state == SNAPBUILD_CONSISTENT);
> 
> I think the problem here is just that we shouldn't even try to get a snapshot
> in the transactional case - note that it's not even used in
> ReorderBufferQueueMessage() for transactional message. The transactional case
> needs to behave like a "normal" change - we might never decode the message if
> the transaction ends up committing before we've reached a consistent point.
> 
> 
>> The sequence decoding did almost the same thing, with the same issue.
>> Maybe the correct thing to do is to just ignore the change in this case?
> 
> No, I don't think that'd be correct, the message | sequence needs to be queued
> for the transaction. If the transaction ends up committing after we've reached
> consistency, we'll get the correct snapshot from the base snapshot set in
> SnapBuildProcessChange().
> 

Yeah, I think you're right. I looked at this again, with fresh mind, and
I came to the same conclusion. Roughly what the attached patch does.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

logical-msg-fix.patch

Re: logical decoding and replication of sequences, take 2

From

Andres Freund

Date:

18 November 2022, 03:03:25

Hi,

On 2022-11-17 22:13:23 +0100, Tomas Vondra wrote:
> On 11/17/22 18:07, Andres Freund wrote:
> > On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
> >> On 11/17/22 03:43, Andres Freund wrote:
> >>> I assume that part of the initial sync would have to be a new sequence
> >>> synchronization step that reads all the sequence states on the publisher and
> >>> ensures that the subscriber sequences are at the same point. There's a bit of
> >>> trickiness there, but it seems entirely doable. The logical replication replay
> >>> support for sequences will have to be a bit careful about not decreasing the
> >>> subscriber's sequence values - the standby initially will be ahead of the
> >>> increments we'll see in the WAL. But that seems inevitable given the
> >>> non-transactional nature of sequences.
> >>>
> >>
> >> See fetch_sequence_data / copy_sequence in the patch. The bit about
> >> ensuring the sequence does not go away (say, using page LSN and/or LSN
> >> of the increment) is not there, however isn't that pretty much what I
> >> proposed doing for "reconciling" the sequence state logged at COMMIT?
> >
> > Well, I think the approach of logging all sequence increments at commit is the
> > wrong idea...
> >
>
> But we're not logging all sequence increments, no?

I was imprecise - I meant streaming them out at commit.



> Yeah, I think you're right. I looked at this again, with fresh mind, and
> I came to the same conclusion. Roughly what the attached patch does.

To me it seems a bit nicer to keep the SnapBuildGetOrBuildSnapshot() call in
decode.c instead of moving it to reorderbuffer.c. Perhaps we should add a
snapbuild.c helper similar to SnapBuildProcessChange() for non-transactional
changes that also gets a snapshot?

Could look something like

    Snapshot snapshot = NULL;

    if (message->transactional &&
        !SnapBuildProcessChange(builder, xid, buf->origptr))
        return;
    else if (!SnapBuildProcessStateNonTx(builder, &snapshot))
        return;

    ...

Or perhaps we should just bite the bullet and add an argument to
SnapBuildProcessChange to deal with that?

Greetings,

Andres Freund

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

10 January 2023, 18:32:12

Hi,

Here's a rebased version of the sequence decoding patch.

0001 is a fix for the pre-existing issue in logicalmsg_decode,
attempting to build a snapshot before getting into a consistent state.
AFAICS this only affects assert-enabled builds and is otherwise
harmless, because we are not actually using the snapshot (apply gets a
valid snapshot from the transaction).

This is mostly the fix I shared in November, except that I kept the call
in decode.c (per comment from Andres). I haven't added any argument to
SnapBuildProcessChange because we may need to backpatch this (and it
didn't seem much simpler, IMHO).

0002 is a rebased version of the original approach, committed as
0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix
as 0001 (for the sequence messages), the primary reason for the revert.

The rebase was not quite straightforward, due to extensive changes in
how publications deal with tables/schemas, and so on. So this adopts
them, but other than that it behaves just like the original patch.

So this abandons the approach with COMMIT-time logging for sequences
accessed/modified by the transaction, proposed in response to the
revert. It seemed like a good (and simpler) alternative, but there were
far too many issues - higher overhead, ordering of records for
concurrent transactions, making it reliable, etc.

I think the main remaining question is what's the goal of this patch, or
rather what "guarantees" we expect from it - what we expect to see on
the replica after incrementing a sequence on the primary.

Robert described [1] a model and argued the standby should not "invent"
new states. It's a long / detailed explanation, I'm not going to try to
shorten in here because that'd inevitably omit various details. So
better read it whole ...

Anyway, I don't think this approach (essentially treating most sequence
increments as non-transactional) breaks any consistency guarantees or
introduces any "new" states that would not be observable on the primary.
In a way, this treats non-transactional sequence increments as separate
transactions, and applies them directly. If you read the sequence in
between two commits, you might see any "intermediate" state of the
sequence - that's the nature of non-transactional changes.

We could "postpone" applying the decoded changes until the first next
commit, which might improve performance if a transaction is long enough
to cover many sequence increments. But that's more a performance
optimization than a matter of correctness, IMHO.

One caveat is that because of how WAL works for sequences, we're
actually decoding changes "ahead" so if you read the sequence on the
subscriber it'll actually seem to be slightly ahead (up to ~32 values).
This could be eliminated by setting SEQ_LOG_VALS to 0, which however
increases the sequence costs, of course.

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

15 January 2023, 23:18:54

cfbot didn't like the rebased / split patch, and after looking at it I
believe it's a bug in parallel apply of large transactions (216a784829),
which seems to have changed interpretation of in_remote_transaction and
in_streamed_transaction. I've reported the issue on that thread [1], but
here's a version with a temporary workaround so that we can continue
reviewing it.

regards

[1]
https://www.postgresql.org/message-id/984ff689-adde-9977-affe-cd6029e850be%40enterprisedb.com

On 1/15/23 00:39, Tomas Vondra wrote:
> Hi,
> 
> here's a slightly updated version - the main change is splitting the
> patch into multiple parts, along the lines of the original patch
> reverted in 2c7ea57e56ca5f668c32d4266e0a3e45b455bef5:
> 
> - basic sequence decoding infrastructure
> - support in test_decoding
> - support in built-in logical replication
> 
> The revert mentions a couple additional parts, but those were mostly
> fixes / improvements. And those are not merged into the three parts.
> 
> 
> On 1/11/23 22:46, Tomas Vondra wrote:
>>
>>> ...
>>>
>>>> +/*
>>>> + * Update the sequence state by modifying the existing sequence data row.
>>>> + *
>>>> + * This keeps the same relfilenode, so the behavior is non-transactional.
>>>> + */
>>>> +static void
>>>> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called)
>>>> +{
>>>> ...
>>>>
>>>> +void
>>>> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called)
>>>> +{
>>>> +    if (transactional)
>>>> +        SetSequence_transactional(seq_relid, last_value, log_cnt, is_called);
>>>> +    else
>>>> +        SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called);
>>>> +}
>>>
>>> That's a lot of duplication with existing code. There's no explanation why
>>> SetSequence() as well as do_setval() exists.
>>>
>>
>> Thanks, I'll look into this.
>>
> 
> I haven't done anything about this yet. The functions are doing similar
> things, but there's also a fair amount of differences so I haven't found
> a good way to merge them yet.
> 
>>>
>>>>  /*
>>>>   * Initialize a sequence's relation with the specified tuple as content
>>>>   *
>>>> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
>>>>  
>>>>      /* check the comment above nextval_internal()'s equivalent call. */
>>>>      if (RelationNeedsWAL(rel))
>>>> +    {
>>>>          GetTopTransactionId();
>>>>  
>>>> +        if (XLogLogicalInfoActive())
>>>> +            GetCurrentTransactionId();
>>>> +    }
>>>
>>> Is it actually possible to reach this without an xid already having been
>>> assigned for the current xact?
>>>
>>
>> I believe it is. That's probably how I found this change is needed,
>> actually.
>>
> 
> I've added a comment explaining why this needed. I don't think it's
> worth trying to optimize this, because in plausible workloads we'd just
> delay the work a little bit.
> 
>>>
>>>
>>>> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions)
>>>>       * It's sufficient to ensure the toplevel transaction has an xid, no need
>>>>       * to assign xids subxacts, that'll already trigger an appropriate wait.
>>>>       * (Have to do that here, so we're outside the critical section)
>>>> +     *
>>>> +     * We have to ensure we have a proper XID, which will be included in
>>>> +     * the XLOG record by XLogRecordAssemble. Otherwise the first nextval()
>>>> +     * in a subxact (without any preceding changes) would get XID 0, and it
>>>> +     * would then be impossible to decide which top xact it belongs to.
>>>> +     * It'd also trigger assert in DecodeSequence. We only do that with
>>>> +     * wal_level=logical, though.
>>>> +     *
>>>> +     * XXX This might seem unnecessary, because if there's no XID the xact
>>>> +     * couldn't have done anything important yet, e.g. it could not have
>>>> +     * created a sequence. But that's incorrect, because of subxacts. The
>>>> +     * current subtransaction might not have done anything yet (thus no XID),
>>>> +     * but an earlier one might have created the sequence.
>>>>       */
>>>
>>> What about restricting this to the case you're mentioning,
>>> i.e. subtransactions?
>>>
>>
>> That might work, but I need to think about it a bit.
>>
>> I don't think it'd save us much, though. I mean, vast majority of
>> transactions (and subtransactions) calling nextval() will then do
>> something else which requires a XID. This just moves the XID a bit,
>> that's all.
>>
> 
> After thinking about this a bit more, I don't think the optimization is
> worth it, for the reasons explained above.
> 
>>>
>>>> +/*
>>>> + * Handle sequence decode
>>>> + *
>>>> + * Decoding sequences is a bit tricky, because while most sequence actions
>>>> + * are non-transactional (not subject to rollback), some need to be handled
>>>> + * as transactional.
>>>> + *
>>>> + * By default, a sequence increment is non-transactional - we must not queue
>>>> + * it in a transaction as other changes, because the transaction might get
>>>> + * rolled back and we'd discard the increment. The downstream would not be
>>>> + * notified about the increment, which is wrong.
>>>> + *
>>>> + * On the other hand, the sequence may be created in a transaction. In this
>>>> + * case we *should* queue the change as other changes in the transaction,
>>>> + * because we don't want to send the increments for unknown sequence to the
>>>> + * plugin - it might get confused about which sequence it's related to etc.
>>>> + */
>>>> +void
>>>> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>>>> +{
>>>
>>>> +    /* extract the WAL record, with "created" flag */
>>>> +    xlrec = (xl_seq_rec *) XLogRecGetData(r);
>>>> +
>>>> +    /* XXX how could we have sequence change without data? */
>>>> +    if(!datalen || !tupledata)
>>>> +        return;
>>>
>>> Yea, I think we should error out here instead, something has gone quite wrong
>>> if this happens.
>>>
>>
>> OK
>>
> 
> Done.
> 
>>>
>>>> +    tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
>>>> +    DecodeSeqTuple(tupledata, datalen, tuplebuf);
>>>> +
>>>> +    /*
>>>> +     * Should we handle the sequence increment as transactional or not?
>>>> +     *
>>>> +     * If the sequence was created in a still-running transaction, treat
>>>> +     * it as transactional and queue the increments. Otherwise it needs
>>>> +     * to be treated as non-transactional, in which case we send it to
>>>> +     * the plugin right away.
>>>> +     */
>>>> +    transactional = ReorderBufferSequenceIsTransactional(ctx->reorder,
>>>> +                                                         target_locator,
>>>> +                                                         xlrec->created);
>>>
>>> Why re-create this information during decoding, when we basically already have
>>> it available on the primary? I think we already pay the price for that
>>> tracking, which we e.g. use for doing a non-transactional truncate:
>>>
>>>         /*
>>>          * Normally, we need a transaction-safe truncation here.  However, if
>>>          * the table was either created in the current (sub)transaction or has
>>>          * a new relfilenumber in the current (sub)transaction, then we can
>>>          * just truncate it in-place, because a rollback would cause the whole
>>>          * table or the current physical file to be thrown away anyway.
>>>          */
>>>         if (rel->rd_createSubid == mySubid ||
>>>             rel->rd_newRelfilelocatorSubid == mySubid)
>>>         {
>>>             /* Immediate, non-rollbackable truncation is OK */
>>>             heap_truncate_one_rel(rel);
>>>         }
>>>
>>> Afaict we could do something similar for sequences, except that I think we
>>> would just check if the sequence was created in the current transaction
>>> (i.e. any of the fields are set).
>>>
>>
>> Hmm, good point.
>>
> 
> But rd_createSubid/rd_newRelfilelocatorSubid fields are available only
> in the original transaction, not during decoding. So we'd have to do
> this check there and add the result to the WAL record. Is that what you
> had in mind?
> 
>>>
>>>> +/*
>>>> + * A transactional sequence increment is queued to be processed upon commit
>>>> + * and a non-transactional increment gets processed immediately.
>>>> + *
>>>> + * A sequence update may be both transactional and non-transactional. When
>>>> + * created in a running transaction, treat it as transactional and queue
>>>> + * the change in it. Otherwise treat it as non-transactional, so that we
>>>> + * don't forget the increment in case of a rollback.
>>>> + */
>>>> +void
>>>> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid,
>>>> +                           Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id,
>>>> +                           RelFileLocator rlocator, bool transactional, bool created,
>>>> +                           ReorderBufferTupleBuf *tuplebuf)
>>>
>>>
>>>> +        /*
>>>> +         * Decoding needs access to syscaches et al., which in turn use
>>>> +         * heavyweight locks and such. Thus we need to have enough state around to
>>>> +         * keep track of those.  The easiest way is to simply use a transaction
>>>> +         * internally.  That also allows us to easily enforce that nothing writes
>>>> +         * to the database by checking for xid assignments.
>>>> +         *
>>>> +         * When we're called via the SQL SRF there's already a transaction
>>>> +         * started, so start an explicit subtransaction there.
>>>> +         */
>>>> +        using_subtxn = IsTransactionOrTransactionBlock();
>>>
>>> This duplicates a lot of the code from ReorderBufferProcessTXN(). But only
>>> does so partially. It's hard to tell whether some of the differences are
>>> intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()?
>>>
>>> Maybe something like
>>>
>>> void
>>> ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals);
>>>
>>> void
>>> ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error);
>>>
>>
>> Thanks for the suggestion, I'll definitely consider that in the next
>> version of the patch.
> 
> I did look at the code a bit, but I'm not sure there really is a lot of
> duplicated code - yes, we start/abort the (sub)transaction, setup and
> tear down the snapshot, etc. Or what else would you put into the two new
> functions?
> 
> 
> regards
> 

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

vignesh C

Date:

27 January 2023, 14:41:45

On Mon, 16 Jan 2023 at 04:49, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> cfbot didn't like the rebased / split patch, and after looking at it I
> believe it's a bug in parallel apply of large transactions (216a784829),
> which seems to have changed interpretation of in_remote_transaction and
> in_streamed_transaction. I've reported the issue on that thread [1], but
> here's a version with a temporary workaround so that we can continue
> reviewing it.
>

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
17e72ec45d313b98bd90b95bc71b4cc77c2c89c3 ===
=== applying patch
./0001-Fix-snapshot-handling-in-logicalmsg_decode-20230116.patch
patching file src/backend/replication/logical/decode.c
patching file src/backend/replication/logical/reorderbuffer.c
=== applying patch ./0002-Logical-decoding-of-sequences-20230116.patch
patching file doc/src/sgml/logicaldecoding.sgml
Hunk #3 FAILED at 483.
Hunk #4 FAILED at 494.
Hunk #7 succeeded at 1252 (offset 4 lines).
2 out of 7 hunks FAILED -- saving rejects to file
doc/src/sgml/logicaldecoding.sgml.rej

[1] - http://cfbot.cputube.org/patch_41_3823.log

Regards,
Vignesh

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

16 February 2023, 15:50:30

Hi,

Here's a rebased patch, without the last bit which is now unnecessary
thanks to c981d9145dea.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

"Jonathan S. Katz"

Date:

22 February 2023, 02:28:55

Hi,

On 2/16/23 10:50 AM, Tomas Vondra wrote:
> Hi,
> 
> Here's a rebased patch, without the last bit which is now unnecessary
> thanks to c981d9145dea.

Thanks for continuing to work on this patch! I tested the latest version 
and have some feedback/clarifications.

I did some testing using a demo-app-based-on-a-real-world app I had 
conjured up[1]. This uses integer sequences as surrogate keys.

In general things seemed to work, but I had a couple of 
observations/questions.

1. Sequence IDs after a "failover". I believe this is a design decision, 
but I noticed that after simulating a failover, the IDs were replicating 
from a higher value, e.g.

INSERT INTO room (name) VALUES ('room 1');
INSERT INTO room (name) VALUES ('room 2');
INSERT INTO room (name) VALUES ('room 3');
INSERT INTO room (name) VALUES ('room 4');

The values of room_id_seq on each instance:

instance 1:

  last_value | log_cnt | is_called
------------+---------+-----------
           4 |      29 | t

  instance 2:

   last_value | log_cnt | is_called
------------+---------+-----------
          33 |       0 | t

After the switchover on instance 2:

INSERT INTO room (name) VALUES ('room 5') RETURNING id;

  id
----
  34

I don't see this as an issue for most applications, but we should at 
least document the behavior somewhere.

2. Using with origin=none with nonconflicting sequences.

I modified the example in [1] to set up two schemas with non-conflicting 
sequences[2], e.g. on instance 1:

CREATE TABLE public.room (
     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1) 
PRIMARY KEY,
     name text NOT NULL
);

and instance 2:

CREATE TABLE public.room (
     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2) 
PRIMARY KEY,
     name text NOT NULL
);

I ran the following on instance 1:

INSERT INTO public.room ('name') VALUES ('room 1-e');

This committed and successfully replicated.

However, when I ran the following on instance 2, I received a conlifct 
error:

INSERT INTO public.room ('name') VALUES ('room 1-w');

The conflict came further down the trigger change, i.e. to a change in 
the `public.calendar` table:

2023-02-22 01:49:12.293 UTC [87235] ERROR:  duplicate key value violates 
unique constraint "calendar_pkey"
2023-02-22 01:49:12.293 UTC [87235] DETAIL:  Key (id)=(661) already exists.

After futzing with the logging and restarting, I was also able to 
reproduce a similar conflict with the same insert pattern into 'room'.

I did notice that the sequence values kept bouncing around between the 
servers. Without any activity, this is what "SELECT * FROM room_id_seq" 
would return with queries run ~4s apart:

  last_value | log_cnt | is_called
------------+---------+-----------
         131 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
          65 |       0 | t

The values were more varying on "calendar". Again, this is under no 
additional write activity, these numbers kept fluctuating:

  last_value | log_cnt | is_called
------------+---------+-----------
         197 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
         461 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
         263 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
         527 |       0 | t

To handle this case for now, I adapted the schema to create sequences 
that we clearly independently named[3]. I did learn that I had to create 
sequences on both instances to support this behavior, e.g.:

-- instance 1
CREATE SEQUENCE public.room_id_1_seq AS int INCREMENT BY 2 START WITH 1;
CREATE SEQUENCE public.room_id_2_seq AS int INCREMENT BY 2 START WITH 2;
CREATE TABLE public.room (
     id int DEFAULT nextval('room_id_1_seq') PRIMARY KEY,
     name text NOT NULL
);

-- instance 2
CREATE SEQUENCE public.room_id_1_seq AS int INCREMENT BY 2 START WITH 1;
CREATE SEQUENCE public.room_id_2_seq AS int INCREMENT BY 2 START WITH 2;
CREATE TABLE public.room (
     id int DEFAULT nextval('room_id_2_seq') PRIMARY KEY,
     name text NOT NULL
);

After building out [3] this did work, but it was more tedious.

Is it possible to support IDENTITY columns (or serial columns) where the 
values of the sequence are set to different intervals on the 
publisher/subscriber?

Thanks,

Jonathan

[1] 
https://github.com/CrunchyData/postgres-realtime-demo/blob/main/examples/demo/demo1.sql
[2] https://gist.github.com/jkatz/5c34bf1e401b3376dfe8e627fcd30af3
[3] https://gist.github.com/jkatz/1599e467d55abec88ab487d8ac9dc7c3

Attachment

OpenPGP_signature

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

22 February 2023, 10:02:12

On 2/22/23 03:28, Jonathan S. Katz wrote:
> Hi,
> 
> On 2/16/23 10:50 AM, Tomas Vondra wrote:
>> Hi,
>>
>> Here's a rebased patch, without the last bit which is now unnecessary
>> thanks to c981d9145dea.
> 
> Thanks for continuing to work on this patch! I tested the latest version
> and have some feedback/clarifications.
> 

Thanks!

> I did some testing using a demo-app-based-on-a-real-world app I had
> conjured up[1]. This uses integer sequences as surrogate keys.
> 
> In general things seemed to work, but I had a couple of
> observations/questions.
> 
> 1. Sequence IDs after a "failover". I believe this is a design decision,
> but I noticed that after simulating a failover, the IDs were replicating
> from a higher value, e.g.
> 
> INSERT INTO room (name) VALUES ('room 1');
> INSERT INTO room (name) VALUES ('room 2');
> INSERT INTO room (name) VALUES ('room 3');
> INSERT INTO room (name) VALUES ('room 4');
> 
> The values of room_id_seq on each instance:
> 
> instance 1:
> 
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>           4 |      29 | t
> 
>  instance 2:
> 
>   last_value | log_cnt | is_called
> ------------+---------+-----------
>          33 |       0 | t
> 
> After the switchover on instance 2:
> 
> INSERT INTO room (name) VALUES ('room 5') RETURNING id;
> 
>  id
> ----
>  34
> 
> I don't see this as an issue for most applications, but we should at
> least document the behavior somewhere.
> 

Yes, this is due to how we WAL-log sequences. We don't log individual
increments, but every 32nd increment and we log the "future" sequence
state so that after a crash/recovery we don't generate duplicates.

So you do nextval() and it returns 1. But into WAL we record 32. And
there will be no WAL records until nextval reaches 32 and needs to
generate another batch.

And because logical replication relies on these WAL records, it inherits
this batching behavior with a "jump" on recovery/failover. IMHO it's OK,
it works for the "logical failover" use case and if you need gapless
sequences then regular sequences are not an issue anyway.

It's possible to reduce the jump a bit by reducing the batch size (from
32 to 0) so that every increment is logged. But it doesn't eliminate it
because of rollbacks.

> 2. Using with origin=none with nonconflicting sequences.
> 
> I modified the example in [1] to set up two schemas with non-conflicting
> sequences[2], e.g. on instance 1:
> 
> CREATE TABLE public.room (
>     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1)
> PRIMARY KEY,
>     name text NOT NULL
> );
> 
> and instance 2:
> 
> CREATE TABLE public.room (
>     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2)
> PRIMARY KEY,
>     name text NOT NULL
> );
> 

Well, yeah. We don't support active-active logical replication (at least
not with the built-in). You can easily get into similar issues without
sequences.

Replicating a sequence overwrites the state of the sequence on the other
side, which may result in it generating duplicate values with the other
node, etc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

"Jonathan S. Katz"

Date:

22 February 2023, 17:04:29

On 2/22/23 5:02 AM, Tomas Vondra wrote:
> 
> On 2/22/23 03:28, Jonathan S. Katz wrote:

>> Thanks for continuing to work on this patch! I tested the latest version
>> and have some feedback/clarifications.
>>
> 
> Thanks!

Also I should mention I've been testing with both async/sync logical 
replication. I didn't have any specific comments on either as it seemed 
to just work and behaviors aligned with existing expectations.

Generally it's been a good experience and it seems to be working. :) At 
this point I'm trying to understand the limitations and tripwires so we 
can guide users appropriately.

> Yes, this is due to how we WAL-log sequences. We don't log individual
> increments, but every 32nd increment and we log the "future" sequence
> state so that after a crash/recovery we don't generate duplicates.
> 
> So you do nextval() and it returns 1. But into WAL we record 32. And
> there will be no WAL records until nextval reaches 32 and needs to
> generate another batch.
> 
> And because logical replication relies on these WAL records, it inherits
> this batching behavior with a "jump" on recovery/failover. IMHO it's OK,
> it works for the "logical failover" use case and if you need gapless
> sequences then regular sequences are not an issue anyway.
> 
> It's possible to reduce the jump a bit by reducing the batch size (from
> 32 to 0) so that every increment is logged. But it doesn't eliminate it
> because of rollbacks.

I generally agree. I think it's mainly something we should capture in 
the user docs that they can be a jump on the subscriber side, so people 
are not surprised.

Interestingly, in systems that tend to have higher rates of failover 
(I'm thinking of a few distributed systems), this may cause int4 
sequences to exhaust numbers slightly (marginally?) more quickly. Likely 
not too big of an issue, but something to keep in mind.

>> 2. Using with origin=none with nonconflicting sequences.
>>
>> I modified the example in [1] to set up two schemas with non-conflicting
>> sequences[2], e.g. on instance 1:
>>
>> CREATE TABLE public.room (
>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1)
>> PRIMARY KEY,
>>      name text NOT NULL
>> );
>>
>> and instance 2:
>>
>> CREATE TABLE public.room (
>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2)
>> PRIMARY KEY,
>>      name text NOT NULL
>> );
>>
> 
> Well, yeah. We don't support active-active logical replication (at least
> not with the built-in). You can easily get into similar issues without
> sequences.

The "origin=none" feature lets you replicate tables bidirectionally. 
While it's not full "active-active", this is a starting point and a 
feature for v16. We'll definitely have users replicating data 
bidirectionally with this.

> Replicating a sequence overwrites the state of the sequence on the other
> side, which may result in it generating duplicate values with the other
> node, etc.

I understand that we don't currently support global sequences, but I am 
concerned there may be a tripwire here in the origin=none case given 
it's fairly common to use serial/GENERATED BY to set primary keys. And 
it's fairly trivial to set them to be nonconflicting, or at least give 
the user the appearance that they are nonconflicting.

 From my high level understand of how sequences work, this sounds like 
it would be a lift to support the example in [1]. Or maybe the answer is 
that you can bidirectionally replicate the changes in the tables, but 
not sequences?

In any case, we should update the restrictions in [2] to state: while 
sequences can be replicated, there is additional work required if you 
are bidirectionally replicating tables that use sequences, esp. if used 
in a PK or a constraint. We can provide alternatives to how a user could 
set that up, i.e. not replicates the sequences or do something like in [3].

Thanks,

Jonathan

[1] https://gist.github.com/jkatz/5c34bf1e401b3376dfe8e627fcd30af3
[2] 
https://www.postgresql.org/docs/devel/logical-replication-restrictions.html
[3] https://gist.github.com/jkatz/1599e467d55abec88ab487d8ac9dc7c3

Attachment

OpenPGP_signature

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

23 February 2023, 12:56:13

On 2/22/23 18:04, Jonathan S. Katz wrote:
> On 2/22/23 5:02 AM, Tomas Vondra wrote:
>>
>> On 2/22/23 03:28, Jonathan S. Katz wrote:
> 
>>> Thanks for continuing to work on this patch! I tested the latest version
>>> and have some feedback/clarifications.
>>>
>>
>> Thanks!
> 
> Also I should mention I've been testing with both async/sync logical
> replication. I didn't have any specific comments on either as it seemed
> to just work and behaviors aligned with existing expectations.
> 
> Generally it's been a good experience and it seems to be working. :) At
> this point I'm trying to understand the limitations and tripwires so we
> can guide users appropriately.
> 

Good to hear.

>> Yes, this is due to how we WAL-log sequences. We don't log individual
>> increments, but every 32nd increment and we log the "future" sequence
>> state so that after a crash/recovery we don't generate duplicates.
>>
>> So you do nextval() and it returns 1. But into WAL we record 32. And
>> there will be no WAL records until nextval reaches 32 and needs to
>> generate another batch.
>>
>> And because logical replication relies on these WAL records, it inherits
>> this batching behavior with a "jump" on recovery/failover. IMHO it's OK,
>> it works for the "logical failover" use case and if you need gapless
>> sequences then regular sequences are not an issue anyway.
>>
>> It's possible to reduce the jump a bit by reducing the batch size (from
>> 32 to 0) so that every increment is logged. But it doesn't eliminate it
>> because of rollbacks.
> 
> I generally agree. I think it's mainly something we should capture in
> the user docs that they can be a jump on the subscriber side, so people
> are not surprised.
> 
> Interestingly, in systems that tend to have higher rates of failover
> (I'm thinking of a few distributed systems), this may cause int4
> sequences to exhaust numbers slightly (marginally?) more quickly. Likely
> not too big of an issue, but something to keep in mind.
> 

IMHO the number of systems that would work fine with int4 sequences but
this change results in the sequences being "exhausted" too quickly is
indistinguishable from 0. I don't think this is an issue.

>>> 2. Using with origin=none with nonconflicting sequences.
>>>
>>> I modified the example in [1] to set up two schemas with non-conflicting
>>> sequences[2], e.g. on instance 1:
>>>
>>> CREATE TABLE public.room (
>>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1)
>>> PRIMARY KEY,
>>>      name text NOT NULL
>>> );
>>>
>>> and instance 2:
>>>
>>> CREATE TABLE public.room (
>>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2)
>>> PRIMARY KEY,
>>>      name text NOT NULL
>>> );
>>>
>>
>> Well, yeah. We don't support active-active logical replication (at least
>> not with the built-in). You can easily get into similar issues without
>> sequences.
> 
> The "origin=none" feature lets you replicate tables bidirectionally.
> While it's not full "active-active", this is a starting point and a
> feature for v16. We'll definitely have users replicating data
> bidirectionally with this.
> 

Well, then the users need to use some other way to generate IDs, not
local sequences. Either some sort of distributed/global sequence, UUIDs
or something like that.

>> Replicating a sequence overwrites the state of the sequence on the other
>> side, which may result in it generating duplicate values with the other
>> node, etc.
> 
> I understand that we don't currently support global sequences, but I am
> concerned there may be a tripwire here in the origin=none case given
> it's fairly common to use serial/GENERATED BY to set primary keys. And
> it's fairly trivial to set them to be nonconflicting, or at least give
> the user the appearance that they are nonconflicting.
> 
> From my high level understand of how sequences work, this sounds like it
> would be a lift to support the example in [1]. Or maybe the answer is
> that you can bidirectionally replicate the changes in the tables, but
> not sequences?
> 

Yes, I don't think local sequences don't and can't work in such setups.

> In any case, we should update the restrictions in [2] to state: while
> sequences can be replicated, there is additional work required if you
> are bidirectionally replicating tables that use sequences, esp. if used
> in a PK or a constraint. We can provide alternatives to how a user could
> set that up, i.e. not replicates the sequences or do something like in [3].
> 

I agree. I see this as mostly a documentation issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

"Jonathan S. Katz"

Date:

26 February 2023, 19:11:53

On 2/23/23 7:56 AM, Tomas Vondra wrote:
> On 2/22/23 18:04, Jonathan S. Katz wrote:
>> On 2/22/23 5:02 AM, Tomas Vondra wrote:
>>>

>> Interestingly, in systems that tend to have higher rates of failover
>> (I'm thinking of a few distributed systems), this may cause int4
>> sequences to exhaust numbers slightly (marginally?) more quickly. Likely
>> not too big of an issue, but something to keep in mind.
>>
> 
> IMHO the number of systems that would work fine with int4 sequences but
> this change results in the sequences being "exhausted" too quickly is
> indistinguishable from 0. I don't think this is an issue.

I agree it's an edge case. I do think it's a number greater than 0, 
having seen some incredibly flaky setups, particularly in distributed 
systems. I would not worry about it, but only mentioned it to try and 
probe edge cases.

>>> Well, yeah. We don't support active-active logical replication (at least
>>> not with the built-in). You can easily get into similar issues without
>>> sequences.
>>
>> The "origin=none" feature lets you replicate tables bidirectionally.
>> While it's not full "active-active", this is a starting point and a
>> feature for v16. We'll definitely have users replicating data
>> bidirectionally with this.
>>
> 
> Well, then the users need to use some other way to generate IDs, not
> local sequences. Either some sort of distributed/global sequence, UUIDs
> or something like that.
[snip]

>> In any case, we should update the restrictions in [2] to state: while
>> sequences can be replicated, there is additional work required if you
>> are bidirectionally replicating tables that use sequences, esp. if used
>> in a PK or a constraint. We can provide alternatives to how a user could
>> set that up, i.e. not replicates the sequences or do something like in [3].
>>
> 
> I agree. I see this as mostly a documentation issue.

Great. I agree that users need other mechanisms to generate IDs, but we 
should ensure we document that. If needed, I'm happy to help with the 
docs here.

Thanks,

Jonathan

Attachment

OpenPGP_signature

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

28 February 2023, 18:01:41

Hi,

here's a rebased patch to make cfbot happy, dropping the first part that
is now unnecessary thanks to 7fe1aa991b.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 3/14/23 08:30, John Naylor wrote:
> I tried a couple toy examples with various combinations of use styles.
> 
> Three with "automatic" reading from sequences:
> 
> create table test(i serial);
> create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> create table test(i int default nextval('s1'));
> 
> ...where s1 has some non-default parameters:
> 
> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> 
> ...and then two with explicit use of s1, one inserting the 'nextval'
> into a table with no default, and one with no table at all, just
> selecting from the sequence.
> 
> The last two seem to work similarly to the first three, so it seems like
> FOR ALL TABLES adds all sequences as well. Is that expected?

Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
the sequence is actually added to the publication. I tracked this down
to a thinko in get_rel_sync_entry() which failed to check the object
type when puballtables or puballsequences was set.

Attached is a patch fixing this.

> The documentation for CREATE PUBLICATION mentions sequence options,
> but doesn't really say how these options should be used.
Good point. The idea is that we handle tables and sequences the same
way, i.e. if you specify 'sequence' then we'll replicate increments for
sequences explicitly added to the publication.

If this is not clear, the docs may need some improvements.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi!

On 3/16/23 08:38, Masahiko Sawada wrote:
> Hi,
> 
> On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>>
>> On 3/14/23 08:30, John Naylor wrote:
>>> I tried a couple toy examples with various combinations of use styles.
>>>
>>> Three with "automatic" reading from sequences:
>>>
>>> create table test(i serial);
>>> create table test(i int GENERATED BY DEFAULT AS IDENTITY);
>>> create table test(i int default nextval('s1'));
>>>
>>> ...where s1 has some non-default parameters:
>>>
>>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
>>>
>>> ...and then two with explicit use of s1, one inserting the 'nextval'
>>> into a table with no default, and one with no table at all, just
>>> selecting from the sequence.
>>>
>>> The last two seem to work similarly to the first three, so it seems like
>>> FOR ALL TABLES adds all sequences as well. Is that expected?
>>
>> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
>> the sequence is actually added to the publication. I tracked this down
>> to a thinko in get_rel_sync_entry() which failed to check the object
>> type when puballtables or puballsequences was set.
>>
>> Attached is a patch fixing this.
>>
>>> The documentation for CREATE PUBLICATION mentions sequence options,
>>> but doesn't really say how these options should be used.
>> Good point. The idea is that we handle tables and sequences the same
>> way, i.e. if you specify 'sequence' then we'll replicate increments for
>> sequences explicitly added to the publication.
>>
>> If this is not clear, the docs may need some improvements.
>>
> 
> I'm late to this thread, but I have some questions and review comments.
> 
> Regarding sequence logical replication, it seems that changes of
> sequence created after CREATE SUBSCRIPTION are applied on the
> subscriber even without REFRESH PUBLICATION command on the subscriber.
> Which is a different behavior than tables. For example, I set both
> publisher and subscriber as follows:
> 
> 1. On publisher
> create publication test_pub for all sequences;
> 
> 2. On subscriber
> create subscription test_sub connection 'dbname=postgres port=5551'
> publication test_pub; -- port=5551 is the publisher
> 
> 3. On publisher
> create sequence s1;
> select nextval('s1');
> 
> I got the error "ERROR:  relation "public.s1" does not exist on the
> subscriber". Probably we need to do should_apply_changes_for_rel()
> check in apply_handle_sequence().
> 

Yes, you're right - the sequence handling should have been calling the
should_apply_changes_for_rel() etc.

The attached 0005 patch should fix that - I still need to test it a bit
more and maybe clean it up a bit, but hopefully it'll allow you to
continue the review.

I had to tweak the protocol a bit, so that this uses the same cache as
tables. I wonder if maybe we should make it even more similar, by
essentially treating sequences as tables with (last_value, log_cnt,
called) columns.

> If my understanding is correct, is there any case where the subscriber
> needs to apply transactional sequence changes? The commit message of
> 0001 patch says:
> 
>     * Changes for sequences created in the same top-level transaction are
>       treated as transactional, i.e. just like any other change from that
>       transaction, and discarded in case of a rollback.
> 
> IIUC such sequences are not visible to the subscriber, so it cannot
> subscribe to them until the commit.
> 

The comment is slightly misleading, as it talks about creation of
sequences, but it should be talking about relfilenodes. For example, if
you create a sequence, add it to publication, and then in a later
transaction you do

   ALTER SEQUENCE x RESTART

or something else that creates a new relfilenode, then the subsequent
increments are visible only in that transaction. But we still need to
apply those on the subscriber, but only as part of the transaction,
because it might roll back.

> ---
> I got an assertion failure. The reproducible steps are:
> 

I do believe this was due to a thinko in apply_handle_sequence, which
sometimes started transaction and didn't terminate it correctly. I've
changed it to use the begin_replication_step() etc. and it seems to be
working fine now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Wed, Mar 15, 2023 at 7:00 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 3/10/23 11:03, John Naylor wrote:

> > + * When we're called via the SQL SRF there's already a transaction
> >
> > I see this was copied from existing code, but I found it confusing --
> > does this function have a stable name?
>
> What do you mean by "stable name"? It certainly is not exposed as a
> user-callable SQL function, so I think this comment it misleading and
> should be removed.

Okay, I was just trying to think of why it was phrased this way...

John Naylor

EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

From

vignesh C

Date:

17 March 2023, 08:51:56

On Thu, 16 Mar 2023 at 21:55, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi!
>
> On 3/16/23 08:38, Masahiko Sawada wrote:
> > Hi,
> >
> > On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >>
> >>
> >> On 3/14/23 08:30, John Naylor wrote:
> >>> I tried a couple toy examples with various combinations of use styles.
> >>>
> >>> Three with "automatic" reading from sequences:
> >>>
> >>> create table test(i serial);
> >>> create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> >>> create table test(i int default nextval('s1'));
> >>>
> >>> ...where s1 has some non-default parameters:
> >>>
> >>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> >>>
> >>> ...and then two with explicit use of s1, one inserting the 'nextval'
> >>> into a table with no default, and one with no table at all, just
> >>> selecting from the sequence.
> >>>
> >>> The last two seem to work similarly to the first three, so it seems like
> >>> FOR ALL TABLES adds all sequences as well. Is that expected?
> >>
> >> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
> >> the sequence is actually added to the publication. I tracked this down
> >> to a thinko in get_rel_sync_entry() which failed to check the object
> >> type when puballtables or puballsequences was set.
> >>
> >> Attached is a patch fixing this.
> >>
> >>> The documentation for CREATE PUBLICATION mentions sequence options,
> >>> but doesn't really say how these options should be used.
> >> Good point. The idea is that we handle tables and sequences the same
> >> way, i.e. if you specify 'sequence' then we'll replicate increments for
> >> sequences explicitly added to the publication.
> >>
> >> If this is not clear, the docs may need some improvements.
> >>
> >
> > I'm late to this thread, but I have some questions and review comments.
> >
> > Regarding sequence logical replication, it seems that changes of
> > sequence created after CREATE SUBSCRIPTION are applied on the
> > subscriber even without REFRESH PUBLICATION command on the subscriber.
> > Which is a different behavior than tables. For example, I set both
> > publisher and subscriber as follows:
> >
> > 1. On publisher
> > create publication test_pub for all sequences;
> >
> > 2. On subscriber
> > create subscription test_sub connection 'dbname=postgres port=5551'
> > publication test_pub; -- port=5551 is the publisher
> >
> > 3. On publisher
> > create sequence s1;
> > select nextval('s1');
> >
> > I got the error "ERROR:  relation "public.s1" does not exist on the
> > subscriber". Probably we need to do should_apply_changes_for_rel()
> > check in apply_handle_sequence().
> >
>
> Yes, you're right - the sequence handling should have been calling the
> should_apply_changes_for_rel() etc.
>
> The attached 0005 patch should fix that - I still need to test it a bit
> more and maybe clean it up a bit, but hopefully it'll allow you to
> continue the review.
>
> I had to tweak the protocol a bit, so that this uses the same cache as
> tables. I wonder if maybe we should make it even more similar, by
> essentially treating sequences as tables with (last_value, log_cnt,
> called) columns.
>
> > If my understanding is correct, is there any case where the subscriber
> > needs to apply transactional sequence changes? The commit message of
> > 0001 patch says:
> >
> >     * Changes for sequences created in the same top-level transaction are
> >       treated as transactional, i.e. just like any other change from that
> >       transaction, and discarded in case of a rollback.
> >
> > IIUC such sequences are not visible to the subscriber, so it cannot
> > subscribe to them until the commit.
> >
>
> The comment is slightly misleading, as it talks about creation of
> sequences, but it should be talking about relfilenodes. For example, if
> you create a sequence, add it to publication, and then in a later
> transaction you do
>
>    ALTER SEQUENCE x RESTART
>
> or something else that creates a new relfilenode, then the subsequent
> increments are visible only in that transaction. But we still need to
> apply those on the subscriber, but only as part of the transaction,
> because it might roll back.
>
> > ---
> > I got an assertion failure. The reproducible steps are:
> >
>
> I do believe this was due to a thinko in apply_handle_sequence, which
> sometimes started transaction and didn't terminate it correctly. I've
> changed it to use the begin_replication_step() etc. and it seems to be
> working fine now.

Few comments:
1) One of the test is failing for me, I had also seen the same failure
in CFBOT at [1] too:
#   Failed test 'create sequence, advance it in rolled-back
transaction, but commit the create'
#   at t/030_sequences.pl line 152.
#          got: '1|0|f'
#     expected: '132|0|t'
t/030_sequences.pl ................. 5/? ?
#   Failed test 'advance the new sequence in a transaction and roll it back'
#   at t/030_sequences.pl line 175.
#          got: '1|0|f'
#     expected: '231|0|t'

#   Failed test 'advance sequence in a subtransaction'
#   at t/030_sequences.pl line 198.
#          got: '1|0|f'
#     expected: '330|0|t'
# Looks like you failed 3 tests of 6.

2) We could replace the below:
$node_publisher->wait_for_catchup('seq_sub');

# Wait for initial sync to finish as well
my $synced_query =
  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
IN ('s', 'r');";
$node_subscriber->poll_query_until('postgres', $synced_query)
  or die "Timed out while waiting for subscriber to synchronize data";

with:
$node_subscriber->wait_for_subscription_sync;

3) We could change 030_sequences to 033_sequences.pl as 030 is already used:
diff --git a/src/test/subscription/t/030_sequences.pl
b/src/test/subscription/t/030_sequences.pl
new file mode 100644
index 00000000000..9ae3c03d7d1
--- /dev/null
+++ b/src/test/subscription/t/030_sequences.pl

4) Copyright year should be changed to 2023:
@@ -0,0 +1,202 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# This tests that sequences are replicated correctly by logical replication
+use strict;
+use warnings;

[1] - https://cirrus-ci.com/task/5032679352041472

Regards,
Vignesh

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

17 March 2023, 17:55:23


On 3/17/23 06:53, John Naylor wrote:
> On Wed, Mar 15, 2023 at 7:51 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>>
>>
>>
>> On 3/14/23 08:30, John Naylor wrote:
>> > I tried a couple toy examples with various combinations of use styles.
>> >
>> > Three with "automatic" reading from sequences:
>> >
>> > create table test(i serial);
>> > create table test(i int GENERATED BY DEFAULT AS IDENTITY);
>> > create table test(i int default nextval('s1'));
>> >
>> > ...where s1 has some non-default parameters:
>> >
>> > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
>> >
>> > ...and then two with explicit use of s1, one inserting the 'nextval'
>> > into a table with no default, and one with no table at all, just
>> > selecting from the sequence.
>> >
>> > The last two seem to work similarly to the first three, so it seems like
>> > FOR ALL TABLES adds all sequences as well. Is that expected?
>>
>> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
>> the sequence is actually added to the publication. I tracked this down
>> to a thinko in get_rel_sync_entry() which failed to check the object
>> type when puballtables or puballsequences was set.
>>
>> Attached is a patch fixing this.
> 
> Okay, I can verify that with 0001-0006, sequences don't replicate unless
> specified. I do see an additional change that doesn't make sense: On the
> subscriber I no longer see a jump to the logged 32 increment, I see the
> very next value:
> 
> # alter system set wal_level='logical';
> # port 7777 is subscriber
> 
> echo
> echo "PUB:"
> psql -c "drop table if exists test;"
> psql -c "drop publication if exists pub1;"
> 
> echo
> echo "SUB:"
> psql -p 7777 -c "drop table if exists test;"
> psql -p 7777 -c "drop subscription if exists sub1 ;"
> 
> echo
> echo "PUB:"
> psql -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
> psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
> psql -c "CREATE PUBLICATION pub2 FOR ALL SEQUENCES;"
> 
> echo
> echo "SUB:"
> psql -p 7777 -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
> psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost
> dbname=postgres application_name=sub1 port=5432' PUBLICATION pub1;"
> psql -p 7777 -c "CREATE SUBSCRIPTION sub2 CONNECTION 'host=localhost
> dbname=postgres application_name=sub2 port=5432' PUBLICATION pub2;"
> 
> echo
> echo "PUB:"
> psql -c "insert into test default values;"
> psql -c "insert into test default values;"
> psql -c "select * from test;"
> psql -c "select * from test_i_seq;"
> 
> sleep 1
> 
> echo
> echo "SUB:"
> psql -p 7777 -c "select * from test;"
> psql -p 7777 -c "select * from test_i_seq;"
> 
> psql -p 7777 -c "drop subscription sub1 ;"
> psql -p 7777 -c "drop subscription sub2 ;"
> 
> psql -p 7777 -c "insert into test default values;"
> psql -p 7777 -c "select * from test;"
> psql -p 7777 -c "select * from test_i_seq;"
> 
> The last two queries on the subscriber show:
> 
>  i
> ---
>  1
>  2
>  3
> (3 rows)
> 
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>           3 |      30 | t
> (1 row)
> 
> ...whereas before with 0001-0003 I saw:
> 
>  i  
> ----
>   1
>   2
>  34
> (3 rows)
> 
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          34 |      32 | t
> 

Oh, this is a silly thinko in how sequences are synced at the beginning
(or maybe a combination of two issues).

fetch_sequence_data() simply runs a select from the sequence

    SELECT last_value, log_cnt, is_called

but that's wrong, because that's the *current* state of the sequence, at
the moment it's initially synced. We to make this "correct" with respect
to the decoding, we'd need to deduce what was the last WAL record, so
something like

    last_value += log_cnt + 1

That should produce 34 again.

FWIW the older patch has this issue too, I believe the difference is
merely due to a slightly different timing between the sync and decoding
the first insert. If you insert a sleep after the CREATE SUBSCRIPTION
commands, it should disappear.


This however made me realize the initial sync of sequences may not be
correct. I mean, the idea of tablesync is syncing the data in REPEATABLE
READ transaction, and then applying decoded changes. But sequences are
not transactional in this way - if you select from a sequence, you'll
always see the latest data, even in REPEATABLE READ.

I wonder if this might result in losing some of the sequence increments,
and/or applying them in the wrong order (so that the sequence goes
backward for a while).


>> > The documentation for CREATE PUBLICATION mentions sequence options,
>> > but doesn't really say how these options should be used.
>> Good point. The idea is that we handle tables and sequences the same
>> way, i.e. if you specify 'sequence' then we'll replicate increments for
>> sequences explicitly added to the publication.
>>
>> If this is not clear, the docs may need some improvements.
> 
> Aside from docs, I'm not clear what some of the tests are doing:
> 
> +CREATE PUBLICATION testpub_forallsequences FOR ALL SEQUENCES WITH
> (publish = 'sequence');
> +RESET client_min_messages;
> +ALTER PUBLICATION testpub_forallsequences SET (publish = 'insert,
> sequence');
> 
> What does it mean to add 'insert' to a sequence publication?
> 

I don't recall why this particular test exists, but you can still add
tables to "for all sequences" publication. IMO it's fine to allow adding
actions that are irrelevant for currently published objects, we don't
have a cross-check to prevent that (how would you even do that e.g. for
FOR ALL TABLES publications?).

> Likewise, from a brief change in my test above, 'sequence' seems to be a
> noise word for table publications. I'm not fully read up on the
> background of this topic, but wanted to make sure I understood the
> design of the syntax.
> 

I think it's fine, for the same reason as above.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

17 March 2023, 21:43:37

On 3/17/23 18:55, Tomas Vondra wrote:
> 
> ...
> 
> This however made me realize the initial sync of sequences may not be
> correct. I mean, the idea of tablesync is syncing the data in REPEATABLE
> READ transaction, and then applying decoded changes. But sequences are
> not transactional in this way - if you select from a sequence, you'll
> always see the latest data, even in REPEATABLE READ.
> 
> I wonder if this might result in losing some of the sequence increments,
> and/or applying them in the wrong order (so that the sequence goes
> backward for a while).
> 

Yeah, I think my suspicion was warranted - it's pretty easy to make the
sequence go backwards for a while by adding a sleep between the slot
creation and the copy_sequence() call, and increment the sequence in
between (enough to do some WAL logging).

The copy_sequence() then reads the current on-disk state (because of the
non-transactional nature w.r.t. REPEATABLE READ), applies it, and then
we start processing the WAL added since the slot creation. But those are
older, so stuff like this happens:

    21:52:54.147 CET [35404] WARNING:  copy_sequence 1222 0 1
    21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 990 0 1
    21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1023 0 1
    21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1056 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1089 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1122 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1155 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1188 0 1
    21:52:54.175 CET [35404] WARNING:  apply_handle_sequence 1221 0 1
    21:52:54.898 CET [35402] WARNING:  apply_handle_sequence 1254 0 1

Clearly, for sequences we can't quite rely on snapshots/slots, we need
to get the LSN to decide what changes to apply/skip from somewhere else.
I wonder if we can just ignore the queued changes in tablesync, but I
guess not - there can be queued increments after reading the sequence
state, and we need to apply those. But maybe we could use the page LSN
from the relfilenode - that should be the LSN of the last WAL record.

Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
use to read the sequence state ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

18 March 2023, 05:35:22

On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/17/23 18:55, Tomas Vondra wrote:
> >
> > ...
> >
> > This however made me realize the initial sync of sequences may not be
> > correct. I mean, the idea of tablesync is syncing the data in REPEATABLE
> > READ transaction, and then applying decoded changes. But sequences are
> > not transactional in this way - if you select from a sequence, you'll
> > always see the latest data, even in REPEATABLE READ.
> >
> > I wonder if this might result in losing some of the sequence increments,
> > and/or applying them in the wrong order (so that the sequence goes
> > backward for a while).
> >
>
> Yeah, I think my suspicion was warranted - it's pretty easy to make the
> sequence go backwards for a while by adding a sleep between the slot
> creation and the copy_sequence() call, and increment the sequence in
> between (enough to do some WAL logging).
>
> The copy_sequence() then reads the current on-disk state (because of the
> non-transactional nature w.r.t. REPEATABLE READ), applies it, and then
> we start processing the WAL added since the slot creation. But those are
> older, so stuff like this happens:
>
>     21:52:54.147 CET [35404] WARNING:  copy_sequence 1222 0 1
>     21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 990 0 1
>     21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1023 0 1
>     21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1056 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1089 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1122 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1155 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1188 0 1
>     21:52:54.175 CET [35404] WARNING:  apply_handle_sequence 1221 0 1
>     21:52:54.898 CET [35402] WARNING:  apply_handle_sequence 1254 0 1
>
> Clearly, for sequences we can't quite rely on snapshots/slots, we need
> to get the LSN to decide what changes to apply/skip from somewhere else.
> I wonder if we can just ignore the queued changes in tablesync, but I
> guess not - there can be queued increments after reading the sequence
> state, and we need to apply those. But maybe we could use the page LSN
> from the relfilenode - that should be the LSN of the last WAL record.
>
> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
> use to read the sequence state ...
>

What if some Alter Sequence is performed before the copy starts and
after the copy is finished, the containing transaction rolled back?
Won't it copy something which shouldn't have been copied?

--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

18 March 2023, 15:19:53

On 3/18/23 06:35, Amit Kapila wrote:
> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> ...
>>
>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
>> to get the LSN to decide what changes to apply/skip from somewhere else.
>> I wonder if we can just ignore the queued changes in tablesync, but I
>> guess not - there can be queued increments after reading the sequence
>> state, and we need to apply those. But maybe we could use the page LSN
>> from the relfilenode - that should be the LSN of the last WAL record.
>>
>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
>> use to read the sequence state ...
>>
> 
> What if some Alter Sequence is performed before the copy starts and
> after the copy is finished, the containing transaction rolled back?
> Won't it copy something which shouldn't have been copied?
> 

That shouldn't be possible - the alter creates a new relfilenode and
it's invisible until commit. So either it gets committed (and then
replicated), or it remains invisible to the SELECT during sync.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

20 March 2023, 03:42:30

On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/18/23 06:35, Amit Kapila wrote:
> > On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> ...
> >>
> >> Clearly, for sequences we can't quite rely on snapshots/slots, we need
> >> to get the LSN to decide what changes to apply/skip from somewhere else.
> >> I wonder if we can just ignore the queued changes in tablesync, but I
> >> guess not - there can be queued increments after reading the sequence
> >> state, and we need to apply those. But maybe we could use the page LSN
> >> from the relfilenode - that should be the LSN of the last WAL record.
> >>
> >> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
> >> use to read the sequence state ...
> >>
> >
> > What if some Alter Sequence is performed before the copy starts and
> > after the copy is finished, the containing transaction rolled back?
> > Won't it copy something which shouldn't have been copied?
> >
>
> That shouldn't be possible - the alter creates a new relfilenode and
> it's invisible until commit. So either it gets committed (and then
> replicated), or it remains invisible to the SELECT during sync.
>

Okay, however, we need to ensure that such a change will later be
replicated and also need to ensure that the required WAL doesn't get
removed.

Say, if we use your first idea of page LSN from the relfilenode, then
how do we ensure that the corresponding WAL doesn't get removed when
later the sync worker tries to start replication from that LSN? I am
imagining here the sync_sequence_slot will be created before
copy_sequence but even then it is possible that the sequence has not
been updated for a long time and the LSN location will be in the past
(as compared to the slot's LSN) which means the corresponding WAL
could be removed. Now, here we can't directly start using the slot's
LSN to stream changes because there is no correlation of it with the
LSN (page LSN of sequence's relfilnode) where we want to start
streaming.

Now, for the second idea which is to directly use
pg_current_wal_insert_lsn(), I think we won't be able to ensure that
the changes covered by in-progress transactions like the one with
Alter Sequence I have given example would be streamed later after the
initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
could be an LSN after the LSN associated with Alter Sequence but
before the corresponding xact's commit.

--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

20 March 2023, 08:19:41


On 3/20/23 04:42, Amit Kapila wrote:
> On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/18/23 06:35, Amit Kapila wrote:
>>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> ...
>>>>
>>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
>>>> to get the LSN to decide what changes to apply/skip from somewhere else.
>>>> I wonder if we can just ignore the queued changes in tablesync, but I
>>>> guess not - there can be queued increments after reading the sequence
>>>> state, and we need to apply those. But maybe we could use the page LSN
>>>> from the relfilenode - that should be the LSN of the last WAL record.
>>>>
>>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
>>>> use to read the sequence state ...
>>>>
>>>
>>> What if some Alter Sequence is performed before the copy starts and
>>> after the copy is finished, the containing transaction rolled back?
>>> Won't it copy something which shouldn't have been copied?
>>>
>>
>> That shouldn't be possible - the alter creates a new relfilenode and
>> it's invisible until commit. So either it gets committed (and then
>> replicated), or it remains invisible to the SELECT during sync.
>>
> 
> Okay, however, we need to ensure that such a change will later be
> replicated and also need to ensure that the required WAL doesn't get
> removed.
> 
> Say, if we use your first idea of page LSN from the relfilenode, then
> how do we ensure that the corresponding WAL doesn't get removed when
> later the sync worker tries to start replication from that LSN? I am
> imagining here the sync_sequence_slot will be created before
> copy_sequence but even then it is possible that the sequence has not
> been updated for a long time and the LSN location will be in the past
> (as compared to the slot's LSN) which means the corresponding WAL
> could be removed. Now, here we can't directly start using the slot's
> LSN to stream changes because there is no correlation of it with the
> LSN (page LSN of sequence's relfilnode) where we want to start
> streaming.
> 

I don't understand why we'd need WAL from before the slot is created,
which happens before copy_sequence so the sync will see a more recent
state (reflecting all changes up to the slot LSN).

I think the only "issue" are the WAL records after the slot LSN, or more
precisely deciding which of the decoded changes to apply.


> Now, for the second idea which is to directly use
> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
> the changes covered by in-progress transactions like the one with
> Alter Sequence I have given example would be streamed later after the
> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
> could be an LSN after the LSN associated with Alter Sequence but
> before the corresponding xact's commit.

Yeah, I think you're right - the locking itself is not sufficient to
prevent this ordering of operations. copy_sequence would have to lock
the sequence exclusively, which seems bit disruptive.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

20 March 2023, 11:00:56

On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
> On 3/20/23 04:42, Amit Kapila wrote:
> > On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 3/18/23 06:35, Amit Kapila wrote:
> >>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
> >>> <tomas.vondra@enterprisedb.com> wrote:
> >>>>
> >>>> ...
> >>>>
> >>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
> >>>> to get the LSN to decide what changes to apply/skip from somewhere else.
> >>>> I wonder if we can just ignore the queued changes in tablesync, but I
> >>>> guess not - there can be queued increments after reading the sequence
> >>>> state, and we need to apply those. But maybe we could use the page LSN
> >>>> from the relfilenode - that should be the LSN of the last WAL record.
> >>>>
> >>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
> >>>> use to read the sequence state ...
> >>>>
> >>>
> >>> What if some Alter Sequence is performed before the copy starts and
> >>> after the copy is finished, the containing transaction rolled back?
> >>> Won't it copy something which shouldn't have been copied?
> >>>
> >>
> >> That shouldn't be possible - the alter creates a new relfilenode and
> >> it's invisible until commit. So either it gets committed (and then
> >> replicated), or it remains invisible to the SELECT during sync.
> >>
> >
> > Okay, however, we need to ensure that such a change will later be
> > replicated and also need to ensure that the required WAL doesn't get
> > removed.
> >
> > Say, if we use your first idea of page LSN from the relfilenode, then
> > how do we ensure that the corresponding WAL doesn't get removed when
> > later the sync worker tries to start replication from that LSN? I am
> > imagining here the sync_sequence_slot will be created before
> > copy_sequence but even then it is possible that the sequence has not
> > been updated for a long time and the LSN location will be in the past
> > (as compared to the slot's LSN) which means the corresponding WAL
> > could be removed. Now, here we can't directly start using the slot's
> > LSN to stream changes because there is no correlation of it with the
> > LSN (page LSN of sequence's relfilnode) where we want to start
> > streaming.
> >
>
> I don't understand why we'd need WAL from before the slot is created,
> which happens before copy_sequence so the sync will see a more recent
> state (reflecting all changes up to the slot LSN).
>

Imagine the following sequence of events:
1. Operation on a sequence seq-1 which requires WAL. Say, this is done
at LSN 1000.
2. Some other random operations on unrelated objects. This would
increase LSN to 2000.
3. Create a slot that uses current LSN 2000.
4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
you will use LSN 1000 as a starting point to start replication in
sequence sync worker.

It is quite possible that WAL from LSN 1000 may not be present. Now,
it may be possible that we use the slot's LSN in this case but
currently, it may not be possible without some changes in the slot
machinery. Even, if we somehow solve this, we have the below problem
where we can miss some concurrent activity.

> I think the only "issue" are the WAL records after the slot LSN, or more
> precisely deciding which of the decoded changes to apply.
>
>
> > Now, for the second idea which is to directly use
> > pg_current_wal_insert_lsn(), I think we won't be able to ensure that
> > the changes covered by in-progress transactions like the one with
> > Alter Sequence I have given example would be streamed later after the
> > initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
> > could be an LSN after the LSN associated with Alter Sequence but
> > before the corresponding xact's commit.
>
> Yeah, I think you're right - the locking itself is not sufficient to
> prevent this ordering of operations. copy_sequence would have to lock
> the sequence exclusively, which seems bit disruptive.
>

Right, that doesn't sound like a good idea.

--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

20 March 2023, 11:43:04

On 3/20/23 12:00, Amit Kapila wrote:
> On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>> On 3/20/23 04:42, Amit Kapila wrote:
>>> On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 3/18/23 06:35, Amit Kapila wrote:
>>>>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
>>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
>>>>>> to get the LSN to decide what changes to apply/skip from somewhere else.
>>>>>> I wonder if we can just ignore the queued changes in tablesync, but I
>>>>>> guess not - there can be queued increments after reading the sequence
>>>>>> state, and we need to apply those. But maybe we could use the page LSN
>>>>>> from the relfilenode - that should be the LSN of the last WAL record.
>>>>>>
>>>>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
>>>>>> use to read the sequence state ...
>>>>>>
>>>>>
>>>>> What if some Alter Sequence is performed before the copy starts and
>>>>> after the copy is finished, the containing transaction rolled back?
>>>>> Won't it copy something which shouldn't have been copied?
>>>>>
>>>>
>>>> That shouldn't be possible - the alter creates a new relfilenode and
>>>> it's invisible until commit. So either it gets committed (and then
>>>> replicated), or it remains invisible to the SELECT during sync.
>>>>
>>>
>>> Okay, however, we need to ensure that such a change will later be
>>> replicated and also need to ensure that the required WAL doesn't get
>>> removed.
>>>
>>> Say, if we use your first idea of page LSN from the relfilenode, then
>>> how do we ensure that the corresponding WAL doesn't get removed when
>>> later the sync worker tries to start replication from that LSN? I am
>>> imagining here the sync_sequence_slot will be created before
>>> copy_sequence but even then it is possible that the sequence has not
>>> been updated for a long time and the LSN location will be in the past
>>> (as compared to the slot's LSN) which means the corresponding WAL
>>> could be removed. Now, here we can't directly start using the slot's
>>> LSN to stream changes because there is no correlation of it with the
>>> LSN (page LSN of sequence's relfilnode) where we want to start
>>> streaming.
>>>
>>
>> I don't understand why we'd need WAL from before the slot is created,
>> which happens before copy_sequence so the sync will see a more recent
>> state (reflecting all changes up to the slot LSN).
>>
> 
> Imagine the following sequence of events:
> 1. Operation on a sequence seq-1 which requires WAL. Say, this is done
> at LSN 1000.
> 2. Some other random operations on unrelated objects. This would
> increase LSN to 2000.
> 3. Create a slot that uses current LSN 2000.
> 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
> you will use LSN 1000 as a starting point to start replication in
> sequence sync worker.
> 
> It is quite possible that WAL from LSN 1000 may not be present. Now,
> it may be possible that we use the slot's LSN in this case but
> currently, it may not be possible without some changes in the slot
> machinery. Even, if we somehow solve this, we have the below problem
> where we can miss some concurrent activity.
> 

I think the question is what would be the WAL-requiring operation at LSN
1000. If it's just regular nextval(), then we *will* see it during
copy_sequence - sequences are not transactional in the MVCC sense.

If it's an ALTER SEQUENCE, I guess it might create a new relfilenode,
and then we might fail to apply this - that'd be bad.

I wonder if we'd allow actually discarding the WAL while building the
consistent snapshot, though. You're however right we can't just decide
this based on LSN, we'd probably need to compare the relfilenodes too or
something like that ...

>> I think the only "issue" are the WAL records after the slot LSN, or more
>> precisely deciding which of the decoded changes to apply.
>>
>>
>>> Now, for the second idea which is to directly use
>>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
>>> the changes covered by in-progress transactions like the one with
>>> Alter Sequence I have given example would be streamed later after the
>>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
>>> could be an LSN after the LSN associated with Alter Sequence but
>>> before the corresponding xact's commit.
>>
>> Yeah, I think you're right - the locking itself is not sufficient to
>> prevent this ordering of operations. copy_sequence would have to lock
>> the sequence exclusively, which seems bit disruptive.
>>
> 
> Right, that doesn't sound like a good idea.
> 

Although, maybe we could use a less strict lock level? I mean, one that
allows nextval() to continue, but would conflict with ALTER SEQUENCE.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

20 March 2023, 12:26:16

On Mon, Mar 20, 2023 at 5:13 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/20/23 12:00, Amit Kapila wrote:
> > On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >>
> >> I don't understand why we'd need WAL from before the slot is created,
> >> which happens before copy_sequence so the sync will see a more recent
> >> state (reflecting all changes up to the slot LSN).
> >>
> >
> > Imagine the following sequence of events:
> > 1. Operation on a sequence seq-1 which requires WAL. Say, this is done
> > at LSN 1000.
> > 2. Some other random operations on unrelated objects. This would
> > increase LSN to 2000.
> > 3. Create a slot that uses current LSN 2000.
> > 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
> > you will use LSN 1000 as a starting point to start replication in
> > sequence sync worker.
> >
> > It is quite possible that WAL from LSN 1000 may not be present. Now,
> > it may be possible that we use the slot's LSN in this case but
> > currently, it may not be possible without some changes in the slot
> > machinery. Even, if we somehow solve this, we have the below problem
> > where we can miss some concurrent activity.
> >
>
> I think the question is what would be the WAL-requiring operation at LSN
> 1000. If it's just regular nextval(), then we *will* see it during
> copy_sequence - sequences are not transactional in the MVCC sense.
>
> If it's an ALTER SEQUENCE, I guess it might create a new relfilenode,
> and then we might fail to apply this - that'd be bad.
>
> I wonder if we'd allow actually discarding the WAL while building the
> consistent snapshot, though.
>

No, as soon as we reserve the WAL location, we update the slot's
minLSN (replicationSlotMinLSN) which would prevent the required WAL
from being removed.

> You're however right we can't just decide
> this based on LSN, we'd probably need to compare the relfilenodes too or
> something like that ...
>
> >> I think the only "issue" are the WAL records after the slot LSN, or more
> >> precisely deciding which of the decoded changes to apply.
> >>
> >>
> >>> Now, for the second idea which is to directly use
> >>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
> >>> the changes covered by in-progress transactions like the one with
> >>> Alter Sequence I have given example would be streamed later after the
> >>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
> >>> could be an LSN after the LSN associated with Alter Sequence but
> >>> before the corresponding xact's commit.
> >>
> >> Yeah, I think you're right - the locking itself is not sufficient to
> >> prevent this ordering of operations. copy_sequence would have to lock
> >> the sequence exclusively, which seems bit disruptive.
> >>
> >
> > Right, that doesn't sound like a good idea.
> >
>
> Although, maybe we could use a less strict lock level? I mean, one that
> allows nextval() to continue, but would conflict with ALTER SEQUENCE.
>

I don't know if that is a good idea but are you imagining a special
interface/mechanism just for logical replication because as far as I
can see you have used SELECT to fetch the sequence values?

--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

20 March 2023, 17:03:57


On 3/20/23 13:26, Amit Kapila wrote:
> On Mon, Mar 20, 2023 at 5:13 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/20/23 12:00, Amit Kapila wrote:
>>> On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>>
>>>> I don't understand why we'd need WAL from before the slot is created,
>>>> which happens before copy_sequence so the sync will see a more recent
>>>> state (reflecting all changes up to the slot LSN).
>>>>
>>>
>>> Imagine the following sequence of events:
>>> 1. Operation on a sequence seq-1 which requires WAL. Say, this is done
>>> at LSN 1000.
>>> 2. Some other random operations on unrelated objects. This would
>>> increase LSN to 2000.
>>> 3. Create a slot that uses current LSN 2000.
>>> 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
>>> you will use LSN 1000 as a starting point to start replication in
>>> sequence sync worker.
>>>
>>> It is quite possible that WAL from LSN 1000 may not be present. Now,
>>> it may be possible that we use the slot's LSN in this case but
>>> currently, it may not be possible without some changes in the slot
>>> machinery. Even, if we somehow solve this, we have the below problem
>>> where we can miss some concurrent activity.
>>>
>>
>> I think the question is what would be the WAL-requiring operation at LSN
>> 1000. If it's just regular nextval(), then we *will* see it during
>> copy_sequence - sequences are not transactional in the MVCC sense.
>>
>> If it's an ALTER SEQUENCE, I guess it might create a new relfilenode,
>> and then we might fail to apply this - that'd be bad.
>>
>> I wonder if we'd allow actually discarding the WAL while building the
>> consistent snapshot, though.
>>
> 
> No, as soon as we reserve the WAL location, we update the slot's
> minLSN (replicationSlotMinLSN) which would prevent the required WAL
> from being removed.
> 
>> You're however right we can't just decide
>> this based on LSN, we'd probably need to compare the relfilenodes too or
>> something like that ...
>>
>>>> I think the only "issue" are the WAL records after the slot LSN, or more
>>>> precisely deciding which of the decoded changes to apply.
>>>>
>>>>
>>>>> Now, for the second idea which is to directly use
>>>>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
>>>>> the changes covered by in-progress transactions like the one with
>>>>> Alter Sequence I have given example would be streamed later after the
>>>>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
>>>>> could be an LSN after the LSN associated with Alter Sequence but
>>>>> before the corresponding xact's commit.
>>>>
>>>> Yeah, I think you're right - the locking itself is not sufficient to
>>>> prevent this ordering of operations. copy_sequence would have to lock
>>>> the sequence exclusively, which seems bit disruptive.
>>>>
>>>
>>> Right, that doesn't sound like a good idea.
>>>
>>
>> Although, maybe we could use a less strict lock level? I mean, one that
>> allows nextval() to continue, but would conflict with ALTER SEQUENCE.
>>
> 
> I don't know if that is a good idea but are you imagining a special
> interface/mechanism just for logical replication because as far as I
> can see you have used SELECT to fetch the sequence values?
> 

Not sure what would the special mechanism be? I don't think it could
read the sequence from somewhere else, and due the lack of MVCC we'd
just read same sequence data from the current relfilenode. Or what else
would it do?

The one thing we can't quite do at the moment is locking the sequence,
because LOCK is only supported for tables. So we could either provide a
function to lock a sequence, or locks it and then returns the current
state (as if we did a SELECT).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

23 March 2023, 22:25:38

On 3/20/23 18:03, Tomas Vondra wrote:
> 
> ...
>>
>> I don't know if that is a good idea but are you imagining a special
>> interface/mechanism just for logical replication because as far as I
>> can see you have used SELECT to fetch the sequence values?
>>
> 
> Not sure what would the special mechanism be? I don't think it could
> read the sequence from somewhere else, and due the lack of MVCC we'd
> just read same sequence data from the current relfilenode. Or what else
> would it do?
> 

I was thinking about alternative ways to do this, but I couldn't think
of anything. The non-MVCC behavior of sequences means it's not really
possible to do this based on snapshots / slots or stuff like that ...

> The one thing we can't quite do at the moment is locking the sequence,
> because LOCK is only supported for tables. So we could either provide a
> function to lock a sequence, or locks it and then returns the current
> state (as if we did a SELECT).
> 

... so I took a stab at doing it like this. I didn't feel relaxing LOCK
restrictions to also allow locking sequences would be the right choice,
so I added a new function pg_sequence_lock_for_sync(). I wonder if we
could/should restrict this to logical replication use, somehow.

The interlock happens right after creating the slot - I was thinking
about doing it even before the slot gets created, but that's not
possible, because that installs a snapshot (so it has to be the first
command in the transaction). It acquires RowExclusiveLock, which is
enough to conflict with ALTER SEQUENCE, but allows nextval().

AFAICS this does the trick - if there's ALTER SEQUENCE, we'll wait for
it to complete. And copy_sequence() will read the resulting state, even
though this is REPEATABLE READ - remember, sequences are not subject to
that consistency.

The once anomaly I can think of is the sequence might seem to go
"backwards" for a little bit during the sync. Imagine this sequence of
operations:

1) tablesync creates slot
2) S1 does ALTER SEQUENCE ... RESTART WITH 20 (gets lock)
3) S2 tries ALTER SEQUENCE ... RESTART WITH 100 (waits for lock)
4) tablesync requests lock
5) S1 does the thing, commits
6) S2 acquires lock, does the thing, commits
7) tablesync gets lock, reads current sequence state
8) tablesync decodes changes from S1 and S2, applies them

But I think this is fine - it's part of the catchup, and until that's
done the sync is not considered completed.

I merged the earlier "fixup" patches into the relevant parts, and left
two patches with new tweaks (deducing the corrent "WAL" state from the
current state read by copy_sequence), and the interlock discussed here.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 3/30/23 05:15, Masahiko Sawada wrote:
>
> ...
>
>>>
>>> Perhaps it'd be reasonable to tie the "protocol version" to subscriber
>>> capabilities, so that a protocol version guarantees what message types
>>> the subscriber understands. So we could increment the protocol version,
>>> check it in pgoutput_startup and then error-out in the sequence callback
>>> if the subscriber version is too old.
>>>
>>> That'd be nicer in the sense that we'd generate nicer error message on
>>> the publisher, not an "unknown message type" on the subscriber.
>>>
>>
>> Agreed. So, we can probably formalize this rule such that whenever in
>> a newer version publisher we want to send additional information which
>> the old version subscriber won't be able to handle, the error should
>> be raised at the publisher by using protocol version number.
> 
> +1
> 

OK, I took a stab at this, see the attached 0007 patch which bumps the
protocol version, and allows the subscriber to specify "sequences" when
starting the replication, similar to what we do for the two-phase stuff.

The patch essentially adds 'sequences' to the replication start command,
depending on the server version, but it can be overridden by "sequences"
subscription option. The patch is pretty small, but I wonder how much
smarter this should be ...

I think there are about 4 cases that we need to consider

1) there are no sequences in the publication -> OK

2) publication with sequences, subscriber knows how to apply (and
specifies "sequences on" either automatically or explicitly) -> OK

3) publication with sequences, subscriber explicitly disabled them by
specifying "sequences off" in startup -> OK

4) publication with sequences, subscriber without sequence support (e.g.
older Postgres release) -> PROBLEM (?)

The reason why I think (4) may be a problem is that my opinion is we
shouldn't silently drop stuff that is meant to be part of the
publication. That is, if someone creates a publication and adds a
sequence to it, he wants to replicate the sequence.

But the current behavior is the old subscriber connects, doesn't specify
the 'sequences on' so the publisher disables that and then simply
ignores sequence increments during decoding.

I think we might want to detect this and error out instead of just
skipping the change, but that needs to happen later, only when the
publication actually has any sequences ...

I don't want to over-think / over-engineer this, though, so I wonder
what are your opinions on this?

There's a couple XXX comments in the code, mostly about stuff I left out
when copying the two-phase stuff. For example, we store two-phase stuff
in the replication slot itself - I don't think we need to do that for
sequences, though.

Another thing what to do about ALTER SUBSCRIPTION - at the moment it's
not possible to change the "sequences" option, but maybe we should allow
that? But then we'd need to re-sync all the sequences, somehow ...

Aside from that, I've also added 0005, which does the sync interlock in
a slightly different way - instead of a custom function for locking
sequence, it allows LOCK on sequences. Peter Eisentraut suggested doing
it like this, it's simpler, and I can't see what issues it might cause.
The patch should update LOCK documentation, I haven't done that yet.
Ultimately it should all be merged into 0003, of course.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

"Gregory Stark (as CFM)"

Date:

04 April 2023, 15:45:56

Fwiw the cfbot seems to have some failing tests with this patch:


[19:05:11.398] # Failed test 'initial test data replicated'
[19:05:11.398] # at t/030_sequences.pl line 75.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '132|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'advance sequence in rolled-back transaction'
[19:05:11.398] # at t/030_sequences.pl line 98.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '231|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'create sequence, advance it in
rolled-back transaction, but commit the create'
[19:05:11.398] # at t/030_sequences.pl line 152.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '132|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'advance the new sequence in a
transaction and roll it back'
[19:05:11.398] # at t/030_sequences.pl line 175.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '231|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'advance sequence in a subtransaction'
[19:05:11.398] # at t/030_sequences.pl line 198.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '330|0|t'
[19:05:11.398] # Looks like you failed 5 tests of 6.



-- 
Gregory Stark
As Commitfest Manager

Re: logical decoding and replication of sequences, take 2

From

Alvaro Herrera

Date:

05 April 2023, 10:39:53

Patch 0002 is very annoying to scroll, and I realized that it's because
psql is writing 200kB of dashes in one of the test_decoding test cases.
I propose to set psql's printing format to 'unaligned' to avoid that,
which should cut the size of that patch to a tenth.

I wonder if there's a similar issue in 0003, but I didn't check.

It's annoying that git doesn't seem to have a way of reporting length of
longest lines.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
                                                  (Linus Torvalds)

Attachment

0001-make-test_decoding-ddl.out-shorter.patch.txt

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

05 April 2023, 21:26:33

On 4/5/23 12:39, Alvaro Herrera wrote:
> Patch 0002 is very annoying to scroll, and I realized that it's because
> psql is writing 200kB of dashes in one of the test_decoding test cases.
> I propose to set psql's printing format to 'unaligned' to avoid that,
> which should cut the size of that patch to a tenth.
> 

Yeah, that's a good idea, I think. It shrunk the diff to ~90kB, which is
much better.

> I wonder if there's a similar issue in 0003, but I didn't check.
> 

I don't think so, there just seems to be enough code changes to generate
~260kB diff with all the context.

As for the cfbot failures reported by Greg, that turned out to be a
minor thinko in the protocol version negotiation, introduced by part
0008 (current part, after adding Alvaro's patch tweaking test output).
The subscriber failed to send 'sequences on' when starting the stream.
It also forgot to refresh the subscription after a sequence was added.

The attached patch version fixes all of this, but I think at this point
it's better to just postpone this for PG17 - if it was something we
could fix within a single release, maybe. But the replication protocol
is something we can't easily change after release, so if we find out the
versioning (and sequence negotiation) should work differently, we can't
change it. In fact, we'd be probably stuck with it until PG16 gets out
of support, not just until PG17 ...

I've thought about pushing at least the first two parts (adding the
sequence decoding infrastructure and test_decoding support), but I'm not
sure that's quite worth it without the built-in replication stuff.

Or we could push it and then tweak it after feature freeze, if we
conclude the protocol versioning should work differently. I recall we
did changes in the column and row filtering in PG15. But that seems
quite wrong, obviously.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Patch set needs a rebase, PFA rebased patch-set.

The conflict was in commit "Add decoding of sequences to built-in
replication", in files tablesync.c and 002_pg_dump.pl.

On Thu, May 18, 2023 at 7:53 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Hi,
> Sorry for jumping late in this thread.
>
> I started experimenting with the functionality. Maybe something that
> was already discussed earlier. Given that the thread is being
> discussed for so long and has gone several changes, revalidating the
> functionality is useful.
>
> I considered following aspects:
> Changes to the sequence on subscriber
> -----------------------------------------------------
> 1. Since this is logical decoding, logical replica is writable. So the
> logically replicated sequence can be manipulated on the subscriber as
> well. This implementation consolidates the changes on subscriber and
> publisher rather than replicating the publisher state as is. That's
> good. See example command sequence below
> a. publisher calls nextval() - this sets the sequence state on
> publisher as (1, 32, t) which is replicated to the subscriber.
> b. subscriber calls nextval() once - this sets the sequence state on
> subscriber as (34, 32, t)
> c. subscriber calls nextval() 32 times - on-disk state of sequence
> doesn't change on subscriber
> d. subscriber calls nextval() 33 times - this sets the sequence state
> on subscriber as (99, 0, t)
> e. publisher calls nextval() 32 times - this sets the sequence state
> on publisher as (33, 0, t)
>
> The on-disk state on publisher at the end of e. is replicated to the
> subscriber but subscriber doesn't apply it. The state there is still
> (99, 0, t). I think this is closer to how logical replication of
> sequence should look like. This is aso good enough as long as we
> expect the replication of sequences to be used for failover and
> switchover.
>
> But it might not help if we want to consolidate the INSERTs that use
> nextvals(). If we were to treat sequences as accumulating the
> increments, we might be able to resolve the conflicts by adjusting the
> columns values considering the increments made on subscriber. IIUC,
> conflict resolution is not part of built-in logical replication. So we
> may not want to go this route. But worth considering.
>
> Implementation agnostic decoded change
> --------------------------------------------------------
> Current method of decoding and replicating the sequences is tied to
> the implementation - it replicates the sequence row as is. If the
> implementation changes in future, we might need to revise the decoded
> presentation of sequence. I think only nextval() matters for sequence.
> So as long as we are replicating information enough to calculate the
> nextval we should be good. Current implementation does that by
> replicating the log_value and is_called. is_called can be consolidated
> into log_value itself. The implemented protocol, thus requires two
> extra values to be replicated. Those can be ignored right now. But
> they might pose a problem in future, if some downstream starts using
> them. We will be forced to provide fake but sane values even if a
> future upstream implementation does not produce those values. Of
> course we can't predict the future implementation enough to decide
> what would be an implementation independent format. E.g. if a
> pluggable storage were to be used to implement sequences or if we come
> around implementing distributed sequences, their shape can't be
> predicted right now. So a change in protocol seems to be unavoidable
> whatever we do. But starting with bare minimum might save us from
> larger troubles. I think, it's better to just replicate the nextval()
> and craft the representation on subscriber so that it produces that
> nextval().
>
> 3. Primary key sequences
> -----------------------------------
> I am not experimented with this. But I think we will need to add the
> sequences associated with the primary keys to the publications
> publishing the owner tables. Otherwise, we will have problems with the
> failover. And it needs to be done automatically since a. the names of
> these sequences are generated automatically b. publications with FOR
> ALL TABLES will add tables automatically and start replicating the
> changes. Users may not be able to intercept the replication activity
> to add the associated sequences are also addedto the publication.
>
> --
> Best Wishes,
> Ashutosh Bapat



--
Best Wishes,
Ashutosh Bapat

0005, 0006 and 0007 are all related to the initial sequence sync. [3]
resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
be reviewed in this response.

I followed the discussion starting [1] till [2]. The second one
mentions the interlock mechanism which has been implemented in 0005
and 0006. While I don't have an objection to allowing LOCKing a
sequence using the LOCK command, I am not sure whether it will
actually work or is even needed.

The problem described in [1] seems to be the same as the problem
described in [2]. In both cases we see the sequence moving backwards
during CATCHUP. At the end of catchup the sequence is in the right
state in both the cases. [2] actually deems this behaviour OK. I also
agree that the behaviour is ok. I am confused whether we have solved
anything using interlocking and it's really needed.

I see that the idea of using an LSN to decide whether or not to apply
a change to sequence started in [4]. In [5] Tomas proposed to use page
LSN. Looking at [6], it actually seems like a good idea. In [7] Tomas
agreed that LSN won't be sufficient. But I don't understand why. There
are three LSNs in the picture - restart LSN of sync slot,
confirmed_flush LSN of sync slot and page LSN of the sequence page
from where we read the initial state of the sequence. I think they can
be used with the following rules:
1. The publisher will not send any changes with LSN less than
confirmed_flush so we are good there.
2. Any non-transactional changes that happened between confirmed_flush
and page LSN should be discarded while syncing. They are already
visible to SELECT.
3. Any transactional changes with commit LSN between confirmed_flush
and page LSN should be discarded while syncing. They are already
visible to SELECT.
4. A DDL acquires a lock on sequence. Thus no other change to that
sequence can have an LSN between the LSN of the change made by DDL and
the commit LSN of that transaction. Only DDL changes to sequence are
transactional. Hence any transactional changes with commit LSN beyond
page LSN would not have been seen by the SELECT otherwise SELECT would
see the page LSN committed by that transaction. so they need to be
applied while syncing.
5. Any non-transactional changes beyond page LSN should be applied.
They are not seen by SELECT.

Am I missing something?

I don't have an idea how to get page LSN via a SQL query (while also
fetching data on that page). That may or may not be a challenge.

[1] https://www.postgresql.org/message-id/c2799362-9098-c7bf-c315-4d7975acafa3%40enterprisedb.com
[2] https://www.postgresql.org/message-id/2d4bee7b-31be-8b36-2847-a21a5d56e04f%40enterprisedb.com
[3] https://www.postgresql.org/message-id/f5a9d63d-a6fe-59a9-d1ed-38f6a5582c13%40enterprisedb.com
[4] https://www.postgresql.org/message-id/CAA4eK1KUYrXFq25xyjBKU1UDh7Dkzw74RXN1d3UAYhd4NzDcsg%40mail.gmail.com
[5] https://www.postgresql.org/message-id/CAA4eK1LiA8nV_ZT7gNHShgtFVpoiOvwoxNsmP_fryP%3DPsYPvmA%40mail.gmail.com
[6] https://www.postgresql.org/docs/current/storage-page-layout.html

--
Best Wishes,
Ashutosh Bapat

Re: logical decoding and replication of sequences, take 2

From

Ashutosh Bapat

Date:

05 July 2023, 14:54:34

And the last patch 0008.

@@ -1180,6 +1194,13 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
... snip ...
+                if (IsSet(opts.specified_opts, SUBOPT_SEQUENCES))
+                {
+                    values[Anum_pg_subscription_subsequences - 1] =
+                        BoolGetDatum(opts.sequences);
+                    replaces[Anum_pg_subscription_subsequences - 1] = true;
+                }
+

The list of allowed options set a few lines above this code does not contain
"sequences". Is this option missing there or this code is unnecessary? If we
intend to add "sequence" at a later time after a subscription is created, will
the sequences be synced after ALTER SUBSCRIPTION?

+    /*
+     * ignore sequences when not requested
+     *
+     * XXX Maybe we should differentiate between "callbacks not defined" or
+     * "subscriber disabled sequence replication" and "subscriber does not
+     * know about sequence replication" (e.g. old subscriber version).
+     *
+     * For the first two it'd be fine to bail out here, but for the last it

It's not clear which two you are talking about. Maybe that's because the
paragraph above is ambiguious. It is in the form of A or B and C; so not clear
which cases we are differentiating between: (A, B, C), ((A or B) and C) or (A or
(B and C)) or something else.

+     * might be better to continue and error out only when the sequence
+     * would be replicated (e.g. as part of the publication). We don't know
+     * that here, unfortunately.

Please see comments on changes to pgoutput_startup() below. We may
want to change the paragraph accordingly.

@@ -298,6 +298,20 @@ StartupDecodingContext(List *output_plugin_options,
      */
     ctx->reorder->update_progress_txn = update_progress_txn_cb_wrapper;

+    /*
+     * To support logical decoding of sequences, we require the sequence
+     * callback. We decide it here, but only check it later in the wrappers.
+     *
+     * XXX Isn't it wrong to define only one of those callbacks? Say we
+     * only define the stream_sequence_cb() - that may get strange results
+     * depending on what gets streamed. Either none or both?

I don't think the current condition is correct; it will consider sequence
changes to be streamed even when sequence_cb is not defined and actually not
send those. sequence_cb is needed to send sequence changes irrespective of
whether transaction streaming is supported.  But stream_sequence_cb is required
if other stream callbacks are available. Something like

if (ctx->callbacks.sequence_cb)
{
    if (ctx->streaming)
    {
        if ctx->callbacks.stream_sequence_cb == NULL)
            ctx->sequences = false;
        else
            ctx->sequences = true;
    }
    else
        ctx->sequences = true;
}
else
    ctx->sequences = false;

+     *
+     * XXX Shouldn't sequence be defined at slot creation time, similar
+     * to two_phase? Probably not.

I don't know why two_phase is defined at the slot creation time, so can't
comment on this. But looks like something we need to answer before committing
the patches.

+    /*
+     * We allow decoding of sequences when the option is given at the streaming
+     * start, provided the plugin supports all the callbacks for two-phase.

s/two-phase/sequences/

+     *
+     * XXX Similar behavior to the two-phase block below.

I think we need to describe sequence specific behaviour instead of pointing to
the two-phase. two-phase is part of in replication slot's on disk specification
but sequence is not. Given that it's XXX, I think you are planning to do that.

+     *
+     * XXX Shouldn't this error out if the callbacks are not defined?

Isn't this already being done in pgoutput_startup()? Should we remove this XXX.

+        /*
+         * Here, we just check whether the sequences decoding option is passed
+         * by plugin and decide whether to enable it at later point of time. It
+         * remains enabled if the previous start-up has done so. But we only
+         * allow the option to be passed in with sufficient version of the
+         * protocol, and when the output plugin supports it.
+         */
+        if (!data->sequences)
+            ctx->sequences_opt_given = false;
+        else if (data->protocol_version <
LOGICALREP_PROTO_SEQUENCES_VERSION_NUM)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("requested proto_version=%d does not
support sequences, need %d or higher",
+                            data->protocol_version,
LOGICALREP_PROTO_SEQUENCES_VERSION_NUM)));
+        else if (!ctx->sequences)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("sequences requested, but not supported
by output plugin")));

If a given output plugin doesn't implement the callbacks but subscription
specifies sequences, the code will throw an error whether or not publication is
publishing sequences. Instead I think the behaviour should be same as the case
when publication doesn't include sequences even if the publisher node has
sequences. In either case publisher (the plugin or the publication) doesn't want
to publish sequence data. So subscriber's request can be ignored.

What might be good is to throw an error if the publication publishes the
sequences but there are no callbacks - both output plugin and the publication
are part of publisher node, thus it's easy for users to setup them consistently.
GetPublicationRelations can be tweaked a bit to return just tables or sequences.
That along with publication's all sequences flag should tell us whether
publication publishes any sequences or not.

That ends my first round of reviews.

-- 
Best Wishes,
Ashutosh Bapat

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

11 July 2023, 07:41:40

On Tue, Jun 27, 2023 at 11:30 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> I have not looked at the DDL replication patch in detail so I may be
> missing something. IIUC, that patch replicates the DDL statement in
> some form: parse tree or statement. But it doesn't replicate the some
> or all WAL records that the DDL execution generates.
>

Yes, the DDL replication patch uses the parse tree and catalog
information to generate a deparsed form of DDL statement which is WAL
logged and used to replicate DDLs.

--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

12 July 2023, 19:05:15

Hi,

here's a rebased and significantly reworked version of this patch
series, based on the recent reviews and discussion. Let me go through
the main differences:

1) reorder the patches to have the "shortening" of test output first

2) merge the various "fix" patches in to the three main patches

0002 - introduce sequence decoding infrastructure
0003 - add sequences to test_decoding
0004 - add sequences to built-in replication

I've kept those patches separate to make the evolution easier to follow
and discuss, but it was necessary to cleanup the patch series and make
it clearer what the current state is.

3) simplify the replicated state

As suggested by Ashutosh, it may not be a good idea to replicate the
(last_value, log_cnt, is_called) tuple, as that's pretty tightly tied to
our internal implementation. Which may not be the right thing for other
plugins. So this new patch replicates just "value" which is pretty much
(last_value + log_cnt), representing the next value that should be safe
to generate on the subscriber (in case of a failover).

4) simplify test_decoding code & tests

I realized I can ditch some of the test_decoding changes, because at
some point we chose to only include sequences in test_decoding when
explicitly requested. So the tests don't need to disable that, it's the
other way - one test needs to enable it.

This now also prints the single value, instead of the three values.

5) minor tweaks in the built-in replication

This adopts the relaxed LOCK code to allow locking sequences during the
initial sync, and also adopts the replication of a single value (this
affects the "apply" side of that change too).

6) simplified protocol versioning

The main open question I had was what to do about protocol versioning
for the built-in replication - how to decide whether the subscriber can
apply sequences, and what should happen if we decode sequence but the
subscriber does not support that.

I was not entirely sure we want to handle this by a simple version
check, because that maps capabilities to a linear scale, which seems
pretty limiting. That is, each protocol version just grows, and new
version number means support of a new capability - like replication of
two-phase commits, or sequences. Which is nice, but it does not allow
supporting just the later feature, for example - you can't skip one.
Which is why 2PC decoding has both a version and a subscription flag,
which allows exactly that ...

When discussing this off-list with Peter Eisentraut, he reminded me of
his old message in the thread:

https://www.postgresql.org/message-id/8046273f-ea88-5c97-5540-0ccd5d244fd4@enterprisedb.com

where he advocates for exactly this simplified behavior. So I took a
stab at it and 0005 should be doing that. I keep it as a separate patch
for now, to make the changes clearer, but ultimately it should be merged
into 0003 and 0004 parts.

It's not particularly complex change, it mostly ditches the subscription
option (which also means columns in the pg_subscription catalog), and a
flag in the decoding context.

But the main change is in pgoutput_sequence(), where we protocol_version
and error-out if it's not the right version (instead of just ignoring
the sequence). AFAICS this behaves as expected - with PG15 subscriber, I
get an ERROR on the publisher side from the sequence callback.

But it no occurred to me we could do the same thing with the original
approach - allow the per-subscription "sequences" flag, but error out
when the subscriber did not enable that capability ...

Re: logical decoding and replication of sequences, take 2

From

Ashutosh Bapat

Date:

14 July 2023, 13:50:59

On Fri, Jul 14, 2023 at 3:59 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

>
> >>
> >> The new patch detects that, and triggers ERROR on the publisher. And I
> >> think that's the correct thing to do.
> >
> > With this behaviour users will never be able to setup logical
> > replication between old and new servers considering almost every setup
> > has sequences.
> >
>
> That's not true.
>
> Replication to older versions works fine as long as the publication does
> not include sequences (which need to be added explicitly). If you have a
> publication with sequences, you clearly want to replicate them, ignoring
> it is just confusing "magic".

I was looking at it from a different angle. Publishers publish what
they want, subscribers choose what they want and what gets replicated
is intersection of these two sets. Both live happily.

But I am fine with that too. It's just that users need to create more
publications.

>
> If you have a publication with sequences and still want to replicate to
> an older server, create a new publication without sequences.
>

I tested the current patches with subscriber at PG 14 and publisher at
master + these patches. I created one table and a sequence on both
publisher and subscriber. I created two publications, one with
sequence and other without it. Both have the table in it. When the
subscriber subscribes to the publication with sequence, following
ERROR is repeated in the subscriber logs and nothing gets replicated
```
[2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOG:  00000:
logical replication apply worker for subscription "sub5433" has
started
[2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOCATION:
ApplyWorkerMain, worker.c:3169
[2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] ERROR:  08P01:
could not receive data from WAL stream: ERROR:  protocol version does
not support sequence replication
    CONTEXT:  slot "sub5433", output plugin "pgoutput", in the
sequence callback, associated LSN 0/1513718
[2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] LOCATION:
libpqrcv_receive, libpqwalreceiver.c:818
[2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOG:  00000:
background worker "logical replication worker" (PID 916293) exited
with exit code 1
[2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOCATION:
LogChildExit, postmaster.c:3737
```

When the subscriber subscribes to the publication without sequence,
things work normally.

The cross-version replication is working as expected then.

--
Best Wishes,
Ashutosh Bapat

Re: logical decoding and replication of sequences, take 2

From

Ashutosh Bapat

Date:

14 July 2023, 14:02:27

On Fri, Jul 14, 2023 at 4:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I don't think that's true - this will create 1 record with
> "created=true" (the one right after the CREATE SEQUENCE) and the rest
> will have "created=false".

I may have misread the code.

>
> I realized I haven't modified seq_desc to show this flag, so I did that
> in the updated patch version, which makes this easy to see.

Now I see it. Thanks for the clarification.

> >
> > Am I missing something here?
> >
>
> You're missing the fact that pg_upgrade does not copy replication slots,
> so the restart_lsn does not matter.
>
> (Yes, this is pretty annoying consequence of using pg_upgrade. And maybe
> we'll improve that in the future - but I'm pretty sure we won't allow
> decoding old WAL.)

Ah, I see. Thanks for correcting me.

> >>>
> >>
> >> Hmmmm, that might work. I feel a bit uneasy about having to keep all
> >> relfilenodes, not just sequences ...
> >
> > From relfilenode it should be easy to get to rel and then see if it's
> > a sequence. Only add relfilenodes for the sequence.
> >
>
> Will try.
>

Actually, adding all relfilenodes to hash may not be that bad. There
shouldn't be many of those. So the extra step to lookup reltype may
not be necessary. What's your reason for uneasiness? But yeah, there's
a way to avoid that as well.

Should I wait for this before the second round of review?

--
Best Wishes,
Ashutosh Bapat

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

14 July 2023, 14:03:51


On 7/14/23 15:50, Ashutosh Bapat wrote:
> On Fri, Jul 14, 2023 at 3:59 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>>
>>>>
>>>> The new patch detects that, and triggers ERROR on the publisher. And I
>>>> think that's the correct thing to do.
>>>
>>> With this behaviour users will never be able to setup logical
>>> replication between old and new servers considering almost every setup
>>> has sequences.
>>>
>>
>> That's not true.
>>
>> Replication to older versions works fine as long as the publication does
>> not include sequences (which need to be added explicitly). If you have a
>> publication with sequences, you clearly want to replicate them, ignoring
>> it is just confusing "magic".
> 
> I was looking at it from a different angle. Publishers publish what
> they want, subscribers choose what they want and what gets replicated
> is intersection of these two sets. Both live happily.
> 
> But I am fine with that too. It's just that users need to create more
> publications.
> 

I think you might make essentially the same argument about replicating
just some of the tables in the publication. That is, the publication has
tables t1 and t2, but subscriber only has t1. That will fail too, we
don't allow the subscriber to ignore changes for t2.

I think it'd be rather weird (and confusing) to do this differently for
different types of replicated objects.

>>
>> If you have a publication with sequences and still want to replicate to
>> an older server, create a new publication without sequences.
>>
> 
> I tested the current patches with subscriber at PG 14 and publisher at
> master + these patches. I created one table and a sequence on both
> publisher and subscriber. I created two publications, one with
> sequence and other without it. Both have the table in it. When the
> subscriber subscribes to the publication with sequence, following
> ERROR is repeated in the subscriber logs and nothing gets replicated
> ```
> [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOG:  00000:
> logical replication apply worker for subscription "sub5433" has
> started
> [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOCATION:
> ApplyWorkerMain, worker.c:3169
> [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] ERROR:  08P01:
> could not receive data from WAL stream: ERROR:  protocol version does
> not support sequence replication
>     CONTEXT:  slot "sub5433", output plugin "pgoutput", in the
> sequence callback, associated LSN 0/1513718
> [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] LOCATION:
> libpqrcv_receive, libpqwalreceiver.c:818
> [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOG:  00000:
> background worker "logical replication worker" (PID 916293) exited
> with exit code 1
> [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOCATION:
> LogChildExit, postmaster.c:3737
> ```
> 
> When the subscriber subscribes to the publication without sequence,
> things work normally.
> 
> The cross-version replication is working as expected then.
> 

Thanks for testing / confirming this! So, do we agree this behavior is
reasonable?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

14 July 2023, 14:31:49

On 7/14/23 16:02, Ashutosh Bapat wrote:
> ...
>>>>>
>>>>
>>>> Hmmmm, that might work. I feel a bit uneasy about having to keep all
>>>> relfilenodes, not just sequences ...
>>>
>>> From relfilenode it should be easy to get to rel and then see if it's
>>> a sequence. Only add relfilenodes for the sequence.
>>>
>>
>> Will try.
>>
> 
> Actually, adding all relfilenodes to hash may not be that bad. There
> shouldn't be many of those. So the extra step to lookup reltype may
> not be necessary. What's your reason for uneasiness? But yeah, there's
> a way to avoid that as well.
> 
> Should I wait for this before the second round of review?
> 

I don't think you have to wait - just ignore the part that changes the
WAL record, which is a pretty tiny bit of the patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

15 July 2023, 15:08:20

Here's a slightly improved version of the patch, fixing two minor issues
reported by cfbot:

- compiler warning about fetch_sequence_data maybe not initializing a
variable (not true, but silence the warning)

- missing "id" for an element in SGML cocs



regards


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 7/19/23 07:42, Ashutosh Bapat wrote:
> On Wed, Jul 19, 2023 at 1:20 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>
>>> This behaviour doesn't need any on-disk changes or has nothing in it
>>> which prohibits us from changing it in future. So I think it's good as
>>> a v0. If required we can add the protocol option to provide more
>>> flexible behaviour.
>>>
>>
>> True, although "no on-disk changes" does not exactly mean we can just
>> change it at will. Essentially, once it gets released, the behavior is
>> somewhat fixed for the next ~5 years, until that release gets EOL. And
>> likely longer, because more features are likely to do the same thing.
>>
>> That's essentially why the patch was reverted from PG16 - I was worried
>> the elaborate protocol versioning/negotiation was not the right thing.
> 
> I agree that elaborate protocol would pose roadblocks in future. It's
> better not to add that burden right now, esp. when usage is not clear.
> 
> Here's behavriour and extension matrix as I understand it and as of
> the last set of patches.
> 
> Publisher PG 17, Subscriber PG 17 - changes to sequences are
> replicated, downstream is capable of applying them
> 
> Publisher PG 16-, Subscriber PG 17  changes to sequences are never replicated
> 
> Publisher PG 18+, Subscriber PG 17 - same as 17, 17 case. Any changes
> in PG 18+ need to make sure that PG 17 subscriber receives sequence
> changes irrespective of changes in protocol. That may pose some
> maintenance burden but doesn't seem to be any harder than usual
> backward compatibility burden.
> 
> Moreover users can control whether changes to sequences get replicated
> or not by controlling the objects contained in publication.
> 
> I don't see any downside to this. Looks all good. Please correct me if wrong.
> 

I think this is an accurate description of what the current patch does.
And I think it's a reasonable behavior.

My point is that if this gets released in PG17, it'll be difficult to
change, even if it does not change on-disk format.

>>
>>> One thing I am worried about is that the subscriber will get an error
>>> only when a sequence change is decoded. All the prior changes will be
>>> replicated and applied on the subscriber. Thus by the time the user
>>> realises this mistake, they may have replicated data. At this point if
>>> they want to subscribe to a publication without sequences they will
>>> need to clean the already replicated data. But they may not be in a
>>> position to know which is which esp when the subscriber has its own
>>> data in those tables. Example,
>>>
>>> publisher: create publication pub with sequences and tables
>>> subscriber: subscribe to pub
>>> publisher: modify data in tables and sequences
>>> subscriber: replicates some data and errors out
>>> publisher: delete some data from tables
>>> publisher: create a publication pub_tab without sequences
>>> subscriber: subscribe to pub_tab
>>> subscriber: replicates the data but rows which were deleted on
>>> publisher remain on the subscriber
>>>
>>
>> Sure, but I'd argue that's correct. If the replication stream has
>> something the subscriber can't apply, what else would you do? We had
>> exactly the same thing with TRUNCATE, for example (except that it failed
>> with "unknown message" on the subscriber).
> 
> When the replication starts, the publisher knows what publication is
> being used, it also knows what protocol is being used. From
> publication it knows what objects will be replicated. So we could fail
> before any changes are replicated when executing START_REPLICATION
> command. According to [1], if an object is added or removed from
> publication the subscriber is required to REFRESH SUBSCRIPTION in
> which case there will be fresh START_REPLICATION command sent. So we
> should fail the START_REPLICATION command before sending any change
> rather than when a change is being replicated. That's more
> deterministic and easy to handle. Of course any changes that were sent
> before ALTER PUBLICATION can not be reverted, but that's expected.
> 
> Coming back to TRUNCATE, I don't think it's possible to know whether a
> publication will send a truncate downstream or not. So we can't throw
> an error before TRUNCATE change is decoded.
> 
> Anyway, I think this behaviour should be documented. I didn't see this
> mentioned in PUBLICATION or SUBSCRIPTION documentation.
> 

I need to think behavior about this a bit more, and maybe check how
difficult would be implementing it.

I did however look at the proposed alternative to the "created" flag.
The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
The smgr_decode code needs a review (I'm not sure the
skipping/fast-forwarding part is correct), but it seems to be working
fine overall, although we need to ensure the WAL record has the correct XID.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

19 July 2023, 21:01:04

On 7/19/23 12:53, Tomas Vondra wrote:
> ...
> 
> I did however look at the proposed alternative to the "created" flag.
> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
> The smgr_decode code needs a review (I'm not sure the
> skipping/fast-forwarding part is correct), but it seems to be working
> fine overall, although we need to ensure the WAL record has the correct XID.
> 

cfbot reported two issues in the patch - compilation warning, due to
unused variable in sequence_decode, and a failing test in test_decoding.

The second thing happens because when creating the relfilenode, it may
happen before we know the XID. The patch already does ensure the WAL
with the sequence data has XID, but that's later. And when the CREATE
record did not have the correct XID, that broke the logic deciding which
increments should be "transactional".

This forces us to assign XID a bit earlier (it'd happen anyway, when
logging the increment). There's a bit of a drawback, because we don't
have the relation yet, so we can't do RelationNeedsWAL ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

Ashutosh Bapat

Date:

20 July 2023, 07:24:21

Thanks Tomas for the updated patches.

Here are my comments on 0006 patch as well as 0002 patch.

On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I think this is an accurate description of what the current patch does.
> And I think it's a reasonable behavior.
>
> My point is that if this gets released in PG17, it'll be difficult to
> change, even if it does not change on-disk format.
>

Yes. I agree. And I don't see any problem even if we are not able to change it.

>
> I need to think behavior about this a bit more, and maybe check how
> difficult would be implementing it.

Ok.

In most of the comments and in documentation, there are some phrases
which do not look accurate.

Change to a sequence is being refered to as "sequence increment". While
ascending sequences are common, PostgreSQL supports descending sequences as
well. The changes there will be decrements. But that's not the only case. A
sequence may be restarted with an older value, in which case the change could
increment or a decrement. I think correct usage is 'changes to sequence' or
'sequence changes'.

Sequence being assigned a new relfilenode is referred to as sequence
being created. This is confusing. When an existing sequence is ALTERed, we
will not "create" a new sequence but we will "create" a new relfilenode and
"assign" it to that sequence.

PFA such edits in 0002 and 0006 patches. Let me know if those look
correct. I think we
need similar changes to the documentation and comments in other places.

>
> I did however look at the proposed alternative to the "created" flag.
> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
> The smgr_decode code needs a review (I'm not sure the
> skipping/fast-forwarding part is correct), but it seems to be working
> fine overall, although we need to ensure the WAL record has the correct XID.
>

Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL
record, it adds the relfilenode mentioned in it to the sequences hash.
When decoding a sequence change record, it checks whether the
relfilenode in the WAL record is in hash table. If it is the sequence
changes is deemed transactional otherwise non-transactional. The
change looks good to me. It simplifies the logic to decide whether a
sequence change is transactional or not.

In sequence_decode() we skip sequence changes when fast forwarding.
Given that smgr_decode() is only to supplement sequence_decode(), I
think it's correct to do the same in smgr_decode() as well. Simillarly
skipping when we don't have full snapshot.

Some minor comments on 0006 patch

+    /* make sure the relfilenode creation is associated with the XID */
+    if (XLogLogicalInfoActive())
+        GetCurrentTransactionId();

I think this change is correct and is inline with similar changes in 0002. But
I looked at other places from where DefineRelation() is called. For regular
tables it is called from ProcessUtilitySlow() which in turn does not call
GetCurrentTransactionId(). I am wondering whether we are just discovering a
class of bugs caused by not associating an xid with a newly created
relfilenode.

+    /*
+     * If we don't have snapshot or we are just fast-forwarding, there is no
+     * point in decoding changes.
+     */
+    if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+        ctx->fast_forward)
+        return;

This code block is repeated.

+void
+ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid,
+                               RelFileLocator rlocator)
+{
    ... snip ...
+
+    /* sequence changes require a transaction */
+    if (xid == InvalidTransactionId)
+        return;

IIUC, with your changes in DefineSequence() in this patch, this should not
happen. So this condition will never be true. But in case it happens, this code
will not add the relfilelocation to the hash table and we will deem the
sequence change as non-transactional. Isn't it better to just throw an error
and stop replication if that (ever) happens?

Also some comments on 0002 patch

@@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple
tuple, ForkNumber forkNum)

     /* check the comment above nextval_internal()'s equivalent call. */
     if (RelationNeedsWAL(rel))
+    {
         GetTopTransactionId();

+        /*
+         * Make sure the subtransaction has a XID assigned, so that
the sequence
+         * increment WAL record is properly associated with it. This
matters for
+         * increments of sequences created/altered in the
transaction, which are
+         * handled as transactional.
+         */
+        if (XLogLogicalInfoActive())
+            GetCurrentTransactionId();
+    }
+

I think we should separately commit the changes which add a call to
GetCurrentTransactionId(). That looks like an existing bug/anomaly
which can stay irrespective of this patch.

+    /*
+     * To support logical decoding of sequences, we require the sequence
+     * callback. We decide it here, but only check it later in the wrappers.
+     *
+     * XXX Isn't it wrong to define only one of those callbacks? Say we
+     * only define the stream_sequence_cb() - that may get strange results
+     * depending on what gets streamed. Either none or both?
+     *
+     * XXX Shouldn't sequence be defined at slot creation time, similar
+     * to two_phase? Probably not.
+     */

Do you intend to keep these XXX's as is? My previous comments on this comment
block are in [1].

In fact, given that whether or not sequences are replicated is decided by the
protocol version, do we really need LogicalDecodingContext::sequences? Drawing
parallel with WAL messages, I don't think it's needed.

[1] https://www.postgresql.org/message-id/CAExHW5vScYKKb0RZoiNEPfbaQ60hihfuWeLuZF4JKrwPJXPcUw%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Attachment

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

20 July 2023, 14:51:59

On 7/20/23 09:24, Ashutosh Bapat wrote:
> Thanks Tomas for the updated patches.
> 
> Here are my comments on 0006 patch as well as 0002 patch.
> 
> On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I think this is an accurate description of what the current patch does.
>> And I think it's a reasonable behavior.
>>
>> My point is that if this gets released in PG17, it'll be difficult to
>> change, even if it does not change on-disk format.
>>
> 
> Yes. I agree. And I don't see any problem even if we are not able to change it.
> 
>>
>> I need to think behavior about this a bit more, and maybe check how
>> difficult would be implementing it.
> 
> Ok.
> 
> In most of the comments and in documentation, there are some phrases
> which do not look accurate.
> 
> Change to a sequence is being refered to as "sequence increment". While
> ascending sequences are common, PostgreSQL supports descending sequences as
> well. The changes there will be decrements. But that's not the only case. A
> sequence may be restarted with an older value, in which case the change could
> increment or a decrement. I think correct usage is 'changes to sequence' or
> 'sequence changes'.
> 
> Sequence being assigned a new relfilenode is referred to as sequence
> being created. This is confusing. When an existing sequence is ALTERed, we
> will not "create" a new sequence but we will "create" a new relfilenode and
> "assign" it to that sequence.
> 
> PFA such edits in 0002 and 0006 patches. Let me know if those look
> correct. I think we
> need similar changes to the documentation and comments in other places.
> 

OK, I merged the changes into the patches, with some minor changes to
the wording etc.

>>
>> I did however look at the proposed alternative to the "created" flag.
>> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
>> The smgr_decode code needs a review (I'm not sure the
>> skipping/fast-forwarding part is correct), but it seems to be working
>> fine overall, although we need to ensure the WAL record has the correct XID.
>>
> 
> Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL
> record, it adds the relfilenode mentioned in it to the sequences hash.
> When decoding a sequence change record, it checks whether the
> relfilenode in the WAL record is in hash table. If it is the sequence
> changes is deemed transactional otherwise non-transactional. The
> change looks good to me. It simplifies the logic to decide whether a
> sequence change is transactional or not.
> 

Right.

> In sequence_decode() we skip sequence changes when fast forwarding.
> Given that smgr_decode() is only to supplement sequence_decode(), I
> think it's correct to do the same in smgr_decode() as well. Simillarly
> skipping when we don't have full snapshot.
> 

I don't follow, smgr_decode already checks ctx->fast_forward.

> Some minor comments on 0006 patch
> 
> +    /* make sure the relfilenode creation is associated with the XID */
> +    if (XLogLogicalInfoActive())
> +        GetCurrentTransactionId();
> 
> I think this change is correct and is inline with similar changes in 0002. But
> I looked at other places from where DefineRelation() is called. For regular
> tables it is called from ProcessUtilitySlow() which in turn does not call
> GetCurrentTransactionId(). I am wondering whether we are just discovering a
> class of bugs caused by not associating an xid with a newly created
> relfilenode.
> 

Not sure. Why would it be a bug?

> +    /*
> +     * If we don't have snapshot or we are just fast-forwarding, there is no
> +     * point in decoding changes.
> +     */
> +    if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
> +        ctx->fast_forward)
> +        return;
> 
> This code block is repeated.
> 

Fixed.

> +void
> +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid,
> +                               RelFileLocator rlocator)
> +{
>     ... snip ...
> +
> +    /* sequence changes require a transaction */
> +    if (xid == InvalidTransactionId)
> +        return;
> 
> IIUC, with your changes in DefineSequence() in this patch, this should not
> happen. So this condition will never be true. But in case it happens, this code
> will not add the relfilelocation to the hash table and we will deem the
> sequence change as non-transactional. Isn't it better to just throw an error
> and stop replication if that (ever) happens?
> 

It can't happen for sequence, but it may happen when creating a
non-sequence relfilenode. In a way, it's a way to skip (some)
unnecessary relfilenodes.

> Also some comments on 0002 patch
> 
> @@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple
> tuple, ForkNumber forkNum)
> 
>      /* check the comment above nextval_internal()'s equivalent call. */
>      if (RelationNeedsWAL(rel))
> +    {
>          GetTopTransactionId();
> 
> +        /*
> +         * Make sure the subtransaction has a XID assigned, so that
> the sequence
> +         * increment WAL record is properly associated with it. This
> matters for
> +         * increments of sequences created/altered in the
> transaction, which are
> +         * handled as transactional.
> +         */
> +        if (XLogLogicalInfoActive())
> +            GetCurrentTransactionId();
> +    }
> +
> 
> I think we should separately commit the changes which add a call to
> GetCurrentTransactionId(). That looks like an existing bug/anomaly
> which can stay irrespective of this patch.
> 

Not sure, but I don't see this as a bug.

> +    /*
> +     * To support logical decoding of sequences, we require the sequence
> +     * callback. We decide it here, but only check it later in the wrappers.
> +     *
> +     * XXX Isn't it wrong to define only one of those callbacks? Say we
> +     * only define the stream_sequence_cb() - that may get strange results
> +     * depending on what gets streamed. Either none or both?
> +     *
> +     * XXX Shouldn't sequence be defined at slot creation time, similar
> +     * to two_phase? Probably not.
> +     */
> 
> Do you intend to keep these XXX's as is? My previous comments on this comment
> block are in [1].
> 
> In fact, given that whether or not sequences are replicated is decided by the
> protocol version, do we really need LogicalDecodingContext::sequences? Drawing
> parallel with WAL messages, I don't think it's needed.
> 

Right. We do that for two_phase because you can override that when
creating the subscription - sequences allowed that too initially, but
then we ditched that. So I don't think we need this.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 7/24/23 08:31, Amit Kapila wrote:
> On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> OK, I merged the changes into the patches, with some minor changes to
>> the wording etc.
>>
> 
> I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720
> even without the rest of the patches. Isn't it a separate improvement?
> 
> I see that origin filtering (origin=none) doesn't work with this
> patch. You can see this by using the following statements:
> Node-1:
> postgres=# create sequence s;
> CREATE SEQUENCE
> postgres=# create publication mypub for all sequences;
> CREATE PUBLICATION
> 
> Node-2:
> postgres=# create sequence s;
> CREATE SEQUENCE
> postgres=# create subscription mysub_sub connection '....' publication
> mypub with (origin=none);
> NOTICE:  created replication slot "mysub_sub" on publisher
> CREATE SUBSCRIPTION
> postgres=# create publication mypub_sub for all sequences;
> CREATE PUBLICATION
> 
> Node-1:
> create subscription mysub_pub connection '...' publication mypub_sub
> with (origin=none);
> NOTICE:  created replication slot "mysub_pub" on publisher
> CREATE SUBSCRIPTION
> 
> SELECT nextval('s') FROM generate_series(1,100);
> 
> After that, you can check on the subscriber that sequences values are
> overridden with older values:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          67 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         100 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         133 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          67 |       0 | t
> (1 row)
> 
> I haven't verified all the details but I think that is because we
> don't set XLOG_INCLUDE_ORIGIN while logging sequence values.
> 

Good point. Attached is a patch that adds XLOG_INCLUDE_ORIGIN to
sequence changes. I considered doing that only for wal_level=logical,
but we don't do that elsewhere. Also, I didn't do that for smgr_create,
because we don't actually replicate that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 7/25/23 15:18, Ashutosh Bapat wrote:
>
> ...
>
>> But for sequences, the copy happens after the slot creation, possibly
>> with (LSN1 < LSN2). And because LSN3 comes from the main subscription
>> (which may be a bit behind, for whatever reason), it may happen that
>>
>>    (LSN1 < LSN3 < LSN2)
>>
>> The the sync ends at LSN3, but that means all sequence changes between
>> LSN3 and LSN2 will be applied "again" making the sequence go away.
>>
>> IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences).
> 

Do you agree this scheme would be correct?

> Back in this thread, an approach to use page LSN (LSN2 above) to make
> sure that no change before LSN2 is applied on subscriber. The approach
> was discussed in emails around [1] and discarded later for no reason.
> I think that approach has some merit.
> 
> [1]
https://www.postgresql.org/message-id/flat/21c87ea8-86c9-80d6-bc78-9b95033ca00b%40enterprisedb.com#36bb9c7968b7af577dc080950761290d
> 

That doesn't seem to be the correct link ... IIRC the page LSN was
discussed as a way to skip changes up to the point when the COPY was
done. I believe it might work with the scheme I described above too.

The trouble is we don't have an interface to select both the sequence
state and the page LSN. It's probably not hard to add (extend the
read_seq_tuple() to also return the LSN, and adding a SQL function), but
I don't think it'd add much value, compared to just getting the current
insert LSN after the COPY.

Yes, the current LSN may be a bit higher, so we may need to apply a
couple changes to get into "ready" state. But we read it right after
copy_sequence() so how much can happen in between?

Also, we can get into similar state anyway - the main subscription can
get ahead, at which point the sync has to catchup to it.

The attached patch (part 0007) does it this way. Can you try if you can
still reproduce the "backwards" movement with this version?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

25 July 2023, 16:32:38

On 7/24/23 14:57, Ashutosh Bapat wrote:
> ...
> 
>>
>>
>> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
>> I was thinking maybe we should have it in the transaction (because we
>> need to do cleanup at the end). It seem a bit inconvenient, because then
>> we'd need to either search htabs in all subxacts, or transfer the
>> entries to the top-level xact (otoh, we already do that with snapshots),
>> and cleanup on abort.
>>
>> What do you think?
> 
> Hash table per transaction seems saner design. Adding it to the top
> level transaction should be fine. The entry will contain an XID
> anyway. If we add it to every subtransaction we will need to search
> hash table in each of the subtransactions when deciding whether a
> sequence change is transactional or not. Top transaction is a
> reasonable trade off.
> 

It's not clear to me what design you're proposing, exactly.

If we track it in top-level transactions, then we'd need copy the data
whenever a transaction is assigned as a child, and perhaps also remove
it when there's a subxact abort.

And we'd need to still search the hashes in all toplevel transactions on
every sequence increment - in principle we can't have increment for a
sequence created in another in-progress transaction, but maybe it's just
not assigned yet.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

25 July 2023, 21:12:53

Here's a somewhat cleaned up version of the patch series, with some of
the smaller "rework" patches (protocol versioning, origins, smgr_create,
...) merged into the appropriate part. I've kept the bit adding separate
tablesync LSN.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 7/26/23 09:27, Amit Kapila wrote:
> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> ...
>>
> 
> I was reading this email thread and found the email by Andres [1]
> which seems to me to say the same thing: "I assume that part of the
> initial sync would have to be a new sequence synchronization step that
> reads all the sequence states on the publisher and ensures that the
> subscriber sequences are at the same point. There's a bit of
> trickiness there, but it seems entirely doable. The logical
> replication replay support for sequences will have to be a bit careful
> about not decreasing the subscriber's sequence values - the standby
> initially will be ahead of the
> increments we'll see in the WAL.". Now, IIUC this means that even
> before the sequence is marked as SYNCDONE, it shouldn't go backward.
> 

Well, I could argue that's more an opinion, and I'm not sure it really
contradicts the idea that the sequence should not go backwards only
after the sync completes.

Anyway, I was thinking about this a bit more, and it seems it's not as
difficult to use the page LSN to ensure sequences don't go backwards.
The 0005 change does that, by:

1) adding pg_sequence_state, that returns both the sequence state and
   the page LSN

2) copy_sequence returns the page LSN

3) tablesync then sets this LSN as origin_startpos (which for tables is
   just the LSN of the replication slot)

AFAICS this makes it work - we start decoding at the page LSN, so that
we  skip the increments that could lead to the sequence going backwards.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 7/28/23 14:35, Ashutosh Bapat wrote:
>
> ...
>
> We hold a strong lock on sequence when changing its relfilenode. The
> sequence whose relfilenode is being changed can not be accessed by any
> concurrent transaction. So I am not able to understand what you are
> trying to say.
> 
> I think per (top level) transaction hash table is cleaner design. It
> puts the hash table where it should be. But if that makes code
> difficult, current design works too.
> 

I was thinking about switching to the per-txn hash, so here's a patch
adopting that approach (in part 0006). I can't say it's much simpler,
but maybe it can be simplified a bit. Most of the complexity comes from
assignments maybe happening with a delay, so it's hard to say what's a
top-level xact.

The patch essentially does this:

1) the HTAB is moved to ReorderBufferTXN

2) after decoding SGMR_CREATE, we add an entry to the current TXN and
(for subtransactions) to the parent TXN (even the copy references the
subxact)

3) when processing an assignment, we copy the HTAB entries from the
subxact to the parent

4) after a subxact abort, we remove the HTAB entries from the parent

5) while searching for the relfilenode, we only scan the HTAB in the
top-level xacts (this is possible due to the copying)

This could work without the copy in parent HTAB, but then we'd have to
scan all the transactions for every increment. And there may be many
lookups and many (sub)transactions, but only a small number of new
relfilenodes. So it seems like a good tradeoff.

If we could convince ourselves the subxact has to be already assigned
while decoding the sequence change, then we could simply search only the
current transaction (and the parent). But I've been unable to convince
myself that's guaranteed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

30 July 2023, 00:00:31

On 7/29/23 14:38, Tomas Vondra wrote:
>
> ...
>
> The only idea how to improve that is we could keep the non-transactional
> changes (instead of applying them immediately), and then apply them on
> the nearest "commit". That'd mean it's subject to the position tracking,
> and the sequence would not go backwards, I think.
> 
> So every time we decode a commit, we'd check if we decoded any sequence
> changes since the last commit, and merge them (a bit like a subxact).
> 
> This would however also mean sequence changes from rolled-back xacts may
> not be replictated. I think that'd be fine, but IIRC Andres suggested
> it's a valid use case.
> 

I wasn't sure how difficult would this approach be, so I experimented
with this today, and it's waaaay more complicated than I thought. In
fact, I'm not even sure how to do that ...

The part 0008 is an WIP patch where ReorderBufferQueueSequence does not
apply the non-transactional changes immediately, and instead adds the
changes to a top-level list. And then ReorderBufferCommit adds a fake
subxact with all sequence changes up to the commit LSN.

The challenging part is snapshot management - when applying the changes
immediately, we can simply build and use the current snapshot. But with
0008 it's not that simple - we don't even know into which transaction
will the sequence change get "injected". In fact, we don't even know if
the parent transaction will have a snapshot (if it only does nextval()
it may seem empty). I was thinking maybe we could "keep" the snapshots
for non-transactional changes, but I suspect it might confuse the main
transaction in some way.

I'm still not convinced this behavior would actually be desirable ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 8/11/23 08:32, Ashutosh Bapat wrote:
> On Tue, Aug 1, 2023 at 8:46 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Anyway, I think this is "just" a matter of efficiency, not correctness.
>> IMHO there are bigger questions regarding the "going back" behavior
>> after apply restart.
> 
> 
> sequence_decode() has the following code
> /* Skip the change if already processed (per the snapshot). */
> if (transactional &&
> !SnapBuildProcessChange(builder, xid, buf->origptr))
> return;
> else if (!transactional &&
> (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
> SnapBuildXactNeedsSkip(builder, buf->origptr)))
> return;
> 
> This means that if the subscription restarts, the upstream will *not*
> send any non-transactional sequence changes with LSN prior to the LSN
> specified by START_REPLICATION command. That should avoid replicating
> all the non-transactional sequence changes since
> ReplicationSlot::restart_lsn if the subscription restarts.
> 

Ah, right, I got confused and mixed restart_lsn and the LSN passed in
the START_REPLICATION COMMAND. Thanks for the details, I think this
works fine.

> But in apply_handle_sequence(), we do not update the
> replorigin_session_origin_lsn with LSN of the non-transactional
> sequence change when it's applied. This means that if a subscription
> restarts while it is half way through applying a transaction, those
> changes will be replicated again. This will move the sequence
> backward. If the subscription keeps restarting again and again while
> applying that transaction, we will see the sequence "rubber banding"
> [1] on subscription. So untill the transaction is completely applied,
> the other users of the sequence may see duplicate values during this
> time. I think this is undesirable.
> 

Well, but as I said earlier, this is not expected to support using the
sequence on the subscriber until after the failover, so there's not real
risk of "duplicate values". Yes, you might select the data from the
sequence directly, but that would have all sorts of issues even without
replication - users are required to use nextval/currval and so on.

> But I am not able to find a case where this can lead to conflicting
> values after failover. If there's only one transaction which is
> repeatedly being applied, the rows which use sequence values were
> never committed so there's no conflicting value present on the
> subscription. The same reasoning can be extended to multiple in-flight
> transactions. If another transaction (T2) uses the sequence values
> changed by in-flight transaction T1 and if T2 commits before T1, the
> sequence changes used by T2 must have LSNs before commit of T2 and
> thus they will never be replicated. (See example below).
> 
> T1
> insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q1
> T2
> insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q2
> COMMIT;
> T1
> insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q13
> COMMIT;
> 
> So I am not able to imagine a case when a sequence going backward can
> cause conflicting values.

Right, I agree this "rubber banding" can happen. But as long as we don't
go back too far (before the last applied commit) I think that'd fine. We
only need to make guarantees about committed transactions, and I don't
think we need to worry about this too much ...

> 
> But whether or not that's the case, downstream should not request (and
> hence receive) any changes that have been already applied (and
> committed) downstream as a principle. I think a way to achieve this is
> to update the replorigin_session_origin_lsn so that a sequence change
> applied once is not requested (and hence sent) again.
> 

I guess we could update the origin, per attached 0004. We don't have
timestamp to set replorigin_session_origin_timestamp, but it seems we
don't need that.

The attached patch merges the earlier improvements, except for the part
that experimented with adding a "fake" transaction (which turned out to
have a number of difficult issues).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Wednesday, August 16, 2023 10:27 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

> 
> 
> I guess we could update the origin, per attached 0004. We don't have
> timestamp to set replorigin_session_origin_timestamp, but it seems we don't
> need that.
> 
> The attached patch merges the earlier improvements, except for the part that
> experimented with adding a "fake" transaction (which turned out to have a
> number of difficult issues).

I tried to test the patch and found a crash when calling
pg_logical_slot_get_changes() to consume sequence changes.

Steps:
----
create table t1_seq(a int);
create sequence seq1;
SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
'test_decoding', false, true);
INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1');
----

Attach the backtrace in bt.txt.

Best Regards,
Hou zj

Attachment

bt.txt

Re: logical decoding and replication of sequences, take 2

From

Dilip Kumar

Date:

20 September 2023, 09:53:43

On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>

I was reading through 0001, I noticed this comment in
ReorderBufferSequenceIsTransactional() function

+ * To decide if a sequence change should be handled as transactional or applied
+ * immediately, we track (sequence) relfilenodes created by each transaction.
+ * We don't know if the current sub-transaction was already assigned to the
+ * top-level transaction, so we need to check all transactions.

It says "We don't know if the current sub-transaction was already
assigned to the top-level transaction, so we need to check all
transactions". But IIRC as part of the steaming of in-progress
transactions we have ensured that whenever we are logging the first
change by any subtransaction we include the top transaction ID in it.

Refer this code

LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *record)
{
...
/*
* If the top-level xid is valid, we need to assign the subxact to the
* top-level xact. We need to do this for all records, hence we do it
* before the switch.
*/
if (TransactionIdIsValid(txid))
{
ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);
}
}

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

From

Dilip Kumar

Date:

22 September 2023, 11:24:58

On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
>
> I was reading through 0001, I noticed this comment in
> ReorderBufferSequenceIsTransactional() function
>
> + * To decide if a sequence change should be handled as transactional or applied
> + * immediately, we track (sequence) relfilenodes created by each transaction.
> + * We don't know if the current sub-transaction was already assigned to the
> + * top-level transaction, so we need to check all transactions.
>
> It says "We don't know if the current sub-transaction was already
> assigned to the top-level transaction, so we need to check all
> transactions". But IIRC as part of the steaming of in-progress
> transactions we have ensured that whenever we are logging the first
> change by any subtransaction we include the top transaction ID in it.
>
> Refer this code
>
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *record)
> {
> ...
> /*
> * If the top-level xid is valid, we need to assign the subxact to the
> * top-level xact. We need to do this for all records, hence we do it
> * before the switch.
> */
> if (TransactionIdIsValid(txid))
> {
> ReorderBufferAssignChild(ctx->reorder,
> txid,
> XLogRecGetXid(record),
> buf.origptr);
> }
> }

Some more comments

1.
ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid
are duplicated except the first one is just confirming whether
relfilelocator was created in the transaction or not and the other is
returning the XID as well so I think these two could be easily merged
so that we can avoid duplicate codes.

2.
/*
+ * ReorderBufferTransferSequencesToParent
+ * Copy the relfilenode entries to the parent after assignment.
+ */
+static void
+ReorderBufferTransferSequencesToParent(ReorderBuffer *rb,
+    ReorderBufferTXN *txn,
+    ReorderBufferTXN *subtxn)

If we agree with my comment in the previous email (i.e. the first WAL
by a subxid will always include topxid) then we do not need this
function at all and always add relfilelocator directly to the top
transaction and we never need to transfer.

That is all I have for now while first pass of 0001, later I will do a
more detailed review and will look into other patches also.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: logical decoding and replication of sequences, take 2

From

"Zhijie Hou (Fujitsu)"

Date:

25 September 2023, 06:33:50

On Friday, September 15, 2023 11:11 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> On Wednesday, August 16, 2023 10:27 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
> Hi,
> 
> >
> >
> > I guess we could update the origin, per attached 0004. We don't have
> > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > don't need that.
> >
> > The attached patch merges the earlier improvements, except for the
> > part that experimented with adding a "fake" transaction (which turned
> > out to have a number of difficult issues).
> 
> I tried to test the patch and found a crash when calling
> pg_logical_slot_get_changes() to consume sequence changes.

Oh, after confirming again, I realize it's my fault that my build environment
was not clean. This case passed after rebuilding. Sorry for the noise.

Best Regards,
Hou zj

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

12 October 2023, 15:05:30


On 9/22/23 13:24, Dilip Kumar wrote:
> On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>
>> I was reading through 0001, I noticed this comment in
>> ReorderBufferSequenceIsTransactional() function
>>
>> + * To decide if a sequence change should be handled as transactional or applied
>> + * immediately, we track (sequence) relfilenodes created by each transaction.
>> + * We don't know if the current sub-transaction was already assigned to the
>> + * top-level transaction, so we need to check all transactions.
>>
>> It says "We don't know if the current sub-transaction was already
>> assigned to the top-level transaction, so we need to check all
>> transactions". But IIRC as part of the steaming of in-progress
>> transactions we have ensured that whenever we are logging the first
>> change by any subtransaction we include the top transaction ID in it.
>>
>> Refer this code
>>
>> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
>> XLogReaderState *record)
>> {
>> ...
>> /*
>> * If the top-level xid is valid, we need to assign the subxact to the
>> * top-level xact. We need to do this for all records, hence we do it
>> * before the switch.
>> */
>> if (TransactionIdIsValid(txid))
>> {
>> ReorderBufferAssignChild(ctx->reorder,
>> txid,
>> XLogRecGetXid(record),
>> buf.origptr);
>> }
>> }
> 
> Some more comments
> 
> 1.
> ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid
> are duplicated except the first one is just confirming whether
> relfilelocator was created in the transaction or not and the other is
> returning the XID as well so I think these two could be easily merged
> so that we can avoid duplicate codes.
> 

Right. The attached patch modifies the IsTransactional function to also
return the XID, and removes the GetXid one. It feels a bit weird because
now the IsTransactional function is called even in places where we know
the change is transactional. It's true two separate functions duplicated
a bit of code, ofc.

> 2.
> /*
> + * ReorderBufferTransferSequencesToParent
> + * Copy the relfilenode entries to the parent after assignment.
> + */
> +static void
> +ReorderBufferTransferSequencesToParent(ReorderBuffer *rb,
> +    ReorderBufferTXN *txn,
> +    ReorderBufferTXN *subtxn)
> 
> If we agree with my comment in the previous email (i.e. the first WAL
> by a subxid will always include topxid) then we do not need this
> function at all and always add relfilelocator directly to the top
> transaction and we never need to transfer.
> 

Good point! I don't recall why I thought this was necessary. I suspect
it was before I added the GetCurrentTransactionId() calls to ensure the
subxact has a XID. I replaced the ReorderBufferTransferSequencesToParent
call with an assert that the relfilenode hash table is empty, and I've
been unable to trigger any failures.

> That is all I have for now while first pass of 0001, later I will do a
> more detailed review and will look into other patches also.
> 

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi!

On 10/24/23 13:31, Zhijie Hou (Fujitsu) wrote:
> On Thursday, October 12, 2023 11:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
> 
> Hi,
> 
> I have been reviewing the patch set, and here are some initial comments.
> 
> 1.
> 
> I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional
> sequence change in ReorderBufferQueueChange().
> 

True. It's unlikely for a transaction to only have sequence increments
and be large enough to get streamed, and other changes would make it to
have this flag. But it's certainly more correct to set the flag even for
sequence changes.

The updated patch modifies ReorderBufferQueueChange to do this.

> 2.
> 
> ReorderBufferSequenceIsTransactional
> 
> It seems we call the above function once in sequence_decode() and call it again
> in ReorderBufferQueueSequence(), would it better to avoid the second call as
> the hashtable search looks not cheap.
> 

In principle yes, but I don't think it's worth it - I doubt the overhead
is going to be measurable.

Based on earlier reviews I tried to reduce the code duplication (there
used to be two separate functions doing the lookup), and I did consider
doing just one call in sequence_decode() and passing the XID to
ReorderBufferQueueSequence() - determining the XID is the only purpose
of the call there. But it didn't seem nice/worth it.

> 3.
> 
> The patch cleans up the sequence hash table when COMMIT or ABORT a transaction
> (via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem
> destory the hash table when PREPARE the transaction. It's not a big porblem but
> would it be better to release the memory earlier by destory the table for
> prepare ?
> 

I think you're right. I added the sequence cleanup to a couple places,
right before cleanup of the transaction. I wonder if we should simply
call ReorderBufferSequenceCleanup() from ReorderBufferCleanupTXN().

> 4.
> 
> +pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> ...
> +    /* output BEGIN if we haven't yet, but only for the transactional case */
> +    if (transactional)
> +    {
> +        if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
> +        {
> +            pg_output_begin(ctx, data, txn, false);
> +        }
> +        txndata->xact_wrote_changes = true;
> +    }
> 
> I think we should call pg_output_stream_start() instead of pg_output_begin()
> for streaming sequence changes.
> 

Good catch! Fixed.

> 5.
> +    /*
> +     * Schema should be sent using the original relation because it
> +     * also sends the ancestor's relation.
> +     */
> +    maybe_send_schema(ctx, txn, relation, relentry);
> 
> The comment seems a bit misleading here, I think it was used for the partition
> logic in pgoutput_change().

True. I've removed the comment.

Attached is an updated patch, with all those tweaks/fixes.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

27 November 2023, 01:11:14

Hi,

I've been cleaning up the first two patches to get them committed soon
(adding the decoding infrastructure + test_decoding), cleaning up stale
comments, updating commit messages etc. And I think it's ready to go,
but it's too late over, so I plan going over once more tomorrow and then
likely push. But if someone wants to take a look, I'd welcome that.

The one issue I found during this cleanup is that the patch was missing
the changes introduced by 29d0a77fa660 for decoding of other stuff.

  commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d
  Author: Amit Kapila <akapila@postgresql.org>
  Date:   Thu Oct 26 06:54:16 2023 +0530

      Migrate logical slots to the new node during an upgrade.
      ...

I fixed that, but perhaps someone might want to double check ...


0003 is here just for completeness - that's the part adding sequences to
built-in replication. I haven't done much with it, it needs some cleanup
too to get it committable. I don't intend to push that right after
0001+0002, though.


While going over 0001, I realized there might be an optimization for
ReorderBufferSequenceIsTransactional. As coded in 0001, it always
searches through all top-level transactions, and if there's many of them
that might be expensive, even if very few of them have any relfilenodes
in the hash table. It's still linear search, and it needs to happen for
each sequence change.

But can the relfilenode even be in some other top-level transaction? How
could it be - our transaction would not see it, and wouldn't be able to
generate the sequence change. So we should be able to simply check *our*
transaction (or if it's a subxact, the top-level transaction). Either
it's there (and it's transactional change), or not (and then it's
non-transactional change). The 0004 does this.

This of course hinges on when exactly the transactions get created, and
assignments processed. For example if this would fire before the txn
gets assigned to the top-level one, this would break. I don't think this
can happen thanks to the immediate logging of assignments, but I'm too
tired to think about it now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Dear Amit, Tomas,

> > >
> > > I am wondering that instead of building the infrastructure to know
> > > whether a particular change is transactional on the decoding side,
> > > can't we have some flag in the WAL record to note whether the change
> > > is transactional or not? I have discussed this point with my colleague
> > > Kuroda-San and we thought that it may be worth exploring whether we
> > > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
> > > determine if the sequence is created/changed in the current
> > > subtransaction and then record that in WAL record. By this, we need to
> > > have additional information in the WAL record like XLOG_SEQ_LOG but we
> > > can probably do it only with wal_level as logical.
> > >
> >
> > I may not understand the proposal exactly, but it's not enough to know
> > if it was created in the same subxact. It might have been created in
> > some earlier subxact in the same top-level xact.
> >
> 
> We should be able to detect even some earlier subxact or top-level
> xact based on rd_createSubid/rd_newRelfilelocatorSubid.

Here is a small PoC patchset to help your understanding. Please see attached
files.

0001, 0002 were not changed, and 0004 was reassigned to 0003.
(For now, I focused only on test_decoding, because it is only for evaluation purpose.)

0004 is what we really wanted to say. is_transactional is added in WAL record, and it stores
whether the operations is transactional. In order to distinguish the status, rd_createSubid and
rd_newRelfilelocatorSubid are used. According to the comment, they would be a valid value
only when the relation was changed within the transaction
Also, sequences_hash was not needed anymore, so it and related functions were removed.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hi,

I spent a bit of time looking at the proposed change, and unfortunately
logging just the boolean flag does not work. A good example is this bit
from a TAP test added by the patch for built-in replication (which was
not included with the WIP patch):

  BEGIN;
  ALTER SEQUENCE s RESTART WITH 1000;
  SAVEPOINT sp1;
  INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
  ROLLBACK TO sp1;
  COMMIT;

This is expected to produce:

  1131|0|t

but produces

  1000|0|f

instead. The reason is very simple - as implemented, the patch simply
checks if the relfilenode is from the same top-level transaction, which
it is, and sets the flag to "true". So we know the sequence changes need
to be queued and replayed as part of this transaction.

But then during decoding, we still queue the changes into the subxact,
which then aborts, and the changes are discarded. That is not how it's
supposed to work, because the new relfilenode is still valid, someone
might do nextval() and commit. And the nextval() may not get WAL-logged,
so we'd lose this.

What I guess we might do is log not just a boolean flag, but the XID of
the subtransaction that created the relfilenode. And then during
decoding we'd queue the changes into this subtransaction ...

0006 in the attached patch series does this, and it seems to fix the TAP
test failure. I left it at the end, to make it easier to run tests
without the patch applied.

There's a couple open questions, though.

- I'm not sure it's a good idea to log XIDs of subxacts into WAL like
this. I think it'd be OK, and there are other records that do that (like
RunningXacts or commit record), but maybe I'm missing something.

- We need the actual XID, not just the SubTransactionId. I wrote
SubTransactionGetXid() to to this, but I did not work with subxacts
this, so it'd be better if someone checked it's dealing with XID and
FullTransactionId correctly.

- I'm a bit concerned how this will perform with deeply nested
subtransactions. SubTransactionGetXid() does pretty much a linear
search, which might be somewhat expensive. And it's a cost put on
everyone who writes WAL, not just the decoding process. Maybe we should
at least limit this to wal_level=logical?

- seq_decode() then uses this XID (for transactional changes) instead of
the XID logged in the record itself. I think that's fine - it's the TXN
where we want to queue the change, after all, right?

- (unrelated) I also noticed that maybe ReorderBufferQueueSequence()
should always expect a valid XID. The code seems to suggest people can
pass InvalidTransactionId in the non-transactional case, but that's not
true because the rb->sequence() then fails.


The attached patches should also fix all the typos reported by Amit
earlier today.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

I have been hacking on improving the improvements outlined in my
preceding e-mail, but I have some bad news - I ran into an issue that I
don't know how to solve :-(

Consider this transaction:

  BEGIN;
  ALTER SEQUENCE s RESTART 1000;

  SAVEPOINT s1;
  ALTER SEQUENCE s RESTART 2000;
  ROLLBACK TO s1;

  INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
  COMMIT;

If you try this with the approach relying on rd_newRelfilelocatorSubid
and rd_createSubid, it fails like this on the subscriber:

  ERROR:  could not map filenode "base/5/16394" to relation OID

This happens because ReorderBufferQueueSequence tries to do this in the
non-transactional branch:

  reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);

and the relfilenode is the one created by the first ALTER. But this is
obviously wrong - the changes should have been treated as transactional,
because they are tied to the first ALTER. So how did we get there?

Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
resets the two fields to InvalidSubTransactionId. Which means the
rollback in the above transaction also forgets about the first ALTER.
Now that I look at the RelationData comments, it actually describes
exactly this situation:

  *
  * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
  * the most-recent relfilenumber change has survived into or zero if
  * not changed in the current transaction (or we have forgotten
  * changing it).  This field is accurate when non-zero, but it can be
  * zero when a relation has multiple new relfilenumbers within a
  * single transaction, with one of them occurring in a subsequently
  * aborted subtransaction, e.g.
  *    BEGIN;
  *    TRUNCATE t;
  *    SAVEPOINT save;
  *    TRUNCATE t;
  *    ROLLBACK TO save;
  *    -- rd_newRelfilelocatorSubid is now forgotten
  *

The root of this problem is that we'd need some sort of "history" for
the field, so that when a subxact aborts, we can restore the previous
value. But we obviously don't have that, and I doubt we want to add that
to relcache - for example, it'd either need to impose some limit on the
history (and thus a failure when we reach the limit), or it'd need to
handle histories of arbitrary length.

At this point I don't see a solution for this, which means the best way
forward with the sequence decoding patch seems to be the original
approach, on the decoding side.

I'm attaching the patch with 0005 and 0006, adding two simple tests (no
other changes compared to yesterday's version).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

29 November 2023, 12:45:21


On 11/27/23 23:06, Peter Smith wrote:
> FWIW, here are some more minor review comments for v20231127-3-0001
> 
> ======
> doc/src/sgml/logicaldecoding.sgml
> 
> 1.
> +      The <parameter>txn</parameter> parameter contains meta information about
> +      the transaction the sequence change is part of. Note however that for
> +      non-transactional updates, the transaction may be NULL, depending on
> +      if the transaction already has an XID assigned.
> +      The <parameter>sequence_lsn</parameter> has the WAL location of the
> +      sequence update. <parameter>transactional</parameter> says if the
> +      sequence has to be replayed as part of the transaction or directly.
> 
> /says if/specifies whether/
> 

Will fix.

> ======
> src/backend/commands/sequence.c
> 
> 2. DecodeSeqTuple
> 
> + memcpy(((char *) tuple->tuple.t_data),
> +    data + sizeof(xl_seq_rec),
> +    SizeofHeapTupleHeader);
> +
> + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
> +    data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
> +    datalen);
> 
> Maybe I am misreading but isn't this just copying 2 contiguous pieces
> of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen)
> achieve the same?
> 

You're right, will fix. I think the code looked differently before, got
simplified and I haven't noticed this can be a single memcpy().

> ======
> .../replication/logical/reorderbuffer.c
> 
> 3.
> + *   To decide if a sequence change is transactional, we maintain a hash
> + *   table of relfilenodes created in each (sub)transactions, along with
> + *   the XID of the (sub)transaction that created the relfilenode. The
> + *   entries from substransactions are copied to the top-level transaction
> + *   to make checks cheaper. The hash table gets cleaned up when the
> + *   transaction completes (commit/abort).
> 
> /substransactions/subtransactions/
> 

Will fix.

> ~~~
> 
> 4.
> + * A naive approach would be to just loop through all transactions and check
> + * each of them, but there may be (easily thousands) of subtransactions, and
> + * the check happens for each sequence change. So this could be very costly.
> 
> /may be (easily thousands) of/may be (easily thousands of)/
> 
> ~~~

Thanks. I've reworded this to

  ... may be many (easily thousands of) subtransactions ...

> 
> 5. ReorderBufferSequenceCleanup
> 
> + while ((ent = (ReorderBufferSequenceEnt *)
> hash_seq_search(&scan_status)) != NULL)
> + {
> + (void) hash_search(txn->toptxn->sequences_hash,
> +    (void *) &ent->rlocator,
> +    HASH_REMOVE, NULL);
> + }
> 
> Typically, other HASH_REMOVE code I saw would check result for NULL to
> give elog(ERROR, "hash table corrupted");
> 

Good point, I'll add the error check

> ~~~
> 
> 6. ReorderBufferQueueSequence
> 
> + if (xid != InvalidTransactionId)
> + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> 
> How about using the macro: TransactionIdIsValid
> 

Actually, I wrote in some other message, I think the check is not
necessary. Or rather it should be an assert that XID is valid. And yeah,
the macro is a good idea.

> ~~~
> 
> 7. ReorderBufferQueueSequence
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(rlocator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

I chose to keep this consistent with other places in reorderbuffer, and
all of them use the equality check.

> ~~~
> 
> 8.
> + /*
> + * Calculate the first value of the next batch (at which point we
> + * generate and decode another WAL record.
> + */
> 
> Missing ')'
> 

Will fix.

> ~~~
> 
> 9. ReorderBufferAddRelFileLocator
> 
> + /*
> + * We only care about sequence relfilenodes for now, and those always have
> + * a XID. So if there's no XID, don't bother adding them to the hash.
> + */
> + if (xid == InvalidTransactionId)
> + return;
> 
> How about using the macro: TransactionIdIsValid
> 

Will change.

> ~~~
> 
> 10. ReorderBufferProcessTXN
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(change->data.sequence.locator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

Same as the other Oid check - consistency.

> ~~~
> 
> 11. ReorderBufferChangeSize
> 
> + if (tup)
> + {
> + sz += sizeof(HeapTupleData);
> + len = tup->tuple.t_len;
> + sz += len;
> + }
> 
> Why is the 'sz' increment split into 2 parts?
> 

Because the other branches in ReorderBufferChangeSize do it that way.
You're right it might be coded on a single line.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

29 November 2023, 13:28:47

Hi!

Considering my findings about issues with the rd_newRelfilelocatorSubid
field and how it makes that approach impossible, I decided to rip out
those patches, and go back to the approach where reorderbuffer tracks
new relfilenodes. This means the open questions I listed two days ago
disappear, because all of that was about the alternative approach.

I've also added a couple more tests into 034_sequences.pl, testing the
basic cases with substransactions that rollback (or not), etc. The
attached patch also addresses the review comments by Peter Smith.

The one remaining open question is ReorderBufferSequenceIsTransactional
and whether it can do better than searching through all top-level
transactions. The idea of 0002 was to only search the current top-level
xact, but Amit pointed out we can't rely on seeing the assignment until
we know we're in a consistent snapshot.

I'm yet to try doing some tests to measure how expensive this lookup can
be in practice. But let's assume it's measurable and significant enough
to matter. I wonder if we could salvage this optimization somehow. I'm
thinking about three options:

1) Could ReorderBufferSequenceIsTransactional check the snapshot is
already consistent etc. and use the optimized variant (looking only at
the same top-level xact) in that case? And if not, fallback to the
search of all top-level xacts. In practice, the full search would be
used only for a short initial period.

2) We could also make ReorderBufferSequenceIsTransactional to always
check the same top-level transaction first and then fallback, no matter
whether the snapshot is consistent or not. The problem is this doesn't
really optimize the common case where there are no new relfilenodes, so
we won't find a match in the top-level xact, and will always search
everything anyway.

3) Alternatively, we could maintain a global hash table, instead of in
the top-level transaction. So there'd always be two copies, one in the
xact itself and then in the global hash. Now there's either one (in
current top-level xact), or two (subxact + top-level xact).

I kinda like (3), because it just works and doesn't require the snapshot
being consistent etc.


Opinions?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Dear Tomas,

> I did some micro-benchmarking today, trying to identify cases where this
> would cause unexpected problems, either due to having to maintain all
> the relfilenodes, or due to having to do hash lookups for every sequence
> change. But I think it's fine, mostly ...
>

I did also performance tests (especially case 3). First of all, there are some
variants from yours.

1. patch 0002 was reverted because it has an issue. So this test checks whether
   refactoring around ReorderBufferSequenceIsTransactional seems really needed.
2. per comments from Amit, I also measured the abort case. In this case, the
   alter_sequence() is called but the transaction is aborted.
3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any cases,
   clients executed 1000 transactions. The performance machine has 128 core so that
   result for 128 clients might be saturated.
4. a short sleep (0.1s) was added in alter_sequence(), especially between
   "alter sequence" and nextval(). Because while testing, I found that the
   transaction is too short to execute in parallel. I think it is reasonable
   because ReorderBufferSequenceIsTransactional() might be worse when the parallelism
   is increased.

I attached one backend process via perf and executed pg_slot_logical_get_changes().
Attached txt file shows which function occupied CPU time, especially from
pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional().
Here are my observations about them.

* In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64
  clients case.
* For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste
  many times. I think this is because changes exceed logical_decoding_work_mem,
  so we do not have to analyze anymore.
* In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is linearly
  longer. This means that we need to think some solution to avoid the overhead by
  ReorderBufferSequenceIsTransactional().

```
8 clients  3.73% occupied time
16 7.26%
32 15.82%
64 29.14%
128 46.27%
```

* In case of abort, I also checked CPU time used by ReorderBufferAddRelFileLocator(), but
  it seems not so depends on the number of clients.

```
8 clients 3.66% occupied time
16 6.94%
32 4.65%
64 5.39%
128 3.06%
```

As next step, I've planned to run the case which uses setval() function, because it
generates more WALs than normal nextval();
How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

perf_results.txt

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

02 December 2023, 00:10:57


On 11/30/23 12:56, Amit Kapila wrote:
> On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> 3) "bad case" - small transactions that generate a lot of relfilenodes
>>
>>   select alter_sequence();
>>
>> where the function is defined like this (I did create 1000 sequences
>> before the test):
>>
>>   CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
>>   DECLARE
>>       v INT;
>>   BEGIN
>>       v := 1 + (random() * 999)::int;
>>       execute format('alter sequence s%s restart with 1000', v);
>>       perform nextval('s');
>>   END;
>>   $$ LANGUAGE plpgsql;
>>
>> This performs terribly, but it's entirely unrelated to sequences.
>> Current master has exactly the same problem, if transactions do DDL.
>> Like this, for example:
>>
>>   CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
>>   DECLARE
>>       v INT;
>>   BEGIN
>>       v := 1 + (random() * 999)::int;
>>       execute format('create table t%s (a int)', v);
>>       execute format('drop table t%s', v);
>>       insert into t values (1);
>>   END;
>>   $$ LANGUAGE plpgsql;
>>
>> This has the same impact on master. The perf report shows this:
>>
>>   --98.06%--pg_logical_slot_get_changes_guts
>>        |
>>         --97.88%--LogicalDecodingProcessRecord
>>              |
>>              --97.56%--xact_decode
>>                   |
>>                    --97.51%--DecodeCommit
>>                         |
>>                         |--91.92%--SnapBuildCommitTxn
>>                         |     |
>>                         |      --91.65%--SnapBuildBuildSnapshot
>>                         |           |
>>                         |           --91.14%--pg_qsort
>>
>> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
>> takes so long is because:
>>
>> -----------------
>>   Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
>>                                       at snapbuild.c:498
>>   498        + sizeof(TransactionId) *   builder->committed.xcnt
>>   (gdb) p builder->committed.xcnt
>>   $4 = 11532
>> -----------------
>>
>> And with each iteration it grows by 1.
>>
> 
> Can we somehow avoid this either by keeping DDL-related xacts open or
> aborting them?
I
I'm not sure why the snapshot builder does this, i.e. why we end up
accumulating that many xids, and I didn't have time to look closer. So I
don't know if this would be a solution or not.

> Also, will it make any difference to use setval as
> do_setval() seems to be logging each time?
> 

I think that's pretty much what case (2) does, as it calls nextval()
enough time for each transaction do generate WAL. But I don't think this
is a very sensible benchmark - it's an extreme case, but practical cases
are far closer to case (1) because sequences are intermixed with other
activity. No one really does just nextval() calls.

> If possible, can you share the scripts? Kuroda-San has access to the
> performance machine, he may be able to try it as well.
> 

Sure, attached. But it's a very primitive script, nothing fancy.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

scripts.tgz

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

02 December 2023, 00:23:08


On 12/1/23 12:08, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas,
> 
>> I did some micro-benchmarking today, trying to identify cases where this
>> would cause unexpected problems, either due to having to maintain all
>> the relfilenodes, or due to having to do hash lookups for every sequence
>> change. But I think it's fine, mostly ...
>>
> 
> I did also performance tests (especially case 3). First of all, there are some
> variants from yours.
> 
> 1. patch 0002 was reverted because it has an issue. So this test checks whether
>    refactoring around ReorderBufferSequenceIsTransactional seems really needed.

FWIW I also did the benchmarks without the 0002 patch, for the same
reason. I forgot to mention that.

> 2. per comments from Amit, I also measured the abort case. In this case, the
>    alter_sequence() is called but the transaction is aborted.
> 3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any cases,
>    clients executed 1000 transactions. The performance machine has 128 core so that
>    result for 128 clients might be saturated.
> 4. a short sleep (0.1s) was added in alter_sequence(), especially between
>    "alter sequence" and nextval(). Because while testing, I found that the
>    transaction is too short to execute in parallel. I think it is reasonable
>    because ReorderBufferSequenceIsTransactional() might be worse when the parallelism
>    is increased.
> 
> I attached one backend process via perf and executed pg_slot_logical_get_changes().
> Attached txt file shows which function occupied CPU time, especially from
> pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional().
> Here are my observations about them.
> 
> * In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64
>   clients case.
> * For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste
>   many times. I think this is because changes exceed logical_decoding_work_mem,
>   so we do not have to analyze anymore.
> * In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is linearly
>   longer. This means that we need to think some solution to avoid the overhead by
>   ReorderBufferSequenceIsTransactional().
> 
> ```
> 8 clients  3.73% occupied time
> 16 7.26%
> 32 15.82%
> 64 29.14%
> 128 46.27%
> ```

Interesting, so what exactly does the transaction do? Anyway, I don't
think this is very surprising - I believe it behaves like this because
of having to search in many hash tables (one in each toplevel xact). And
I think the solution I explained before (maintaining a single toplevel
hash, instead of many per-top-level hashes).

FWIW I find this case interesting, but not very practical, because no
practical workload has that many aborts.

> 
> * In case of abort, I also checked CPU time used by ReorderBufferAddRelFileLocator(), but
>   it seems not so depends on the number of clients.
> 
> ```
> 8 clients 3.66% occupied time
> 16 6.94%
> 32 4.65%
> 64 5.39%
> 128 3.06%
> ```
> 
> As next step, I've planned to run the case which uses setval() function, because it
> generates more WALs than normal nextval();
> How do you think?
> 

Sure, although I don't think it's much different from the test selecting
40 values from the sequence (in each transaction). That generates about
the same amount of WAL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: logical decoding and replication of sequences, take 2

From

"Hayato Kuroda (Fujitsu)"

Date:

03 December 2023, 12:55:56

Dear Tomas,

> > I did also performance tests (especially case 3). First of all, there are some
> > variants from yours.
> >
> > 1. patch 0002 was reverted because it has an issue. So this test checks whether
> >    refactoring around ReorderBufferSequenceIsTransactional seems really
> needed.
> 
> FWIW I also did the benchmarks without the 0002 patch, for the same
> reason. I forgot to mention that.

Oh, good news. So your bench markings are quite meaningful.

> 
> Interesting, so what exactly does the transaction do?

It is quite simple - PSA the script file. It was executed with 64 multiplicity.
The definition of alter_sequence() is same as you said.
(I did use normal bash script for running them, but your approach may be smarter)

> Anyway, I don't
> think this is very surprising - I believe it behaves like this because
> of having to search in many hash tables (one in each toplevel xact). And
> I think the solution I explained before (maintaining a single toplevel
> hash, instead of many per-top-level hashes).

Agreed. And I can benchmark again for new ones, maybe when we decide new
approach.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

one_client.sh

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

03 December 2023, 17:52:12

On 12/3/23 13:55, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas,
> 
>>> I did also performance tests (especially case 3). First of all, there are some
>>> variants from yours.
>>>
>>> 1. patch 0002 was reverted because it has an issue. So this test checks whether
>>>    refactoring around ReorderBufferSequenceIsTransactional seems really
>> needed.
>>
>> FWIW I also did the benchmarks without the 0002 patch, for the same
>> reason. I forgot to mention that.
> 
> Oh, good news. So your bench markings are quite meaningful.
> 
>>
>> Interesting, so what exactly does the transaction do?
> 
> It is quite simple - PSA the script file. It was executed with 64 multiplicity.
> The definition of alter_sequence() is same as you said.
> (I did use normal bash script for running them, but your approach may be smarter)
> 
>> Anyway, I don't
>> think this is very surprising - I believe it behaves like this because
>> of having to search in many hash tables (one in each toplevel xact). And
>> I think the solution I explained before (maintaining a single toplevel
>> hash, instead of many per-top-level hashes).
> 
> Agreed. And I can benchmark again for new ones, maybe when we decide new
> approach.
> 

Thanks for the script. Are you also measuring the time it takes to
decode this using test_decoding?

FWIW I did more comprehensive suite of tests over the weekend, with a
couple more variations. I'm attaching the updated scripts, running it
should be as simple as

  ./run.sh BRANCH TRANSACTIONS RUNS

so perhaps

  ./run.sh master 1000 3

to do 3 runs with 1000 transactions per client. And it'll run a bunch of
combinations hard-coded in the script, and write the timings into a CSV
file (with "master" in each row).

I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I
did this with current master, the basic patch (without the 0002 part),
and then with the optimized approach (single global hash table, see the
0004 part). That's what master / patched / optimized in the results is.

Interestingly enough, the i5 handled this much faster, it seems to be
better in single-core tasks. The xeon is still running, so the results
for "optimized" only have one run (out of 3), but shouldn't change much.

Attached is also a table summarizing this, and visualizing the timing
change (vs. master) in the last couple columns. Green is "faster" than
master (but we don't really expect that), and "red" means slower than
master (the more red, the slower).

There results are grouped by script (see the attached .tgz), with either
32 or 96 clients (which does affect the timing, but not between master
and patch). Some executions have no pg_sleep() calls, some have 0.001
wait (but that doesn't seem to make much difference).

Overall, I'd group the results into about three groups:

1) good cases [nextval, nextval-40, nextval-abort]

These are cases that slow down a bit, but the slowdown is mostly within
reasonable bounds (we're making the decoding to do more stuff, so it'd
be a bit silly to require that extra work to make no impact). And I do
think this is reasonable, because this is pretty much an extreme / worst
case behavior. People don't really do just nextval() calls, without
doing anything else. Not to mention doing aborts for 100% transactions.

So in practice this is going to be within noise (and in those cases the
results even show speedup, which seems a bit surprising). It's somewhat
dependent on CPU too - on xeon there's hardly any regression.

2) nextval-40-abort

Here the slowdown is clear, but I'd argue it generally falls in the same
group as (1). Yes, I'd be happier if it didn't behave like this, but if
someone can show me a practical workload affected by this ...

3) irrelevant cases [all the alters taking insane amounts of time]

I absolutely refuse to care about these extreme cases where decoding
100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on
xeon). If this was a problem for some practical workload, we'd have
already heard about it I guess. And even if there was such workload, it
wouldn't be up to this patch to fix that. There's clearly something
misbehaving in the snapshot builder.

I was hopeful the global hash table would be an improvement, but that
doesn't seem to be the case. I haven't done much profiling yet, but I'd
guess most of the overhead is due to ReorderBufferQueueSequence()
starting and aborting a transaction in the non-transactinal case. Which
is unfortunate, but I don't know if there's a way to optimize that.

Some time ago I floated the idea of maybe "queuing" the sequence changes
and only replay them on the next commit, somehow. But we did ran into
problems with which snapshot to use, that I didn't know how to solve.
Maybe we should try again. The idea is we'd queue the non-transactional
changes somewhere (can't be in the transaction, because we must keep
them even if it aborts), and then "inject" them into the next commit.
That'd mean we wouldn't do the separate start/abort for each change.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 12/5/23 13:17, Amit Kapila wrote:
> ...
>> I was hopeful the global hash table would be an improvement, but that
>> doesn't seem to be the case. I haven't done much profiling yet, but I'd
>> guess most of the overhead is due to ReorderBufferQueueSequence()
>> starting and aborting a transaction in the non-transactinal case. Which
>> is unfortunate, but I don't know if there's a way to optimize that.
>>
> 
> Before discussing the alternative ideas you shared, let me try to
> clarify my understanding so that we are on the same page. I see two
> observations based on the testing and discussion we had (a) for
> non-transactional cases, the overhead observed is mainly due to
> starting/aborting a transaction for each change;

Yes, I believe that's true. See the attached profiles for nextval.sql
and nextval-40.sql from master and optimized build (with the global
hash), and also a perf-diff. I only include the top 1000 lines for each
profile, that should be enough.

master - current master without patches applied
optimized - master + sequence decoding with global hash table

For nextval, there's almost no difference in the profile. Decoding the
other changes (inserts) is the dominant part, as we only log sequences
every 32 increments.

For nextval-40, the main increase is likely due to this part

  |--11.09%--seq_decode
  |     |
  |     |--9.25%--ReorderBufferQueueSequence
  |     |     |
  |     |     |--3.56%--AbortCurrentTransaction
  |     |     |    |
  |     |     |     --3.53%--AbortSubTransaction
  |     |     |        |
  |     |     |        |--0.95%--AtSubAbort_Portals
  |     |     |        |          |
  |     |     |        |           --0.83%--hash_seq_search
  |     |     |        |
  |     |     |         --0.83%--ResourceOwnerReleaseInternal
  |     |     |
  |     |     |--2.06%--BeginInternalSubTransaction
  |     |     |          |
  |     |     |           --1.10%--CommitTransactionCommand
  |     |     |                     |
  |     |     |                      --1.07%--StartSubTransaction
  |     |     |
  |     |     |--1.28%--CleanupSubTransaction
  |     |     |          |
  |     |     |           --0.64%--AtSubCleanup_Portals
  |     |     |                     |
  |     |     |                      --0.55%--hash_seq_search
  |     |     |
  |     |      --0.67%--RelidByRelfilenumber

So yeah, that's the transaction stuff in ReorderBufferQueueSequence.

There's also per-diff, comparing individual functions.

> (b) for transactional
> cases, we see overhead due to traversing all the top-level txns and
> check the hash table for each one to find whether change is
> transactional.
> 

Not really, no. As I explained in my preceding e-mail, this check makes
almost no difference - I did expect it to matter, but it doesn't. And I
was a bit disappointed the global hash table didn't move the needle.

Most of the time is spent in

    78.81%     0.00%  postgres  postgres  [.] DecodeCommit (inlined)
      |
      ---DecodeCommit (inlined)
         |
         |--72.65%--SnapBuildCommitTxn
         |     |
         |      --72.61%--SnapBuildBuildSnapshot
         |            |
         |             --72.09%--pg_qsort
         |                    |
         |                    |--66.24%--pg_qsort
         |                    |          |

And there's almost no difference between master and build with sequence
decoding - see the attached diff-alter-sequence.perf, comparing the two
branches (perf diff -c delta-abs).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

There's been a lot discussed over the past month or so, and it's become
difficult to get a good idea what's the current state - what issues
remain to be solved, what's unrelated to this patch, and how to move if
forward. Long-running threads tend to be confusing, so I had a short
call with Amit to discuss the current state yesterday, and to make sure
we're on the same page. I believe it was very helpful, and I've promised
to post a short summary of the call - issues, what we agreed seems like
a path forward, etc.

Obviously, I might have misunderstood something, in which case Amit can
correct me. And I'd certainly welcome opinions from others.

In general, we discussed three areas - desirability of the feature,
correctness and performance. I believe a brief summary of the agreement
would be this:

- desirability of the feature: Random IDs (UUIDs etc.) are likely a much
better solution for distributed (esp. active-active) systems. But there
are important use cases that are likely to keep using regular sequences
(online upgrades of single-node instances, existing systems, ...).

- correctness: There's one possible correctness issue, when the snapshot
changes to FULL between record creating a sequence relfilenode and that
sequence advancing. This needs to be verified/reproduced, and fixed.

- performance issues: We've agreed the case with a lot of aborts (when
DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've
discussed whether the overhead with many sequence changes (nextval-40)
is acceptable, and/or how to improve it.

Next, I'll go over these points in more details, with my understanding
of what the challenges are, possible solutions etc. Most of this was
discussed/agreed on the call, but some are ideas I had only after the
call when writing this summary.

1) desirability of the feature

Firstly, do we actually want/need this feature? I believe that's very
much a question of what use cases we're targeting.

If we only focus on distributed databases (particularly those with
multiple active nodes), then we probably agree that the right solution
is to not use sequences (~generators of incrementing values) but UUIDs
or similar random identifiers (better not call them sequences, there's
not much sequential about it). The huge advantage is this does not
require replicating any state between the nodes, so logical decoding can
simply ignore them and replicate just the generated values. I don't
think there's any argument about that. If I as building such distributed
system, I'd certainly use such random IDs.

The question is what to do about the other use cases - online upgrades
relying on logical decoding, failovers to logical replicas, and so on.
Or what to do about existing systems that can't be easily changed to use
different/random identifiers. Those are not really distributed systems
and therefore don't quite need random IDs.

Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can
easily lead to massive write amplification, for example. There are
variants like UUIDv7 reducing the impact, but there's other stuff.

My takeaway from this is there's still value in having this feature.

2) correctness

The only correctness issue I'm aware of is the question what happens
when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding
the relfilenode creation and the sequence increment, pointed out by
Dilip in [1].

If this happens (and while I don't have a reproducer, I also don't have
a very clear idea why it couldn't happen), it breaks how the patch
decides between transactional and non-transactional sequence changes.

So this seems like a fatal flaw - it definitely needs to be solved. I
don't have a good idea how to do that, unfortunately. The problem is the
dependency on an earlier record, and that this needs to be evaluated
immediately (in the decode phase). Logical messages don't have the same
issue because the "transactional" flag does not depend on earlier stuff,
and other records are not interpreted until apply/commit, when we know
everything relevant was decoded.

I don't know what the solution is. Either we find a way to make sure not
to lose/skip the smgr record, or we need to rethink how we determine the
transactional flag (perhaps even try again adding it to the WAL record,
but we didn't find a way to do that earlier).

3) performance issues

We have discussed two cases - "ddl-abort" and "nextval-40".

The "ddl-abort" is when the workload does a lot of DDL and then aborts
them, leading to profiles dominated by DecodeCommit. The agreement here
is that while this is a valid issue and we should try fixing it, it's
unrelated to this patch. The issue exists even on master. So in the
context of this patch we can ignore this issue.

The "nextval-40" applies to workloads doing a lot of regular sequence
changes. We only decode/apply changes written to WAL, and that happens
only for every 32 increments or so. The test was with a very simple
transaction (just sequence advanced to write WAL + 1-row insert), which
means it's pretty much a worst case impact. For larger transactions,
it's going to be hardly measurable. Also, this only measured decoding,
not apply (which also will make this less significant).

Most of the overhead comes from ReorderBufferQueueSequence() starting
and then aborting a transaction, per the profile in [2]. This only
happens in the non-transactional case, but we expect that in regular

Anyway, let's say we want to mitigate this overhead. I think there are
three ways to do that:

a) find a way to not have to apply sequence changes immediately, but
queue them until the next commit

This would give a chance to combine multiple sequence changes into a
single "replay change", reducing the overhead. There's a couple problems
with this, though. Firstly, it can't help OLTP workloads because the
transactions are short so sequence changes are unlikely to combine. It's
also, not clear how expensive this be - could it be expensive enough to
outweigh the benefits?

All of this is assuming it can be implemented, we don't have such patch
yet. I was speculating about something like this earlier, but I haven't
managed to make that work. Doesn't mean it's impossible, ofc.

b) provide a way for the output plugin to skip sequence decoding early

The way the decoding is coded now, ReorderBufferQueueSequence does all
the expensive dance even if the output plugin does not implement the
sequence callbacks.

Maybe we should have a way to allow skipping all of this early, right at
the beginning of ReorderBufferQueueSequence (and thus before we even try
to start/abort the transaction).

Ofc, this is not a perfect solution either - it won't help workloads
that actually need/want sequence decoding but the workload is such that
the decoding has significant overhead, or with plugins that choose to
support decoding sequences in genera. For example the built-in output
plugin would certainly support sequences - and the overhead would still
be there (even if no sequences are added to the publication).

b) instruct people to increase the sequence cache from 32 to 1024

This would reduce the number of WAL messages that need to be decoded and
replayed, reducing the overhead proportionally. Of course, this also
means the sequence will "jump forward" more in case of crash or failover
to the logical replica, but I think that's acceptable tradeoff. People
should not expect sequences to be gap-less anyway.

Considering nextval-40 is pretty much a worst-case behavior, I think
this might actually be an acceptable solution/workaround.

regards

[1]
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/0bc34f71-7745-dc16-d765-5ba1f0776a3f%40enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

13 December 2023, 12:56:34

On Thu, Dec 7, 2023 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > Yes, if something like this happens, that'd be a problem:
> >
> > 1) decoding starts, with
> >
> >    SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT
> >
> > 2) transaction that creates a new refilenode gets decoded, but we skip
> >    it because we don't have the correct snapshot
> >
> > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT
> >
> > 4) we decode sequence change from nextval() for the sequence
> >
> > This would lead to us attempting to apply sequence change for a
> > relfilenode that's not visible yet (and may even get aborted).
> >
> > But can this even happen? Can we start decoding in the middle of a
> > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > messages, where we also call the output plugin in non-transactional cases.
>
> It's not a problem for logical messages because whether the message is
> transaction or non-transactional is decided while WAL logs the message
> itself.  But here our problem starts with deciding whether the change
> is transactional vs non-transactional, because if we insert the
> 'relfilenode' in hash then the subsequent sequence change in the same
> transaction would be considered transactional otherwise
> non-transactional.
>

It is correct that we can make a wrong decision about whether a change
is transactional or non-transactional when sequence DDL happens before
the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
after that state. However, one thing to note here is that we won't try
to stream such a change because for non-transactional cases we don't
proceed unless the snapshot is in a consistent state. Now, if the
decision had been correct then we would probably have queued the
sequence change and discarded at commit.

One thing that we deviate here is that for non-sequence transactional
cases (including logical messages), we immediately start queuing the
changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
SnapBuildProcessChange() returns true which is quite possible) and
take final decision at commit/prepare/abort time. However, that won't
be the case for sequences because of the dependency of determining
transactional cases on one of the prior records. Now, I am not
completely sure at this stage if such a deviation can cause any
problem and or whether we are okay to have such a deviation for
sequences.

--
With Regards,
Amit Kapila.

RE: logical decoding and replication of sequences, take 2

From

"Hayato Kuroda (Fujitsu)"

Date:

14 December 2023, 03:44:22

Dear hackers,

> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state.

I found a workload which decoder distinguish wrongly.

# Prerequisite

Apply an attached patch for inspecting the sequence status. It can be applied atop v20231203 patch set.
Also, a table and a sequence must be defined:

```
CREATE TABLE foo (var int);
CREATE SEQUENCE s;
```

# Workload

Then, you can execute concurrent transactions from three clients like below:

Client-1

BEGIN;
INSERT INTO foo VALUES (1);

            Client-2

            SELECT pg_create_logical_replication_slot('slot', 'test_decoding');

                        Client-3

                        BEGIN;
                        ALTER SEQUENCE s MAXVALUE 5000;
COMMIT;
                        SAVEPOINT s1;
                        SELECT setval('s', 2000);
                        ROLLBACK;

            SELECT pg_logical_slot_get_changes('slot', 'test_decoding');

# Result and analysis

At first, below lines would be output on the log. This meant that WAL records
for ALTER SEQUENCE were decoded but skipped because the snapshot had been building.

```
...
LOG:  logical decoding found initial starting point at 0/154D238
DETAIL:  Waiting for transactions (approximately 1) older than 741 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: skipped
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: skipped
...
```

Note that above `seq_decode...` line was not output via `setval()`, it was done
by ALTER SEQUENCE statement. Below is a call stack for inserting WAL.

```
XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
fill_seq_fork_with_data
fill_seq_with_data
AlterSequence
```

Then, subsequent lines would say like them. This means that the snapshot becomes
FULL and `setval()` is regarded non-transactional wrongly.

```
LOG:  logical decoding found initial consistent point at 0/154D658
DETAIL:  Waiting for transactions (approximately 1) older than 742 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: the sequence is non-transactional
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: not consistent: skipped
```

The change would be discarded because the snapshot has not been CONSISTENT yet
by the below part. If it has been transactional, we would have queued this
change though the transaction will be skipped at commit.

```
    else if (!transactional &&
             (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
              SnapBuildXactNeedsSkip(builder, buf->origptr)))
        return;
```

But anyway, we could find a case which we can make a wrong decision. This example
is lucky - does not output wrongly, but I'm not sure all the case like that.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

add_elog.txt

Re: logical decoding and replication of sequences, take 2

From

Dilip Kumar

Date:

14 December 2023, 05:22:57

On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > > But can this even happen? Can we start decoding in the middle of a
> > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > > messages, where we also call the output plugin in non-transactional cases.
> >
> > It's not a problem for logical messages because whether the message is
> > transaction or non-transactional is decided while WAL logs the message
> > itself.  But here our problem starts with deciding whether the change
> > is transactional vs non-transactional, because if we insert the
> > 'relfilenode' in hash then the subsequent sequence change in the same
> > transaction would be considered transactional otherwise
> > non-transactional.
> >
>
> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state. However, one thing to note here is that we won't try
> to stream such a change because for non-transactional cases we don't
> proceed unless the snapshot is in a consistent state. Now, if the
> decision had been correct then we would probably have queued the
> sequence change and discarded at commit.
>
> One thing that we deviate here is that for non-sequence transactional
> cases (including logical messages), we immediately start queuing the
> changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> SnapBuildProcessChange() returns true which is quite possible) and
> take final decision at commit/prepare/abort time. However, that won't
> be the case for sequences because of the dependency of determining
> transactional cases on one of the prior records. Now, I am not
> completely sure at this stage if such a deviation can cause any
> problem and or whether we are okay to have such a deviation for
> sequences.

Okay, so this particular scenario that I raised is somehow saved, I
mean although we are considering transactional sequence operation as
non-transactional we also know that if some of the changes for a
transaction are skipped because the snapshot was not FULL that means
that transaction can not be streamed because that transaction has to
be committed before snapshot become CONSISTENT (based on the snapshot
state change machinery).  Ideally based on the same logic that the
snapshot is not consistent the non-transactional sequence changes are
also skipped.  But the only thing that makes me a bit uncomfortable is
that even though the result is not wrong we have made some wrong
intermediate decisions i.e. considered transactional change as
non-transactions.

One solution to this issue is that, even if the snapshot state does
not reach FULL just add the sequence relids to the hash, I mean that
hash is only maintained for deciding whether the sequence is changed
in that transaction or not.  So no adding such relids to hash seems
like a root cause of the issue.  Honestly, I haven't analyzed this
idea in detail about how easy it would be to add only these changes to
the hash and what are the other dependencies, but this seems like a
worthwhile direction IMHO.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

From

Ashutosh Bapat

Date:

14 December 2023, 07:01:03

On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> > >
> >
> > It is correct that we can make a wrong decision about whether a change
> > is transactional or non-transactional when sequence DDL happens before
> > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > after that state. However, one thing to note here is that we won't try
> > to stream such a change because for non-transactional cases we don't
> > proceed unless the snapshot is in a consistent state. Now, if the
> > decision had been correct then we would probably have queued the
> > sequence change and discarded at commit.
> >
> > One thing that we deviate here is that for non-sequence transactional
> > cases (including logical messages), we immediately start queuing the
> > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > SnapBuildProcessChange() returns true which is quite possible) and
> > take final decision at commit/prepare/abort time. However, that won't
> > be the case for sequences because of the dependency of determining
> > transactional cases on one of the prior records. Now, I am not
> > completely sure at this stage if such a deviation can cause any
> > problem and or whether we are okay to have such a deviation for
> > sequences.
>
> Okay, so this particular scenario that I raised is somehow saved, I
> mean although we are considering transactional sequence operation as
> non-transactional we also know that if some of the changes for a
> transaction are skipped because the snapshot was not FULL that means
> that transaction can not be streamed because that transaction has to
> be committed before snapshot become CONSISTENT (based on the snapshot
> state change machinery).  Ideally based on the same logic that the
> snapshot is not consistent the non-transactional sequence changes are
> also skipped.  But the only thing that makes me a bit uncomfortable is
> that even though the result is not wrong we have made some wrong
> intermediate decisions i.e. considered transactional change as
> non-transactions.
>
> One solution to this issue is that, even if the snapshot state does
> not reach FULL just add the sequence relids to the hash, I mean that
> hash is only maintained for deciding whether the sequence is changed
> in that transaction or not.  So no adding such relids to hash seems
> like a root cause of the issue.  Honestly, I haven't analyzed this
> idea in detail about how easy it would be to add only these changes to
> the hash and what are the other dependencies, but this seems like a
> worthwhile direction IMHO.

I also thought about the same solution. I tried this solution as the
attached patch on top of Hayato's diagnostic changes. Following log
messages are seen in server error log. Those indicate that the
sequence change was correctly deemed as a transactional change (line
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is
transactional).
2023-12-14 12:12:50.550 IST [321229] ERROR: relation
"pg_replication_slot" does not exist at character 15
2023-12-14 12:12:50.550 IST [321229] STATEMENT: select * from
pg_replication_slot;
2023-12-14 12:12:57.289 IST [321229] LOG: logical decoding found
initial starting point at 0/1598D50
2023-12-14 12:12:57.289 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 759 to end.
2023-12-14 12:12:57.289 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: skipped
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.552 IST [321229] LOG: logical decoding found
initial consistent point at 0/1599170
2023-12-14 12:13:49.552 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 760 to end.
2023-12-14 12:13:49.552 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_FULL_SNAPSHOT
2023-12-14 12:14:55.591 IST [321230] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional
2023-12-14 12:14:55.591 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.813 IST [321229] LOG: logical decoding found
consistent point at 0/15992E8
2023-12-14 12:14:55.813 IST [321229] DETAIL: There are no running transactions.
2023-12-14 12:14:55.813 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');

It looks like the solution works. But this is the only place where we
process a change before SNAPSHOT reaches FULL. But this is also the
only record which affects a decision to queue/not a following change.
So it should be ok. The sequence_hash'es as separate for each
transaction and they are cleaned when processing COMMIT record. So I
think we don't have any side effects of adding relfilenode to sequence
hash even though snapshot is not FULL.

As a side note
1. the prologue of ReorderBufferSequenceCleanup() mentions only abort,
but this function will be called for COMMIT as well. Prologue needs to
be fixed.
2. Now that sequence hashes are per transaction, do we need
ReoderBufferTXN in ReorderBufferSequenceEnt?

--
Best Wishes,
Ashutosh Bapat

Re: logical decoding and replication of sequences, take 2

From

Dilip Kumar

Date:

14 December 2023, 07:07:07

On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
> I also thought about the same solution. I tried this solution as the
> attached patch on top of Hayato's diagnostic changes.

I think you forgot to attach the patch.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

14 December 2023, 09:06:31

On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
>
...
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
>

>
It looks like the solution works. But this is the only place where we
process a change before SNAPSHOT reaches FULL. But this is also the
only record which affects a decision to queue/not a following change.
So it should be ok. The sequence_hash'es as separate for each
transaction and they are cleaned when processing COMMIT record.
>

But it is possible that even commit or abort also happens before the
snapshot reaches full state in which case the hash table will have
stale or invalid (for aborts) entries. That will probably be cleaned
at a later point by running_xact records. Now, I think in theory, it
is possible that the same RelFileLocator can again be allocated before
we clean up the existing entry which can probably confuse the system.
It might or might not be a problem in practice but I think the more
assumptions we add for sequences, the more difficult it will become to
ensure its correctness.

--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Ashutosh Bapat

Date:

14 December 2023, 09:14:56

On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I think you forgot to attach the patch.

Sorry. Here it is.

On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
> >
>
> But it is possible that even commit or abort also happens before the
> snapshot reaches full state in which case the hash table will have
> stale or invalid (for aborts) entries. That will probably be cleaned
> at a later point by running_xact records.

Why would cleaning wait till running_xact records? Won't txn entry
itself be removed when processing commit/abort record? At the same the
sequence hash will be cleaned as well.

> Now, I think in theory, it
> is possible that the same RelFileLocator can again be allocated before
> we clean up the existing entry which can probably confuse the system.

How? The transaction allocating the first time would be cleaned before
it happens the second time. So shouldn't matter.

--
Best Wishes,
Ashutosh Bapat

Attachment

add_relfilenode_before_full.txt

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

14 December 2023, 09:20:54

On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I think you forgot to attach the patch.
>
> Sorry. Here it is.
>
> On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > It looks like the solution works. But this is the only place where we
> > process a change before SNAPSHOT reaches FULL. But this is also the
> > only record which affects a decision to queue/not a following change.
> > So it should be ok. The sequence_hash'es as separate for each
> > transaction and they are cleaned when processing COMMIT record.
> > >
> >
> > But it is possible that even commit or abort also happens before the
> > snapshot reaches full state in which case the hash table will have
> > stale or invalid (for aborts) entries. That will probably be cleaned
> > at a later point by running_xact records.
>
> Why would cleaning wait till running_xact records? Won't txn entry
> itself be removed when processing commit/abort record? At the same the
> sequence hash will be cleaned as well.
>
> > Now, I think in theory, it
> > is possible that the same RelFileLocator can again be allocated before
> > we clean up the existing entry which can probably confuse the system.
>
> How? The transaction allocating the first time would be cleaned before
> it happens the second time. So shouldn't matter.
>

It can only be cleaned if we process it but xact_decode won't allow us
to process it and I don't think it would be a good idea to add another
hack for sequences here. See below code:

xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
SnapBuild  *builder = ctx->snapshot_builder;
ReorderBuffer *reorder = ctx->reorder;
XLogReaderState *r = buf->record;
uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;

/*
* If the snapshot isn't yet fully built, we cannot decode anything, so
* bail out.
*/
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
return;


--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Ashutosh Bapat

Date:

14 December 2023, 15:44:23

On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I think you forgot to attach the patch.
> >
> > Sorry. Here it is.
> >
> > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > It looks like the solution works. But this is the only place where we
> > > process a change before SNAPSHOT reaches FULL. But this is also the
> > > only record which affects a decision to queue/not a following change.
> > > So it should be ok. The sequence_hash'es as separate for each
> > > transaction and they are cleaned when processing COMMIT record.
> > > >
> > >
> > > But it is possible that even commit or abort also happens before the
> > > snapshot reaches full state in which case the hash table will have
> > > stale or invalid (for aborts) entries. That will probably be cleaned
> > > at a later point by running_xact records.
> >
> > Why would cleaning wait till running_xact records? Won't txn entry
> > itself be removed when processing commit/abort record? At the same the
> > sequence hash will be cleaned as well.
> >
> > > Now, I think in theory, it
> > > is possible that the same RelFileLocator can again be allocated before
> > > we clean up the existing entry which can probably confuse the system.
> >
> > How? The transaction allocating the first time would be cleaned before
> > it happens the second time. So shouldn't matter.
> >
>
> It can only be cleaned if we process it but xact_decode won't allow us
> to process it and I don't think it would be a good idea to add another
> hack for sequences here. See below code:
>
> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> {
> SnapBuild  *builder = ctx->snapshot_builder;
> ReorderBuffer *reorder = ctx->reorder;
> XLogReaderState *r = buf->record;
> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
>
> /*
> * If the snapshot isn't yet fully built, we cannot decode anything, so
> * bail out.
> */
> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> return;

That may be true for a transaction which is decoded, but I think all
the transactions which are added to ReorderBuffer should be cleaned up
once they have been processed irrespective of whether they are
decoded/sent downstream or not. In this case I see the sequence hash
being cleaned up for the sequence related transaction in Hayato's
reproducer. See attached patch with a diagnostic change and the output
below (notice sequence cleanup called on transaction 767).
2023-12-14 21:06:36.756 IST [386957] LOG:  logical decoding found
initial starting point at 0/15B2F68
2023-12-14 21:06:36.756 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 767 to end.
2023-12-14 21:06:36.756 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: skipped
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.710 IST [386957] LOG:  logical decoding found
initial consistent point at 0/15B3388
2023-12-14 21:07:05.710 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 768 to end.
2023-12-14 21:07:05.710 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:39.292 IST [386298] LOG:  checkpoint starting: time
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_FULL_SNAPSHOT
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: the sequence is transactional
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  sequence cleanup called on
transaction 767
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  logical decoding found
consistent point at 0/15B3518
2023-12-14 21:07:40.919 IST [386957] DETAIL:  There are no running transactions.
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');

We see similar output when pg_logical_slot_get_changes() is called.

I haven't found the code path from where the sequence cleanup gets
called. But it's being called. Am I missing something?

--
Best Wishes,
Ashutosh Bapat

Attachment

sequence_clean_elog.txt

Re: logical decoding and replication of sequences, take 2

From

"Euler Taveira"

Date:

14 December 2023, 19:05:44

On Thu, Dec 14, 2023, at 12:44 PM, Ashutosh Bapat wrote:

I haven't found the code path from where the sequence cleanup gets
called. But it's being called. Am I missing something?

ReorderBufferCleanupTXN.

Euler Taveira

EDB https://www.enterprisedb.com/

Re: logical decoding and replication of sequences, take 2

From

Amit Kapila

Date:

15 December 2023, 02:33:14

On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It can only be cleaned if we process it but xact_decode won't allow us
> > to process it and I don't think it would be a good idea to add another
> > hack for sequences here. See below code:
> >
> > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> > {
> > SnapBuild  *builder = ctx->snapshot_builder;
> > ReorderBuffer *reorder = ctx->reorder;
> > XLogReaderState *r = buf->record;
> > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
> >
> > /*
> > * If the snapshot isn't yet fully built, we cannot decode anything, so
> > * bail out.
> > */
> > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> > return;
>
> That may be true for a transaction which is decoded, but I think all
> the transactions which are added to ReorderBuffer should be cleaned up
> once they have been processed irrespective of whether they are
> decoded/sent downstream or not. In this case I see the sequence hash
> being cleaned up for the sequence related transaction in Hayato's
> reproducer.
>

It was because the test you are using was not designed to show the
problem I mentioned. In this case, the rollback was after a full
snapshot state was reached.

--
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

From

Christophe Pettus

Date:

19 December 2023, 12:54:32

Hi,

I wanted to hop in here on one particular issue:

> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
> better solution for distributed (esp. active-active) systems. But there
> are important use cases that are likely to keep using regular sequences
> (online upgrades of single-node instances, existing systems, ...).

+1.

Right now, the lack of sequence replication is a rather large foot-gun on logical replication upgrades.  Copying the
sequencesover during the cutover period is doable, of course, but: 

(a) There's no out-of-the-box tooling that does it, so everyone has to write some scripts just for that one function.
(b) It's one more thing that extends the cutover window.

I don't think it is a good idea to make it mandatory: for example, there's a strong use case for replicating a table
butnot a sequence associated with it.  But it's definitely a missing feature in logical replication.

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

21 December 2023, 13:17:29

On 12/19/23 13:54, Christophe Pettus wrote:
> Hi,
> 
> I wanted to hop in here on one particular issue:
> 
>> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
>> better solution for distributed (esp. active-active) systems. But there
>> are important use cases that are likely to keep using regular sequences
>> (online upgrades of single-node instances, existing systems, ...).
> 
> +1.
> 
> Right now, the lack of sequence replication is a rather large 
> foot-gun on logical replication upgrades.  Copying the sequences
> over during the cutover period is doable, of course, but:
> 
> (a) There's no out-of-the-box tooling that does it, so everyone has
> to write some scripts just for that one function.
>
> (b) It's one more thing that extends the cutover window.
> 

I agree it's an annoying gap for this use case. But if this is the only
use cases, maybe a better solution would be to provide such tooling
instead of adding it to the logical decoding?

It might seem a bit strange if most data is copied by replication
directly, while sequences need special handling, ofc.

> I don't think it is a good idea to make it mandatory: for example, 
> there's a strong use case for replicating a table but not a sequence 
> associated with it.  But it's definitely a missing feature in
> logical replication.

I don't think the plan was to make replication of sequences mandatory,
certainly not with the built-in replication. If you don't add sequences
to the publication, the sequence changes will be skipped.

But it still needs to be part of the decoding, which adds overhead for
all logical decoding uses, even if the sequence changes end up being
discarded. That's somewhat annoying, especially considering sequences
are fairly common part of the WAL stream.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

21 December 2023, 14:04:55

On 12/15/23 03:33, Amit Kapila wrote:
> On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
>>
>> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> It can only be cleaned if we process it but xact_decode won't allow us
>>> to process it and I don't think it would be a good idea to add another
>>> hack for sequences here. See below code:
>>>
>>> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>>> {
>>> SnapBuild  *builder = ctx->snapshot_builder;
>>> ReorderBuffer *reorder = ctx->reorder;
>>> XLogReaderState *r = buf->record;
>>> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
>>>
>>> /*
>>> * If the snapshot isn't yet fully built, we cannot decode anything, so
>>> * bail out.
>>> */
>>> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
>>> return;
>>
>> That may be true for a transaction which is decoded, but I think all
>> the transactions which are added to ReorderBuffer should be cleaned up
>> once they have been processed irrespective of whether they are
>> decoded/sent downstream or not. In this case I see the sequence hash
>> being cleaned up for the sequence related transaction in Hayato's
>> reproducer.
>>
> 
> It was because the test you are using was not designed to show the
> problem I mentioned. In this case, the rollback was after a full
> snapshot state was reached.
> 

Right, I haven't tried to reproduce this, but it very much looks like we
the entry would not be removed if the xact aborts/commits before the
snapshot reaches FULL state.

I suppose one way to deal with this would be to first check if an entry
for the same relfilenode exists. If it does, the original transaction
must have terminated, but we haven't cleaned it up yet - in which case
we can just "move" the relfilenode to the new one.

However, can't that happen even with full snapshots? I mean, let's say a
transaction creates a relfilenode and terminates without writing an
abort record (surely that's possible, right?). And then another xact
comes and generates the same relfilenode (presumably that's unlikely,
but perhaps possible?). Aren't we in pretty much the same situation,
until the next RUNNING_XACTS cleans up the hash table?

I think tracking all relfilenodes would fix the original issue (with
treating some changes as transactional), and the tweak that "moves" the
relfilenode to the new xact would fix this other issue too.

That being said, I feel a bit uneasy about it, for similar reasons as
Amit. If we start processing records before full snapshot, that seems
like moving the assumptions a bit. For example it means we'd create
ReorderBufferTXN entries for cases that'd have skipped before. OTOH this
is (or should be) only a very temporary period while starting the
replication, I believe.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

From

Tomas Vondra

Date:

11 January 2024, 16:26:56

Hi,

Here's new version of this patch series. It rebases the 2023/12/03
version, and there's a couple improvements to address the performance
and correctness questions.

Since the 2023/12/03 version was posted, there were a couple off-list
discussions with several people - with Amit, as mentioned in [1], and
then also internally and at pgconf.eu.

My personal (very brief) takeaway from these discussions is this:

1) desirability: We want a built-in way to handle sequences in logical
replication. I think everyone agrees this is not a way to do distributed
sequences in an active-active setups, but that there are other use cases
that need this feature - typically upgrades / logical failover.

Multiple approaches were discussed (support in logical replication or a
separate tool to be executed on the logical replica). Both might work,
people usually end up with some sort of custom tool anyway. But it's
cumbersome, and the consensus seems the logical rep feature is better.

2) performance: There was concern about the performance impact, and that
it affects everyone, including those who don't replicate sequences (as
the overhead is mostly incurred before calls to output plugin etc.).

I do agree with this, but I don't think sequences can be decoded in a
much cheaper way. There was a proposal [2] that maybe we could batch the
non-transactional sequences changes in the "next" transaction, and
distribute them similarly to SnapBuildDistributeNewCatalogSnapshot()
distributes catalog snapshots.

But I doubt that'd actually work. Or more precisely - if we can make the
code work, I think it would not solve the issue for some common cases.
Consider for example a case with many concurrent top-level transactions,
making this quite expensive. And I'd bet sequence changes are far more
common than catalog changes.

However, I think we ultimately agreed that the overhead is acceptable if
it only applies to use cases that actually need to decode sequences. So
if there was a way to skip sequence decoding when not necessary, that
would work. Unfortunately, that can't be based on simply checking which
callbacks are defined by the output plugin, because e.g. pgoutput needs
to handle both cases (so the callbacks need to be defined). Nor it can
be determined based on what's included in the publication (as that's not
available that early).

The agreement was that the best way is to have a CREATE SUBSCRIPTION
option that would instruct the upstream to decode sequences. By default
this option is 'off' (because that's the no-overhead case), but it can
be enabled for each subscription.

This is what 0005 implements, and interestingly enough, this is what an
earlier version [3] from 2023/04/02 did.

This means that if you add a sequence to the publication, but leave
"sequences=off" in CREATE SUBSCRIPTION, the sequence won't be replicated
after all. That may seems a bit surprising, and I don't like it, but I
don't think there's a better way to do this.

3) correctness: The last point is about making "transactional" flag
correct when the snapshot state changes mid-transaction, originally
pointed out by Dilip [4]. Per [5] this however happens to work
correctly, because while we identify the change as 'non-transactional'
(which is incorrect), we immediately throw it again (so we don't try to
apply it, which would error-out).

One option would be to document/describe this in the comments, per 0006.
This means that when ReorderBufferSequenceIsTransactional() returns
true, it's correct. But if it returns 'false', it means 'maybe'. I agree
it seems a bit strange, but with the extra comments I think it's OK. It
simply means that if we get transactional=false incorrectly, we're
guaranteed to not process it. Maybe we could rename the function to make
this clear from the name.

The other solution proposed in the thread [6] was to always decode the
relfilenode, and add it to the hash table. 0007 does this, and it works.
But I agree this seems possibly worse than 0006 - it means we may be
adding entries to the hash table, and it's not clear when exactly we'll
clean them up etc. It'd be the only place processing stuff before the
snapshots reaches FULL.

I personally would go with 0006, i.e. just explaining why doing it this
way is correct.

regards

[1]
https://www.postgresql.org/message-id/12822961-b7de-9d59-dd27-2e3dc3980c7e%40enterprisedb.com

[2]
https://www.postgresql.org/message-id/CAFiTN-vm3-bGfm-uJdzRLERMHozW8xjZHu4rdmtWR-rP-SJYMQ%40mail.gmail.com

[3]
https://www.postgresql.org/message-id/1f96b282-cb90-8302-cee8-7b3f5576a31c%40enterprisedb.com

From

Tomas Vondra

Date:

06 March 2024, 17:34:15

Hi,

Let me share a bit of an update regarding this patch and PG17. I have
discussed this patch and how to move it forward with a couple hackers
(both within EDB and outside), and my takeaway is that the patch is not
quite baked yet, not enough to make it into PG17 :-(

There are two main reasons / concerns leading to this conclusion:

* correctness of the decoding part

There are (were) doubts about decoding during startup, before the
snapshot gets consistent, when we can get "temporarily incorrect"
decisions whether a change is transactional. While the behavior is
ultimately correct (we treat all changes as non-transactional and
discard it), it seems "dirty" and it’s unclear to me if it might cause
more serious issues down the line (not necessarily bugs, but perhaps
making it harder to implement future changes).

* handling of sequences in built-in replication

Per the patch, sequences need to be added to the publication explicitly.
But there were suggestions we might (should) add certain sequences
automatically - e.g. sequences backing SERIAL/BIGSERIAL columns, etc.
I’m not sure we really want to do that, and so far I assumed we would
start with the manual approach and move to automatic addition in the
future. But the agreement seems to be it would be a pretty significant
"breaking change", and something we probably don’t want to do.


If someone feels has an opinion on either of the two issues (in either
way), I'd like to hear it.


Obviously, I'm not particularly happy about this outcome. And I'm also
somewhat cautious because this patch was already committed+reverted in
PG16 cycle, and doing the same thing in PG17 is not on my wish list.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company