Thread: Replication identifiers, take 3

Replication identifiers, take 3

From

Andres Freund

Date:

23 September 2014, 18:24:30

Hi,

I've previously started two threads about replication identifiers. Check
http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
and
http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
.

The've also been discussed in the course of another thread:
http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de

As the topic has garnered some heat and confusion I thought it'd be
worthwile to start afresh with an explanation why I think they're
useful.

I don't really want to discuss about implementation specifics for now,
but rather about (details of the) concept. Once we've hashed those out,
I'll adapt the existing patch to match them.

There are three primary use cases for replication identifiers:
1) The ability Monitor how for replication has progressed in a crashsafe manner to allow it to restart at the right
pointafter errors/crashes.

2) Efficiently identify the origin of individual changes and transactions. In multimaster and some cascading scenarios
itis necessary to do so to avoid sending out replayed changes again.

3) Allow to efficiently filter out replayed changes from logical decoding. It's currently possible to filter changes
frominside the output plugin, but it's much more efficient to filter them out before decoding.

== Logical Decoding Background ==

To understand the need for 1) it's important to roughly understand how
logical decoding/walsender streams changes and handles feedback from
the receiving side. A walsender performing logical decoding
*continously* sends out transactions. As long as there's new local
changes (i.e. unprocessed WAL) and the network buffers aren't full it
will send changes. *Without* waiting for the client. Everything else
would lead to horrible latency.

Because it sends data without waiting for the client to have processed
them it obviously can't remove resources that are needed to stream
them out again. The client or network connection could crash after
all.

To let the sender know when it can remove resources the receiver
regularly sends back 'feedback' messages acknowledging up to where
changes have been safely received. Whenever such a feedback message
arrives the sender can release resources that aren't needed to decode
the changes below that horizon.

When the receiver ask the server to stream changes out it tells the
sender at which LSN it should start sending changes. All
*transactions* that *commit* after that LSN are sent out. Possibly
again.

== Crashsafe apply ==

Based on those explanations, when building a logical replication
solution on top of logical decoding, one must remember the latest
*remote* LSN that already has been replayed. So that, when the apply
process or the whole database crashes, it is possibly to ask for all
changes since the last transaction that has been successfully applied.

The trivial solution here is to have a table (remote_node,
last_replayed_lsn) and update it for every replayed
transaction. Unfortunately that doesn't perform very well because that
table quickly gets heavily bloated. It's also hard to avoid page level
contention when replaying transaction from multiple remote
nodes. Additionally these changes have to be filtered out when
replicating these changes in a cascading fashion.

To do this more efficiently there needs to be a crashsafe way to
associate the latest successfully replayed remote transaction.

== Identify the origin of changes ==

Say you're building a replication solution that allows two nodes to
insert into the same table on two nodes. Ignoring conflict resolution
and similar fun, one needs to prevent the same change being replayed
over and over. In logical replication the changes to the heap have to
be WAL logged, and thus the *replay* of changes from a remote node
produce WAL which then will be decoded again.

To avoid that it's very useful to tag individual changes/transactions
with their 'origin'. I.e. mark changes that have been directly
triggered by the user sending SQL as originating 'locally' and changes
originating from replaying another node's changes as originating
somewhere else.

If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.

It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.

== Proposed solution ==

These two fundamental problems happen to have overlapping
requirements.

A rather efficient solution for 1) is to attach the 'origin node' and
the remote commit LSN to every local commit record produced by
replay. That allows to have a shared memory "table" (remote_node,
local_lsn, remote_lsn).

During replay that table is kept up2date in sync with transaction
commits. If updated within the transaction commit's critical section
it's guaranteed to be correct, even if transactions can abort due to
constraint violations and such. When the cluster crashes it can be
rebuilt during crash recovery, by updating values whenever a commit
record is read.

The primary complexity here is that the identification of the
'origin' node should be as small as possible to keep the WAL volume
down.

Similarly, to solve the problem of identifying the origin of changes
during decoding, the problem can be solved nicely by adding the origin
node of every change to changes/transactions. At first it might seem
to be sufficient to do so on transaction level, but for cascading
scenarios it's very useful to be able to have changes from different
source transactions combinded into a larger one.

Again the primary problem here is how to efficiently identify origin
nodes.

== Replication Identifiers ==

The above explains the need to have as small as possible identifiers
for nodes. Two years back I'd suggested that we rely on the user to
manually assign 16bit ids to individual nodes. Not very surprisingly
that was shot down because a) 16bit numbers are not descriptive b) a
per node identifier is problematic because it prohibits replication
inside the same cluster.

So, what I've proposed since is to have two different forms of
identifiers. A long one, that's as descriptive as
$replication_solution wants. And a small one (16bit in my case) that's
*only meaningful within one node*. The long, user facing, identifier
is locally mapped to the short one.

In the first version I proposed these long identifiers had a fixed
form, including the system identifier, timeline id, database id, and a
freeform name. That wasn't well received and I agree that that's too
narrow. I think it should be a freeform text of arbitrary length.

Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.

What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.

There's a example of how this can be used from SQL at
http://archives.postgresql.org/message-id/20131114172632.GE7522@alap2.anarazel.de
That version is built on top of commit timestamps, but that only shows
because pg_replication_identifier_setup_tx_origin() allows to set the
source transaction's timestamp.

With that, far too long, explanation, is it clearer what I think
replication identifiers are for? What's your thougts?

Greetings,

Andres Freund

-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Robert Haas

Date:

26 September 2014, 02:45:05

Thanks for this write-up.

On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> 1) The ability Monitor how for replication has progressed in a
>    crashsafe manner to allow it to restart at the right point after
>    errors/crashes.
> 2) Efficiently identify the origin of individual changes and
>    transactions. In multimaster and some cascading scenarios it is
>    necessary to do so to avoid sending out replayed changes again.
> 3) Allow to efficiently filter out replayed changes from logical
>    decoding. It's currently possible to filter changes from inside the
>    output plugin, but it's much more efficient to filter them out
>    before decoding.

I agree with these goals.

Let me try to summarize the information requirements for each of these
things.  For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message.  For #2, you need to know, when decoding each
change, what the origin node was.  And for #3, you need to know, when
decoding each change, whether it is of local origin.  The information
requirements for #3 are a subset of those for #2.

> A rather efficient solution for 1) is to attach the 'origin node' and
> the remote commit LSN to every local commit record produced by
> replay. That allows to have a shared memory "table" (remote_node,
> local_lsn, remote_lsn).

This seems OK to me, modulo some arguing about what the origin node
information ought to look like.  People who are not using logical
replication can use the compact form of the commit record in most
cases, and people who are using logical replication can pay for it.

> Similarly, to solve the problem of identifying the origin of changes
> during decoding, the problem can be solved nicely by adding the origin
> node of every change to changes/transactions. At first it might seem
> to be sufficient to do so on transaction level, but for cascading
> scenarios it's very useful to be able to have changes from different
> source transactions combinded into a larger one.

I think this is a lot more problematic.  I agree that having the data
in the commit record isn't sufficient here, because for filtering
purposes (for example) you really want to identify the problematic
transactions at the beginning, so you can chuck their WAL, rather than
reassembling the transaction first and then throwing it out.  But
putting the origin ID in every insert/update/delete is pretty
unappealing from my point of view - not just because it adds bytes to
WAL, though that's a non-trivial concern, but also because it adds
complexity - IMHO, a non-trivial amount of complexity.  I'd be a lot
happier with a solution where, say, we have a separate WAL record that
says "XID 1234 will be decoding for origin 567 until further notice".

> == Replication Identifiers ==
>
> The above explains the need to have as small as possible identifiers
> for nodes. Two years back I'd suggested that we rely on the user to
> manually assign 16bit ids to individual nodes. Not very surprisingly
> that was shot down because a) 16bit numbers are not descriptive b) a
> per node identifier is problematic because it prohibits replication
> inside the same cluster.
>
> So, what I've proposed since is to have two different forms of
> identifiers. A long one, that's as descriptive as
> $replication_solution wants. And a small one (16bit in my case) that's
> *only meaningful within one node*. The long, user facing, identifier
> is locally mapped to the short one.
>
> In the first version I proposed these long identifiers had a fixed
> form, including the system identifier, timeline id, database id, and a
> freeform name. That wasn't well received and I agree that that's too
> narrow. I think it should be a freeform text of arbitrary length.
>
> Note that it depends on the replication solution whether these
> external identifiers need to be coordinated across systems or not. I
> think it's *good* if we don't propose a solution for that - different
> replication solutions will have different requirements.

I'm pretty fuzzy on how this actually works.  Like, the short form
here is just getting injected into WAL by the apply process.  How does
it figure out what value to inject?  What if it injects a value that
doesn't have a short-to-long mapping?  What's the point of the
short-to-long mappings in the first place?  Is that only required
because of the possibility that there might be multiple replication
solutions in play on the same node?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication identifiers, take 3

From

Petr Jelinek

Date:

26 September 2014, 09:15:09

On 26/09/14 04:44, Robert Haas wrote:
> On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>>
>> Note that it depends on the replication solution whether these
>> external identifiers need to be coordinated across systems or not. I
>> think it's *good* if we don't propose a solution for that - different
>> replication solutions will have different requirements.
>
> I'm pretty fuzzy on how this actually works.  Like, the short form
> here is just getting injected into WAL by the apply process.  How does
> it figure out what value to inject?  What if it injects a value that
> doesn't have a short-to-long mapping?  What's the point of the
> short-to-long mappings in the first place?  Is that only required
> because of the possibility that there might be multiple replication
> solutions in play on the same node?
>
From my perspective the short-to-long mapping is mainly convenience 
thing, long id should be something that can be used to map the 
identifier to the specific node for the purposes of configuration, 
monitoring, troubleshooting, etc. You also usually don't use just Oids 
to represent the DB objects, I see some analogy there.
This could be potentially done by the solution itself, not by the 
framework, but it seems logical (pardon the pun) that most (if not all) 
solutions will want some kind of mapping of the generated ids to 
something that represents the logical node.

So answer to your first two questions depends on the specific solution, 
it can map it from connection configuration, it can get the it from the 
output plugin as part of wire protocol, it can generate it based on some 
internal logic, etc.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Andres Freund

Date:

26 September 2014, 09:15:35

On 2014-09-25 22:44:49 -0400, Robert Haas wrote:
> Thanks for this write-up.
>
> On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > 1) The ability Monitor how for replication has progressed in a
> >    crashsafe manner to allow it to restart at the right point after
> >    errors/crashes.
> > 2) Efficiently identify the origin of individual changes and
> >    transactions. In multimaster and some cascading scenarios it is
> >    necessary to do so to avoid sending out replayed changes again.
> > 3) Allow to efficiently filter out replayed changes from logical
> >    decoding. It's currently possible to filter changes from inside the
> >    output plugin, but it's much more efficient to filter them out
> >    before decoding.
>
> I agree with these goals.
>
> Let me try to summarize the information requirements for each of these
> things.  For #1, you need to know, after crash recovery, for each
> standby, the last commit LSN which the client has confirmed via a
> feedback message.

I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.

Say you replay the following:
SET synchronous_commit = off;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/10 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/20 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/30 */;

If postgres crashes at any point during this, we need to know whether we
successfully replayed up to 0/10, 0/20 or 0/30. Note that the problem
exists independent of s_c=off, it just excerbates the issue.

Then, after finishing recovery and discovering only 0/10 has persisted,
the standby can reconnect to the primary and do
START_REPLICATION SLOT .. LOGICAL 0/10;
and it'll receive all transactions that have committed since 0/10.

> For #2, you need to know, when decoding each
> change, what the origin node was.  And for #3, you need to know, when
> decoding each change, whether it is of local origin.  The information
> requirements for #3 are a subset of those for #2.

Right. For #3 it's more important to have the information available
efficiently on individual records.

> > A rather efficient solution for 1) is to attach the 'origin node' and
> > the remote commit LSN to every local commit record produced by
> > replay. That allows to have a shared memory "table" (remote_node,
> > local_lsn, remote_lsn).
>
> This seems OK to me, modulo some arguing about what the origin node
> information ought to look like.  People who are not using logical
> replication can use the compact form of the commit record in most
> cases, and people who are using logical replication can pay for it.

Exactly.

> > Similarly, to solve the problem of identifying the origin of changes
> > during decoding, the problem can be solved nicely by adding the origin
> > node of every change to changes/transactions. At first it might seem
> > to be sufficient to do so on transaction level, but for cascading
> > scenarios it's very useful to be able to have changes from different
> > source transactions combinded into a larger one.
>
> I think this is a lot more problematic.  I agree that having the data
> in the commit record isn't sufficient here, because for filtering
> purposes (for example) you really want to identify the problematic
> transactions at the beginning, so you can chuck their WAL, rather than
> reassembling the transaction first and then throwing it out.  But
> putting the origin ID in every insert/update/delete is pretty
> unappealing from my point of view - not just because it adds bytes to
> WAL, though that's a non-trivial concern, but also because it adds
> complexity - IMHO, a non-trivial amount of complexity.  I'd be a lot
> happier with a solution where, say, we have a separate WAL record that
> says "XID 1234 will be decoding for origin 567 until further notice".

I think it actually ends up much simpler than what you propose. In the
apply process, you simply execute
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
or it's C equivalent. That sets a global variable which XLogInsert()
includes the record.
Note that this doesn't actually require any additional space in the WAL
- padding bytes in struct XLogRecord are used to store the
identifier. These have been unused at least since 8.0.

I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.

> > == Replication Identifiers ==
> >
> > The above explains the need to have as small as possible identifiers
> > for nodes. Two years back I'd suggested that we rely on the user to
> > manually assign 16bit ids to individual nodes. Not very surprisingly
> > that was shot down because a) 16bit numbers are not descriptive b) a
> > per node identifier is problematic because it prohibits replication
> > inside the same cluster.
> >
> > So, what I've proposed since is to have two different forms of
> > identifiers. A long one, that's as descriptive as
> > $replication_solution wants. And a small one (16bit in my case) that's
> > *only meaningful within one node*. The long, user facing, identifier
> > is locally mapped to the short one.
> >
> > In the first version I proposed these long identifiers had a fixed
> > form, including the system identifier, timeline id, database id, and a
> > freeform name. That wasn't well received and I agree that that's too
> > narrow. I think it should be a freeform text of arbitrary length.
> >
> > Note that it depends on the replication solution whether these
> > external identifiers need to be coordinated across systems or not. I
> > think it's *good* if we don't propose a solution for that - different
> > replication solutions will have different requirements.
>
> I'm pretty fuzzy on how this actually works.  Like, the short form
> here is just getting injected into WAL by the apply process.  How does
> it figure out what value to inject?

Thy apply process once does
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
that looks up the internal identifier and stores it in a global
variable. That's then filled in struct XLogRecord.
To setup the origin LSN of a transaction
SELECT pg_replication_identifier_setup_tx_origin('0/123456', '2013-12-11 15:14:59.219737+01')
is used. If setup that'll emit the 'extended' commit record with the
remote commit LSN.

> What if it injects a value that doesn't have a short-to-long mapping?

Shouldn't be possible unless you drop a replication identifier after it
has been setup by *_replaying_from(). If we feel that's a actually
dangerous scenario we can prohibit it with a session level lock.

> What's the point of the short-to-long mappings in the first place?  Is
> that only required because of the possibility that there might be
> multiple replication solutions in play on the same node?

In my original proposal, 2 years+ back, I only used short numeric
ids. And people didn't like it because it requires coordination between
the replication solutions and possibly between servers. Using a string
identifier like in the above allows to easily build unique names; and
allows every solution to add the information it needs into replication
identifiers.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Robert Haas

Date:

26 September 2014, 13:53:15

On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Let me try to summarize the information requirements for each of these
>> things.  For #1, you need to know, after crash recovery, for each
>> standby, the last commit LSN which the client has confirmed via a
>> feedback message.
>
> I'm not sure I understand what you mean here? This is all happening on
> the *standby*. The standby needs to know, after crash recovery, the
> latest commit LSN from the primary that it has successfully replayed.

Ah, sorry, you're right: so, you need to know, after crash recovery,
for each machine you are replicating *from*, the last transaction (in
terms of LSN) from that server that you successfully replayed.

>> > Similarly, to solve the problem of identifying the origin of changes
>> > during decoding, the problem can be solved nicely by adding the origin
>> > node of every change to changes/transactions. At first it might seem
>> > to be sufficient to do so on transaction level, but for cascading
>> > scenarios it's very useful to be able to have changes from different
>> > source transactions combinded into a larger one.
>>
>> I think this is a lot more problematic.  I agree that having the data
>> in the commit record isn't sufficient here, because for filtering
>> purposes (for example) you really want to identify the problematic
>> transactions at the beginning, so you can chuck their WAL, rather than
>> reassembling the transaction first and then throwing it out.  But
>> putting the origin ID in every insert/update/delete is pretty
>> unappealing from my point of view - not just because it adds bytes to
>> WAL, though that's a non-trivial concern, but also because it adds
>> complexity - IMHO, a non-trivial amount of complexity.  I'd be a lot
>> happier with a solution where, say, we have a separate WAL record that
>> says "XID 1234 will be decoding for origin 567 until further notice".
>
> I think it actually ends up much simpler than what you propose. In the
> apply process, you simply execute
> SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
> or it's C equivalent. That sets a global variable which XLogInsert()
> includes the record.
> Note that this doesn't actually require any additional space in the WAL
> - padding bytes in struct XLogRecord are used to store the
> identifier. These have been unused at least since 8.0.

Sure, that's simpler for logical decoding, for sure.  That doesn't
make it the right decision for the system overall.

> I don't think a solution which logs the change of origin will be
> simpler. When the origin is in every record, you can filter without keep
> track of any state. That's different if you can switch the origin per
> tx. At the very least you need a in memory entry for the origin.

But again, that complexity pertains only to logical decoding.
Somebody who wants to tweak the WAL format for an UPDATE in the future
doesn't need to understand how this works, or care.  You know me: I've
been a huge advocate of logical decoding.  But just like row-level
security or BRIN indexes or any other feature, I think it needs to be
designed in a way that minimizes the impact it has on the rest of the
system.  I simply don't believe your contention that this isn't adding
any complexity to the code path for regular DML operations.  It's
entirely possible we could need bit space in those records in the
future for something that actually pertains to those operations; if
you've burned it for logical decoding, it'll be difficult to claw it
back.  And what if Tom gets around, some day, to doing that pluggable
heap AM work?  Then every heap AM has got to allow for those bits, and
maybe that doesn't happen to be free for them.

Admittedly, these are hypothetical scenarios, but I don't think
they're particularly far-fetched.  And as a fringe benefit, if you do
it the way that I'm proposing, you can use an OID instead of a 16-bit
thing that we picked to be 16 bits because that happens to be 100% of
the available bit-space.  Yeah, there's some complexity on decoding,
but it's minimal: one more piece of fixed-size state to track per XID.
That's trivial compared to what you've already got.

>> What's the point of the short-to-long mappings in the first place?  Is
>> that only required because of the possibility that there might be
>> multiple replication solutions in play on the same node?
>
> In my original proposal, 2 years+ back, I only used short numeric
> ids. And people didn't like it because it requires coordination between
> the replication solutions and possibly between servers. Using a string
> identifier like in the above allows to easily build unique names; and
> allows every solution to add the information it needs into replication
> identifiers.

I get that, but what I'm asking is why those mappings can't be managed
on a per-replication-solution basis.  I think that's just because
there's a limited namespace and so coordination is needed between
multiple replication solutions that might possibly be running on the
same system.  But I want to confirm if that's actually what you're
thinking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication identifiers, take 3

From

Andres Freund

Date:

26 September 2014, 14:22:00

On 2014-09-26 09:53:09 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Let me try to summarize the information requirements for each of these
> >> things.  For #1, you need to know, after crash recovery, for each
> >> standby, the last commit LSN which the client has confirmed via a
> >> feedback message.
> >
> > I'm not sure I understand what you mean here? This is all happening on
> > the *standby*. The standby needs to know, after crash recovery, the
> > latest commit LSN from the primary that it has successfully replayed.
> 
> Ah, sorry, you're right: so, you need to know, after crash recovery,
> for each machine you are replicating *from*, the last transaction (in
> terms of LSN) from that server that you successfully replayed.

Precisely.

> > I don't think a solution which logs the change of origin will be
> > simpler. When the origin is in every record, you can filter without keep
> > track of any state. That's different if you can switch the origin per
> > tx. At the very least you need a in memory entry for the origin.
> 
> But again, that complexity pertains only to logical decoding.

> Somebody who wants to tweak the WAL format for an UPDATE in the future
> doesn't need to understand how this works, or care.

I agree that that's a worthy goal. But I don't see how this isn't the
case with what I propose? This isn't happening on the level of
individual rmgrs/ams - there've been two padding bytes after 'xl_rmid'
in struct XLogRecord for a long time.

There's also the significant advantage that not basing this on the xid
allows it to work correctly with records not tied to a
transaction. There's not that much of that happening yet, but I've
several features in mind:

* separately replicate 2PC commits. 2PC commits don't have an xid anymore... With some tooling on top replication 2PC
intwo phases allow for very cool stuff. Like optionally synchronous multimaster replication.

* I have a pending patch that allows to send 'messages' through logical decoding - yielding a really fast and
persistentqueue. For that it's useful have transactional *and* nontransactional messages.

* Sanely replicating CONCURRENTLY stuff gets harder if you tie things to the xid.

The absolutely, super, uber most convincing reason is:
It's trivial to build tools to analyze how much WAL traffic is generated
by which replication stream and how much by originates locally. A
pg_xlogdump --stats=replication_identifier wouldn't be hard ;)

> You know me: I've
> been a huge advocate of logical decoding.  But just like row-level
> security or BRIN indexes or any other feature, I think it needs to be
> designed in a way that minimizes the impact it has on the rest of the
> system.

Totally agreed. And that always will take some arguing...

> I simply don't believe your contention that this isn't adding
> any complexity to the code path for regular DML operations.  It's
> entirely possible we could need bit space in those records in the
> future for something that actually pertains to those operations; if
> you've burned it for logical decoding, it'll be difficult to claw it
> back.  And what if Tom gets around, some day, to doing that pluggable
> heap AM work?  Then every heap AM has got to allow for those bits, and
> maybe that doesn't happen to be free for them.

As explained above this isn't happening on the level of individual AMs.

> Admittedly, these are hypothetical scenarios, but I don't think
> they're particularly far-fetched.  And as a fringe benefit, if you do
> it the way that I'm proposing, you can use an OID instead of a 16-bit
> thing that we picked to be 16 bits because that happens to be 100% of
> the available bit-space.  Yeah, there's some complexity on decoding,
> but it's minimal: one more piece of fixed-size state to track per XID.
> That's trivial compared to what you've already got.

But it forces you to track the xids/transactions. With my proposal you
can ignore transaction *entirely* unless they manipulate the
catalog. For concurrent OLTP workloads that's quite the advantage.

> >> What's the point of the short-to-long mappings in the first place?  Is
> >> that only required because of the possibility that there might be
> >> multiple replication solutions in play on the same node?
> >
> > In my original proposal, 2 years+ back, I only used short numeric
> > ids. And people didn't like it because it requires coordination between
> > the replication solutions and possibly between servers. Using a string
> > identifier like in the above allows to easily build unique names; and
> > allows every solution to add the information it needs into replication
> > identifiers.
> 
> I get that, but what I'm asking is why those mappings can't be managed
> on a per-replication-solution basis.  I think that's just because
> there's a limited namespace and so coordination is needed between
> multiple replication solutions that might possibly be running on the
> same system.  But I want to confirm if that's actually what you're
> thinking.

Yes, that and that such a mapping needs to be done across all database
are the primary reasons. As it's currently impossible to create further
shared relations you'd have to invent something living in the data
directory on filesystem level... Brr.

I think it'd also be much worse for debugging if there'd be no way to
map such a internal identifier back to the replication solution in some
form.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Robert Haas

Date:

26 September 2014, 14:40:44

On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> As explained above this isn't happening on the level of individual AMs.

Well, that's even worse.  You want to grab 100% of the available
generic bitspace applicable to all record types for purposes specific
to logical decoding (and it's still not really enough bits).

>> I get that, but what I'm asking is why those mappings can't be managed
>> on a per-replication-solution basis.  I think that's just because
>> there's a limited namespace and so coordination is needed between
>> multiple replication solutions that might possibly be running on the
>> same system.  But I want to confirm if that's actually what you're
>> thinking.
>
> Yes, that and that such a mapping needs to be done across all database
> are the primary reasons. As it's currently impossible to create further
> shared relations you'd have to invent something living in the data
> directory on filesystem level... Brr.
>
> I think it'd also be much worse for debugging if there'd be no way to
> map such a internal identifier back to the replication solution in some
> form.

OK.

One question I have is what the structure of the names should be.  It
seems some coordination could be needed here.  I mean, suppose BDR
uses bdr:$NODENAME and Slony uses
$SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
server uses xdb__$NODE_NAME.  That seems like it would be sad.  Maybe
we should decide that names ought to be of the form
<replication-solution>.<further-period-separated-components> or
something like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication identifiers, take 3

From

Andres Freund

Date:

26 September 2014, 14:56:50

On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > As explained above this isn't happening on the level of individual AMs.
> 
> Well, that's even worse.  You want to grab 100% of the available
> generic bitspace applicable to all record types for purposes specific
> to logical decoding (and it's still not really enough bits).

I don't think that's a fair characterization. Right now it's available
to precisely nobody. You can't put any data in there in *any* way. It
just has been sitting around unused for at least 8 years.

> One question I have is what the structure of the names should be.  It
> seems some coordination could be needed here.  I mean, suppose BDR
> uses bdr:$NODENAME and Slony uses
> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
> server uses xdb__$NODE_NAME.  That seems like it would be sad.  Maybe
> we should decide that names ought to be of the form
> <replication-solution>.<further-period-separated-components> or
> something like that.

I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Robert Haas

Date:

26 September 2014, 15:02:26

On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
>> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > As explained above this isn't happening on the level of individual AMs.
>>
>> Well, that's even worse.  You want to grab 100% of the available
>> generic bitspace applicable to all record types for purposes specific
>> to logical decoding (and it's still not really enough bits).
>
> I don't think that's a fair characterization. Right now it's available
> to precisely nobody. You can't put any data in there in *any* way. It
> just has been sitting around unused for at least 8 years.

Huh?  That's just to say that the unused bit space is, in fact,
unused.  But so what?  We've always been very careful about using up
things like infomask bits, because there are only so many bits
available, and when they're gone they are gone.

>> One question I have is what the structure of the names should be.  It
>> seems some coordination could be needed here.  I mean, suppose BDR
>> uses bdr:$NODENAME and Slony uses
>> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
>> server uses xdb__$NODE_NAME.  That seems like it would be sad.  Maybe
>> we should decide that names ought to be of the form
>> <replication-solution>.<further-period-separated-components> or
>> something like that.
>
> I've also wondered about that. Perhaps we simply should have an
> additional 'name' column indicating the replication solution?

Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name.  Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication identifiers, take 3

From

Andres Freund

Date:

26 September 2014, 16:32:21

On 2014-09-26 11:02:16 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
> >> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > As explained above this isn't happening on the level of individual AMs.
> >>
> >> Well, that's even worse.  You want to grab 100% of the available
> >> generic bitspace applicable to all record types for purposes specific
> >> to logical decoding (and it's still not really enough bits).
> >
> > I don't think that's a fair characterization. Right now it's available
> > to precisely nobody. You can't put any data in there in *any* way. It
> > just has been sitting around unused for at least 8 years.
> 
> Huh?  That's just to say that the unused bit space is, in fact,
> unused.  But so what?  We've always been very careful about using up
> things like infomask bits, because there are only so many bits
> available, and when they're gone they are gone.

I don't think that's a very meaningful comparison. The problem with
infomask bits is that it's impossible to change anything once added
because of pg_upgrade'ability. That problem does not exist for
XLogRecord. We've twiddled with the WAL format pretty much in every
release. We can reconsider every release.

I can't remember anyone but me thinking about using these two bytes. So
the comparison here really is using two free bytes vs. issuing at least
~30 (record + origin) for every replayed transaction. Don't think that's
a unfair tradeof.

> >> One question I have is what the structure of the names should be.  It
> >> seems some coordination could be needed here.  I mean, suppose BDR
> >> uses bdr:$NODENAME and Slony uses
> >> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
> >> server uses xdb__$NODE_NAME.  That seems like it would be sad.  Maybe
> >> we should decide that names ought to be of the form
> >> <replication-solution>.<further-period-separated-components> or
> >> something like that.
> >
> > I've also wondered about that. Perhaps we simply should have an
> > additional 'name' column indicating the replication solution?
> 
> Yeah, maybe, but there's still the question of substructure within the
> non-replication-solution part of the name.  Not sure if we can assume
> that a bipartite identifier, specifically, is right, or whether some
> solutions will end up with different numbers of components.

Ah. I thought you only wanted to suggest a separator between the
replication solution and it's internal dat. But you actually want to
suggest an internal separator to be used in the solution's namespace?
I'm fine with that. I don't think we can suggest much beyond that -
different solutions will have fundamentally differing requirements about
which information to store.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Robert Haas

Date:

26 September 2014, 18:57:18

On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Huh?  That's just to say that the unused bit space is, in fact,
>> unused.  But so what?  We've always been very careful about using up
>> things like infomask bits, because there are only so many bits
>> available, and when they're gone they are gone.
>
> I don't think that's a very meaningful comparison. The problem with
> infomask bits is that it's impossible to change anything once added
> because of pg_upgrade'ability. That problem does not exist for
> XLogRecord. We've twiddled with the WAL format pretty much in every
> release. We can reconsider every release.
>
> I can't remember anyone but me thinking about using these two bytes. So
> the comparison here really is using two free bytes vs. issuing at least
> ~30 (record + origin) for every replayed transaction. Don't think that's
> a unfair tradeof.

Mmph.  You have a point about the WAL format being easier to change.
"Reconsidering", though, would mean that some developer who probably
isn't you needs those bytes for something that really is a more
general need than this, so they write a patch to get them back by
doing what I proposed - and then it gets rejected because it's not as
good for logical replication.  So I'm not sure I really buy this as an
argument.  For all practical purposes, if you grab them, they'll be
gone.

>> > I've also wondered about that. Perhaps we simply should have an
>> > additional 'name' column indicating the replication solution?
>>
>> Yeah, maybe, but there's still the question of substructure within the
>> non-replication-solution part of the name.  Not sure if we can assume
>> that a bipartite identifier, specifically, is right, or whether some
>> solutions will end up with different numbers of components.
>
> Ah. I thought you only wanted to suggest a separator between the
> replication solution and it's internal dat. But you actually want to
> suggest an internal separator to be used in the solution's namespace?
> I'm fine with that. I don't think we can suggest much beyond that -
> different solutions will have fundamentally differing requirements about
> which information to store.

Agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication identifiers, take 3

From

Andres Freund

Date:

26 September 2014, 22:05:14

On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Huh?  That's just to say that the unused bit space is, in fact,
> >> unused.  But so what?  We've always been very careful about using up
> >> things like infomask bits, because there are only so many bits
> >> available, and when they're gone they are gone.
> >
> > I don't think that's a very meaningful comparison. The problem with
> > infomask bits is that it's impossible to change anything once added
> > because of pg_upgrade'ability. That problem does not exist for
> > XLogRecord. We've twiddled with the WAL format pretty much in every
> > release. We can reconsider every release.
> >
> > I can't remember anyone but me thinking about using these two bytes. So
> > the comparison here really is using two free bytes vs. issuing at least
> > ~30 (record + origin) for every replayed transaction. Don't think that's
> > a unfair tradeof.
> 
> Mmph.  You have a point about the WAL format being easier to change.
> "Reconsidering", though, would mean that some developer who probably
> isn't you needs those bytes for something that really is a more
> general need than this, so they write a patch to get them back by
> doing what I proposed - and then it gets rejected because it's not as
> good for logical replication.  So I'm not sure I really buy this as an
> argument.  For all practical purposes, if you grab them, they'll be
> gone.

Sure, it'll possibly not be trivial to move them elsewhere. On the other
hand, the padding bytes have been unused for 8+ years without somebody
laying "claim" on them but "me". I don't think it's a good idea to leave
them there unused when nobody even has proposed another use for a long
while. That'll just end up with them continuing to be unused.  And
there's actually four more consecutive bytes on 64bit systems that are
unused.

Should there really be a dire need after that, we can simply bump the
record size. WAL volume wise it'd not be too bad to make the record a
tiny bit bigger - the header is only a relatively small fraction of the
entire content.

> >> > I've also wondered about that. Perhaps we simply should have an
> >> > additional 'name' column indicating the replication solution?
> >>
> >> Yeah, maybe, but there's still the question of substructure within the
> >> non-replication-solution part of the name.  Not sure if we can assume
> >> that a bipartite identifier, specifically, is right, or whether some
> >> solutions will end up with different numbers of components.
> >
> > Ah. I thought you only wanted to suggest a separator between the
> > replication solution and it's internal dat. But you actually want to
> > suggest an internal separator to be used in the solution's namespace?
> > I'm fine with that. I don't think we can suggest much beyond that -
> > different solutions will have fundamentally differing requirements about
> > which information to store.
> 
> Agreed.

So, let's recommend underscores as that separator?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Steve Singer

Date:

27 September 2014, 16:11:30

On 09/26/2014 06:05 PM, Andres Freund wrote:
> On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
> Sure, it'll possibly not be trivial to move them elsewhere. On the other
> hand, the padding bytes have been unused for 8+ years without somebody
> laying "claim" on them but "me". I don't think it's a good idea to leave
> them there unused when nobody even has proposed another use for a long
> while. That'll just end up with them continuing to be unused.  And
> there's actually four more consecutive bytes on 64bit systems that are
> unused.
>
> Should there really be a dire need after that, we can simply bump the
> record size. WAL volume wise it'd not be too bad to make the record a
> tiny bit bigger - the header is only a relatively small fraction of the
> entire content.

If we were now increasing the WAL record size anyway for some unrelated 
reason, would we be willing to increase it by a further 2 bytes for the 
node identifier?
If the answer is 'no' then I don't think we can justify using the 2 
padding bytes just because they are there and have been unused for 
years.  But if the answer is yes, we feel this important enough to 
justfiy a slightly (2 byte) larger WAL record header then we shouldn't 
use the excuse of maybe needing those 2 bytes for something else.   When 
something else comes along that needs the WAL space we'll have to 
increase the record size.

To say that if some other patch comes along that needs the space we'll 
redo this feature to use the method Robert describes is unrealistic.  If 
we think that the replication identifier  isn't 
general/important/necessary to justify 2 bytes of WAL header space then 
we should start out with something that doesn't use the WAL header,

Steve

>>>>> I've also wondered about that. Perhaps we simply should have an
>>>>> additional 'name' column indicating the replication solution?
>>>> Yeah, maybe, but there's still the question of substructure within the
>>>> non-replication-solution part of the name.  Not sure if we can assume
>>>> that a bipartite identifier, specifically, is right, or whether some
>>>> solutions will end up with different numbers of components.
>>> Ah. I thought you only wanted to suggest a separator between the
>>> replication solution and it's internal dat. But you actually want to
>>> suggest an internal separator to be used in the solution's namespace?
>>> I'm fine with that. I don't think we can suggest much beyond that -
>>> different solutions will have fundamentally differing requirements about
>>> which information to store.
>> Agreed.
> So, let's recommend underscores as that separator?
>
> Greetings,
>
> Andres Freund
>

Re: Replication identifiers, take 3

From

Steve Singer

Date:

27 September 2014, 16:16:27

On 09/26/2014 10:21 AM, Andres Freund wrote:
> On 2014-09-26 09:53:09 -0400, Robert Haas wrote:
>> On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>>> Let me try to summarize the information requirements for each of these
>>>> things.  For #1, you need to know, after crash recovery, for each
>>>> standby, the last commit LSN which the client has confirmed via a
>>>> feedback message.
>>> I'm not sure I understand what you mean here? This is all happening on
>>> the *standby*. The standby needs to know, after crash recovery, the
>>> latest commit LSN from the primary that it has successfully replayed.
>> Ah, sorry, you're right: so, you need to know, after crash recovery,
>> for each machine you are replicating *from*, the last transaction (in
>> terms of LSN) from that server that you successfully replayed.
> Precisely.
>
>>> I don't think a solution which logs the change of origin will be
>>> simpler. When the origin is in every record, you can filter without keep
>>> track of any state. That's different if you can switch the origin per
>>> tx. At the very least you need a in memory entry for the origin.
>> But again, that complexity pertains only to logical decoding.
>> Somebody who wants to tweak the WAL format for an UPDATE in the future
>> doesn't need to understand how this works, or care.
> I agree that that's a worthy goal. But I don't see how this isn't the
> case with what I propose? This isn't happening on the level of
> individual rmgrs/ams - there've been two padding bytes after 'xl_rmid'
> in struct XLogRecord for a long time.
>
> There's also the significant advantage that not basing this on the xid
> allows it to work correctly with records not tied to a
> transaction. There's not that much of that happening yet, but I've
> several features in mind:
>
> * separately replicate 2PC commits. 2PC commits don't have an xid
>    anymore... With some tooling on top replication 2PC in two phases
>    allow for very cool stuff. Like optionally synchronous multimaster
>    replication.
> * I have a pending patch that allows to send 'messages' through logical
>    decoding - yielding a really fast and persistent queue. For that it's
>    useful have transactional *and* nontransactional messages.
> * Sanely replicating CONCURRENTLY stuff gets harder if you tie things to
>    the xid.

At what point in the decoding stream should something related to a 
CONCURRENTLY command show up?
Also, for a logical message queue why couldn't you have a xid associated 
with the message that had nothing else in the transaction?

l

Re: Replication identifiers, take 3

From

Andres Freund

Date:

29 September 2014, 08:15:46

On 2014-09-27 12:11:16 -0400, Steve Singer wrote:
> On 09/26/2014 06:05 PM, Andres Freund wrote:
> >On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
> >Sure, it'll possibly not be trivial to move them elsewhere. On the other
> >hand, the padding bytes have been unused for 8+ years without somebody
> >laying "claim" on them but "me". I don't think it's a good idea to leave
> >them there unused when nobody even has proposed another use for a long
> >while. That'll just end up with them continuing to be unused.  And
> >there's actually four more consecutive bytes on 64bit systems that are
> >unused.
> >
> >Should there really be a dire need after that, we can simply bump the
> >record size. WAL volume wise it'd not be too bad to make the record a
> >tiny bit bigger - the header is only a relatively small fraction of the
> >entire content.
> 
> If we were now increasing the WAL record size anyway for some unrelated
> reason, would we be willing to increase it by a further 2 bytes for the node
> identifier?
> If the answer is 'no' then I don't think we can justify using the 2 padding
> bytes just because they are there and have been unused for years.  But if
> the answer is yes, we feel this important enough to justfiy a slightly (2
> byte) larger WAL record header then we shouldn't use the excuse of maybe
> needing those 2 bytes for something else.   When something else comes along
> that needs the WAL space we'll have to increase the record size.

I don't think that's a good way to see this. By that argument these
bytes will never be used.

Also there's four more free bytes on 64bit systems...

> To say that if some other patch comes along that needs the space we'll redo
> this feature to use the method Robert describes is unrealistic.  If we think
> that the replication identifier  isn't general/important/necessary to
> justify 2 bytes of WAL header space then we should start out with something
> that doesn't use the WAL header,

Maintaining complexity also has its costs. And I think that's much more
concrete than some imaginary feature (of which nothing was heard for the
last 8+ years) also needing two bytes.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Robert Haas

Date:

29 September 2014, 14:59:30

On Sat, Sep 27, 2014 at 12:11 PM, Steve Singer <steve@ssinger.info> wrote:
> If we were now increasing the WAL record size anyway for some unrelated
> reason, would we be willing to increase it by a further 2 bytes for the node
> identifier?

Obviously not.  Otherwise Andres would be proposing to put an OID in
there instead of a kooky 16-bit identifier.

> If the answer is 'no' then I don't think we can justify using the 2 padding
> bytes just because they are there and have been unused for years.  But if
> the answer is yes, we feel this important enough to justfiy a slightly (2
> byte) larger WAL record header then we shouldn't use the excuse of maybe
> needing those 2 bytes for something else.   When something else comes along
> that needs the WAL space we'll have to increase the record size.
>
> To say that if some other patch comes along that needs the space we'll redo
> this feature to use the method Robert describes is unrealistic.  If we think
> that the replication identifier  isn't general/important/necessary to
> justify 2 bytes of WAL header space then we should start out with something
> that doesn't use the WAL header,

I lean in that direction too, but would welcome more input from others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication identifiers, take 3

From

Heikki Linnakangas

Date:

02 October 2014, 08:49:45

On 09/23/2014 09:24 PM, Andres Freund wrote:
> I've previously started two threads about replication identifiers. Check
> http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
> and
> http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
> .
>
> The've also been discussed in the course of another thread:
> http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de

And even earlier here:

http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-andres@2ndquadrant.com#1339586927-13156-10-git-send-email-andres@2ndquadrant.com
The thread branched a lot, the relevant branch is the one with subject 
"[PATCH 10/16] Introduce the concept that wal has a 'origin' node"

> == Identify the origin of changes ==
>
> Say you're building a replication solution that allows two nodes to
> insert into the same table on two nodes. Ignoring conflict resolution
> and similar fun, one needs to prevent the same change being replayed
> over and over. In logical replication the changes to the heap have to
> be WAL logged, and thus the *replay* of changes from a remote node
> produce WAL which then will be decoded again.
>
> To avoid that it's very useful to tag individual changes/transactions
> with their 'origin'. I.e. mark changes that have been directly
> triggered by the user sending SQL as originating 'locally' and changes
> originating from replaying another node's changes as originating
> somewhere else.
>
> If that origin is exposed to logical decoding output plugins they can
> easily check whether to stream out the changes/transactions or not.
>
>
> It is possible to do this by adding extra columns to every table and
> store the origin of a row in there, but that a) permanently needs
> storage b) makes things much more invasive.

An origin column in the table itself helps tremendously to debug issues 
with the replication system. In many if not most scenarios, I think 
you'd want to have that extra column, even if it's not strictly required.

> What I've previously suggested (and which works well in BDR) is to add
> the internal id to the XLogRecord struct. There's 2 free bytes of
> padding that can be used for that purpose.

Adding a field to XLogRecord for this feels wrong. This is for *logical* 
replication - why do you need to mess with something as physical as the 
WAL record format?

And who's to say that a node ID is the most useful piece of information 
for a replication system to add to the WAL header. I can easily imagine 
that you'd want to put a changeset ID or something else in there, 
instead. (I mentioned another example of this in 
http://www.postgresql.org/message-id/4FE17043.60403@enterprisedb.com)

If we need additional information added to WAL records, for extensions, 
then that should be made in an extensible fashion. IIRC (I couldn't find 
a link right now), when we discussed the changes to heap_insert et al 
for wal_level=logical, I already argued back then that we should make it 
possible for extensions to annotate WAL records, with things like "this 
is the primary key", or whatever information is needed for conflict 
resolution, or handling loops. I don't like it that we're adding little 
pieces of information to the WAL format, bit by bit.

- Heikki

Re: Replication identifiers, take 3

From

Andres Freund

Date:

02 October 2014, 09:30:21

On 2014-10-02 11:49:31 +0300, Heikki Linnakangas wrote:
> On 09/23/2014 09:24 PM, Andres Freund wrote:
> >I've previously started two threads about replication identifiers. Check
> >http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
> >and
> >http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
> >.
> >
> >The've also been discussed in the course of another thread:
> >http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de
> 
> And even earlier here:
>
http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-andres@2ndquadrant.com#1339586927-13156-10-git-send-email-andres@2ndquadrant.com
> The thread branched a lot, the relevant branch is the one with subject
> "[PATCH 10/16] Introduce the concept that wal has a 'origin' node"

Right. Long time ago already ;)

> >== Identify the origin of changes ==
> >
> >Say you're building a replication solution that allows two nodes to
> >insert into the same table on two nodes. Ignoring conflict resolution
> >and similar fun, one needs to prevent the same change being replayed
> >over and over. In logical replication the changes to the heap have to
> >be WAL logged, and thus the *replay* of changes from a remote node
> >produce WAL which then will be decoded again.
> >
> >To avoid that it's very useful to tag individual changes/transactions
> >with their 'origin'. I.e. mark changes that have been directly
> >triggered by the user sending SQL as originating 'locally' and changes
> >originating from replaying another node's changes as originating
> >somewhere else.
> >
> >If that origin is exposed to logical decoding output plugins they can
> >easily check whether to stream out the changes/transactions or not.
> >
> >
> >It is possible to do this by adding extra columns to every table and
> >store the origin of a row in there, but that a) permanently needs
> >storage b) makes things much more invasive.
> 
> An origin column in the table itself helps tremendously to debug issues with
> the replication system. In many if not most scenarios, I think you'd want to
> have that extra column, even if it's not strictly required.

I don't think you'll have much success convincing actual customers of
that. It's one thing to increase the size of the WAL stream a bit, it's
entirely different to persistently increase the table size of all their
tables.

> >What I've previously suggested (and which works well in BDR) is to add
> >the internal id to the XLogRecord struct. There's 2 free bytes of
> >padding that can be used for that purpose.
> 
> Adding a field to XLogRecord for this feels wrong. This is for *logical*
> replication - why do you need to mess with something as physical as the WAL
> record format?

XLogRecord isn't all that "physical". It doesn't encode anything in that
regard but the fact that there's backup blocks in the record. It's
essentially just an implementation detail of logging. Whether that's
physical or logical doesn't really matter much.

There's basically two primary reasons I think it's a good idea to add it
there:

a) There's many different type of records where it's useful to add the  origin. Adding the information to all these
willmake things more  complicated, using more space, and be more fragile. And I'm pretty  sure that the number of
thingspeople will want to expose over  logical replication will increase.

  I know of at least two things that have at least some working code:  Exposing 2PC to logical decoding to allow
optionallysynchronous  replication, and allowing to send transactional/nontransactional  'messages' via the WAL without
writingto a table.

  Now, we could add a framework to attach general information to every  record - but I have a very hard time seing how
thiswill be of  comparable complexity *and* efficiency.

b) It's dead simple with a pretty darn low cost. Both from a runtime as  well as a maintenance perspective.

c) There needs to be crash recovery interation anyway to compute the  state of how far replication succeeded before
crashing.So it's not  like we could make this completely extensible without core code  knowing.

> And who's to say that a node ID is the most useful piece of information for
> a replication system to add to the WAL header. I can easily imagine that
> you'd want to put a changeset ID or something else in there, instead. (I
> mentioned another example of this in
> http://www.postgresql.org/message-id/4FE17043.60403@enterprisedb.com)

I'm onboard with adding a extensible facility to attach data to
successful transactions. There've been at least two people asking me
directly about how to e.g. attach user information to transactions.

I don't think that's equivalent with what I'm talking about here
though. One important thing about this proposal is that it allows to
completely skip (nearly, except cache inval) all records with a
uninteresting origin id *before* decoding them. Without having to keep
any per transaction state about 'uninteresting' transactions.

> If we need additional information added to WAL records, for extensions, then
> that should be made in an extensible fashion

I can see how we'd do that for individual records (e.g. the various
commit records, after combining them), but i have a hard time seing the
cost of doing that for all records worth it. Especially as it seems
likely to require significant increases in wal volume?

> IIRC (I couldn't find a link
> right now), when we discussed the changes to heap_insert et al for
> wal_level=logical, I already argued back then that we should make it
> possible for extensions to annotate WAL records, with things like "this is
> the primary key", or whatever information is needed for conflict resolution,
> or handling loops. I don't like it that we're adding little pieces of
> information to the WAL format, bit by bit.

I don't think this is "adding little pieces of information to the WAL
format, bit by bit.". It's a relatively central piece for allowing
efficient and maintainable logical replication.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Replication identifiers, take 3

From

Robert Haas

Date:

02 October 2014, 12:28:40

On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> An origin column in the table itself helps tremendously to debug issues with
> the replication system. In many if not most scenarios, I think you'd want to
> have that extra column, even if it's not strictly required.

I like a lot of what you wrote here, but I strongly disagree with this
part.  A good replication solution shouldn't require changes to the
objects being replicated.  The triggers that Slony and other current
logical replication solutions use are a problem not only because
they're slow (although that is a problem) but also because they
represent a user-visible wart that people who don't otherwise care
about the fact that their database is being replicated have to be
concerned with.  I would agree that some people might, for particular
use cases, want to include origin information in the table that the
replication system knows about, but it shouldn't be required.

When you look at the replication systems that we have today, you've
basically got streaming replication, which is high-performance and
fairly hands-off (at least once you get it set up properly; that part
can be kind of a bear) but can't cross versions let alone database
systems and requires that the slaves be strictly read-only.  Then on
the flip side you've got things like Slony, Bucardo, and others.  Some
of these allow multi-master; all of them at least allow table-level
determination of which server has the writable copy.  Nearly all of
them are cross-version and some even allow replication into
non-PostgreSQL systems.  But they are slow *and administratively
complex*.  If we're able to get something that feels like streaming
replication from a performance and administrative complexity point of
view but can be cross-version and allow at least some writes on
slaves, that's going to be an epic leap forward for the project.

In my mind, that means it's got to be completely hands-off from a
schema design point of view: you should be able to start up a database
and design it however you want, put anything you like into it, and
then decide later that you want to bolt logical replication on top of
it, just as you can for streaming physical replication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication identifiers, take 3

From

Jim Nasby

Date:

04 October 2014, 20:13:56

On 10/2/14, 7:28 AM, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> >An origin column in the table itself helps tremendously to debug issues with
>> >the replication system. In many if not most scenarios, I think you'd want to
>> >have that extra column, even if it's not strictly required.
> I like a lot of what you wrote here, but I strongly disagree with this
> part.  A good replication solution shouldn't require changes to the
> objects being replicated.
I agree that asking users to modify objects is bad, but I also think that if you do have records coming into one table
frommultiple sources then you will need to know what system they originated on.

Maybe some sort of "hidden" column would work here? That means users don't need to modify anything (including anything
doingSELECT *), but the data is there.

As for space concerns I think the answer there is to somehow normalize the identifiers themselves. That has the added
benefitof allowing a rename of a source to propagate to all the data already replicated from that source.

Re: Replication identifiers, take 3

From

Simon Riggs

Date:

29 October 2014, 19:46:51

On 2 October 2014 09:49, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

>> What I've previously suggested (and which works well in BDR) is to add
>> the internal id to the XLogRecord struct. There's 2 free bytes of
>> padding that can be used for that purpose.
>
>
> Adding a field to XLogRecord for this feels wrong. This is for *logical*
> replication - why do you need to mess with something as physical as the WAL
> record format?

Some reasons why this would be OK:

1. We've long agreed that adding information to the WAL record is OK,
just as long as it doesn't interfere (much) with the primary purpose.
It seems OK to change wal_level so it is a list of additional info,
e.g. wal_level = 'hot standby, logical, originid', or just wal_level =
'$new_level_name' that includes originid info

2. We seem to have agreed elsewhere that extensions to WAL will be
allowed. So perhaps redefining those bytes is something that can be
re-used. That way we don't all have to agree what we'll use them for.
Just call a user defined function that returns a 2 byte integer, or
zero if no plugin.

3. As for how many bytes there are - I count 6 wasted bytes on a
MAXALIGN=8 record. Plus we discussed long ago ways we can save another
4 bytes on records that don't have a backup block, since in that case
xl_tot_len == xl_len. I've also got a feeling that WAL records that
are 2 billion bytes long might be overkill. I figure we could easily
make a length limit of something less than that - only commit records
can be longer than 2^19 bytes when they have >65536 subxids, which is
hardly common. Plus xl_prev is always at least 4 byte aligned, so
there are at least 3 bits to be reclaimed from there. (Plus we have 7
unused RmgrId bits, though maybe we want to keep those)

So I count about 14 bytes of potentially reclaimable space in the WAL
record header, or 10 with backup blocks. The reason we never reclaimed
it before was that benchmarks previously showed that reducing the
volume of WAL had little effect on performance, we weren't looking
specifically at WAL volume or info content. (And the perf result is
probably different now anyway).

Should we grovel around to reclaim any of that? Not necessary, the
next person with a good reason to use some space can do that.

Pick any of those: I see no reason to prevent reusing 2 bytes for a
useful purpose, when we do all agree it is a useful purpose.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Thread: Replication identifiers, take 3

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment