Thread: Replication identifiers, take 3
Hi, I've previously started two threads about replication identifiers. Check http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de and http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de . The've also been discussed in the course of another thread: http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de As the topic has garnered some heat and confusion I thought it'd be worthwile to start afresh with an explanation why I think they're useful. I don't really want to discuss about implementation specifics for now, but rather about (details of the) concept. Once we've hashed those out, I'll adapt the existing patch to match them. There are three primary use cases for replication identifiers: 1) The ability Monitor how for replication has progressed in a crashsafe manner to allow it to restart at the right pointafter errors/crashes. 2) Efficiently identify the origin of individual changes and transactions. In multimaster and some cascading scenarios itis necessary to do so to avoid sending out replayed changes again. 3) Allow to efficiently filter out replayed changes from logical decoding. It's currently possible to filter changes frominside the output plugin, but it's much more efficient to filter them out before decoding. == Logical Decoding Background == To understand the need for 1) it's important to roughly understand how logical decoding/walsender streams changes and handles feedback from the receiving side. A walsender performing logical decoding *continously* sends out transactions. As long as there's new local changes (i.e. unprocessed WAL) and the network buffers aren't full it will send changes. *Without* waiting for the client. Everything else would lead to horrible latency. Because it sends data without waiting for the client to have processed them it obviously can't remove resources that are needed to stream them out again. The client or network connection could crash after all. To let the sender know when it can remove resources the receiver regularly sends back 'feedback' messages acknowledging up to where changes have been safely received. Whenever such a feedback message arrives the sender can release resources that aren't needed to decode the changes below that horizon. When the receiver ask the server to stream changes out it tells the sender at which LSN it should start sending changes. All *transactions* that *commit* after that LSN are sent out. Possibly again. == Crashsafe apply == Based on those explanations, when building a logical replication solution on top of logical decoding, one must remember the latest *remote* LSN that already has been replayed. So that, when the apply process or the whole database crashes, it is possibly to ask for all changes since the last transaction that has been successfully applied. The trivial solution here is to have a table (remote_node, last_replayed_lsn) and update it for every replayed transaction. Unfortunately that doesn't perform very well because that table quickly gets heavily bloated. It's also hard to avoid page level contention when replaying transaction from multiple remote nodes. Additionally these changes have to be filtered out when replicating these changes in a cascading fashion. To do this more efficiently there needs to be a crashsafe way to associate the latest successfully replayed remote transaction. == Identify the origin of changes == Say you're building a replication solution that allows two nodes to insert into the same table on two nodes. Ignoring conflict resolution and similar fun, one needs to prevent the same change being replayed over and over. In logical replication the changes to the heap have to be WAL logged, and thus the *replay* of changes from a remote node produce WAL which then will be decoded again. To avoid that it's very useful to tag individual changes/transactions with their 'origin'. I.e. mark changes that have been directly triggered by the user sending SQL as originating 'locally' and changes originating from replaying another node's changes as originating somewhere else. If that origin is exposed to logical decoding output plugins they can easily check whether to stream out the changes/transactions or not. It is possible to do this by adding extra columns to every table and store the origin of a row in there, but that a) permanently needs storage b) makes things much more invasive. == Proposed solution == These two fundamental problems happen to have overlapping requirements. A rather efficient solution for 1) is to attach the 'origin node' and the remote commit LSN to every local commit record produced by replay. That allows to have a shared memory "table" (remote_node, local_lsn, remote_lsn). During replay that table is kept up2date in sync with transaction commits. If updated within the transaction commit's critical section it's guaranteed to be correct, even if transactions can abort due to constraint violations and such. When the cluster crashes it can be rebuilt during crash recovery, by updating values whenever a commit record is read. The primary complexity here is that the identification of the 'origin' node should be as small as possible to keep the WAL volume down. Similarly, to solve the problem of identifying the origin of changes during decoding, the problem can be solved nicely by adding the origin node of every change to changes/transactions. At first it might seem to be sufficient to do so on transaction level, but for cascading scenarios it's very useful to be able to have changes from different source transactions combinded into a larger one. Again the primary problem here is how to efficiently identify origin nodes. == Replication Identifiers == The above explains the need to have as small as possible identifiers for nodes. Two years back I'd suggested that we rely on the user to manually assign 16bit ids to individual nodes. Not very surprisingly that was shot down because a) 16bit numbers are not descriptive b) a per node identifier is problematic because it prohibits replication inside the same cluster. So, what I've proposed since is to have two different forms of identifiers. A long one, that's as descriptive as $replication_solution wants. And a small one (16bit in my case) that's *only meaningful within one node*. The long, user facing, identifier is locally mapped to the short one. In the first version I proposed these long identifiers had a fixed form, including the system identifier, timeline id, database id, and a freeform name. That wasn't well received and I agree that that's too narrow. I think it should be a freeform text of arbitrary length. Note that it depends on the replication solution whether these external identifiers need to be coordinated across systems or not. I think it's *good* if we don't propose a solution for that - different replication solutions will have different requirements. What I've previously suggested (and which works well in BDR) is to add the internal id to the XLogRecord struct. There's 2 free bytes of padding that can be used for that purpose. There's a example of how this can be used from SQL at http://archives.postgresql.org/message-id/20131114172632.GE7522@alap2.anarazel.de That version is built on top of commit timestamps, but that only shows because pg_replication_identifier_setup_tx_origin() allows to set the source transaction's timestamp. With that, far too long, explanation, is it clearer what I think replication identifiers are for? What's your thougts? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Thanks for this write-up. On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote: > 1) The ability Monitor how for replication has progressed in a > crashsafe manner to allow it to restart at the right point after > errors/crashes. > 2) Efficiently identify the origin of individual changes and > transactions. In multimaster and some cascading scenarios it is > necessary to do so to avoid sending out replayed changes again. > 3) Allow to efficiently filter out replayed changes from logical > decoding. It's currently possible to filter changes from inside the > output plugin, but it's much more efficient to filter them out > before decoding. I agree with these goals. Let me try to summarize the information requirements for each of these things. For #1, you need to know, after crash recovery, for each standby, the last commit LSN which the client has confirmed via a feedback message. For #2, you need to know, when decoding each change, what the origin node was. And for #3, you need to know, when decoding each change, whether it is of local origin. The information requirements for #3 are a subset of those for #2. > A rather efficient solution for 1) is to attach the 'origin node' and > the remote commit LSN to every local commit record produced by > replay. That allows to have a shared memory "table" (remote_node, > local_lsn, remote_lsn). This seems OK to me, modulo some arguing about what the origin node information ought to look like. People who are not using logical replication can use the compact form of the commit record in most cases, and people who are using logical replication can pay for it. > Similarly, to solve the problem of identifying the origin of changes > during decoding, the problem can be solved nicely by adding the origin > node of every change to changes/transactions. At first it might seem > to be sufficient to do so on transaction level, but for cascading > scenarios it's very useful to be able to have changes from different > source transactions combinded into a larger one. I think this is a lot more problematic. I agree that having the data in the commit record isn't sufficient here, because for filtering purposes (for example) you really want to identify the problematic transactions at the beginning, so you can chuck their WAL, rather than reassembling the transaction first and then throwing it out. But putting the origin ID in every insert/update/delete is pretty unappealing from my point of view - not just because it adds bytes to WAL, though that's a non-trivial concern, but also because it adds complexity - IMHO, a non-trivial amount of complexity. I'd be a lot happier with a solution where, say, we have a separate WAL record that says "XID 1234 will be decoding for origin 567 until further notice". > == Replication Identifiers == > > The above explains the need to have as small as possible identifiers > for nodes. Two years back I'd suggested that we rely on the user to > manually assign 16bit ids to individual nodes. Not very surprisingly > that was shot down because a) 16bit numbers are not descriptive b) a > per node identifier is problematic because it prohibits replication > inside the same cluster. > > So, what I've proposed since is to have two different forms of > identifiers. A long one, that's as descriptive as > $replication_solution wants. And a small one (16bit in my case) that's > *only meaningful within one node*. The long, user facing, identifier > is locally mapped to the short one. > > In the first version I proposed these long identifiers had a fixed > form, including the system identifier, timeline id, database id, and a > freeform name. That wasn't well received and I agree that that's too > narrow. I think it should be a freeform text of arbitrary length. > > Note that it depends on the replication solution whether these > external identifiers need to be coordinated across systems or not. I > think it's *good* if we don't propose a solution for that - different > replication solutions will have different requirements. I'm pretty fuzzy on how this actually works. Like, the short form here is just getting injected into WAL by the apply process. How does it figure out what value to inject? What if it injects a value that doesn't have a short-to-long mapping? What's the point of the short-to-long mappings in the first place? Is that only required because of the possibility that there might be multiple replication solutions in play on the same node? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 26/09/14 04:44, Robert Haas wrote: > On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> >> Note that it depends on the replication solution whether these >> external identifiers need to be coordinated across systems or not. I >> think it's *good* if we don't propose a solution for that - different >> replication solutions will have different requirements. > > I'm pretty fuzzy on how this actually works. Like, the short form > here is just getting injected into WAL by the apply process. How does > it figure out what value to inject? What if it injects a value that > doesn't have a short-to-long mapping? What's the point of the > short-to-long mappings in the first place? Is that only required > because of the possibility that there might be multiple replication > solutions in play on the same node? > From my perspective the short-to-long mapping is mainly convenience thing, long id should be something that can be used to map the identifier to the specific node for the purposes of configuration, monitoring, troubleshooting, etc. You also usually don't use just Oids to represent the DB objects, I see some analogy there. This could be potentially done by the solution itself, not by the framework, but it seems logical (pardon the pun) that most (if not all) solutions will want some kind of mapping of the generated ids to something that represents the logical node. So answer to your first two questions depends on the specific solution, it can map it from connection configuration, it can get the it from the output plugin as part of wire protocol, it can generate it based on some internal logic, etc. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-25 22:44:49 -0400, Robert Haas wrote: > Thanks for this write-up. > > On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > 1) The ability Monitor how for replication has progressed in a > > crashsafe manner to allow it to restart at the right point after > > errors/crashes. > > 2) Efficiently identify the origin of individual changes and > > transactions. In multimaster and some cascading scenarios it is > > necessary to do so to avoid sending out replayed changes again. > > 3) Allow to efficiently filter out replayed changes from logical > > decoding. It's currently possible to filter changes from inside the > > output plugin, but it's much more efficient to filter them out > > before decoding. > > I agree with these goals. > > Let me try to summarize the information requirements for each of these > things. For #1, you need to know, after crash recovery, for each > standby, the last commit LSN which the client has confirmed via a > feedback message. I'm not sure I understand what you mean here? This is all happening on the *standby*. The standby needs to know, after crash recovery, the latest commit LSN from the primary that it has successfully replayed. Say you replay the following: SET synchronous_commit = off; BEGIN; INSERT INTO foo ... COMMIT /* original LSN 0/10 */; BEGIN; INSERT INTO foo ... COMMIT /* original LSN 0/20 */; BEGIN; INSERT INTO foo ... COMMIT /* original LSN 0/30 */; If postgres crashes at any point during this, we need to know whether we successfully replayed up to 0/10, 0/20 or 0/30. Note that the problem exists independent of s_c=off, it just excerbates the issue. Then, after finishing recovery and discovering only 0/10 has persisted, the standby can reconnect to the primary and do START_REPLICATION SLOT .. LOGICAL 0/10; and it'll receive all transactions that have committed since 0/10. > For #2, you need to know, when decoding each > change, what the origin node was. And for #3, you need to know, when > decoding each change, whether it is of local origin. The information > requirements for #3 are a subset of those for #2. Right. For #3 it's more important to have the information available efficiently on individual records. > > A rather efficient solution for 1) is to attach the 'origin node' and > > the remote commit LSN to every local commit record produced by > > replay. That allows to have a shared memory "table" (remote_node, > > local_lsn, remote_lsn). > > This seems OK to me, modulo some arguing about what the origin node > information ought to look like. People who are not using logical > replication can use the compact form of the commit record in most > cases, and people who are using logical replication can pay for it. Exactly. > > Similarly, to solve the problem of identifying the origin of changes > > during decoding, the problem can be solved nicely by adding the origin > > node of every change to changes/transactions. At first it might seem > > to be sufficient to do so on transaction level, but for cascading > > scenarios it's very useful to be able to have changes from different > > source transactions combinded into a larger one. > > I think this is a lot more problematic. I agree that having the data > in the commit record isn't sufficient here, because for filtering > purposes (for example) you really want to identify the problematic > transactions at the beginning, so you can chuck their WAL, rather than > reassembling the transaction first and then throwing it out. But > putting the origin ID in every insert/update/delete is pretty > unappealing from my point of view - not just because it adds bytes to > WAL, though that's a non-trivial concern, but also because it adds > complexity - IMHO, a non-trivial amount of complexity. I'd be a lot > happier with a solution where, say, we have a separate WAL record that > says "XID 1234 will be decoding for origin 567 until further notice". I think it actually ends up much simpler than what you propose. In the apply process, you simply execute SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier'); or it's C equivalent. That sets a global variable which XLogInsert() includes the record. Note that this doesn't actually require any additional space in the WAL - padding bytes in struct XLogRecord are used to store the identifier. These have been unused at least since 8.0. I don't think a solution which logs the change of origin will be simpler. When the origin is in every record, you can filter without keep track of any state. That's different if you can switch the origin per tx. At the very least you need a in memory entry for the origin. > > == Replication Identifiers == > > > > The above explains the need to have as small as possible identifiers > > for nodes. Two years back I'd suggested that we rely on the user to > > manually assign 16bit ids to individual nodes. Not very surprisingly > > that was shot down because a) 16bit numbers are not descriptive b) a > > per node identifier is problematic because it prohibits replication > > inside the same cluster. > > > > So, what I've proposed since is to have two different forms of > > identifiers. A long one, that's as descriptive as > > $replication_solution wants. And a small one (16bit in my case) that's > > *only meaningful within one node*. The long, user facing, identifier > > is locally mapped to the short one. > > > > In the first version I proposed these long identifiers had a fixed > > form, including the system identifier, timeline id, database id, and a > > freeform name. That wasn't well received and I agree that that's too > > narrow. I think it should be a freeform text of arbitrary length. > > > > Note that it depends on the replication solution whether these > > external identifiers need to be coordinated across systems or not. I > > think it's *good* if we don't propose a solution for that - different > > replication solutions will have different requirements. > > I'm pretty fuzzy on how this actually works. Like, the short form > here is just getting injected into WAL by the apply process. How does > it figure out what value to inject? Thy apply process once does SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier'); that looks up the internal identifier and stores it in a global variable. That's then filled in struct XLogRecord. To setup the origin LSN of a transaction SELECT pg_replication_identifier_setup_tx_origin('0/123456', '2013-12-11 15:14:59.219737+01') is used. If setup that'll emit the 'extended' commit record with the remote commit LSN. > What if it injects a value that doesn't have a short-to-long mapping? Shouldn't be possible unless you drop a replication identifier after it has been setup by *_replaying_from(). If we feel that's a actually dangerous scenario we can prohibit it with a session level lock. > What's the point of the short-to-long mappings in the first place? Is > that only required because of the possibility that there might be > multiple replication solutions in play on the same node? In my original proposal, 2 years+ back, I only used short numeric ids. And people didn't like it because it requires coordination between the replication solutions and possibly between servers. Using a string identifier like in the above allows to easily build unique names; and allows every solution to add the information it needs into replication identifiers. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Let me try to summarize the information requirements for each of these >> things. For #1, you need to know, after crash recovery, for each >> standby, the last commit LSN which the client has confirmed via a >> feedback message. > > I'm not sure I understand what you mean here? This is all happening on > the *standby*. The standby needs to know, after crash recovery, the > latest commit LSN from the primary that it has successfully replayed. Ah, sorry, you're right: so, you need to know, after crash recovery, for each machine you are replicating *from*, the last transaction (in terms of LSN) from that server that you successfully replayed. >> > Similarly, to solve the problem of identifying the origin of changes >> > during decoding, the problem can be solved nicely by adding the origin >> > node of every change to changes/transactions. At first it might seem >> > to be sufficient to do so on transaction level, but for cascading >> > scenarios it's very useful to be able to have changes from different >> > source transactions combinded into a larger one. >> >> I think this is a lot more problematic. I agree that having the data >> in the commit record isn't sufficient here, because for filtering >> purposes (for example) you really want to identify the problematic >> transactions at the beginning, so you can chuck their WAL, rather than >> reassembling the transaction first and then throwing it out. But >> putting the origin ID in every insert/update/delete is pretty >> unappealing from my point of view - not just because it adds bytes to >> WAL, though that's a non-trivial concern, but also because it adds >> complexity - IMHO, a non-trivial amount of complexity. I'd be a lot >> happier with a solution where, say, we have a separate WAL record that >> says "XID 1234 will be decoding for origin 567 until further notice". > > I think it actually ends up much simpler than what you propose. In the > apply process, you simply execute > SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier'); > or it's C equivalent. That sets a global variable which XLogInsert() > includes the record. > Note that this doesn't actually require any additional space in the WAL > - padding bytes in struct XLogRecord are used to store the > identifier. These have been unused at least since 8.0. Sure, that's simpler for logical decoding, for sure. That doesn't make it the right decision for the system overall. > I don't think a solution which logs the change of origin will be > simpler. When the origin is in every record, you can filter without keep > track of any state. That's different if you can switch the origin per > tx. At the very least you need a in memory entry for the origin. But again, that complexity pertains only to logical decoding. Somebody who wants to tweak the WAL format for an UPDATE in the future doesn't need to understand how this works, or care. You know me: I've been a huge advocate of logical decoding. But just like row-level security or BRIN indexes or any other feature, I think it needs to be designed in a way that minimizes the impact it has on the rest of the system. I simply don't believe your contention that this isn't adding any complexity to the code path for regular DML operations. It's entirely possible we could need bit space in those records in the future for something that actually pertains to those operations; if you've burned it for logical decoding, it'll be difficult to claw it back. And what if Tom gets around, some day, to doing that pluggable heap AM work? Then every heap AM has got to allow for those bits, and maybe that doesn't happen to be free for them. Admittedly, these are hypothetical scenarios, but I don't think they're particularly far-fetched. And as a fringe benefit, if you do it the way that I'm proposing, you can use an OID instead of a 16-bit thing that we picked to be 16 bits because that happens to be 100% of the available bit-space. Yeah, there's some complexity on decoding, but it's minimal: one more piece of fixed-size state to track per XID. That's trivial compared to what you've already got. >> What's the point of the short-to-long mappings in the first place? Is >> that only required because of the possibility that there might be >> multiple replication solutions in play on the same node? > > In my original proposal, 2 years+ back, I only used short numeric > ids. And people didn't like it because it requires coordination between > the replication solutions and possibly between servers. Using a string > identifier like in the above allows to easily build unique names; and > allows every solution to add the information it needs into replication > identifiers. I get that, but what I'm asking is why those mappings can't be managed on a per-replication-solution basis. I think that's just because there's a limited namespace and so coordination is needed between multiple replication solutions that might possibly be running on the same system. But I want to confirm if that's actually what you're thinking. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-26 09:53:09 -0400, Robert Haas wrote: > On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Let me try to summarize the information requirements for each of these > >> things. For #1, you need to know, after crash recovery, for each > >> standby, the last commit LSN which the client has confirmed via a > >> feedback message. > > > > I'm not sure I understand what you mean here? This is all happening on > > the *standby*. The standby needs to know, after crash recovery, the > > latest commit LSN from the primary that it has successfully replayed. > > Ah, sorry, you're right: so, you need to know, after crash recovery, > for each machine you are replicating *from*, the last transaction (in > terms of LSN) from that server that you successfully replayed. Precisely. > > I don't think a solution which logs the change of origin will be > > simpler. When the origin is in every record, you can filter without keep > > track of any state. That's different if you can switch the origin per > > tx. At the very least you need a in memory entry for the origin. > > But again, that complexity pertains only to logical decoding. > Somebody who wants to tweak the WAL format for an UPDATE in the future > doesn't need to understand how this works, or care. I agree that that's a worthy goal. But I don't see how this isn't the case with what I propose? This isn't happening on the level of individual rmgrs/ams - there've been two padding bytes after 'xl_rmid' in struct XLogRecord for a long time. There's also the significant advantage that not basing this on the xid allows it to work correctly with records not tied to a transaction. There's not that much of that happening yet, but I've several features in mind: * separately replicate 2PC commits. 2PC commits don't have an xid anymore... With some tooling on top replication 2PC intwo phases allow for very cool stuff. Like optionally synchronous multimaster replication. * I have a pending patch that allows to send 'messages' through logical decoding - yielding a really fast and persistentqueue. For that it's useful have transactional *and* nontransactional messages. * Sanely replicating CONCURRENTLY stuff gets harder if you tie things to the xid. The absolutely, super, uber most convincing reason is: It's trivial to build tools to analyze how much WAL traffic is generated by which replication stream and how much by originates locally. A pg_xlogdump --stats=replication_identifier wouldn't be hard ;) > You know me: I've > been a huge advocate of logical decoding. But just like row-level > security or BRIN indexes or any other feature, I think it needs to be > designed in a way that minimizes the impact it has on the rest of the > system. Totally agreed. And that always will take some arguing... > I simply don't believe your contention that this isn't adding > any complexity to the code path for regular DML operations. It's > entirely possible we could need bit space in those records in the > future for something that actually pertains to those operations; if > you've burned it for logical decoding, it'll be difficult to claw it > back. And what if Tom gets around, some day, to doing that pluggable > heap AM work? Then every heap AM has got to allow for those bits, and > maybe that doesn't happen to be free for them. As explained above this isn't happening on the level of individual AMs. > Admittedly, these are hypothetical scenarios, but I don't think > they're particularly far-fetched. And as a fringe benefit, if you do > it the way that I'm proposing, you can use an OID instead of a 16-bit > thing that we picked to be 16 bits because that happens to be 100% of > the available bit-space. Yeah, there's some complexity on decoding, > but it's minimal: one more piece of fixed-size state to track per XID. > That's trivial compared to what you've already got. But it forces you to track the xids/transactions. With my proposal you can ignore transaction *entirely* unless they manipulate the catalog. For concurrent OLTP workloads that's quite the advantage. > >> What's the point of the short-to-long mappings in the first place? Is > >> that only required because of the possibility that there might be > >> multiple replication solutions in play on the same node? > > > > In my original proposal, 2 years+ back, I only used short numeric > > ids. And people didn't like it because it requires coordination between > > the replication solutions and possibly between servers. Using a string > > identifier like in the above allows to easily build unique names; and > > allows every solution to add the information it needs into replication > > identifiers. > > I get that, but what I'm asking is why those mappings can't be managed > on a per-replication-solution basis. I think that's just because > there's a limited namespace and so coordination is needed between > multiple replication solutions that might possibly be running on the > same system. But I want to confirm if that's actually what you're > thinking. Yes, that and that such a mapping needs to be done across all database are the primary reasons. As it's currently impossible to create further shared relations you'd have to invent something living in the data directory on filesystem level... Brr. I think it'd also be much worse for debugging if there'd be no way to map such a internal identifier back to the replication solution in some form. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote: > As explained above this isn't happening on the level of individual AMs. Well, that's even worse. You want to grab 100% of the available generic bitspace applicable to all record types for purposes specific to logical decoding (and it's still not really enough bits). >> I get that, but what I'm asking is why those mappings can't be managed >> on a per-replication-solution basis. I think that's just because >> there's a limited namespace and so coordination is needed between >> multiple replication solutions that might possibly be running on the >> same system. But I want to confirm if that's actually what you're >> thinking. > > Yes, that and that such a mapping needs to be done across all database > are the primary reasons. As it's currently impossible to create further > shared relations you'd have to invent something living in the data > directory on filesystem level... Brr. > > I think it'd also be much worse for debugging if there'd be no way to > map such a internal identifier back to the replication solution in some > form. OK. One question I have is what the structure of the names should be. It seems some coordination could be needed here. I mean, suppose BDR uses bdr:$NODENAME and Slony uses $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe we should decide that names ought to be of the form <replication-solution>.<further-period-separated-components> or something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-26 10:40:37 -0400, Robert Haas wrote: > On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > As explained above this isn't happening on the level of individual AMs. > > Well, that's even worse. You want to grab 100% of the available > generic bitspace applicable to all record types for purposes specific > to logical decoding (and it's still not really enough bits). I don't think that's a fair characterization. Right now it's available to precisely nobody. You can't put any data in there in *any* way. It just has been sitting around unused for at least 8 years. > One question I have is what the structure of the names should be. It > seems some coordination could be needed here. I mean, suppose BDR > uses bdr:$NODENAME and Slony uses > $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication > server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe > we should decide that names ought to be of the form > <replication-solution>.<further-period-separated-components> or > something like that. I've also wondered about that. Perhaps we simply should have an additional 'name' column indicating the replication solution? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-09-26 10:40:37 -0400, Robert Haas wrote: >> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > As explained above this isn't happening on the level of individual AMs. >> >> Well, that's even worse. You want to grab 100% of the available >> generic bitspace applicable to all record types for purposes specific >> to logical decoding (and it's still not really enough bits). > > I don't think that's a fair characterization. Right now it's available > to precisely nobody. You can't put any data in there in *any* way. It > just has been sitting around unused for at least 8 years. Huh? That's just to say that the unused bit space is, in fact, unused. But so what? We've always been very careful about using up things like infomask bits, because there are only so many bits available, and when they're gone they are gone. >> One question I have is what the structure of the names should be. It >> seems some coordination could be needed here. I mean, suppose BDR >> uses bdr:$NODENAME and Slony uses >> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication >> server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe >> we should decide that names ought to be of the form >> <replication-solution>.<further-period-separated-components> or >> something like that. > > I've also wondered about that. Perhaps we simply should have an > additional 'name' column indicating the replication solution? Yeah, maybe, but there's still the question of substructure within the non-replication-solution part of the name. Not sure if we can assume that a bipartite identifier, specifically, is right, or whether some solutions will end up with different numbers of components. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-26 11:02:16 -0400, Robert Haas wrote: > On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-09-26 10:40:37 -0400, Robert Haas wrote: > >> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > As explained above this isn't happening on the level of individual AMs. > >> > >> Well, that's even worse. You want to grab 100% of the available > >> generic bitspace applicable to all record types for purposes specific > >> to logical decoding (and it's still not really enough bits). > > > > I don't think that's a fair characterization. Right now it's available > > to precisely nobody. You can't put any data in there in *any* way. It > > just has been sitting around unused for at least 8 years. > > Huh? That's just to say that the unused bit space is, in fact, > unused. But so what? We've always been very careful about using up > things like infomask bits, because there are only so many bits > available, and when they're gone they are gone. I don't think that's a very meaningful comparison. The problem with infomask bits is that it's impossible to change anything once added because of pg_upgrade'ability. That problem does not exist for XLogRecord. We've twiddled with the WAL format pretty much in every release. We can reconsider every release. I can't remember anyone but me thinking about using these two bytes. So the comparison here really is using two free bytes vs. issuing at least ~30 (record + origin) for every replayed transaction. Don't think that's a unfair tradeof. > >> One question I have is what the structure of the names should be. It > >> seems some coordination could be needed here. I mean, suppose BDR > >> uses bdr:$NODENAME and Slony uses > >> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication > >> server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe > >> we should decide that names ought to be of the form > >> <replication-solution>.<further-period-separated-components> or > >> something like that. > > > > I've also wondered about that. Perhaps we simply should have an > > additional 'name' column indicating the replication solution? > > Yeah, maybe, but there's still the question of substructure within the > non-replication-solution part of the name. Not sure if we can assume > that a bipartite identifier, specifically, is right, or whether some > solutions will end up with different numbers of components. Ah. I thought you only wanted to suggest a separator between the replication solution and it's internal dat. But you actually want to suggest an internal separator to be used in the solution's namespace? I'm fine with that. I don't think we can suggest much beyond that - different solutions will have fundamentally differing requirements about which information to store. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Huh? That's just to say that the unused bit space is, in fact, >> unused. But so what? We've always been very careful about using up >> things like infomask bits, because there are only so many bits >> available, and when they're gone they are gone. > > I don't think that's a very meaningful comparison. The problem with > infomask bits is that it's impossible to change anything once added > because of pg_upgrade'ability. That problem does not exist for > XLogRecord. We've twiddled with the WAL format pretty much in every > release. We can reconsider every release. > > I can't remember anyone but me thinking about using these two bytes. So > the comparison here really is using two free bytes vs. issuing at least > ~30 (record + origin) for every replayed transaction. Don't think that's > a unfair tradeof. Mmph. You have a point about the WAL format being easier to change. "Reconsidering", though, would mean that some developer who probably isn't you needs those bytes for something that really is a more general need than this, so they write a patch to get them back by doing what I proposed - and then it gets rejected because it's not as good for logical replication. So I'm not sure I really buy this as an argument. For all practical purposes, if you grab them, they'll be gone. >> > I've also wondered about that. Perhaps we simply should have an >> > additional 'name' column indicating the replication solution? >> >> Yeah, maybe, but there's still the question of substructure within the >> non-replication-solution part of the name. Not sure if we can assume >> that a bipartite identifier, specifically, is right, or whether some >> solutions will end up with different numbers of components. > > Ah. I thought you only wanted to suggest a separator between the > replication solution and it's internal dat. But you actually want to > suggest an internal separator to be used in the solution's namespace? > I'm fine with that. I don't think we can suggest much beyond that - > different solutions will have fundamentally differing requirements about > which information to store. Agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-26 14:57:12 -0400, Robert Haas wrote: > On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Huh? That's just to say that the unused bit space is, in fact, > >> unused. But so what? We've always been very careful about using up > >> things like infomask bits, because there are only so many bits > >> available, and when they're gone they are gone. > > > > I don't think that's a very meaningful comparison. The problem with > > infomask bits is that it's impossible to change anything once added > > because of pg_upgrade'ability. That problem does not exist for > > XLogRecord. We've twiddled with the WAL format pretty much in every > > release. We can reconsider every release. > > > > I can't remember anyone but me thinking about using these two bytes. So > > the comparison here really is using two free bytes vs. issuing at least > > ~30 (record + origin) for every replayed transaction. Don't think that's > > a unfair tradeof. > > Mmph. You have a point about the WAL format being easier to change. > "Reconsidering", though, would mean that some developer who probably > isn't you needs those bytes for something that really is a more > general need than this, so they write a patch to get them back by > doing what I proposed - and then it gets rejected because it's not as > good for logical replication. So I'm not sure I really buy this as an > argument. For all practical purposes, if you grab them, they'll be > gone. Sure, it'll possibly not be trivial to move them elsewhere. On the other hand, the padding bytes have been unused for 8+ years without somebody laying "claim" on them but "me". I don't think it's a good idea to leave them there unused when nobody even has proposed another use for a long while. That'll just end up with them continuing to be unused. And there's actually four more consecutive bytes on 64bit systems that are unused. Should there really be a dire need after that, we can simply bump the record size. WAL volume wise it'd not be too bad to make the record a tiny bit bigger - the header is only a relatively small fraction of the entire content. > >> > I've also wondered about that. Perhaps we simply should have an > >> > additional 'name' column indicating the replication solution? > >> > >> Yeah, maybe, but there's still the question of substructure within the > >> non-replication-solution part of the name. Not sure if we can assume > >> that a bipartite identifier, specifically, is right, or whether some > >> solutions will end up with different numbers of components. > > > > Ah. I thought you only wanted to suggest a separator between the > > replication solution and it's internal dat. But you actually want to > > suggest an internal separator to be used in the solution's namespace? > > I'm fine with that. I don't think we can suggest much beyond that - > > different solutions will have fundamentally differing requirements about > > which information to store. > > Agreed. So, let's recommend underscores as that separator? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/26/2014 06:05 PM, Andres Freund wrote: > On 2014-09-26 14:57:12 -0400, Robert Haas wrote: > Sure, it'll possibly not be trivial to move them elsewhere. On the other > hand, the padding bytes have been unused for 8+ years without somebody > laying "claim" on them but "me". I don't think it's a good idea to leave > them there unused when nobody even has proposed another use for a long > while. That'll just end up with them continuing to be unused. And > there's actually four more consecutive bytes on 64bit systems that are > unused. > > Should there really be a dire need after that, we can simply bump the > record size. WAL volume wise it'd not be too bad to make the record a > tiny bit bigger - the header is only a relatively small fraction of the > entire content. If we were now increasing the WAL record size anyway for some unrelated reason, would we be willing to increase it by a further 2 bytes for the node identifier? If the answer is 'no' then I don't think we can justify using the 2 padding bytes just because they are there and have been unused for years. But if the answer is yes, we feel this important enough to justfiy a slightly (2 byte) larger WAL record header then we shouldn't use the excuse of maybe needing those 2 bytes for something else. When something else comes along that needs the WAL space we'll have to increase the record size. To say that if some other patch comes along that needs the space we'll redo this feature to use the method Robert describes is unrealistic. If we think that the replication identifier isn't general/important/necessary to justify 2 bytes of WAL header space then we should start out with something that doesn't use the WAL header, Steve >>>>> I've also wondered about that. Perhaps we simply should have an >>>>> additional 'name' column indicating the replication solution? >>>> Yeah, maybe, but there's still the question of substructure within the >>>> non-replication-solution part of the name. Not sure if we can assume >>>> that a bipartite identifier, specifically, is right, or whether some >>>> solutions will end up with different numbers of components. >>> Ah. I thought you only wanted to suggest a separator between the >>> replication solution and it's internal dat. But you actually want to >>> suggest an internal separator to be used in the solution's namespace? >>> I'm fine with that. I don't think we can suggest much beyond that - >>> different solutions will have fundamentally differing requirements about >>> which information to store. >> Agreed. > So, let's recommend underscores as that separator? > > Greetings, > > Andres Freund >
On 09/26/2014 10:21 AM, Andres Freund wrote: > On 2014-09-26 09:53:09 -0400, Robert Haas wrote: >> On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote: >>>> Let me try to summarize the information requirements for each of these >>>> things. For #1, you need to know, after crash recovery, for each >>>> standby, the last commit LSN which the client has confirmed via a >>>> feedback message. >>> I'm not sure I understand what you mean here? This is all happening on >>> the *standby*. The standby needs to know, after crash recovery, the >>> latest commit LSN from the primary that it has successfully replayed. >> Ah, sorry, you're right: so, you need to know, after crash recovery, >> for each machine you are replicating *from*, the last transaction (in >> terms of LSN) from that server that you successfully replayed. > Precisely. > >>> I don't think a solution which logs the change of origin will be >>> simpler. When the origin is in every record, you can filter without keep >>> track of any state. That's different if you can switch the origin per >>> tx. At the very least you need a in memory entry for the origin. >> But again, that complexity pertains only to logical decoding. >> Somebody who wants to tweak the WAL format for an UPDATE in the future >> doesn't need to understand how this works, or care. > I agree that that's a worthy goal. But I don't see how this isn't the > case with what I propose? This isn't happening on the level of > individual rmgrs/ams - there've been two padding bytes after 'xl_rmid' > in struct XLogRecord for a long time. > > There's also the significant advantage that not basing this on the xid > allows it to work correctly with records not tied to a > transaction. There's not that much of that happening yet, but I've > several features in mind: > > * separately replicate 2PC commits. 2PC commits don't have an xid > anymore... With some tooling on top replication 2PC in two phases > allow for very cool stuff. Like optionally synchronous multimaster > replication. > * I have a pending patch that allows to send 'messages' through logical > decoding - yielding a really fast and persistent queue. For that it's > useful have transactional *and* nontransactional messages. > * Sanely replicating CONCURRENTLY stuff gets harder if you tie things to > the xid. At what point in the decoding stream should something related to a CONCURRENTLY command show up? Also, for a logical message queue why couldn't you have a xid associated with the message that had nothing else in the transaction? l
On 2014-09-27 12:11:16 -0400, Steve Singer wrote: > On 09/26/2014 06:05 PM, Andres Freund wrote: > >On 2014-09-26 14:57:12 -0400, Robert Haas wrote: > >Sure, it'll possibly not be trivial to move them elsewhere. On the other > >hand, the padding bytes have been unused for 8+ years without somebody > >laying "claim" on them but "me". I don't think it's a good idea to leave > >them there unused when nobody even has proposed another use for a long > >while. That'll just end up with them continuing to be unused. And > >there's actually four more consecutive bytes on 64bit systems that are > >unused. > > > >Should there really be a dire need after that, we can simply bump the > >record size. WAL volume wise it'd not be too bad to make the record a > >tiny bit bigger - the header is only a relatively small fraction of the > >entire content. > > If we were now increasing the WAL record size anyway for some unrelated > reason, would we be willing to increase it by a further 2 bytes for the node > identifier? > If the answer is 'no' then I don't think we can justify using the 2 padding > bytes just because they are there and have been unused for years. But if > the answer is yes, we feel this important enough to justfiy a slightly (2 > byte) larger WAL record header then we shouldn't use the excuse of maybe > needing those 2 bytes for something else. When something else comes along > that needs the WAL space we'll have to increase the record size. I don't think that's a good way to see this. By that argument these bytes will never be used. Also there's four more free bytes on 64bit systems... > To say that if some other patch comes along that needs the space we'll redo > this feature to use the method Robert describes is unrealistic. If we think > that the replication identifier isn't general/important/necessary to > justify 2 bytes of WAL header space then we should start out with something > that doesn't use the WAL header, Maintaining complexity also has its costs. And I think that's much more concrete than some imaginary feature (of which nothing was heard for the last 8+ years) also needing two bytes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Sep 27, 2014 at 12:11 PM, Steve Singer <steve@ssinger.info> wrote: > If we were now increasing the WAL record size anyway for some unrelated > reason, would we be willing to increase it by a further 2 bytes for the node > identifier? Obviously not. Otherwise Andres would be proposing to put an OID in there instead of a kooky 16-bit identifier. > If the answer is 'no' then I don't think we can justify using the 2 padding > bytes just because they are there and have been unused for years. But if > the answer is yes, we feel this important enough to justfiy a slightly (2 > byte) larger WAL record header then we shouldn't use the excuse of maybe > needing those 2 bytes for something else. When something else comes along > that needs the WAL space we'll have to increase the record size. > > To say that if some other patch comes along that needs the space we'll redo > this feature to use the method Robert describes is unrealistic. If we think > that the replication identifier isn't general/important/necessary to > justify 2 bytes of WAL header space then we should start out with something > that doesn't use the WAL header, I lean in that direction too, but would welcome more input from others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/23/2014 09:24 PM, Andres Freund wrote: > I've previously started two threads about replication identifiers. Check > http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de > and > http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de > . > > The've also been discussed in the course of another thread: > http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de And even earlier here: http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-andres@2ndquadrant.com#1339586927-13156-10-git-send-email-andres@2ndquadrant.com The thread branched a lot, the relevant branch is the one with subject "[PATCH 10/16] Introduce the concept that wal has a 'origin' node" > == Identify the origin of changes == > > Say you're building a replication solution that allows two nodes to > insert into the same table on two nodes. Ignoring conflict resolution > and similar fun, one needs to prevent the same change being replayed > over and over. In logical replication the changes to the heap have to > be WAL logged, and thus the *replay* of changes from a remote node > produce WAL which then will be decoded again. > > To avoid that it's very useful to tag individual changes/transactions > with their 'origin'. I.e. mark changes that have been directly > triggered by the user sending SQL as originating 'locally' and changes > originating from replaying another node's changes as originating > somewhere else. > > If that origin is exposed to logical decoding output plugins they can > easily check whether to stream out the changes/transactions or not. > > > It is possible to do this by adding extra columns to every table and > store the origin of a row in there, but that a) permanently needs > storage b) makes things much more invasive. An origin column in the table itself helps tremendously to debug issues with the replication system. In many if not most scenarios, I think you'd want to have that extra column, even if it's not strictly required. > What I've previously suggested (and which works well in BDR) is to add > the internal id to the XLogRecord struct. There's 2 free bytes of > padding that can be used for that purpose. Adding a field to XLogRecord for this feels wrong. This is for *logical* replication - why do you need to mess with something as physical as the WAL record format? And who's to say that a node ID is the most useful piece of information for a replication system to add to the WAL header. I can easily imagine that you'd want to put a changeset ID or something else in there, instead. (I mentioned another example of this in http://www.postgresql.org/message-id/4FE17043.60403@enterprisedb.com) If we need additional information added to WAL records, for extensions, then that should be made in an extensible fashion. IIRC (I couldn't find a link right now), when we discussed the changes to heap_insert et al for wal_level=logical, I already argued back then that we should make it possible for extensions to annotate WAL records, with things like "this is the primary key", or whatever information is needed for conflict resolution, or handling loops. I don't like it that we're adding little pieces of information to the WAL format, bit by bit. - Heikki
On 2014-10-02 11:49:31 +0300, Heikki Linnakangas wrote: > On 09/23/2014 09:24 PM, Andres Freund wrote: > >I've previously started two threads about replication identifiers. Check > >http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de > >and > >http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de > >. > > > >The've also been discussed in the course of another thread: > >http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de > > And even earlier here: > http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-andres@2ndquadrant.com#1339586927-13156-10-git-send-email-andres@2ndquadrant.com > The thread branched a lot, the relevant branch is the one with subject > "[PATCH 10/16] Introduce the concept that wal has a 'origin' node" Right. Long time ago already ;) > >== Identify the origin of changes == > > > >Say you're building a replication solution that allows two nodes to > >insert into the same table on two nodes. Ignoring conflict resolution > >and similar fun, one needs to prevent the same change being replayed > >over and over. In logical replication the changes to the heap have to > >be WAL logged, and thus the *replay* of changes from a remote node > >produce WAL which then will be decoded again. > > > >To avoid that it's very useful to tag individual changes/transactions > >with their 'origin'. I.e. mark changes that have been directly > >triggered by the user sending SQL as originating 'locally' and changes > >originating from replaying another node's changes as originating > >somewhere else. > > > >If that origin is exposed to logical decoding output plugins they can > >easily check whether to stream out the changes/transactions or not. > > > > > >It is possible to do this by adding extra columns to every table and > >store the origin of a row in there, but that a) permanently needs > >storage b) makes things much more invasive. > > An origin column in the table itself helps tremendously to debug issues with > the replication system. In many if not most scenarios, I think you'd want to > have that extra column, even if it's not strictly required. I don't think you'll have much success convincing actual customers of that. It's one thing to increase the size of the WAL stream a bit, it's entirely different to persistently increase the table size of all their tables. > >What I've previously suggested (and which works well in BDR) is to add > >the internal id to the XLogRecord struct. There's 2 free bytes of > >padding that can be used for that purpose. > > Adding a field to XLogRecord for this feels wrong. This is for *logical* > replication - why do you need to mess with something as physical as the WAL > record format? XLogRecord isn't all that "physical". It doesn't encode anything in that regard but the fact that there's backup blocks in the record. It's essentially just an implementation detail of logging. Whether that's physical or logical doesn't really matter much. There's basically two primary reasons I think it's a good idea to add it there: a) There's many different type of records where it's useful to add the origin. Adding the information to all these willmake things more complicated, using more space, and be more fragile. And I'm pretty sure that the number of thingspeople will want to expose over logical replication will increase. I know of at least two things that have at least some working code: Exposing 2PC to logical decoding to allow optionallysynchronous replication, and allowing to send transactional/nontransactional 'messages' via the WAL without writingto a table. Now, we could add a framework to attach general information to every record - but I have a very hard time seing how thiswill be of comparable complexity *and* efficiency. b) It's dead simple with a pretty darn low cost. Both from a runtime as well as a maintenance perspective. c) There needs to be crash recovery interation anyway to compute the state of how far replication succeeded before crashing.So it's not like we could make this completely extensible without core code knowing. > And who's to say that a node ID is the most useful piece of information for > a replication system to add to the WAL header. I can easily imagine that > you'd want to put a changeset ID or something else in there, instead. (I > mentioned another example of this in > http://www.postgresql.org/message-id/4FE17043.60403@enterprisedb.com) I'm onboard with adding a extensible facility to attach data to successful transactions. There've been at least two people asking me directly about how to e.g. attach user information to transactions. I don't think that's equivalent with what I'm talking about here though. One important thing about this proposal is that it allows to completely skip (nearly, except cache inval) all records with a uninteresting origin id *before* decoding them. Without having to keep any per transaction state about 'uninteresting' transactions. > If we need additional information added to WAL records, for extensions, then > that should be made in an extensible fashion I can see how we'd do that for individual records (e.g. the various commit records, after combining them), but i have a hard time seing the cost of doing that for all records worth it. Especially as it seems likely to require significant increases in wal volume? > IIRC (I couldn't find a link > right now), when we discussed the changes to heap_insert et al for > wal_level=logical, I already argued back then that we should make it > possible for extensions to annotate WAL records, with things like "this is > the primary key", or whatever information is needed for conflict resolution, > or handling loops. I don't like it that we're adding little pieces of > information to the WAL format, bit by bit. I don't think this is "adding little pieces of information to the WAL format, bit by bit.". It's a relatively central piece for allowing efficient and maintainable logical replication. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > An origin column in the table itself helps tremendously to debug issues with > the replication system. In many if not most scenarios, I think you'd want to > have that extra column, even if it's not strictly required. I like a lot of what you wrote here, but I strongly disagree with this part. A good replication solution shouldn't require changes to the objects being replicated. The triggers that Slony and other current logical replication solutions use are a problem not only because they're slow (although that is a problem) but also because they represent a user-visible wart that people who don't otherwise care about the fact that their database is being replicated have to be concerned with. I would agree that some people might, for particular use cases, want to include origin information in the table that the replication system knows about, but it shouldn't be required. When you look at the replication systems that we have today, you've basically got streaming replication, which is high-performance and fairly hands-off (at least once you get it set up properly; that part can be kind of a bear) but can't cross versions let alone database systems and requires that the slaves be strictly read-only. Then on the flip side you've got things like Slony, Bucardo, and others. Some of these allow multi-master; all of them at least allow table-level determination of which server has the writable copy. Nearly all of them are cross-version and some even allow replication into non-PostgreSQL systems. But they are slow *and administratively complex*. If we're able to get something that feels like streaming replication from a performance and administrative complexity point of view but can be cross-version and allow at least some writes on slaves, that's going to be an epic leap forward for the project. In my mind, that means it's got to be completely hands-off from a schema design point of view: you should be able to start up a database and design it however you want, put anything you like into it, and then decide later that you want to bolt logical replication on top of it, just as you can for streaming physical replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/2/14, 7:28 AM, Robert Haas wrote: > On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> >An origin column in the table itself helps tremendously to debug issues with >> >the replication system. In many if not most scenarios, I think you'd want to >> >have that extra column, even if it's not strictly required. > I like a lot of what you wrote here, but I strongly disagree with this > part. A good replication solution shouldn't require changes to the > objects being replicated. I agree that asking users to modify objects is bad, but I also think that if you do have records coming into one table frommultiple sources then you will need to know what system they originated on. Maybe some sort of "hidden" column would work here? That means users don't need to modify anything (including anything doingSELECT *), but the data is there. As for space concerns I think the answer there is to somehow normalize the identifiers themselves. That has the added benefitof allowing a rename of a source to propagate to all the data already replicated from that source.
On 2 October 2014 09:49, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> What I've previously suggested (and which works well in BDR) is to add >> the internal id to the XLogRecord struct. There's 2 free bytes of >> padding that can be used for that purpose. > > > Adding a field to XLogRecord for this feels wrong. This is for *logical* > replication - why do you need to mess with something as physical as the WAL > record format? Some reasons why this would be OK: 1. We've long agreed that adding information to the WAL record is OK, just as long as it doesn't interfere (much) with the primary purpose. It seems OK to change wal_level so it is a list of additional info, e.g. wal_level = 'hot standby, logical, originid', or just wal_level = '$new_level_name' that includes originid info 2. We seem to have agreed elsewhere that extensions to WAL will be allowed. So perhaps redefining those bytes is something that can be re-used. That way we don't all have to agree what we'll use them for. Just call a user defined function that returns a 2 byte integer, or zero if no plugin. 3. As for how many bytes there are - I count 6 wasted bytes on a MAXALIGN=8 record. Plus we discussed long ago ways we can save another 4 bytes on records that don't have a backup block, since in that case xl_tot_len == xl_len. I've also got a feeling that WAL records that are 2 billion bytes long might be overkill. I figure we could easily make a length limit of something less than that - only commit records can be longer than 2^19 bytes when they have >65536 subxids, which is hardly common. Plus xl_prev is always at least 4 byte aligned, so there are at least 3 bits to be reclaimed from there. (Plus we have 7 unused RmgrId bits, though maybe we want to keep those) So I count about 14 bytes of potentially reclaimable space in the WAL record header, or 10 with backup blocks. The reason we never reclaimed it before was that benchmarks previously showed that reducing the volume of WAL had little effect on performance, we weren't looking specifically at WAL volume or info content. (And the perf result is probably different now anyway). Should we grovel around to reclaim any of that? Not necessary, the next person with a good reason to use some space can do that. Pick any of those: I see no reason to prevent reusing 2 bytes for a useful purpose, when we do all agree it is a useful purpose. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, Here's my next attept attempt at producing something we can agree upon. The major change that might achieve that is that I've now provided a separate method to store the origin_id of a node. I've made it conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both paths. That new method uses Heikki's xlog rework to dynamically add the origin to the record if a origin is set up. That works surprisingly simply. Other changes: * Locking preventing several backends to replay changes at the same time. This is actually overly restrictive in some cases,but I think good enough for now. * Logical decoding grew a filter_by_origin callback that allows to ignore changes that were replayed on a remote system.Such filters are executed before much is done with records, potentially saving a fair bit of costs. * Rebased. That took a bit due the xlog and other changes. * A couple more SQL interface functions (like dropping a replication identifier). I also want to quickly recap replication identifiers, given that in-person conversations with several people proved that the concept was slightly misunderstood: Think about a logical replication solution trying to replay changes. The postmaster in which the data is replayed into crashes every now and then. Replication identifiers allow you to do something like: do_replication() { source = ConnectToSourceSystem('mysource'); target = ConnectToSourceSystem('target'); # mark we're replayin target.exec($$SELECT pg_replication_identifier_setup_replaying_from('myrep_mysource')$$); #get how far we've replayed last time round remote_lsn = target.exec($$SELECT remote_lsn FROM pg_get_replication_identifier_progressWHERE external_id = 'myrep_mysource'); # and now replay changes copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', remote_lsn); while (record = copystream.get_record()) { if (record.type = 'begin') { target.exec('BEGIN'); # setup the position of this individual xact target.exec('SELECT pg_replication_identifier_setup_tx_origin($1,$2);', record.origin_lsn, record.origin_commit_timestamp); } else if (record.type = 'change') target.exec(record.change_sql) else if (record.type = 'commit') target.exec('COMMIT'); } } A non pseudocode version of the above would be safe against crashes of both the source and the target system. If the target system crashes the replication identifier logic will recover how far we replayed during crash recovery. If the source system crashes/disconnects we'll have the current value in memory. Note that this works perfectly well if the target system (and obviously the source system, but that's obvious) use synchronous_commit = off - we'll not miss any changes. Furthermore the fact that the origin of records is recorded allows to avoid decoding them in logical decoding. That has both efficiency advantages (we can do so before they are stored in memory/disk) and functionality advantages. Imagine using a logical replication solution to replicate inserts to a single table between two databases where inserts are allowed on both - unless you prevent the replicated inserts from being replicated again you obviously have a loop. This infrastructure lets you avoid that. The SQL interface consists out of: # manage existance of identifiers internal_id pg_replication_identifier_create(external_id); void pg_replication_identifier_drop(external_id); # replay management void pg_replication_identifier_setup_replaying_from(external_id); void pg_replication_identifier_reset_replaying_from(); bool pg_replication_identifier_is_replaying(); void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time); # replication progress status view SELECT * FROM pgreplication_identifier_progress; # replicatation identifiers SELECT * FROM pg_replication_identifier; Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a SQL reimplementation of replication identifiers and that has proven that for busier workloads doing a table update to store the replication progress indeed has a noticeable overhead. Especially if there's some longer running activity on the standby. The bigger questions I have are: 1) Where to store the origin. I personally still think that using the padding is fine. Now that I have proven that it'spretty simple to store additional information the argument that it might be needed for something else doesn't reallyhold anymore. But I can live with the other solution as well - 3 bytes additional overhead ain't so bad. 2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do we want to store the origin only on relevant records?That'd be XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE // XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED.I'm thinking of something like XLogLogOriginIfAvailable() before the emittinglog XLogInsert()s. 3) There should be a lwlock for the individual replication identifier progress slots. 4) Right now identifier progress is stored during checkpoints in special files - maybe it'd be better to store them insidethe checkpoint record somehow. We read that even after a clean shutdown, so that should be fine. 5) I'm think there are issues with a streaming replication standby if many identifiers are created/dropped. Those shouldn'tbe too hard to fix. 6) Obviously the hack in bootstrap.c to get riname marked NOT NULL isn't acceptable. Either I need to implement boostrapsupport for marking varlenas NOT NULL as discussed nearby or replace the syscache lookup with a index lookup. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2015-02-16 01:21:55 +0100, Andres Freund wrote: > Here's my next attept attempt at producing something we can agree > upon. > > The major change that might achieve that is that I've now provided a > separate method to store the origin_id of a node. I've made it > conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both > paths. That new method uses Heikki's xlog rework to dynamically add the > origin to the record if a origin is set up. That works surprisingly > simply. > > Other changes: > > * Locking preventing several backends to replay changes at the same > time. This is actually overly restrictive in some cases, but I think > good enough for now. > * Logical decoding grew a filter_by_origin callback that allows to > ignore changes that were replayed on a remote system. Such filters are > executed before much is done with records, potentially saving a fair > bit of costs. > * Rebased. That took a bit due the xlog and other changes. > * A couple more SQL interface functions (like dropping a replication > identifier). And here an actual patch. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 02/16/2015 02:21 AM, Andres Freund wrote: > Furthermore the fact that the origin of records is recorded allows to > avoid decoding them in logical decoding. That has both efficiency > advantages (we can do so before they are stored in memory/disk) and > functionality advantages. Imagine using a logical replication solution > to replicate inserts to a single table between two databases where > inserts are allowed on both - unless you prevent the replicated inserts > from being replicated again you obviously have a loop. This > infrastructure lets you avoid that. That makes sense. How does this work if you have multiple replication systems in use in the same cluster? You might use Slony to replication one table to one system, and BDR to replication another table with another system. Or the same replication software, but different hosts. - Heikki
On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote: > On 02/16/2015 02:21 AM, Andres Freund wrote: > >Furthermore the fact that the origin of records is recorded allows to > >avoid decoding them in logical decoding. That has both efficiency > >advantages (we can do so before they are stored in memory/disk) and > >functionality advantages. Imagine using a logical replication solution > >to replicate inserts to a single table between two databases where > >inserts are allowed on both - unless you prevent the replicated inserts > >from being replicated again you obviously have a loop. This > >infrastructure lets you avoid that. > > That makes sense. > > How does this work if you have multiple replication systems in use in the > same cluster? You might use Slony to replication one table to one system, > and BDR to replication another table with another system. Or the same > replication software, but different hosts. It should "just work". Replication identifiers are identified by a free form text, each replication solution can add the information/distinguising data they need in there. Bdr uses something like #define BDR_NODE_ID_FORMAT "bdr_"UINT64_FORMAT"_%u_%u_%u_%s" with remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name as parameters as a replication identifier name. I've been wondering whether the bdr_ part in the above should be split of into a separate field, similar to how the security label stuff does it. But I don't think it'd really buy us much, especially as we did not do that for logical slot names. Each of the used replication solution would probably ask their output plugin to only stream locally generated (i.e. origin_id = InvalidRepNodeId) changes, and possibly from a defined list of other known hosts in the cascading case. Does that answer your question? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02/16/2015 11:18 AM, Andres Freund wrote: > On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote: >> How does this work if you have multiple replication systems in use in the >> same cluster? You might use Slony to replication one table to one system, >> and BDR to replication another table with another system. Or the same >> replication software, but different hosts. > > It should "just work". Replication identifiers are identified by a free > form text, each replication solution can add the > information/distinguising data they need in there. > > Bdr uses something like > #define BDR_NODE_ID_FORMAT "bdr_"UINT64_FORMAT"_%u_%u_%u_%s" > with > remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name > as parameters as a replication identifier name. > > I've been wondering whether the bdr_ part in the above should be split > of into a separate field, similar to how the security label stuff does > it. But I don't think it'd really buy us much, especially as we did > not do that for logical slot names. > > Each of the used replication solution would probably ask their output > plugin to only stream locally generated (i.e. origin_id = > InvalidRepNodeId) changes, and possibly from a defined list of other > known hosts in the cascading case. > > Does that answer your question? Yes, thanks. Note to self and everyone else looking at this: It's important to keep in mind is that the replication IDs are completely internal to the local cluster. They are *not* sent over the wire. That's not completely true if you also use physical replication, though. The replication IDs will be included in the WAL stream. Can you have logical decoding running in a hot standby server? And how does the progress report stuff that's checkpointed in pg_logical/checkpoints work in a hot standby? (Sorry if I'm not making sense, I haven't really thought hard about this myself) At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. - Heikki
On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: > Yes, thanks. Note to self and everyone else looking at this: It's important > to keep in mind is that the replication IDs are completely internal to the > local cluster. They are *not* sent over the wire. Well, if you *want* to, you could send the external (free form text) replication identifiers over the wire in the output plugin. There are scenarios where that might make sense. > That's not completely true if you also use physical replication, though. The > replication IDs will be included in the WAL stream. Right. But then a physical rep server isn't realy a different server. > Can you have logical decoding running in a hot standby server? Not at the moment, there's some minor stuff missing (following timelines, enforcing tuple visibility on the primary). > And how does the progress report stuff that's checkpointed in > pg_logical/checkpoints work in a hot standby? (Sorry if I'm not > making sense, I haven't really thought hard about this myself) It doesn't work that greatly atm, they'd need to be WAL logged for that - which would not be hard. It'd be better to include the information in the checkpoint, but that unfortunately doesn't really work, because we store the checkpoint in the control file. And that has to be <= 512 bytes. What, I guess, we could do is log it in the checkpoint, after determining the redo pointer, and store the LSN for it in the checkpoint record proper. That'd mean we'd read one more record at startup, but that isn't particularly bad. > At a quick glance, this basic design seems workable. I would suggest > expanding the replication IDs to regular 4 byte oids. Two extra bytes is a > small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 16/02/15 10:46, Andres Freund wrote: > On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: > >> At a quick glance, this basic design seems workable. I would suggest >> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a >> small price to pay, to make it work more like everything else in the system. > > I don't know. Growing from 3 to 5 byte overhead per relevant record (or > even 0 to 5 in case the padding is reused) is rather noticeable. If we > later find it to be a limit (I seriously doubt that), we can still > increase it in a major release without anybody really noticing. > I agree that limiting the overhead is important. But I have related though about this - I wonder if it's worth to try to map this more directly to the CommitTsNodeId. I mean it is currently using that for the actual storage, but I am thinking of having the Replication Identifiers be the public API and the CommitTsNodeId stuff be just hidden implementation detail. This thought is based on the fact that CommitTsNodeId provides somewhat meaningless number while Replication Identifiers give us nice name for that number. And if we do that then the type should perhaps be same for both? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Now that the issue with padding seems to no longer exists since the patch works both with and without padding, I went through the code and here are some comments I have (in no particular order). In CheckPointReplicationIdentifier: > + * FIXME: Add a CRC32 to the end. The function already does it (I guess you forgot to remove the comment). Using max_replication_slots as limit for replication_identifier states does not really make sense to me as replication_identifiers track remote info while and slots are local and in case of master-slave replication you need replication identifiers but don't need slots. In bootstrap.c: > #define MARKNOTNULL(att) \ > ((att)->attlen > 0 || \ > (att)->atttypid == OIDVECTOROID || \ > - (att)->atttypid == INT2VECTOROID) > + (att)->atttypid == INT2VECTOROID || \ > + strcmp(NameStr((att)->attname), "riname") == 0 \ > + ) Huh? Can this be solved in a nicer way? Since we call XLogFlush with local_lsn as parameter, shouldn't we check that it's actually within valid range? Currently we'll get errors like this if set to invalid value: ERROR: xlog flush request 123/123 is not satisfied --- flushed only to 0/168FB18 In AdvanceReplicationIndentifier: > + /* > + * XXX: should we restore into a hashtable and dump into shmem only after > + * recovery finished? > + */ Probably no given that the function is also callable via SQL interface. As I wrote in another email, I would like to integrate the RepNodeId and CommitTSNodeId into single thing. There are no docs for the new sql interfaces. The replication_identifier.c might deserve some intro/notes text. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2015-02-22 04:59:30 +0100, Petr Jelinek wrote: > Now that the issue with padding seems to no longer exists since the patch > works both with and without padding, I went through the code and here are > some comments I have (in no particular order). > > In CheckPointReplicationIdentifier: > >+ * FIXME: Add a CRC32 to the end. > > The function already does it (I guess you forgot to remove the comment). Yep. I locally have a WIP version that's much cleaned up and doesn't contain it anymore. > Using max_replication_slots as limit for replication_identifier states does > not really make sense to me as replication_identifiers track remote info > while and slots are local and in case of master-slave replication you need > replication identifiers but don't need slots. On the other hand, it's quite cheap if unused. Not sure if several variables are worth it. > In bootstrap.c: > > #define MARKNOTNULL(att) \ > > ((att)->attlen > 0 || \ > > (att)->atttypid == OIDVECTOROID || \ > >- (att)->atttypid == INT2VECTOROID) > >+ (att)->atttypid == INT2VECTOROID || \ > >+ strcmp(NameStr((att)->attname), "riname") == 0 \ > >+ ) > > Huh? Can this be solved in a nicer way? Yes. I'd mentioned that this is just a temporary hack ;). I've since pushed a more proper solution; BKI_FORCE_NOT_NULL can now be specified on the column. > Since we call XLogFlush with local_lsn as parameter, shouldn't we check that > it's actually within valid range? > Currently we'll get errors like this if set to invalid value: > ERROR: xlog flush request 123/123 is not satisfied --- flushed only to > 0/168FB18 I think we should rather remove local_lsn from all parameters that are user callable, adding them there was a mistake. It's really only relevant for the cases where it's called by commit. > In AdvanceReplicationIndentifier: > >+ /* > >+ * XXX: should we restore into a hashtable and dump into shmem only after > >+ * recovery finished? > >+ */ > > Probably no given that the function is also callable via SQL interface. Well, it's still a good idea regardless... > As I wrote in another email, I would like to integrate the RepNodeId and > CommitTSNodeId into single thing. Will reply separately there. > There are no docs for the new sql interfaces. Yea. The whole thing needs docs. Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote: > On 16/02/15 10:46, Andres Freund wrote: > >On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: > > > >>At a quick glance, this basic design seems workable. I would suggest > >>expanding the replication IDs to regular 4 byte oids. Two extra bytes is a > >>small price to pay, to make it work more like everything else in the system. > > > >I don't know. Growing from 3 to 5 byte overhead per relevant record (or > >even 0 to 5 in case the padding is reused) is rather noticeable. If we > >later find it to be a limit (I seriously doubt that), we can still > >increase it in a major release without anybody really noticing. > > > > I agree that limiting the overhead is important. > > But I have related though about this - I wonder if it's worth to try to map > this more directly to the CommitTsNodeId. Maybe. I'd rather go the other way round though; replication_identifier.c/h's stuff seems much more generic than CommitTsNodeId. > I mean it is currently using that > for the actual storage, but I am thinking of having the Replication > Identifiers be the public API and the CommitTsNodeId stuff be just hidden > implementation detail. This thought is based on the fact that CommitTsNodeId > provides somewhat meaningless number while Replication Identifiers give us > nice name for that number. And if we do that then the type should perhaps be > same for both? I'm not sure. Given that this is included in a significant number of recordsd I'd really rather not increase the overhead as described above. Maybe we can just limit CommitTsNodeId to the same size for now? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 22/02/15 09:57, Andres Freund wrote: > On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote: >> On 16/02/15 10:46, Andres Freund wrote: >>> On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: >>> >>>> At a quick glance, this basic design seems workable. I would suggest >>>> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a >>>> small price to pay, to make it work more like everything else in the system. >>> >>> I don't know. Growing from 3 to 5 byte overhead per relevant record (or >>> even 0 to 5 in case the padding is reused) is rather noticeable. If we >>> later find it to be a limit (I seriously doubt that), we can still >>> increase it in a major release without anybody really noticing. >>> >> >> I agree that limiting the overhead is important. >> >> But I have related though about this - I wonder if it's worth to try to map >> this more directly to the CommitTsNodeId. > > Maybe. I'd rather go the other way round though; > replication_identifier.c/h's stuff seems much more generic than > CommitTsNodeId. > Probably not more generic but definitely nicer as the nodes are named and yes it has richer API. >> I mean it is currently using that >> for the actual storage, but I am thinking of having the Replication >> Identifiers be the public API and the CommitTsNodeId stuff be just hidden >> implementation detail. This thought is based on the fact that CommitTsNodeId >> provides somewhat meaningless number while Replication Identifiers give us >> nice name for that number. And if we do that then the type should perhaps be >> same for both? > > I'm not sure. Given that this is included in a significant number of > recordsd I'd really rather not increase the overhead as described > above. Maybe we can just limit CommitTsNodeId to the same size for now? > That would make sense. I also wonder about the default nodeid infrastructure the committs provides. I can't think of use-case which it can be used for and which isn't solved by replication identifiers - in the end the main reason I added that infra was to make it possible to write something like replication identifiers as part of an extension. In any case I don't think the default nodeid can be used in parallel with replication identifiers since one will overwrite the SLRU record for the other. Maybe it's enough if this is documented but I think it might be better if this patch removed that default committs nodeid infrastructure. It's just few lines of code which nobody should be using yet. Thinking about this some more and rereading the code I see that you are setting TransactionTreeSetCommitTsData during xlog replay, but not during transaction commit, that does not seem correct as the local records will not have any nodeid/origin. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2/15/15 7:24 PM, Andres Freund wrote: > On 2015-02-16 01:21:55 +0100, Andres Freund wrote: >> Here's my next attept attempt at producing something we can agree >> upon. >> >> The major change that might achieve that is that I've now provided a >> separate method to store the origin_id of a node. I've made it >> conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both >> paths. That new method uses Heikki's xlog rework to dynamically add the >> origin to the record if a origin is set up. That works surprisingly >> simply. I'm trying to figure out what this feature is supposed to do, but the patch contains no documentation or a commit message. Where is one supposed to start?
On 2015-02-22 10:03:59 -0500, Peter Eisentraut wrote: > On 2/15/15 7:24 PM, Andres Freund wrote: > > On 2015-02-16 01:21:55 +0100, Andres Freund wrote: > >> Here's my next attept attempt at producing something we can agree > >> upon. > >> > >> The major change that might achieve that is that I've now provided a > >> separate method to store the origin_id of a node. I've made it > >> conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both > >> paths. That new method uses Heikki's xlog rework to dynamically add the > >> origin to the record if a origin is set up. That works surprisingly > >> simply. > > I'm trying to figure out what this feature is supposed to do, but the > patch contains no documentation or a commit message. Where is one > supposed to start? For a relatively short summary: http://archives.postgresql.org/message-id/20150216002155.GI15326%40awork2.anarazel.de For a longer one: http://www.postgresql.org/message-id/20140923182422.GA15776@alap3.anarazel.de Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, Here's the next version of this patch. I've tried to address the biggest issue (documentation) and some more. Now that both the more flexible commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it looks much cleaner. Changes: * Loads of documentation and comments * Revamped locking strategy. There's now a LWLock protecting all the replication progress array and spinlock for the individual sltos. * Simpler checkpoint format. * A new pg_replication_identifier_progress() function returning a individual identifier's replication progress; previously there was only the view showing all of them. * Lots of minor cleanup * Some more tests I'd greatly appreciate some feedback on the documentation. I'm not entirely sure into how much detail to go; and where exactly in the docs to place it. I do wonder if we shouldn't merge this with the logical decoding section and whether we could also document commit timestamps somewhere in there. I've verified that this correctly works on a stanby; replication progress is replicated correctly. I think there's two holes though: Changing the replication progress without replicating anything and dropping a replication identifier with some replication progress might not work correctly. That's fairly easily fixed and I intend to do so. Other than that I'm not aware of outstanding issues. Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 24/03/15 16:33, Andres Freund wrote: > Hi, > > Here's the next version of this patch. I've tried to address the biggest > issue (documentation) and some more. Now that both the more flexible > commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it > looks much cleaner. > Nice, I see you also did the more close integration with CommitTs. I only skimmed the patch but so far and it looks quite good, I'll take closer look around end of the week. > > I'd greatly appreciate some feedback on the documentation. I'm not > entirely sure into how much detail to go; and where exactly in the docs > to place it. I do wonder if we shouldn't merge this with the logical > decoding section and whether we could also document commit timestamps > somewhere in there. Perhaps we should have some Logical replication developer documentation section and put all those three as subsections of that? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> At a quick glance, this basic design seems workable. I would suggest >> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a >> small price to pay, to make it work more like everything else in the system. > > I don't know. Growing from 3 to 5 byte overhead per relevant record (or > even 0 to 5 in case the padding is reused) is rather noticeable. If we > later find it to be a limit (I seriously doubt that), we can still > increase it in a major release without anybody really noticing. You might notice that Heikki is making the same point here that I've attempted to make multiple times in the past: limiting to replication identifier to 2 bytes because that's how much padding space you happen to have available is optimizing for the wrong thing. What we should be optimizing for is consistency and uniformity of design. System catalogs have OIDs, so this one should, too. You're not going to be able to paper over the fact that the column has some funky data type that is unlike every other column in the system. To the best of my knowledge, the statement that there is a noticeable performance cost for those 2 extra bytes is also completely unsupported by any actual benchmarking. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
So I did some more in depth look, I do have couple of comments. I would really like to have something like "Logical Replication Infrastructure" doc section that would have both decoding and identifiers (and possibly even CommitTs) underneath. There is typo in docs: > + <para> > + The optional <function>filter_by_origin_cb</function> callback > + is called to determine wheter data that has been replayed wheter -> whether And finally I have issue with how the new identifiers are allocated. Currently, if you create identifier 'foo', remove identifier 'foo' and create identifier 'bar', the identifier 'bar' will have same id as the old 'foo' identifier. This can be problem because the identifier id is used as origin of the data and the replication solution using the replication identifiers can end up thinking that data came from node 'bar' even though they came from the node 'foo' which no longer exists. This can have bad effects for example on conflict detection or debugging problems with replication. Maybe another reason to use standard Oids? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote: > And finally I have issue with how the new identifiers are allocated. > Currently, if you create identifier 'foo', remove identifier 'foo' and > create identifier 'bar', the identifier 'bar' will have same id as the old > 'foo' identifier. This can be problem because the identifier id is used as > origin of the data and the replication solution using the replication > identifiers can end up thinking that data came from node 'bar' even though > they came from the node 'foo' which no longer exists. This can have bad > effects for example on conflict detection or debugging problems with > replication. > > Maybe another reason to use standard Oids? As the same reason exists for oids, just somewhat less likely, I don't see it as a reason for much. It's really not that hard to get oid conflicts once your server has lived for a while. As soon as the oid counter has wrapped around once, it's not unlikely to have conflicts. And with temp tables (or much more extremely WITH OID tables) and such it's not that hard to reach that point. The only material difference this makes is that it's much easier to notice the problem during development. Greetings, Andres Freund
On 2015-04-07 16:30:25 +0200, Andres Freund wrote: > And with temp tables (or much more extremely WITH OID tables) > and such it's not that hard to reach that point. Oh, and obviously toast data. A couple tables with toasted columns is also a good way to rapidly consume oids. Greetings, Andres Freund
On 2015-03-24 23:11:26 -0400, Robert Haas wrote: > On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> At a quick glance, this basic design seems workable. I would suggest > >> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a > >> small price to pay, to make it work more like everything else in the system. > > > > I don't know. Growing from 3 to 5 byte overhead per relevant record (or > > even 0 to 5 in case the padding is reused) is rather noticeable. If we > > later find it to be a limit (I seriously doubt that), we can still > > increase it in a major release without anybody really noticing. > > You might notice that Heikki is making the same point here that I've > attempted to make multiple times in the past: limiting to replication > identifier to 2 bytes because that's how much padding space you happen > to have available is optimizing for the wrong thing. What we should > be optimizing for is consistency and uniformity of design. System > catalogs have OIDs, so this one should, too. You're not going to be > able to paper over the fact that the column has some funky data type > that is unlike every other column in the system. > > To the best of my knowledge, the statement that there is a noticeable > performance cost for those 2 extra bytes is also completely > unsupported by any actual benchmarking. I'm starting benchmarks now. But I have to say: I find the idea that you'd need more than 2^16 identifiers anytime soon not very credible. The likelihood that replication identifiers are the limiting factor towards that seems incredibly small. Just consider how you'd apply changes from so many remotes; how to stream changes to them; how to even configure such a complex setup. We can easily change the size limits in the next major release without anybody being inconvenienced. We've gone through quite some lengths reducing the overhead of WAL. I don't understand why it's important that we do not make compromises here; but why that doesn't matter elsewhere. Greetings, Andres Freund
On 4/7/15 9:30 AM, Andres Freund wrote: > On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote: >> And finally I have issue with how the new identifiers are allocated. >> Currently, if you create identifier 'foo', remove identifier 'foo' and >> create identifier 'bar', the identifier 'bar' will have same id as the old >> 'foo' identifier. This can be problem because the identifier id is used as >> origin of the data and the replication solution using the replication >> identifiers can end up thinking that data came from node 'bar' even though >> they came from the node 'foo' which no longer exists. This can have bad >> effects for example on conflict detection or debugging problems with >> replication. >> >> Maybe another reason to use standard Oids? > > As the same reason exists for oids, just somewhat less likely, I don't > see it as a reason for much. It's really not that hard to get oid > conflicts once your server has lived for a while. As soon as the oid > counter has wrapped around once, it's not unlikely to have > conflicts. And with temp tables (or much more extremely WITH OID tables) > and such it's not that hard to reach that point. The only material > difference this makes is that it's much easier to notice the problem > during development. Why not just create a sequence? I suspect it may not be as fast to assign as an OID, but it's not like you'd be doing this all the time. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
> Why not just create a sequence? I suspect it may not be as fast to assign as > an OID, but it's not like you'd be doing this all the time. What does that have to do with the thread? Greetings, Andres Freund
On 4/7/15 10:58 AM, Andres Freund wrote: >> Why not just create a sequence? I suspect it may not be as fast to assign as >> an OID, but it's not like you'd be doing this all the time. > > What does that have to do with the thread? The original bit was... > And finally I have issue with how the new identifiers are allocated. > Currently, if you create identifier 'foo', remove identifier 'foo' and > create identifier 'bar', the identifier 'bar' will have same id as the old > 'foo' identifier. This can be problem because the identifier id is used as > origin of the data and the replication solution using the replication > identifiers can end up thinking that data came from node 'bar' even though > they came from the node 'foo' which no longer exists. This can have bad > effects for example on conflict detection or debugging problems with > replication. > > Maybe another reason to use standard Oids? Wasn't the reason for using OIDs so that we're not doing the equivalent of max(identifier) + 1? Perhaps I'm just confused... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-04-07 16:30:25 +0200, Andres Freund wrote: >> And with temp tables (or much more extremely WITH OID tables) >> and such it's not that hard to reach that point. > > Oh, and obviously toast data. A couple tables with toasted columns is > also a good way to rapidly consume oids. You are forgetting as well large objects on the stack, when client application does not assign an OID by itself. -- Michael
On 08/04/15 06:59, Michael Paquier wrote: > On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2015-04-07 16:30:25 +0200, Andres Freund wrote: >>> And with temp tables (or much more extremely WITH OID tables) >>> and such it's not that hard to reach that point. >> >> Oh, and obviously toast data. A couple tables with toasted columns is >> also a good way to rapidly consume oids. > > You are forgetting as well large objects on the stack, when client > application does not assign an OID by itself. > And you guys are not getting my point. What I proposed was to not reuse the RI id immediately because that can make debugging issues with replication/conflict handling harder when something happens after cluster configuration has changed. Whether it's done using Oid or some other way, I don't really care and wrapping around eventually is ok, since the old origin info for transactions will be cleared out during the freeze at the latest anyway. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2015-04-08 14:17:04 +0200, Petr Jelinek wrote: > And you guys are not getting my point. What I proposed was to not reuse the > RI id immediately because that can make debugging issues with > replication/conflict handling harder when something happens after cluster > configuration has changed. If that's the goal, you shouldn't delete the replication identifier at that point. That's the only sane way preventing it from being reused. > Whether it's done using Oid or some other way, I don't really care and > wrapping around eventually is ok, since the old origin info for > transactions will be cleared out during the freeze at the latest > anyway. How are you proposing to do the allocation then? There's no magic preventing immediate reuse with oids or anything else. The oid counter might *already* have wrapped around and point exactly to the identifier you're about to delete. Then when you deleted it it's going to be reused for the next allocated oid. Andres Freund
On 08/04/15 14:22, Andres Freund wrote: > On 2015-04-08 14:17:04 +0200, Petr Jelinek wrote: >> And you guys are not getting my point. What I proposed was to not reuse the >> RI id immediately because that can make debugging issues with >> replication/conflict handling harder when something happens after cluster >> configuration has changed. > > If that's the goal, you shouldn't delete the replication identifier at > that point. That's the only sane way preventing it from being reused. > Ok, I am happy with that solution. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2015-04-07 17:08:16 +0200, Andres Freund wrote: > I'm starting benchmarks now. What I'm benchmarking here is the WAL overhead, since that's what we're debating. The test setup I used was a pgbench scale 10 instance. I've run with full_page_write=off to have more reproducible results. This of course over-emphasizes the overhead a bit. But for a long checkpoint interval and a memory resident working set it's not that unrealistic. I ran 50k transactions in a signle b baseline: - 20445024 - 20437128 - 20436864 - avg: 20439672 extern 2byte identifiers: - 23318368 - 23148648 - 23128016 - avg: 23198344 - avg overhead: 13.5% padding 2byte identifiers: - 21160408 - 21319720 - 21164280 - avg: 21214802 - avg overhead: 3.8% extern 4byte identifiers: - 23514216 - 23540128 - 23523080 - avg: 23525808 - avg overhead: 15.1% To me that shows pretty clearly that a) reusing the padding is worthwhile b) even without that using 2byte instead of 4 byte identifiers is beneficial. Now. Especially in the case of extern identifiers we *can* optimize a bit more. But there's no way we can get the efficiency of the version reusing padding. To run the benchmarks you need to SELECT pg_replication_identifier_create('frak'); before starting pgbench with the attached file. Greetings, Andres Freund
Attachment
On 10/04/15 18:03, Andres Freund wrote: > On 2015-04-07 17:08:16 +0200, Andres Freund wrote: >> I'm starting benchmarks now. > > What I'm benchmarking here is the WAL overhead, since that's what we're > debating. > > The test setup I used was a pgbench scale 10 instance. I've run with > full_page_write=off to have more reproducible results. This of course > over-emphasizes the overhead a bit. But for a long checkpoint interval > and a memory resident working set it's not that unrealistic. > > I ran 50k transactions in a signle b > baseline: > - 20445024 > - 20437128 > - 20436864 > - avg: 20439672 > extern 2byte identifiers: > - 23318368 > - 23148648 > - 23128016 > - avg: 23198344 > - avg overhead: 13.5% > padding 2byte identifiers: > - 21160408 > - 21319720 > - 21164280 > - avg: 21214802 > - avg overhead: 3.8% > extern 4byte identifiers: > - 23514216 > - 23540128 > - 23523080 > - avg: 23525808 > - avg overhead: 15.1% > > To me that shows pretty clearly that a) reusing the padding is > worthwhile b) even without that using 2byte instead of 4 byte > identifiers is beneficial. > My opinion is that 10% of WAL size difference is quite high price to pay so that we can keep the padding for some other, yet unknown feature that hasn't come up in several years, which would need those 2 bytes. But if we are willing to pay it then we can really go all the way and just use Oids... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 04/12/2015 02:56 AM, Petr Jelinek wrote: > On 10/04/15 18:03, Andres Freund wrote: >> On 2015-04-07 17:08:16 +0200, Andres Freund wrote: >>> I'm starting benchmarks now. >> >> What I'm benchmarking here is the WAL overhead, since that's what we're >> debating. >> >> The test setup I used was a pgbench scale 10 instance. I've run with >> full_page_write=off to have more reproducible results. This of course >> over-emphasizes the overhead a bit. But for a long checkpoint interval >> and a memory resident working set it's not that unrealistic. >> >> I ran 50k transactions in a signle b >> baseline: >> - 20445024 >> - 20437128 >> - 20436864 >> - avg: 20439672 >> extern 2byte identifiers: >> - 23318368 >> - 23148648 >> - 23128016 >> - avg: 23198344 >> - avg overhead: 13.5% >> padding 2byte identifiers: >> - 21160408 >> - 21319720 >> - 21164280 >> - avg: 21214802 >> - avg overhead: 3.8% >> extern 4byte identifiers: >> - 23514216 >> - 23540128 >> - 23523080 >> - avg: 23525808 >> - avg overhead: 15.1% >> >> To me that shows pretty clearly that a) reusing the padding is >> worthwhile b) even without that using 2byte instead of 4 byte >> identifiers is beneficial. >> > > My opinion is that 10% of WAL size difference is quite high price to pay > so that we can keep the padding for some other, yet unknown feature that > hasn't come up in several years, which would need those 2 bytes. > > But if we are willing to pay it then we can really go all the way and > just use Oids... This needs to be weighed against removing the padding bytes altogether. See attached. That would reduce the WAL size further when you don't need replication IDs. It's very straightforward, but need to do some performance/scalability testing to make sure that using memcpy instead of a straight 32-bit assignment doesn't hurt performance, since it happens in very performance critical paths. I'm surprised there's such a big difference between the "extern" and "padding" versions above. At a quick approximation, storing the ID as a separate "fragment", along with XLogRecordDataHeaderShort and XLogRecordDataHeaderLong, should add one byte of overhead plus the ID itself. So that would be 3 extra bytes for 2-byte identifiers, or 5 bytes for 4-byte identifiers. Does that mean that the average record length is only about 30 bytes? That's what it seems like, if adding the "extern 2 byte identifiers" added about 10% of overhead compared to the "padding 2 byte identifiers" version. That doesn't sound right, 30 bytes is very little. Perhaps the size of the records created by pgbench happen to cross a 8-byte alignment boundary at that point, making a big difference. In another workload, there might be no difference at all, due to alignment. Also, you don't need to tag every record type with the replication ID. All indexam records can skip it, for starters, since logical decoding doesn't care about them. That should remove a fair amount of bloat. - Heikki
Attachment
On 2015-04-12 22:02:38 +0300, Heikki Linnakangas wrote: > This needs to be weighed against removing the padding bytes > altogether. Hrmpf. Says the person that used a lot of padding, without much discussion, for the WAL level infrastructure making pg_rewind more maintainable. And you deemed to be perfectly ok to use them up to avoid *increasing* the WAL size with the *additional data* (which so far nothing but pg_rewind needs in that way). While it perfectly well could have been used to shrink the WAL size to less than it now is. And that's *far*, *far* harder to back out/refactor changes than this (which are pretty localized and thus can easily be changed); to the point that I think it's infeasible to do so... If you want to shrink the WAL size, send in a patch independently. Not as a way to block somebody else implementing something. > I'm surprised there's such a big difference between the "extern" and > "padding" versions above. At a quick approximation, storing the ID as a > separate "fragment", along with XLogRecordDataHeaderShort and > XLogRecordDataHeaderLong, should add one byte of overhead plus the ID > itself. So that would be 3 extra bytes for 2-byte identifiers, or 5 bytes > for 4-byte identifiers. Does that mean that the average record length is > only about 30 bytes? Yes, nearly. That's xlogdump --stats=record from the above scenario with replication identifiers used and reusing the padding: Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- Transaction/COMMIT 50003 ( 16.89) 2600496 ( 23.38) 0 ( -nan) 2600496 ( 23.38) CLOG/ZEROPAGE 1 ( 0.00) 28 ( 0.00) 0 ( -nan) 28 ( 0.00) Standby/RUNNING_XACTS 5 ( 0.00) 248 ( 0.00) 0 ( -nan) 248 ( 0.00) Heap2/CLEAN 46034 ( 15.55) 1473088 ( 13.24) 0 ( -nan) 1473088 ( 13.24) Heap2/VISIBLE 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00) Heap/INSERT 49682 ( 16.78) 1341414 ( 12.06) 0 ( -nan) 1341414 ( 12.06) Heap/HOT_UPDATE 150013 ( 50.67) 5700494 ( 51.24) 0 ( -nan) 5700494 ( 51.24) Heap/INPLACE 5 ( 0.00) 130 ( 0.00) 0 ( -nan) 130 ( 0.00) Heap/INSERT+INIT 318 ( 0.11) 8586 ( 0.08) 0 ( -nan) 8586 ( 0.08) Btree/VACUUM 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00) -------- -------- -------- -------- Total 296065 11124596 [100.00%] 0 [0.00%] 11124596 [100% (The FPI percentage display above is arguably borked. Interesting.) So the average record size is ~37.5 bytes including the increased commit record size due to the origin information (which is the part that increases the size for that version that reuses the padding). This *most definitely* isn't representative of every workload. But it *is* *a* common type of workload. Note that --stats will *not* show the size difference in xlog records when adding data as an additional chunk vs. padding as it uses XLogRecGetDataLen() to compute the record length... That confused me for a while. > That doesn't sound right, 30 bytes is very little. Well, it's mostly HOT_UPDATES and INSERTS into not indexed tables. So that's not too surprising. Obviously that'd look different with FPIs enabled. > Perhaps the size > of the records created by pgbench happen to cross a 8-byte alignment > boundary at that point, making a big difference. In another workload, > there might be no difference at all, due to alignment. Right. > Also, you don't need to tag every record type with the replication ID. All > indexam records can skip it, for starters, since logical decoding doesn't > care about them. That should remove a fair amount of bloat. Yes. I mentioned that. It's additional complexity because now the decision has to be made at each xlog insertion callsite. Making refactoring this into a different representation a bit harder. I don't think it will make that much of a differenced in the above workload (just CLEAN will be smaller); but it clearly might in others. I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for "external" 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. I still don't see a point in using 4 byte identifiers atm, given the above numbers that just seems like a waste for unrealistic use cases (>2^16 nodes). It's just two lines to change if we feel the need in the future. Working on fixing the issue with WAL logging of deletions and rearranging docs as Petr suggested. Not sure if the latter will really look good, but I guess we'll see ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
fix xlogdump percentage display (was Re: Replication identifiers, take 4)
From
Abhijit Menon-Sen
Date:
At 2015-04-17 10:54:51 +0200, andres@anarazel.de wrote: > > (The FPI percentage display above is arguably borked. Interesting.) Sorry for the trouble. Patch attached. -- Abhijit
Attachment
On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote:
Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable.
Sounds bad. What padding are we talking about?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/17/2015 12:04 PM, Simon Riggs wrote: > On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote: > >> Hrmpf. Says the person that used a lot of padding, without much >> discussion, for the WAL level infrastructure making pg_rewind more >> maintainable. > > Sounds bad. What padding are we talking about? In the new WAL format, the data chunks are stored unaligned, without padding, to save space. The new format is quite different to the old one, so it's not straightforward to compare how much that saved. The fixed-size XLogRecord header is 8 bytes shorter in the new format, because it doesn't have the xl_len field anymore. But the same information is stored elsewhere in the record, where it takes 2 or 5 bytes (XLogRecordDataHeaderShort/Long). But it's a fair point that we could've just made small adjustments to the old format, without revamping every record type and the way the block information is stored, and that the space saving of the new format should be compared with that instead, for a fair comparison. As an example, one simple thing we could've done with the old format: remove xl_len, and store the length in place of the two unused padding bytes instead, as long as it fits in 16 bits. For longer records, set a flag and store it right after XLogRecord header. For practically all WAL records, that would've shrunk XLogRecord from 32 to 24 bytes, and made each record 8 bytes shorter. I ran the same pgbench test Andres used, with scale 10, and 50000 transactions, and compared the WAL size between master and 9.4: master: 20738352 9.4: 23915800 According to pg_xlogdump, there were 301153 WAL records. If you take the 9.4 figure, and imagine that we had saved those 8 bytes on each WAL record, 9.4 would've been 21506576 bytes instead. So yeah, we could've achieved much of the WAL savings with that much smaller change. That's a useful thing to compare with. BTW, those numbers are with wal_level=minimal. With wal_level=logical, the WAL size from the same test on master was 26503520 bytes. That's quite a bump. Looking at pg_xlogdump output, it seems that it's all because the commit records are wider. - Heikki
On 17 April 2015 at 18:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
--
On 04/17/2015 12:04 PM, Simon Riggs wrote:On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote:Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable.
Sounds bad. What padding are we talking about?
In the new WAL format, the data chunks are stored unaligned, without padding, to save space. The new format is quite different to the old one, so it's not straightforward to compare how much that saved.
The key point here is the whole WAL format was changed to accommodate a minor requirement for one utility. Please notice that nobody tried to stop you doing that.
The changes Andres is requesting have a very significant effect on a major new facility. Perhaps there is concern that it is an external utility?
If we can trust Heikki to include code into core that was written externally then I think we can do the same for Andres.
I think its time to stop the padding discussion and commit something useful. We need this.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/17/2015 08:36 PM, Simon Riggs wrote: > On 17 April 2015 at 18:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> On 04/17/2015 12:04 PM, Simon Riggs wrote: >> >>> On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote: >>> >>> Hrmpf. Says the person that used a lot of padding, without much >>>> discussion, for the WAL level infrastructure making pg_rewind more >>>> maintainable. >>> >>> Sounds bad. What padding are we talking about? >> >> In the new WAL format, the data chunks are stored unaligned, without >> padding, to save space. The new format is quite different to the old one, >> so it's not straightforward to compare how much that saved. > > The key point here is the whole WAL format was changed to accommodate a > minor requirement for one utility. Please notice that nobody tried to stop > you doing that. > > The changes Andres is requesting have a very significant effect on a major > new facility. Perhaps there is concern that it is an external utility? > > If we can trust Heikki to include code into core that was written > externally then I think we can do the same for Andres. I'm not concerned of the fact it is an external utility. Well, it concerns me a little bit, because that means that it'll get little testing with PostgreSQL. But that has nothing to do with the WAL size question. > I think its time to stop the padding discussion and commit something > useful. We need this. To be honest, I'm not entirely sure what we're arguing over. I said that IMO the difference in WAL size is so small that we should just use 4-byte OIDs for the replication identifiers, instead of trying to make do with 2 bytes. Not because I find it too likely that you'll run out of IDs (although it could happen), but more to make replication IDs more like all other system objects we have. Andreas did some pgbench benchmarking to show that the difference in WAL size is about 10%. The WAL records generated by pgbench happen to have just the right sizes so that the 2-3 extra bytes bump them over to the next alignment boundary. That's why there is such a big difference - on average it'll be less. I think that's acceptable, Andreas seems to think otherwise. But if the WAL size really is so precious, we could remove the two padding bytes from XLogRecord, instead of dedicating them for the replication ids. That would be an even better use for them. - Heikki
On 17 April 2015 at 19:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
+1 to Andres' very reasonable suggestion. Lets commit this and go home.
--
To be honest, I'm not entirely sure what we're arguing over.
When arguing over something you consider small, it is customary to allow the author precedence. We can't do things our own way all the time.
I didn't much like pg_rewind, but it doesn't hurt and you like it, so I didn't object. We've all got better things to do.
I said that IMO the difference in WAL size is so small that we should just use 4-byte OIDs for the replication identifiers, instead of trying to make do with 2 bytes. Not because I find it too likely that you'll run out of IDs (although it could happen), but more to make replication IDs more like all other system objects we have. Andreas did some pgbench benchmarking to show that the difference in WAL size is about 10%. The WAL records generated by pgbench happen to have just the right sizes so that the 2-3 extra bytes bump them over to the next alignment boundary. That's why there is such a big difference - on average it'll be less. I think that's acceptable, Andreas seems to think otherwise. But if the WAL size really is so precious, we could remove the two padding bytes from XLogRecord, instead of dedicating them for the replication ids. That would be an even better use for them.
The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that.
The change does nothing useful, since I doubt anyone will ever need >32768 nodes in their cluster.
Increasing WAL size for any non-zero amount is needlessly wasteful for a change with only cosmetic value. But for a change that has significant value for database resilience, it is a sensible use of bytes.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 17/04/15 22:36, Simon Riggs wrote: > > I said that IMO the difference in WAL size is so small that we > should just use 4-byte OIDs for the replication identifiers, instead > of trying to make do with 2 bytes. Not because I find it too likely > that you'll run out of IDs (although it could happen), but more to > make replication IDs more like all other system objects we have. > Andreas did some pgbench benchmarking to show that the difference in > WAL size is about 10%. The WAL records generated by pgbench happen > to have just the right sizes so that the 2-3 extra bytes bump them > over to the next alignment boundary. That's why there is such a big > difference - on average it'll be less. I think that's acceptable, > Andreas seems to think otherwise. But if the WAL size really is so > precious, we could remove the two padding bytes from XLogRecord, > instead of dedicating them for the replication ids. That would be an > even better use for them. > > > The argument to move to 4 bytes is a poor one. If it was reasonable in > terms of code or cosmetic value then all values used in the backend > would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we > don't do that. > > The change does nothing useful, since I doubt anyone will ever need > >32768 nodes in their cluster. > And if they did there would be other much bigger problems than replication identifier being 16bit (it's actually >65534 as it's unsigned btw). Considering the importance and widespread use of replication I think we should really make sure the related features have small overhead. > Increasing WAL size for any non-zero amount is needlessly wasteful for a > change with only cosmetic value. But for a change that has significant > value for database resilience, it is a sensible use of bytes. > > +1 to Andres' very reasonable suggestion. Lets commit this and go home. > +1 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 04/17/2015 11:36 PM, Simon Riggs wrote: > On 17 April 2015 at 19:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> To be honest, I'm not entirely sure what we're arguing over. > > When arguing over something you consider small, it is customary to allow > the author precedence. We can't do things our own way all the time. Sure, I'm not going to throw a tantrum if Andres commits this as it is. >> I said that IMO the difference in WAL size is so small that we should just >> use 4-byte OIDs for the replication identifiers, instead of trying to make >> do with 2 bytes. Not because I find it too likely that you'll run out of >> IDs (although it could happen), but more to make replication IDs more like >> all other system objects we have. Andreas did some pgbench benchmarking to >> show that the difference in WAL size is about 10%. The WAL records >> generated by pgbench happen to have just the right sizes so that the 2-3 >> extra bytes bump them over to the next alignment boundary. That's why there >> is such a big difference - on average it'll be less. I think that's >> acceptable, Andreas seems to think otherwise. But if the WAL size really is >> so precious, we could remove the two padding bytes from XLogRecord, instead >> of dedicating them for the replication ids. That would be an even better >> use for them. > > The argument to move to 4 bytes is a poor one. If it was reasonable in > terms of code or cosmetic value then all values used in the backend would > be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do > that. That's a straw man argument. I'm not saying we should never use 2 byte values anywhere. OID is usually used as the primary key in system tables. There are exceptions, but that is nevertheless the norm. I'm saying that saving in WAL size is not worth making an exception here, and we should go with the simplest option of using OIDs. - Heikki
On 04/17/2015 11:45 PM, Petr Jelinek wrote: >> >The argument to move to 4 bytes is a poor one. If it was reasonable in >> >terms of code or cosmetic value then all values used in the backend >> >would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we >> >don't do that. >> > >> >The change does nothing useful, since I doubt anyone will ever need >> > >32768 nodes in their cluster. >> > > And if they did there would be other much bigger problems than > replication identifier being 16bit (it's actually >65534 as it's > unsigned btw). Can you name some of the bigger problems you'd have? Obviously, if you have 100000 high-volume OLTP nodes connected to a single server, feeding transactions as a continous stream, you're going to choke the system. But you might have 100000 tiny satellite databases that sync up with the master every few hours, and each of them do only a few updates per day. - Heikki
On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote: > Can you name some of the bigger problems you'd have? Several parts of the system are O(#max_replication_slots). Having 100k outgoing logical replication slots is going to be expensive. Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be the biggest problem. Greetings, Andres Freund
On 04/17/2015 11:54 AM, Andres Freund wrote: > I've attached a rebased patch, that adds decision about origin logging > to the relevant XLogInsert() callsites for "external" 2 byte identifiers > and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about "replication identifier" as the new fundamental concept. The system table is called "pg_replication_identifier". But that's like talking about "index identifiers", instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of "replication identifier" -> "replication origin" to the patch. Or even "transaction origin". - Heikki
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: > I just realized that it talks about "replication identifier" as the new > fundamental concept. The system table is called "pg_replication_identifier". > But that's like talking about "index identifiers", instead of just indexes, > and calling the system table pg_index_oid. > > The important concept this patch actually adds is the *origin* of each > transaction. That term is already used in some parts of the patch. I think > we should roughly do a search-replace of "replication identifier" -> > "replication origin" to the patch. Or even "transaction origin". Sounds good to me. Greetings, Andres Freund
On 20 April 2015 at 09:28, Andres Freund <andres@anarazel.de> wrote:
--
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> I just realized that it talks about "replication identifier" as the new
> fundamental concept. The system table is called "pg_replication_identifier".
> But that's like talking about "index identifiers", instead of just indexes,
> and calling the system table pg_index_oid.
>
> The important concept this patch actually adds is the *origin* of each
> transaction. That term is already used in some parts of the patch. I think
> we should roughly do a search-replace of "replication identifier" ->
> "replication origin" to the patch. Or even "transaction origin".
Sounds good to me.
+1
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-04-20 10:28:02 +0200, Andres Freund wrote: > On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: > > I just realized that it talks about "replication identifier" as the new > > fundamental concept. The system table is called "pg_replication_identifier". > > But that's like talking about "index identifiers", instead of just indexes, > > and calling the system table pg_index_oid. > > > > The important concept this patch actually adds is the *origin* of each > > transaction. That term is already used in some parts of the patch. I think > > we should roughly do a search-replace of "replication identifier" -> > > "replication origin" to the patch. Or even "transaction origin". > > Sounds good to me. I'm working on changing this (I've implemented the missing WAL bits). I'd like to discuss the new terms for a sec, before I go and revise the docs. I'm now calling the feature 'replication progress tracking'. There's "replication origins" and there's progress tracking infrastructure that tracks how far data from a "replication origin" has replicated. Catalog wise there's an actual table 'pg_replication_origin' that maps between 'roident' and 'roname'. There's a pg_replication_progress view (used to be named pg_replication_identifier_progress). I'm not sure if the latter name isn't too generic? Maybe pg_logical_replication_progress? I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : "manually" set the replication progress to a value. Primarily useful for copying valuesfrom other systems and such. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. Any comments? Andres
Andres Freund wrote: > I'm working on changing this (I've implemented the missing WAL > bits). I'd like to discuss the new terms for a sec, before I go and > revise the docs. > > I'm now calling the feature 'replication progress tracking'. There's > "replication origins" and there's progress tracking infrastructure that > tracks how far data from a "replication origin" has replicated. Sounds good to me. > Catalog wise there's an actual table 'pg_replication_origin' that maps > between 'roident' and 'roname'. There's a pg_replication_progress view > (used to be named pg_replication_identifier_progress). I'm not sure if > the latter name isn't too generic? Maybe > pg_logical_replication_progress? I think if we wanted "pg_logical_replication_progress" (and I don't really agree that we do) then we would add the "logical" bit to the names above as well. This seems unnecessary. pg_replication_progress seems okay to me. > I've now named the functions: > > * pg_replication_origin_create > * pg_replication_origin_drop > * pg_replication_origin_get (map from name to id) > * pg_replication_progress_setup_origin : configure session to replicate > from a specific origin > * pg_replication_progress_reset_origin > * pg_replication_progress_is_replaying : Is a origin configured for the session > * pg_replication_progress_advance : "manually" set the replication > progress to a value. Primarily useful for copying values from other > systems and such. These all look acceptable to me. > * pg_replication_progress_get : How far did replay progress for a > certain origin > * pg_get_replication_progress : SRF returning the replay progress for > all origin. This combination seems confusing. In some other thread not too long ago there was the argument that "all functions 'get' something, so that verb should not appear in the function name". That would call for "pg_replication_progress" on the singleton. Maybe to distinguish the SRF, add "all" as a suffix? > * pg_replication_progress_setup_tx_details : configure per transaction > details (LSN and timestamp currently) Not sure about the "tx" here. We use "xact" as an abbreviation for "transaction" in most places. If nowadays we don't like that term, maybe just spell out "transaction" in full. I assume this function pairs up with pg_replication_progress_setup_origin, yes? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-04-21 12:20:42 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > Catalog wise there's an actual table 'pg_replication_origin' that maps > > between 'roident' and 'roname'. There's a pg_replication_progress view > > (used to be named pg_replication_identifier_progress). I'm not sure if > > the latter name isn't too generic? Maybe > > pg_logical_replication_progress? > > I think if we wanted "pg_logical_replication_progress" (and I don't > really agree that we do) then we would add the "logical" bit to the > names above as well. This seems unnecessary. pg_replication_progress > seems okay to me. Cool. > > * pg_replication_progress_get : How far did replay progress for a > > certain origin > > * pg_get_replication_progress : SRF returning the replay progress for > > all origin. > > This combination seems confusing. In some other thread not too long ago > there was the argument that "all functions 'get' something, so that verb > should not appear in the function name". > That would call for "pg_replication_progress" on the singleton. Hm. I don't like that. That'd e.g. clash with the above view. I think it's good to distinguish between functions (that have a verb in the name) and views/tables (that don't). I agree that the above combination isn't optimal. Although pg_get (and pg_stat_get) is what's used for a lot of other SRF backed views. Maybe naming the SRF pg_get_all_replication_progress? > > * pg_replication_progress_setup_tx_details : configure per transaction > > details (LSN and timestamp currently) > > Not sure about the "tx" here. We use "xact" as an abbreviation for > "transaction" in most places. Oh, yea. Xact is more consistent. > If nowadays we don't like that term, maybe just spell out > "transaction" in full. I assume this function pairs up with > pg_replication_progress_setup_origin, yes? pg_replication_progress_setup_origin sets up the per session state, setup_xact_details the "per replayed transaction" state. Greetings, Andres Freund
On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote: > I've now named the functions: > > * pg_replication_origin_create > * pg_replication_origin_drop > * pg_replication_origin_get (map from name to id) > * pg_replication_progress_setup_origin : configure session to replicate > from a specific origin > * pg_replication_progress_reset_origin > * pg_replication_progress_setup_tx_details : configure per transaction > details (LSN and timestamp currently) > * pg_replication_progress_is_replaying : Is a origin configured for the session > * pg_replication_progress_advance : "manually" set the replication > progress to a value. Primarily useful for copying values from other > systems and such. > * pg_replication_progress_get : How far did replay progress for a > certain origin > * pg_get_replication_progress : SRF returning the replay progress for > all origin. > > Any comments? Why are we using functions for this rather than DDL? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-04-21 16:26:08 -0400, Robert Haas wrote: > On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote: > > I've now named the functions: > > > > * pg_replication_origin_create > > * pg_replication_origin_drop > > * pg_replication_origin_get (map from name to id) > > * pg_replication_progress_setup_origin : configure session to replicate > > from a specific origin > > * pg_replication_progress_reset_origin > > * pg_replication_progress_setup_tx_details : configure per transaction > > details (LSN and timestamp currently) > > * pg_replication_progress_is_replaying : Is a origin configured for the session > > * pg_replication_progress_advance : "manually" set the replication > > progress to a value. Primarily useful for copying values from other > > systems and such. > > * pg_replication_progress_get : How far did replay progress for a > > certain origin > > * pg_get_replication_progress : SRF returning the replay progress for > > all origin. > > > > Any comments? > > Why are we using functions for this rather than DDL? Unless I miss something the only two we really could use ddl for is pg_replication_origin_create/pg_replication_origin_drop. We could use DDL for them if we really want, but I'm not really seeing the advantage. Greetings, Andres Freund
On 21/04/15 22:36, Andres Freund wrote: > On 2015-04-21 16:26:08 -0400, Robert Haas wrote: >> On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote: >>> I've now named the functions: >>> >>> * pg_replication_origin_create >>> * pg_replication_origin_drop >>> * pg_replication_origin_get (map from name to id) >>> * pg_replication_progress_setup_origin : configure session to replicate >>> from a specific origin >>> * pg_replication_progress_reset_origin >>> * pg_replication_progress_setup_tx_details : configure per transaction >>> details (LSN and timestamp currently) >>> * pg_replication_progress_is_replaying : Is a origin configured for the session >>> * pg_replication_progress_advance : "manually" set the replication >>> progress to a value. Primarily useful for copying values from other >>> systems and such. >>> * pg_replication_progress_get : How far did replay progress for a >>> certain origin >>> * pg_get_replication_progress : SRF returning the replay progress for >>> all origin. >>> >>> Any comments? >> >> Why are we using functions for this rather than DDL? > > Unless I miss something the only two we really could use ddl for is > pg_replication_origin_create/pg_replication_origin_drop. We could use > DDL for them if we really want, but I'm not really seeing the advantage. > I think the only value of having DDL for this would be consistency (catalog objects are created via DDL) as it looks like something that will be called only by extensions and not users during normal operation. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2015-03-24 22:22:29 +0100, Petr Jelinek wrote: > Perhaps we should have some Logical replication developer documentation > section and put all those three as subsections of that? So I just played around with this and it didn't find it worthwhile. Primarily because there's lots of uses of logical decoding besides building a logical replication solution. I've reverted to putting it into a separate chapter 'besides' logical decoding. Greetings, Andres Freund
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: > On 04/17/2015 11:54 AM, Andres Freund wrote: > >I've attached a rebased patch, that adds decision about origin logging > >to the relevant XLogInsert() callsites for "external" 2 byte identifiers > >and removes the pad-reusing version in the interest of moving forward. > > Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: > > I just realized that it talks about "replication identifier" as the new > fundamental concept. The system table is called "pg_replication_identifier". > But that's like talking about "index identifiers", instead of just indexes, > and calling the system table pg_index_oid. > > The important concept this patch actually adds is the *origin* of each > transaction. That term is already used in some parts of the patch. I think > we should roughly do a search-replace of "replication identifier" -> > "replication origin" to the patch. Or even "transaction origin". Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. I'm pretty happy with this state. Greetings, Andres Freund
Attachment
On 24/04/15 14:32, Andres Freund wrote: > On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: >> On 04/17/2015 11:54 AM, Andres Freund wrote: >>> I've attached a rebased patch, that adds decision about origin logging >>> to the relevant XLogInsert() callsites for "external" 2 byte identifiers >>> and removes the pad-reusing version in the interest of moving forward. >> >> Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: >> >> I just realized that it talks about "replication identifier" as the new >> fundamental concept. The system table is called "pg_replication_identifier". >> But that's like talking about "index identifiers", instead of just indexes, >> and calling the system table pg_index_oid. >> >> The important concept this patch actually adds is the *origin* of each >> transaction. That term is already used in some parts of the patch. I think >> we should roughly do a search-replace of "replication identifier" -> >> "replication origin" to the patch. Or even "transaction origin". > > Attached is a patch that does this, and some more, renaming. That was > more work than I'd imagined. I've also made the internal naming in > origin.c more consistent/simpler and did a bunch of other cleanup. > There are few oversights in the renaming: doc/src/sgml/func.sgml: + Return the replay position for the passed in replication + identifier. The parameter <parameter>flush</parameter> src/include/replication/origin.h: + * replication_identifier.h ---- +extern PGDLLIMPORT RepOriginId replident_sesssion_origin; +extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn; +extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp; (these are used then in multiple places in code afterwards and also mentioned in comment above replorigin_advance) src/backend/replication/logical/origin.c: + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("replication identiefer ---- + default: + elog(PANIC, "replident_redo: unknown op code ---- + * This function may only be called if a origin was setup with + * replident_session_setup(). I also think the "replident_checkpoint" file should be renamed to "replorigin_checkpoint". -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4/24/15 8:32 AM, Andres Freund wrote: > On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: >> On 04/17/2015 11:54 AM, Andres Freund wrote: >>> I've attached a rebased patch, that adds decision about origin logging >>> to the relevant XLogInsert() callsites for "external" 2 byte identifiers >>> and removes the pad-reusing version in the interest of moving forward. >> >> Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: >> >> I just realized that it talks about "replication identifier" as the new >> fundamental concept. The system table is called "pg_replication_identifier". >> But that's like talking about "index identifiers", instead of just indexes, >> and calling the system table pg_index_oid. >> >> The important concept this patch actually adds is the *origin* of each >> transaction. That term is already used in some parts of the patch. I think >> we should roughly do a search-replace of "replication identifier" -> >> "replication origin" to the patch. Or even "transaction origin". > > Attached is a patch that does this, and some more, renaming. That was > more work than I'd imagined. I've also made the internal naming in > origin.c more consistent/simpler and did a bunch of other cleanup. > > I'm pretty happy with this state. Shouldn't this be backed up by pg_dump(all?)?
On April 24, 2015 10:26:23 PM GMT+02:00, Peter Eisentraut <peter_e@gmx.net> wrote: >On 4/24/15 8:32 AM, Andres Freund wrote: >> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: >>> On 04/17/2015 11:54 AM, Andres Freund wrote: >>>> I've attached a rebased patch, that adds decision about origin >logging >>>> to the relevant XLogInsert() callsites for "external" 2 byte >identifiers >>>> and removes the pad-reusing version in the interest of moving >forward. >>> >>> Putting aside the 2 vs. 4 byte identifier issue, let's discuss >naming: >>> >>> I just realized that it talks about "replication identifier" as the >new >>> fundamental concept. The system table is called >"pg_replication_identifier". >>> But that's like talking about "index identifiers", instead of just >indexes, >>> and calling the system table pg_index_oid. >>> >>> The important concept this patch actually adds is the *origin* of >each >>> transaction. That term is already used in some parts of the patch. I >think >>> we should roughly do a search-replace of "replication identifier" -> >>> "replication origin" to the patch. Or even "transaction origin". >> >> Attached is a patch that does this, and some more, renaming. That was >> more work than I'd imagined. I've also made the internal naming in >> origin.c more consistent/simpler and did a bunch of other cleanup. >> >> I'm pretty happy with this state. > >Shouldn't this be backed up by pg_dump(all?)? Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The otherside's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won'tmake much sense. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On 4/24/15 4:29 PM, Andres Freund wrote: >> Shouldn't this be backed up by pg_dump(all?)? > > Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. Theother side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented datawon't make much sense. I agree it might not be the best match. But we should consider that we used to say, a backup by pg_dumpall plus configuration files is a complete backup. Now we have replication slots and possibly replication identifiers and maybe similar features in the future that are not covered by this backup method.
On Tue, Apr 28, 2015 at 10:00 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 4/24/15 4:29 PM, Andres Freund wrote: >>> Shouldn't this be backed up by pg_dump(all?)? >> >> Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. Theother side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented datawon't make much sense. > > I agree it might not be the best match. But we should consider that we > used to say, a backup by pg_dumpall plus configuration files is a > complete backup. Now we have replication slots and possibly replication > identifiers and maybe similar features in the future that are not > covered by this backup method. That's true. But if you did backup the replication slots with pg_dump, and then you restored them as part of restoring the dump, your replication setup would be just as broken as if you had never backed up those replication slots at all. I think the problem here is that replication slots are part of *cluster* configuration, not individual node configuration. If you back up a set of nodes that make up a cluster, and then restore them, you might hope that you will end up with working slots established between the same pairs of machines that had working slots between them before. But I don't see a way to make that happen when you look at it from the point of view of backing up and restoring just one node. As we get better clustering facilities into core, we may develop more instances of this problem; the best way of solving it is not clear to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 25 April 2015 at 00:32, Andres Freund <andres@anarazel.de> wrote:
Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.
Hi,
It looks like bowerbird is not too happy with this, neither is my build environment.
The attached seems to fix it.
I put the include in the origin.h rather than in the origin.c as the DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c.
Regards
David Rowley
Attachment
On 29/04/15 22:12, David Rowley wrote: > On 25 April 2015 at 00:32, Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> wrote: > > > Attached is a patch that does this, and some more, renaming. That was > more work than I'd imagined. I've also made the internal naming in > origin.c more consistent/simpler and did a bunch of other cleanup. > > > Hi, > > It looks like bowerbird is not too happy with this, neither is my build > environment. > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-04-29%2019%3A31%3A06 > > The attached seems to fix it. > I put the include in the origin.h rather than in the origin.c as the > DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c. > I think correct fix is using PG_UINT16_MAX. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services