Thread: Streaming replication and non-blocking I/O
I find the backend libpq changes related to non-blocking I/O quite complex. Can we find a simpler solution? The problem we're trying to solve is that while the walsender backend sends a lot of WAL records to the client, the client can send a lot of messages to the backend. If volume of the messages from client to server exceeds both the input buffer in the server and the output buffer in the client, the client will block until the server has read some data. But if the client is blocked, it will not process incoming data from the server, and eventually the server will block too. And we have a deadlock. This: http://florin.bjdean.id.au/docs/omnimark/omni55/docs/html/concept/717.htm is a pretty good description of the problem. The first question is: do we really need to be prepared for that? The XLogRecPtr acknowledgment messages the client sends are very small, and if the client is mindful about not sending them too often - perhaps max 1 ack per 1 received XLOG message - the receive buffer in the backend should never fill up in practice. If that's deemed not good enough, we could modify just internal_flush() so that it uses secure_poll to wait for the possibility to either read or write, instead of blocking for just write. Whenever there's incoming data, read them into PqRecvBuffer for later processing, which keeps the OS input buffer from filling up. If PqRecvBuffer fills up, it can be extended, or we can start dropping old XLogRecPtr messages from it. In any case, we'll need something like pq_wait to check if a message can be read without blocking, but that's just a small additional function as opposed to a whole new API for assembling and sending messages without blocking. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 8, 2009 at 11:23 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > The first question is: do we really need to be prepared for that? The > XLogRecPtr acknowledgment messages the client sends are very small, and > if the client is mindful about not sending them too often - perhaps max > 1 ack per 1 received XLOG message - the receive buffer in the backend > should never fill up in practice. It's OK to drop that feature. > If that's deemed not good enough, we could modify just internal_flush() > so that it uses secure_poll to wait for the possibility to either read > or write, instead of blocking for just write. Whenever there's incoming > data, read them into PqRecvBuffer for later processing, which keeps the > OS input buffer from filling up. If PqRecvBuffer fills up, it can be > extended, or we can start dropping old XLogRecPtr messages from it. Extending PqRecvBuffer seems better because XLogRecPtr message has some types (i.e., we cannot just drop old message without parsing all messages in the buffer). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Tue, Dec 8, 2009 at 11:23 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> If that's deemed not good enough, we could modify just internal_flush() >> so that it uses secure_poll to wait for the possibility to either read >> or write, instead of blocking for just write. Whenever there's incoming >> data, read them into PqRecvBuffer for later processing, which keeps the >> OS input buffer from filling up. If PqRecvBuffer fills up, it can be >> extended, or we can start dropping old XLogRecPtr messages from it. > > Extending PqRecvBuffer seems better because XLogRecPtr message > has some types (i.e., we cannot just drop old message without parsing > all messages in the buffer). True. Another idea I had was to introduce a callback that backend libpq can call when the buffer fills. Walsender would set the callback to ProcessStreamMsgs(). But if everyone is happy with just relying on the OS buffer to not fill up, let's just drop it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 9, 2009 at 3:58 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > True. Another idea I had was to introduce a callback that backend libpq > can call when the buffer fills. Walsender would set the callback to > ProcessStreamMsgs(). > > But if everyone is happy with just relying on the OS buffer to not fill > up, let's just drop it. The OS buffer is expected to be able to store a large number of XLogRecPtr messages, because its size is small. So it's also OK to just drop it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > On Wed, Dec 9, 2009 at 3:58 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> But if everyone is happy with just relying on the OS buffer to not fill >> up, let's just drop it. > The OS buffer is expected to be able to store a large number of > XLogRecPtr messages, because its size is small. So it's also OK > to just drop it. It certainly seems to be something we could improve later, when and if evidence emerges that it's a real-world problem. For now, simple is beautiful. regards, tom lane
On Thu, Dec 10, 2009 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The OS buffer is expected to be able to store a large number of >> XLogRecPtr messages, because its size is small. So it's also OK >> to just drop it. > > It certainly seems to be something we could improve later, when and > if evidence emerges that it's a real-world problem. For now, > simple is beautiful. I just dropped the backend libpq changes related to non-blocking I/O. git://git.postgresql.org/git/users/fujii/postgres.git branch: replication Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Thu, Dec 10, 2009 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The OS buffer is expected to be able to store a large number of >>> XLogRecPtr messages, because its size is small. So it's also OK >>> to just drop it. >> It certainly seems to be something we could improve later, when and >> if evidence emerges that it's a real-world problem. For now, >> simple is beautiful. > > I just dropped the backend libpq changes related to non-blocking I/O. > > git://git.postgresql.org/git/users/fujii/postgres.git > branch: replication Thanks, much simpler now. Changing the finish_time argument to pqWaitTimed into timeout_ms changes the behavior connect_timeout option to PQconnectdb. It should wait for max connect_timeout seconds in total, but now it is waiting for connect_timeout seconds at each step in the connection process: opening a socket, authenticating etc. Could we change the API of PQgetXLogData to be more like PQgetCopyData? I'm thinking of removing the timeout argument, and instead looping with select/poll and PQconsumeInput in the caller. That probably means introducing a new state analogous to PGASYNC_COPY_IN. I haven't thought this fully through yet, but it seems like it would be good to have a consistent API. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Changing the finish_time argument to pqWaitTimed into timeout_ms changes > the behavior connect_timeout option to PQconnectdb. It should wait for > max connect_timeout seconds in total, but now it is waiting for > connect_timeout seconds at each step in the connection process: opening > a socket, authenticating etc. Refresh my memory as to why this patch is touching any of that code at all? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Changing the finish_time argument to pqWaitTimed into timeout_ms changes >> the behavior connect_timeout option to PQconnectdb. It should wait for >> max connect_timeout seconds in total, but now it is waiting for >> connect_timeout seconds at each step in the connection process: opening >> a socket, authenticating etc. > > Refresh my memory as to why this patch is touching any of that code at > all? Walreceiver wants to wait for data to arrive from the master or a signal. PQgetXLogData(), which is the libpq function to read a piece of WAL, takes a timeout argument to support that. Walreceiver calls PQgetXLogData() in an endless loop, checking for a received sighup or death of postmaster at every iteration. In the synchronous replication mode, I presume it's also going to listen for a signal from the startup process, so that it can send a acknowledgment to the master as soon as a COMMIT record has been replayed that a backend on the master is waiting for. To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed to take a timeout instead of finishing_time argument. Which is a mistake because it breaks PQconnectdb, and as I said I don't think PQgetXLogData(9 should have a timeout argument to begin with. Instead, it should have a boolean 'async' argument to return immediately if there's no data, and walreceiver main loop should call poll()/select() to wait. Ie. just like PQgetCopyData() works. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, Dec 13, 2009 at 5:42 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Walreceiver wants to wait for data to arrive from the master or a > signal. PQgetXLogData(), which is the libpq function to read a piece of > WAL, takes a timeout argument to support that. Walreceiver calls > PQgetXLogData() in an endless loop, checking for a received sighup or > death of postmaster at every iteration. > > In the synchronous replication mode, I presume it's also going to listen > for a signal from the startup process, so that it can send a > acknowledgment to the master as soon as a COMMIT record has been > replayed that a backend on the master is waiting for. Right. > To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed > to take a timeout instead of finishing_time argument. Which is a mistake > because it breaks PQconnectdb, and as I said I don't think > PQgetXLogData(9 should have a timeout argument to begin with. Instead, > it should have a boolean 'async' argument to return immediately if > there's no data, and walreceiver main loop should call poll()/select() > to wait. Ie. just like PQgetCopyData() works. Seems good. I'll revise the code. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > On Sun, Dec 13, 2009 at 5:42 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed >> to take a timeout instead of finishing_time argument. Which is a mistake >> because it breaks PQconnectdb, and as I said I don't think >> PQgetXLogData(9 should have a timeout argument to begin with. Instead, >> it should have a boolean 'async' argument to return immediately if >> there's no data, and walreceiver main loop should call poll()/select() >> to wait. Ie. just like PQgetCopyData() works. > Seems good. I'll revise the code. Do we need a new "PQgetXLogData" function at all? Seems like you could shove the data through the COPY protocol and not have to touch libpq at all, rather than duplicating a nontrivial amount of code there. regards, tom lane
On Mon, Dec 14, 2009 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Do we need a new "PQgetXLogData" function at all? Seems like you could > shove the data through the COPY protocol and not have to touch libpq > at all, rather than duplicating a nontrivial amount of code there. Yeah, I also think that all data (the WAL data itself, its LSN and the flag bits) which the "PQgetXLogData" handles could be shoved through the COPY protocol. But, outside libpq, it's somewhat messy to extract the LSN and the flag bits from the data buffer which "PQgetCopyData" returns, by using ntohs(). So I provided the new libpq function only for replication. That is, I didn't want to expose the low layer of network which libpq should handle. I think that the friendly function would be useful to implement the standby program (e.g., a stand-alone walreceiver tool) outside the core. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, Dec 12, 2009 at 5:09 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Could we change the API of PQgetXLogData to be more like PQgetCopyData? > I'm thinking of removing the timeout argument, and instead looping with > select/poll and PQconsumeInput in the caller. That probably means > introducing a new state analogous to PGASYNC_COPY_IN. I haven't thought > this fully through yet, but it seems like it would be good to have a > consistent API. On a related issue, so far I haven't considered about the way to output the notice message at all :( In the current SR, it's always written to stderr by the defaultNoticeProcessor by using fprintf, whether the log_destination is specified or not. This is bizarre, and would need to be fixed. I'm going to set the new function calling ereport as the current notice processor by using PQsetNoticeProcessor. But the problem is that only the completed message like "NOTICE: xxx" is passed to such notice processor, i.e., the error level itself is not passed. So I wonder which error level should be used to output the notice message. There are some approaches to address this; 1. Always use a specific level without regard to the actual one 2. Reverse-engineer the level from the complete message 3. Change some libpq functions so as to pass the error level to the notice processor But nothing really stands out. Do you have another good idea? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > On Mon, Dec 14, 2009 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Do we need a new "PQgetXLogData" function at all? �Seems like you could >> shove the data through the COPY protocol and not have to touch libpq >> at all, rather than duplicating a nontrivial amount of code there. > Yeah, I also think that all data (the WAL data itself, its LSN and > the flag bits) which the "PQgetXLogData" handles could be shoved > through the COPY protocol. But, outside libpq, it's somewhat messy > to extract the LSN and the flag bits from the data buffer which > "PQgetCopyData" returns, by using ntohs(). So I provided the new > libpq function only for replication. That is, I didn't want to expose > the low layer of network which libpq should handle. I find that a completely unconvincing division of labor. Who is to say that the LSN is the only part of the data that needs special treatment? The very, very large practical problem with this is that if you decide to change the behavior at any time, the only way to be sure that the WAL receiver is using the right libpq version is to perform a soname major version bump. The transformations done by libpq will essentially become part of its ABI, and not a very visible part at that. I am going to insist that no such logic be placed in libpq. From a packager's standpoint that's insanity. regards, tom lane
Fujii Masao <masao.fujii@gmail.com> writes: > I'm going to set the new function calling ereport as the current notice > processor by using PQsetNoticeProcessor. But the problem is that only the > completed message like "NOTICE: xxx" is passed to such notice processor, > i.e., the error level itself is not passed. Use PQsetNoticeReceiver. The other one is just there for backwards compatibility. regards, tom lane
Tom Lane wrote: > The very, very large practical problem with this is that if you decide > to change the behavior at any time, the only way to be sure that the WAL > receiver is using the right libpq version is to perform a soname major > version bump. The transformations done by libpq will essentially become > part of its ABI, and not a very visible part at that. Not having to change the libpq API would certainly be a big advantage. It's going to be a bit more complicated in walsender/walreceiver to work with the libpq COPY API. We're going to need a WAL sending/receiving protocol on top of it, defined in terms of rows and columns passed through the COPY protocol. One problem is the the standby is supposed to send back acknowledgments to the master, telling it how far it has received/replayed the WAL. Is there any way to send information back to the server, while a COPY OUT is in progress? That's not absolutely necessary with asynchronous replication, but will be with synchronous. One idea is to stop/start the COPY between every batch of WAL records sent, giving the client (= walreceiver) a chance to send messages back. But that will lead to extra round trips. BTW, something that's been bothering me a bit with this patch is that we now have to link the backend with libpq. I don't see an immediate problem with that, but I'm not a packager. Does anyone see a problem with that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > It's going to be a bit more complicated in walsender/walreceiver to work > with the libpq COPY API. We're going to need a WAL sending/receiving > protocol on top of it, defined in terms of rows and columns passed > through the COPY protocol. AFAIR, libpq knows essentially nothing of the data being passed through COPY --- it just treats that as a byte stream. I think you can define any data format you want, it doesn't need to look exactly like a COPY of a table would. In fact it's probably a lot better if it DOESN'T look like COPY data once it gets past libpq, so that you can check that it is WAL and not COPY data. > One problem is the the standby is supposed to send back acknowledgments > to the master, telling it how far it has received/replayed the WAL. Is > there any way to send information back to the server, while a COPY OUT > is in progress? That's not absolutely necessary with asynchronous > replication, but will be with synchronous. Well, a real COPY would of course not stop to look for incoming messages, but I don't think that's inherent in the protocol. You would likely need some libpq adjustments so it didn't throw error when you tried that, but it would be a small and one-time adjustment. > BTW, something that's been bothering me a bit with this patch is that we > now have to link the backend with libpq. I don't see an immediate > problem with that, but I'm not a packager. Does anyone see a problem > with that? Yeah, I have a problem with that. What's the backend doing with libpq? It's not receiving this data, it's sending it. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Yeah, I have a problem with that. What's the backend doing with libpq? >> It's not receiving this data, it's sending it. > walreceiver is a postmaster subprocess too. Hm. Perhaps it should be a loadable plugin and not hard-linked into the backend? Compare dblink. The main concern I have with hard-linking libpq is that it has a lot of symbol conflicts with the backend --- and at least the ones from src/port/ aren't easily removed. I foresee problems that will be very difficult to fix on platforms where we can't filter the set of link symbols exposed by libpq. Linking a thread-enabled libpq into the backend could also create problems on some platforms --- it would likely cause a thread-capable libc to get linked, which is not what we want. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> BTW, something that's been bothering me a bit with this patch is that we >> now have to link the backend with libpq. I don't see an immediate >> problem with that, but I'm not a packager. Does anyone see a problem >> with that? > > Yeah, I have a problem with that. What's the backend doing with libpq? > It's not receiving this data, it's sending it. walreceiver is a postmaster subprocess too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 15, 2009 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm. Perhaps it should be a loadable plugin and not hard-linked into the > backend? Compare dblink. You mean that such plugin is supplied in shared_preload_libraries, a new process is forked and the shared-memory related to walreceiver is created by using shmem_startup_hook? Since this approach would solve the problem discussed previously, ISTM this makes sense. http://archives.postgresql.org/pgsql-hackers/2009-11/msg00031.php Some additional code might be required to control the termination of walreceiver. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Dec 15, 2009 at 3:47 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Tom Lane wrote: >> The very, very large practical problem with this is that if you decide >> to change the behavior at any time, the only way to be sure that the WAL >> receiver is using the right libpq version is to perform a soname major >> version bump. The transformations done by libpq will essentially become >> part of its ABI, and not a very visible part at that. > > Not having to change the libpq API would certainly be a big advantage. Done; I replaced PQgetXLogData and PQputXLogRecPtr with PQgetCopyData and PQputCopyData. git://git.postgresql.org/git/users/fujii/postgres.gitbranch: replication Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Tue, Dec 15, 2009 at 3:47 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Tom Lane wrote: >>> The very, very large practical problem with this is that if you decide >>> to change the behavior at any time, the only way to be sure that the WAL >>> receiver is using the right libpq version is to perform a soname major >>> version bump. The transformations done by libpq will essentially become >>> part of its ABI, and not a very visible part at that. >> Not having to change the libpq API would certainly be a big advantage. > > Done; I replaced PQgetXLogData and PQputXLogRecPtr with PQgetCopyData and > PQputCopyData. Great! The logical next step is move the handling of TimelineID and system identifier out of libpq as well. I'm thinking of refactoring the protocol along these lines: 0. Begin by connecting to the master just like a normal backend does. We don't necessarily need the new ProtocolVersion code either, though it's probably still a good idea to reject connections to older server versions. 1. Get the system identifier of the master. Slave -> Master: Query message, with a query string like "GET_SYSTEM_IDENTIFIER" Master -> Slave: RowDescription, DataRow CommandComplete, and ReadyForQuery messages. The system identifier is returned in the DataRow message. This is identical to what happens when a query is executed against a normal backend using the simple query protocol, so walsender can use PQexec() for this. 2. Another query exchange like above, for timeline ID. (or these two steps can be joined into one query, to eliminate one round-trip). 3. Request a backup history file, if needed: Slave -> Master: Query message, with a query string like "GET_BACKUP_HISTORY_FILE XXX" where XXX is XLogRecPtr or file name. Master -> Slave: RowDescription, DataRow CommandComplete and ReadyForQuery messages as usual. The file contents are returned in the DataRow message. 4. Start replication Slave -> Master: Query message, with query string "START REPLICATION: XXXX", where XXXX is the RecPtr of the starting point. Master -> Slave: CopyOutResponse followed by a continuous stream of CopyData messages with WAL contents. This minimizes the changes to the protocol and libpq, with a clear way of extending by adding new commands. Similar to what you did a long time ago, connecting as an actual backend at first and then switching to walsender mode after running a few queries, but this would all be handled in a separate loop in walsender instead of running as a full-blown backend. We'll still need small changes to libpq to allow sending messages back to the server in COPY_IN mode (maybe add a new COPY_IN_OUT mode for that). Thoughts? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
<p>I'm interested in abstracting out features of replication from libpq too. It would be nice if we could implement differentcommunication bus modules. <p>For example if you have dozens of replicas you may want to use something like spreadto distribute the records using multicast. <p>Sorry for top posting -- I haven't yet figured out how not to in thisclient. <p><blockquote type="cite">On 16 Dec 2009 09:54, "Heikki Linnakangas" <<a href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>>wrote:<br /><br /><p><font color="#500050">FujiiMasao wrote: > On Tue, Dec 15, 2009 at 3:47 AM, Heikki Linnakangas > <heikki.linnakangas@enter...</font>Great!The logical next step is move the handling of TimelineID and<br /> system identifierout of libpq as well.<br /><br /><br /> I'm thinking of refactoring the protocol along these lines:<br /><br />0. Begin by connecting to the master just like a normal backend does. We<br /> don't necessarily need the new ProtocolVersioncode either, though it's<br /> probably still a good idea to reject connections to older server versions.<br/><br /> 1. Get the system identifier of the master.<br /><br /> Slave -> Master: Query message, with a querystring like<br /> "GET_SYSTEM_IDENTIFIER"<br /><br /> Master -> Slave: RowDescription, DataRow CommandComplete, and<br/> ReadyForQuery messages. The system identifier is returned in the DataRow<br /> message.<br /><br /> This is identicalto what happens when a query is executed against a<br /> normal backend using the simple query protocol, so walsendercan use<br /> PQexec() for this.<br /><br /> 2. Another query exchange like above, for timeline ID. (or these two<br/> steps can be joined into one query, to eliminate one round-trip).<br /><br /> 3. Request a backup history file,if needed:<br /><br /> Slave -> Master: Query message, with a query string like<br /> "GET_BACKUP_HISTORY_FILE XXX"where XXX is XLogRecPtr or file name.<br /><br /> Master -> Slave: RowDescription, DataRow CommandComplete and<br/> ReadyForQuery messages as usual. The file contents are returned in the<br /> DataRow message.<br /><br /><br />4. Start replication<br /><br /> Slave -> Master: Query message, with query string "START REPLICATION:<br /> XXXX",where XXXX is the RecPtr of the starting point.<br /><br /> Master -> Slave: CopyOutResponse followed by a continuousstream of<br /> CopyData messages with WAL contents.<br /><br /><br /> This minimizes the changes to the protocoland libpq, with a clear way<br /> of extending by adding new commands. Similar to what you did a long time<br />ago, connecting as an actual backend at first and then switching to<br /> walsender mode after running a few queries, butthis would all be<br /> handled in a separate loop in walsender instead of running as a<br /> full-blown backend. We'llstill need small changes to libpq to allow<br /> sending messages back to the server in COPY_IN mode (maybe add a new<br/> COPY_IN_OUT mode for that).<br /><br /> Thoughts?<br /><p><font color="#500050"> -- Heikki Linnakangas EnterpriseDB<a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a> -- </font><p><font color="#500050">Sentvia pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)To make changes to your subscript...</font></blockquote>
On Wed, Dec 16, 2009 at 6:53 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Great! The logical next step is move the handling of TimelineID and > system identifier out of libpq as well. All right. > 0. Begin by connecting to the master just like a normal backend does. We > don't necessarily need the new ProtocolVersion code either, though it's > probably still a good idea to reject connections to older server versions. And, I think that such backend should switch to walsender mode when the startup packet arrives. Otherwise, we would have to authenticate such backend twice on different context, i.e., a normal backend and walsender. So the settings for each context would be required in pg_hba.conf. This is odd, I think. Thought? > 1. Get the system identifier of the master. > > Slave -> Master: Query message, with a query string like > "GET_SYSTEM_IDENTIFIER" > > Master -> Slave: RowDescription, DataRow CommandComplete, and > ReadyForQuery messages. The system identifier is returned in the DataRow > message. > > This is identical to what happens when a query is executed against a > normal backend using the simple query protocol, so walsender can use > PQexec() for this. s/walsender/walreceiver ? A signal cannot cancel PQexec() during waiting for the message from the server. We might need to change SIGTERM handler of walreceiver so as to call proc_exit() immediately if it's during PQexec(). > 2. Another query exchange like above, for timeline ID. (or these two > steps can be joined into one query, to eliminate one round-trip). > > 3. Request a backup history file, if needed: > > Slave -> Master: Query message, with a query string like > "GET_BACKUP_HISTORY_FILE XXX" where XXX is XLogRecPtr or file name. > > Master -> Slave: RowDescription, DataRow CommandComplete and > ReadyForQuery messages as usual. The file contents are returned in the > DataRow message. > > 4. Start replication > > Slave -> Master: Query message, with query string "START REPLICATION: > XXXX", where XXXX is the RecPtr of the starting point. > > Master -> Slave: CopyOutResponse followed by a continuous stream of > CopyData messages with WAL contents. Seems OK. > This minimizes the changes to the protocol and libpq, with a clear way > of extending by adding new commands. Similar to what you did a long time > ago, connecting as an actual backend at first and then switching to > walsender mode after running a few queries, but this would all be > handled in a separate loop in walsender instead of running as a > full-blown backend. Agreed. Only walsender should be allowed to handle the query strings that you proposed, in order that we avoid touching a parser. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Wed, Dec 16, 2009 at 6:53 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> 0. Begin by connecting to the master just like a normal backend does. We >> don't necessarily need the new ProtocolVersion code either, though it's >> probably still a good idea to reject connections to older server versions. > > And, I think that such backend should switch to walsender mode when the startup > packet arrives. Otherwise, we would have to authenticate such backend twice > on different context, i.e., a normal backend and walsender. So the settings for > each context would be required in pg_hba.conf. This is odd, I think. Thought? True. >> This is identical to what happens when a query is executed against a >> normal backend using the simple query protocol, so walsender can use >> PQexec() for this. > > s/walsender/walreceiver ? Right. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 17, 2009 at 9:02 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> And, I think that such backend should switch to walsender mode when the startup >> packet arrives. Otherwise, we would have to authenticate such backend twice >> on different context, i.e., a normal backend and walsender. So the settings for >> each context would be required in pg_hba.conf. This is odd, I think. Thought? > > True. Currently this switch depends on whether XLOG_STREAMING_CODE is sent from the standby or not, also which depends on whether PQstartXLogStreaming() is called or not. But, as the next step, we should get rid of also such changes of libpq. I'm thinking of making the standby send the "walsender-switch-code" the same way as application_name; walreceiver always specifies the option like "replication=on" in conninfo string and calls PQconnectdb(), which sends the code as a part of startup packet. And, the environment variable for that should not be defined to avoid user's mis-configuration, I think. Thought? Better idea? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > I'm thinking of making the standby send the "walsender-switch-code" the same way > as application_name; walreceiver always specifies the option like > "replication=on" > in conninfo string and calls PQconnectdb(), which sends the code as a part of > startup packet. And, the environment variable for that should not be defined to > avoid user's mis-configuration, I think. Sounds good. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 17, 2009 at 10:25 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> I'm thinking of making the standby send the "walsender-switch-code" the same way >> as application_name; walreceiver always specifies the option like >> "replication=on" >> in conninfo string and calls PQconnectdb(), which sends the code as a part of >> startup packet. And, the environment variable for that should not be defined to >> avoid user's mis-configuration, I think. > > Sounds good. Okey. Design clarification again; 0. Begin by connecting to the master using PQconnectdb() with new conninfo option specifying the request of replication. The startup packet with the request is sent to the master, then the backend switches to the walsender mode. The walsender goes into the main loop and wait for the request from the walreceiver. 1. Get the system identifier of the master. Slave -> Master: Query message, with a query string like "GET_SYSTEM_IDENTIFIER" Master -> Slave: RowDescription, DataRow CommandComplete, and ReadyForQuery messages. The system identifier is returned in the DataRow message. 2. Another query exchange like above, for timeline ID. Slave -> Master: Query message, with a query string like "GET_TIMELINE" Master -> Slave: RowDescription, DataRow CommandComplete, and ReadyForQuery messages. The timeline ID is returned in the DataRow message. 3. Request a backup history file, if needed: Slave -> Master: Query message, with a query string like "GET_BACKUP_HISTORY_FILE XXX" where XXX is XLogRecPtr. Master -> Slave: RowDescription, DataRow CommandComplete and ReadyForQuery messages as usual. The file contents are returned in the DataRow message. In 1, 2, 3, the walreceiver uses PQexec() to send Query message and receive the results. 4. Start replication Slave -> Master: Query message, with query string "START REPLICATION: XXXX", where XXXX is the RecPtr of the starting point. Master -> Slave: CopyOutResponse followed by a continuous stream of CopyData messages with WAL contents. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Dec 18, 2009 at 11:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Okey. Design clarification again; > > 0. Begin by connecting to the master using PQconnectdb() with new conninfo > option specifying the request of replication. The startup packet with the > request is sent to the master, then the backend switches to the walsender > mode. The walsender goes into the main loop and wait for the request from > the walreceiver. <snip> > 4. Start replication > > Slave -> Master: Query message, with query string "START REPLICATION: > XXXX", where XXXX is the RecPtr of the starting point. > > Master -> Slave: CopyOutResponse followed by a continuous stream of > CopyData messages with WAL contents. Done. Currently there is no new libpq function for replication. The walreceiver uses only existing functions like PQconnectdb, PQexec, PQgetCopyData, etc. git://git.postgresql.org/git/users/fujii/postgres.git branch: replication Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Fri, Dec 18, 2009 at 11:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Okey. Design clarification again; >> >> 0. Begin by connecting to the master using PQconnectdb() with new conninfo >> option specifying the request of replication. The startup packet with the >> request is sent to the master, then the backend switches to the walsender >> mode. The walsender goes into the main loop and wait for the request from >> the walreceiver. > <snip> >> 4. Start replication >> >> Slave -> Master: Query message, with query string "START REPLICATION: >> XXXX", where XXXX is the RecPtr of the starting point. >> >> Master -> Slave: CopyOutResponse followed by a continuous stream of >> CopyData messages with WAL contents. > > Done. Currently there is no new libpq function for replication. The > walreceiver uses only existing functions like PQconnectdb, PQexec, > PQgetCopyData, etc. Ok thanks, sounds good, I'll take a look. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Fujii Masao wrote: > On Tue, Dec 15, 2009 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm. Perhaps it should be a loadable plugin and not hard-linked into the >> backend? Compare dblink. > > You mean that such plugin is supplied in shared_preload_libraries, > a new process is forked and the shared-memory related to walreceiver > is created by using shmem_startup_hook? Since this approach would > solve the problem discussed previously, ISTM this makes sense. > http://archives.postgresql.org/pgsql-hackers/2009-11/msg00031.php > > Some additional code might be required to control the termination > of walreceiver. I'm not sure which problem in that thread you're referring to, but I can see two options: 1. Use dlopen()/dlsym() in walreceiver to use libpq. A bit awkward, though we could write a bunch of macros to hide that and make the libpq calls look normal. 2. Move walreceiver altogether into a loadable module, which is linked as usual to libpq. Like e.g contrib/dblink. Thoughts? Both seem reasonable to me. I tested the 2nd option (see 'replication' branch in my git repository), splitting walreceiver.c into two: the functions that run in the walreceiver process, and the functions that are called from other processes to control walreceiver. That's a quite nice separation, though of course we could do that with the 1st approach as well. PS. I just merged with CVS HEAD. Streaming replication is pretty awesome with Hot Standby! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Fujii Masao wrote: > I'm not sure which problem in that thread you're referring to, but I can > see two options: > 1. Use dlopen()/dlsym() in walreceiver to use libpq. A bit awkward, > though we could write a bunch of macros to hide that and make the libpq > calls look normal. > 2. Move walreceiver altogether into a loadable module, which is linked > as usual to libpq. Like e.g contrib/dblink. > Thoughts? Both seem reasonable to me. From a packager's standpoint the second is much saner. If you want to use dlopen() then you will have to know the exact name of the .so file (e.g. libpq.so.5.3) and possibly its location too. Or you will have to persuade packagers that they should ship bare "libpq.so" symlinks, which is contrary to packaging standards on most Linux distros. (walreceiver.so wouldn't be subject to those standards, but libpq is because it's a regular library that can also be hard-linked by applications.) regards, tom lane
On Tue, Dec 22, 2009 at 2:31 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > 2. Move walreceiver altogether into a loadable module, which is linked > as usual to libpq. Like e.g contrib/dblink. > > Thoughts? Both seem reasonable to me. I tested the 2nd option (see > 'replication' branch in my git repository), splitting walreceiver.c into > two: the functions that run in the walreceiver process, and the > functions that are called from other processes to control walreceiver. > That's a quite nice separation, though of course we could do that with > the 1st approach as well. Though I seem not to understand what a loadable module means, I wonder how the walreceiver module is loaded. AFAIK, we need to manually install the dblink functions by executing dblink.sql before using them. Likewise, if we choose the 2nd option, we must manually install the walreceiver module before starting replication? Or we automatically install that by executing system_view.sql, like pg_start_backup? I'd like to reduce the number of installation operations as much as possible. Is my concern besides the point? > PS. I just merged with CVS HEAD. Streaming replication is pretty awesome > with Hot Standby! Thanks! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > Though I seem not to understand what a loadable module means, I wonder > how the walreceiver module is loaded. Put it in shared_preload_libraries, perhaps. regards, tom lane
Fujii Masao wrote: > On Tue, Dec 22, 2009 at 2:31 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> 2. Move walreceiver altogether into a loadable module, which is linked >> as usual to libpq. Like e.g contrib/dblink. >> >> Thoughts? Both seem reasonable to me. I tested the 2nd option (see >> 'replication' branch in my git repository), splitting walreceiver.c into >> two: the functions that run in the walreceiver process, and the >> functions that are called from other processes to control walreceiver. >> That's a quite nice separation, though of course we could do that with >> the 1st approach as well. > > Though I seem not to understand what a loadable module means, I wonder > how the walreceiver module is loaded. AFAIK, we need to manually install > the dblink functions by executing dblink.sql before using them. Likewise, > if we choose the 2nd option, we must manually install the walreceiver > module before starting replication? I think we can just use load_external_function() to load the library and call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the library name. Walreceiver is quite tightly coupled with the rest of the backend anyway, so I don't think we need to come up with a pluggable API at the moment. That's the way I did it yesterday, see 'replication' branch in my git repository, but it looks like I fumbled the commit so that some of the changes were committed as part of the merge commit with origin/master (=CVS HEAD). Sorry about that. shared_preload_libraries seems like a bad place because the library doesn't need to be loaded in all backends. Just the walreceiver process. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 22, 2009 at 3:30 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I think we can just use load_external_function() to load the library and > call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the > library name. Walreceiver is quite tightly coupled with the rest of the > backend anyway, so I don't think we need to come up with a pluggable API > at the moment. > > That's the way I did it yesterday, see 'replication' branch in my git > repository, but it looks like I fumbled the commit so that some of the > changes were committed as part of the merge commit with origin/master > (=CVS HEAD). Sorry about that. Umm.., I still cannot find the place where the walreceiver module is loaded by using load_external_function() in your 'replication' branch. Also the compilation of that branch fails. Is the 'pushed' branch the latest? Sorry if I'm missing something. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Dec 22, 2009 at 6:30 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I think we can just use load_external_function() to load the library and > call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the > library name. Walreceiver is quite tightly coupled with the rest of the > backend anyway, so I don't think we need to come up with a pluggable API > at the moment. Please? I am really interested in replacing walsender and walreceiver with something which uses a communication bus like spread instead of a single point to point connection. ISTM if we start with something tightly coupled it'll be hard to decouple later. Whereas if we start with a limited interface we'll learn just how much information is really required by the modules and will have fewer surprises later when we find suprising interdependencies. -- greg
Fujii Masao wrote: > On Tue, Dec 22, 2009 at 3:30 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I think we can just use load_external_function() to load the library and >> call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the >> library name. Walreceiver is quite tightly coupled with the rest of the >> backend anyway, so I don't think we need to come up with a pluggable API >> at the moment. >> >> That's the way I did it yesterday, see 'replication' branch in my git >> repository, but it looks like I fumbled the commit so that some of the >> changes were committed as part of the merge commit with origin/master >> (=CVS HEAD). Sorry about that. > > Umm.., I still cannot find the place where the walreceiver module is > loaded by using load_external_function() in your 'replication' branch. > Also the compilation of that branch fails. Is the 'pushed' branch the > latest? Sorry if I'm missing something. Ah, I see. The changes were not included in the merge commit after all, but I had simple forgot to "git add" them. Sorry about that, should be there now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Greg Stark wrote: > On Tue, Dec 22, 2009 at 6:30 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I think we can just use load_external_function() to load the library and >> call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the >> library name. Walreceiver is quite tightly coupled with the rest of the >> backend anyway, so I don't think we need to come up with a pluggable API >> at the moment. > > Please? I am really interested in replacing walsender and walreceiver > with something which uses a communication bus like spread instead of a > single point to point connection. I think you'd still need to be able to request older WAL segments to resync after a lost connection, restore from base backup etc., which don't really fit into a publish/subscribe style communication bus. I'm sure it could all be solved though. It would be a pretty cool feature, for scaling to a large number of slaves. > ISTM if we start with something tightly coupled it'll be hard to > decouple later. Whereas if we start with a limited interface we'll > learn just how much information is really required by the modules and > will have fewer surprises later when we find suprising > interdependencies. I'm all ears if you have a concrete proposal. I'm not too worried about it being hard to decouple later. The interface is actually quite limited already, as the communication between processes is done via shared memory. It probably wouldn't be hard to turn it into an API, but I don't think there's a hurry to do that until someone actually steps up to write an alternative walreceiver/walsender, -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 22, 2009 at 8:49 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Ah, I see. The changes were not included in the merge commit after all, > but I had simple forgot to "git add" them. Sorry about that, should be > there now. Thanks for doing "git push" again! But the compilation still fails. Attached patch addresses this problem. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
I've merged the replication branch with PostgreSQL CVS HEAD now, including the patch for end-of-backup WAL records I committed earlier today. See 'replication' branch in my git repository. There's also a couple of other small changes: I believe the SSL stuff isn't really necessary, so I removed it. I also moved the START_REPLICATION phase from the walreceiver main loop to WalRcvConnect, as it's simpler that way. I will continue reviewing.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Jan 5, 2010 at 12:22 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I've merged the replication branch with PostgreSQL CVS HEAD now, > including the patch for end-of-backup WAL records I committed earlier > today. See 'replication' branch in my git repository. > > There's also a couple of other small changes: I believe the SSL stuff > isn't really necessary, so I removed it. I also moved the > START_REPLICATION phase from the walreceiver main loop to WalRcvConnect, > as it's simpler that way. I also fixed a couple of small bugs: * The ErrorResponse message from the primary server had been ignored * The segment-boundary had been wrongly handled * Valid replication starting location had been wrongly regarded as invalid git://git.postgresql.org/git/users/fujii/postgres.gitbranch: replication Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Dec 22, 2009 at 8:49 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Umm.., I still cannot find the place where the walreceiver module is >> loaded by using load_external_function() in your 'replication' branch. >> Also the compilation of that branch fails. Is the 'pushed' branch the >> latest? Sorry if I'm missing something. > > Ah, I see. The changes were not included in the merge commit after all, > but I had simple forgot to "git add" them. Sorry about that, should be > there now. This change which moves walreceiver process into a dynamically loaded module caused the following compile error on my MinGW environment. --------------------------- gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g -I. -I../../../../src/interfaces/libpq -I../../../../src/include -I./src/include/port/win32 -DEXEC_BACKEND "-I../../../../src/include/port/win32" -DBUILDING_DLL -c -o walreceiverproc.o walreceiverproc.c dlltool --export-all --output-def libwalreceiverprocdll.def walreceiverproc.o dllwrap -o walreceiverproc.dll --dllname walreceiverproc.dll --def libwalreceiverprocdll.def walreceiverproc.o -L../../../../src/backend -lpostgres -L../../../../src/interfaces/libpq -L../../../../src/port -lpq Info: resolving _pg_signal_mask by linking to __imp__pg_signal_mask (auto-import) Info: resolving _pg_signal_queue by linking to __imp__pg_signal_queue (auto-import) Info: resolving _InterruptPending by linking to __imp__InterruptPending (auto-import) Info: resolving _assert_enabled by linking to __imp__assert_enabled (auto-import) Info: resolving _WalRcv by linking to __imp__WalRcv (auto-import) Info: resolving _proc_exit_inprogress by linking to __imp__proc_exit_inprogress (auto-import) Info: resolving _BlockSig by linking to __imp__BlockSig (auto-import) Info: resolving _sync_method by linking to __imp__sync_method (auto-import) Info: resolving _MyProcPid by linking to __imp__MyProcPid (auto-import) Info: resolving _CurrentResourceOwner by linking to __imp__CurrentResourceOwner (auto-import) Info: resolving _TopMemoryContext by linking to __imp__TopMemoryContext (auto-import) Info: resolving _CurrentMemoryContext by linking to __imp__CurrentMemoryContext (auto-import) Info: resolving _PG_exception_stack by linking to __imp__PG_exception_stack (auto-import) Info: resolving _UnBlockSig by linking to __imp__UnBlockSig (auto-import) Info: resolving _ThisTimeLineID by linking to __imp__ThisTimeLineID (auto-import) Info: resolving _error_context_stack by linking to __imp__error_context_stack (auto-import) Info: resolving _InterruptHoldoffCount by linking to __imp__InterruptHoldoffCount (auto-import) c:\MinGW\bin\..\lib\gcc\mingw32\3.4.2\..\..\..\..\mingw32\bin\ld.exe: warning: auto-importing has been activated without --enable-auto-import specified on the command line. This should work unless it involves constant data structures referencing symbols from auto-imported DLLs. fu000001.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname' fu000003.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname' fu000005.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname' fu000006.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname' fu000008.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname' fu000009.o:(.idata$2+0xc): more undefined references to `libpostgres_a_iname' follow nmth000000.o:(.idata$4+0x0): undefined reference to `_nm__pg_signal_mask' nmth000002.o:(.idata$4+0x0): undefined reference to `_nm__pg_signal_queue' nmth000004.o:(.idata$4+0x0): undefined reference to `_nm__InterruptPending' nmth000007.o:(.idata$4+0x0): undefined reference to `_nm__assert_enabled' nmth000012.o:(.idata$4+0x0): undefined reference to `_nm__WalRcv' nmth000018.o:(.idata$4+0x0): undefined reference to `_nm__proc_exit_inprogress' nmth000020.o:(.idata$4+0x0): undefined reference to `_nm__BlockSig' nmth000023.o:(.idata$4+0x0): undefined reference to `_nm__sync_method' nmth000026.o:(.idata$4+0x0): undefined reference to `_nm__MyProcPid' nmth000028.o:(.idata$4+0x0): undefined reference to `_nm__CurrentResourceOwner' nmth000030.o:(.idata$4+0x0): undefined reference to `_nm__TopMemoryContext' nmth000032.o:(.idata$4+0x0): undefined reference to `_nm__CurrentMemoryContext' nmth000035.o:(.idata$4+0x0): undefined reference to `_nm__PG_exception_stack' nmth000037.o:(.idata$4+0x0): undefined reference to `_nm__UnBlockSig' nmth000039.o:(.idata$4+0x0): undefined reference to `_nm__ThisTimeLineID' nmth000041.o:(.idata$4+0x0): undefined reference to `_nm__error_context_stack' nmth000043.o:(.idata$4+0x0): undefined reference to `_nm__InterruptHoldoffCount' collect2: ld returned 1 exit status c:\MinGW\bin\dllwrap.exe: c:\MinGW\bin\gcc exited with status 1 make[2]: *** [walreceiverproc.dll] Error 1 make[2]: Leaving directory `/c/postgres/mmm/src/backend/postmaster/walreceiverproc' make[1]: *** [all] Error 2 make[1]: Leaving directory `/c/postgres/mmm/src' make: *** [all] Error 2 --------------------------- Though I marked the variables shown in the above message as PGDLLIMPORT, the "make" still fails in the same way. I struggled with this issue for some time, but could not fix it yet :( Frankly I'm not familiar with that area. So it would be nice if someone could analyze this issue. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 12, 2010 at 17:58, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Dec 22, 2009 at 8:49 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >>> Umm.., I still cannot find the place where the walreceiver module is >>> loaded by using load_external_function() in your 'replication' branch. >>> Also the compilation of that branch fails. Is the 'pushed' branch the >>> latest? Sorry if I'm missing something. >> >> Ah, I see. The changes were not included in the merge commit after all, >> but I had simple forgot to "git add" them. Sorry about that, should be >> there now. > > This change which moves walreceiver process into a dynamically loaded > module caused the following compile error on my MinGW environment. That sounds strange - it should pick those up from the -lpostgres. Any chance you have an old postgres binary around from a non-syncrep build or something? > --------------------------- > > Though I marked the variables shown in the above message as PGDLLIMPORT, > the "make" still fails in the same way. I struggled with this issue > for some time, but > could not fix it yet :( > > Frankly I'm not familiar with that area. So it would be nice if > someone could analyze > this issue. Do you have an environment to try to build it under msvc? in my experience, that gives you easier-to-understand error messages in a lot of cases like this - it removets the mingw black magic. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Thanks for your advice! On Wed, Jan 13, 2010 at 3:37 AM, Magnus Hagander <magnus@hagander.net> wrote: >> This change which moves walreceiver process into a dynamically loaded >> module caused the following compile error on my MinGW environment. > > That sounds strange - it should pick those up from the -lpostgres. Any > chance you have an old postgres binary around from a non-syncrep build > or something? No, there is no old postgres binary. > Do you have an environment to try to build it under msvc? No, unfortunately. > in my > experience, that gives you easier-to-understand error messages in a > lot of cases like this - it removets the mingw black magic. OK. I'll try to build it under msvc. But since there seems to be a long way to go before doing that, I would appreciate if someone could give me some advice. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > Done. Currently there is no new libpq function for replication. The > walreceiver uses only existing functions like PQconnectdb, PQexec, > PQgetCopyData, etc. > > git://git.postgresql.org/git/users/fujii/postgres.git > branch: replication Thanks! I'm afraid we haven't quite nailed the select/poll issue yet. You copied pq_wait() from the libpq pqSocketCheck(), but there's one big difference between the backend and the frontend: the frontend always puts the connection to non-blocking mode, while the backend uses blocking mode. At least with SSL, I think it's possible for pq_wait() to return false positives, if the SSL layer decides to renegotiate the connection causing data to flow in the other direction in the underlying TCP connection. A false positive would lead cause walsender to block indefinitely on the pq_getbyte() call. I don't even want to think about the changes required to put the backend socket to non-blocking mode, I don't know that code well enough. Maybe we could temporarily put it to non-blocking mode, read to see if there's any data available, and put it back to blocking mode. But even then I think we'd need to modify at least secure_read() to work correctly with SSL in non-blocking mode. Another idea is to use poll() to check for POLLHUP, on those platforms that have poll(). AFAICS there is no equivalent for that in select(), so for platforms that don't have poll() we would have to simply ignore the issue or write some other platform-specific work-around (Windows WSAEventSelect() seems to have a FD_CLOSE event for that). That would be a quite localized change. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Jan 13, 2010 at 7:27 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > the frontend always puts the > connection to non-blocking mode, while the backend uses blocking mode. Really? By default (i.e., without the expressly setting by using PQsetnonblocking()), the connection is set to blocking mode even in frontend. Am I missing something? > At least with SSL, I think it's possible for pq_wait() to return false > positives, if the SSL layer decides to renegotiate the connection > causing data to flow in the other direction in the underlying TCP > connection. A false positive would lead cause walsender to block > indefinitely on the pq_getbyte() call. Sorry. I could not understand that issue scenario. Could you explain it in more detail? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Wed, Jan 13, 2010 at 7:27 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> the frontend always puts the >> connection to non-blocking mode, while the backend uses blocking mode. > > Really? By default (i.e., without the expressly setting by using > PQsetnonblocking()), the connection is set to blocking mode even > in frontend. Am I missing something? That's right. The underlying socket is always put to non-blocking mode in libpq. PQsetnonblocking() only affects whether libpq commands wait and retry if the output buffer is full. >> At least with SSL, I think it's possible for pq_wait() to return false >> positives, if the SSL layer decides to renegotiate the connection >> causing data to flow in the other direction in the underlying TCP >> connection. A false positive would lead cause walsender to block >> indefinitely on the pq_getbyte() call. > > Sorry. I could not understand that issue scenario. Could you explain > it in more detail? 1. Walsender calls pq_wait() which calls select(), waiting for timeout, or data to become available for reading in the underlying socket. 2. Client issues an SSL renegotiation by sending a message to the server 3. Server receives the message, and select() returns indicating that data has arrived 4. Walsender calls HandleEndOfRep() which calls pq_getbyte(). pq_readbyte() calls SSL_read(), which receives the renegotiation message and handles it. No application data has arrived, however, so SSL_read() blocks for some to arrive. It never does. I don't understand enough of SSL to know if renegotiation can actually happen like that, but the man page of SSL_read() suggests so. But a similar thing can happen if an SSL record is broken into two TCP packets. select() returns immediately as the first packet arrives, but SSL_read() will block until the 2nd packet arrives. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2010/1/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > Fujii Masao wrote: >> On Wed, Jan 13, 2010 at 7:27 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> the frontend always puts the >>> connection to non-blocking mode, while the backend uses blocking mode. >> >> Really? By default (i.e., without the expressly setting by using >> PQsetnonblocking()), the connection is set to blocking mode even >> in frontend. Am I missing something? > > That's right. The underlying socket is always put to non-blocking mode > in libpq. PQsetnonblocking() only affects whether libpq commands wait > and retry if the output buffer is full. > >>> At least with SSL, I think it's possible for pq_wait() to return false >>> positives, if the SSL layer decides to renegotiate the connection >>> causing data to flow in the other direction in the underlying TCP >>> connection. A false positive would lead cause walsender to block >>> indefinitely on the pq_getbyte() call. >> >> Sorry. I could not understand that issue scenario. Could you explain >> it in more detail? > > 1. Walsender calls pq_wait() which calls select(), waiting for timeout, > or data to become available for reading in the underlying socket. > > 2. Client issues an SSL renegotiation by sending a message to the server > > 3. Server receives the message, and select() returns indicating that > data has arrived > > 4. Walsender calls HandleEndOfRep() which calls pq_getbyte(). > pq_readbyte() calls SSL_read(), which receives the renegotiation message > and handles it. No application data has arrived, however, so SSL_read() > blocks for some to arrive. It never does. > > I don't understand enough of SSL to know if renegotiation can actually > happen like that, but the man page of SSL_read() suggests so. But a > similar thing can happen if an SSL record is broken into two TCP > packets. select() returns immediately as the first packet arrives, but > SSL_read() will block until the 2nd packet arrives. I *think* renegotiation happens based on amount of content, not amount of time. But it could still happen in cornercases I think. If the renegotiation happens right after a complete packet has been sent (which would be the logical place), but not fast enough that the SSL library gets it in one read() from the socket, you could end up in that situation. (if the SSL library gets the renegotiation request as part of the first read(), it would probably do the renegotiation before returning from that call to SSL_read(), in which case the socket would be in the correct state before you call select) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
After reading up on SSL_read() and SSL_pending(), it seems that there is unfortunately no reliable way of checking if there is incoming data that can be read using SSL_read() without blocking, short of putting the socket to non-blocking mode. It also seems that we can't rely on poll() returning POLLHUP if the remote end has disconnected; it's not doing that at least on my laptop. So, the only solution I can see is to put the socket to non-blocking mode. But to keep the change localized, let's switch to non-blocking mode only temporarily, just when polling to see if there's data to read (or EOF), and switch back immediately afterwards. I've added a pq_getbyte_if_available() function to pqcomm.c to do that. The API to the upper levels is quite nice, the function returns a byte if one is available without blocking. Only minimal changes are required elsewhere. See that in my git repository. Attached is a new version of the whole streaming replication patch, for the benefit of archives and git non-users. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On Thu, Jan 14, 2010 at 9:14 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > After reading up on SSL_read() and SSL_pending(), it seems that there is > unfortunately no reliable way of checking if there is incoming data that > can be read using SSL_read() without blocking, short of putting the > socket to non-blocking mode. It also seems that we can't rely on poll() > returning POLLHUP if the remote end has disconnected; it's not doing > that at least on my laptop. > > So, the only solution I can see is to put the socket to non-blocking > mode. But to keep the change localized, let's switch to non-blocking > mode only temporarily, just when polling to see if there's data to read > (or EOF), and switch back immediately afterwards. Agreed. Though I also read some pages referring to that issue, I was not able to find any better action other than the temporal switch of the blocking mode. > I've added a pq_getbyte_if_available() function to pqcomm.c to do that. > The API to the upper levels is quite nice, the function returns a byte > if one is available without blocking. Only minimal changes are required > elsewhere. Great! Thanks a lot! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Wed, Jan 13, 2010 at 3:37 AM, Magnus Hagander <magnus@hagander.net> wrote: >>> This change which moves walreceiver process into a dynamically loaded >>> module caused the following compile error on my MinGW environment. >> That sounds strange - it should pick those up from the -lpostgres. Any >> chance you have an old postgres binary around from a non-syncrep build >> or something? > > No, there is no old postgres binary. > >> Do you have an environment to try to build it under msvc? > > No, unfortunately. > >> in my >> experience, that gives you easier-to-understand error messages in a >> lot of cases like this - it removets the mingw black magic. > > OK. I'll try to build it under msvc. > > But since there seems to be a long way to go before doing that, > I would appreciate if someone could give me some advice. It looks like dawn_bat is experiencing the same problem. I don't think we want to sprinkle all those variables with PGDLLIMPORT, and it didn't fix the problem for you earlier anyway. Is there some other way to fix this? Do people still use MinGW for any real work? Could we just drop walreceiver support from MinGW builds? Or maybe we should consider splitting walreceiver into two parts after all. Only the bare minimum that needs to access libpq would go into the shared object, and the rest would be linked with the backend as usual. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2010/1/15 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > Fujii Masao wrote: >> On Wed, Jan 13, 2010 at 3:37 AM, Magnus Hagander <magnus@hagander.net> wrote: >>>> This change which moves walreceiver process into a dynamically loaded >>>> module caused the following compile error on my MinGW environment. >>> That sounds strange - it should pick those up from the -lpostgres. Any >>> chance you have an old postgres binary around from a non-syncrep build >>> or something? >> >> No, there is no old postgres binary. >> >>> Do you have an environment to try to build it under msvc? >> >> No, unfortunately. >> >>> in my >>> experience, that gives you easier-to-understand error messages in a >>> lot of cases like this - it removets the mingw black magic. >> >> OK. I'll try to build it under msvc. >> >> But since there seems to be a long way to go before doing that, >> I would appreciate if someone could give me some advice. > > It looks like dawn_bat is experiencing the same problem. I don't think > we want to sprinkle all those variables with PGDLLIMPORT, and it didn't > fix the problem for you earlier anyway. Is there some other way to fix this? > > Do people still use MinGW for any real work? Could we just drop > walreceiver support from MinGW builds? We don't know if this works on MSVC, because MSVC doesn't actually try to build the walreceiver. I'm going to look at that tomorrow. If we get the same issues there, we a problem in our code. If not, we need to figure out what's up with mingw. > Or maybe we should consider splitting walreceiver into two parts after > all. Only the bare minimum that needs to access libpq would go into the > shared object, and the rest would be linked with the backend as usual. That would certainly be one option. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Heikki Linnakangas wrote: > Do people still use MinGW for any real work? Could we just drop > walreceiver support from MinGW builds? > > Or maybe we should consider splitting walreceiver into two parts after > all. Only the bare minimum that needs to access libpq would go into the > shared object, and the rest would be linked with the backend as usual. > > I use MinGW when doing Windows work (e.g. the threading piece in parallel pg_restore). And I think it is generally desirable to be able to build on Windows using an open source tool chain. I'd want a damn good reason to abandon its use. And I don't like the idea of not supporting walreceiver on it either. Please find another solution if possible. cheers andrew
2010/1/15 Andrew Dunstan <andrew@dunslane.net>: > > > Heikki Linnakangas wrote: >> >> Do people still use MinGW for any real work? Could we just drop >> walreceiver support from MinGW builds? >> >> Or maybe we should consider splitting walreceiver into two parts after >> all. Only the bare minimum that needs to access libpq would go into the >> shared object, and the rest would be linked with the backend as usual. >> >> > > I use MinGW when doing Windows work (e.g. the threading piece in parallel pg_restore). And I think it is generally desirableto be able to build on Windows using an open source tool chain. I'd want a damn good reason to abandon its use.And I don't like the idea of not supporting walreceiver on it either. Please find another solution if possible. > Yeah. FWIW, I don't use mingw do do any windows development, but definitely +1 on working hard to keep support for it if at all possible. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander wrote: > 2010/1/15 Andrew Dunstan <andrew@dunslane.net>: >> >> Heikki Linnakangas wrote: >>> Do people still use MinGW for any real work? Could we just drop >>> walreceiver support from MinGW builds? >>> >>> Or maybe we should consider splitting walreceiver into two parts after >>> all. Only the bare minimum that needs to access libpq would go into the >>> shared object, and the rest would be linked with the backend as usual. >>> >> I use MinGW when doing Windows work (e.g. the threading piece in parallel pg_restore). And I think it is generally desirableto be able to build on Windows using an open source tool chain. I'd want a damn good reason to abandon its use.And I don't like the idea of not supporting walreceiver on it either. Please find another solution if possible. > > Yeah. FWIW, I don't use mingw do do any windows development, but > definitely +1 on working hard to keep support for it if at all > possible. Ok. I'll look at splitting walreceiver code between the shared module and backend binary slightly differently. At first glance, it doesn't seem that hard after all, and will make the code more modular anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Magnus Hagander wrote: >> Yeah. FWIW, I don't use mingw do do any windows development, but >> definitely +1 on working hard to keep support for it if at all >> possible. > Ok. I'll look at splitting walreceiver code between the shared module > and backend binary slightly differently. At first glance, it doesn't > seem that hard after all, and will make the code more modular anyway. This is probably going in the wrong direction. There is no good reason why that module should be failing to link, and I don't think it's going to be "more modular" if you're forced to avoid any global variable references at all in some arbitrary portion of the code. I think it's a tools/build process problem and should be attacked that way. regards, tom lane
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100115 15:20]: > Ok. I'll look at splitting walreceiver code between the shared module > and backend binary slightly differently. At first glance, it doesn't > seem that hard after all, and will make the code more modular anyway. Maybe an insane question, but why can postmaster just not "exec" walreceiver? I mean, because of windows, we already have that code around, and then walreceiver could link directly to libpq and not have to worry at all about linking all of postmaster backends to libpq... But I do understand that's a radical change... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
I wrote: > I think it's a tools/build process problem and should be attacked that > way. Specifically, I think you missed out $(BE_DLLLIBS) in SHLIB_LINK. We'll find out at the next mingw build... regards, tom lane
Aidan Van Dyk <aidan@highrise.ca> writes: > Maybe an insane question, but why can postmaster just not "exec" > walreceiver? It'd greatly complicate access to shared memory. regards, tom lane
Tom Lane wrote: > I wrote: >> I think it's a tools/build process problem and should be attacked that >> way. > > Specifically, I think you missed out $(BE_DLLLIBS) in SHLIB_LINK. > We'll find out at the next mingw build... Thanks. But what is BE_DLLLIBS? I can't find any description of it. I suspect the MinGW build will fail because of the missing PGDLLIMPORTs. Before we sprinkle all the global variables it touches with that, let me explain what I meant by dividing walreceiver code differently between dynamically loaded module and backend code. Right now I have to go to sleep, though, but I'll try to get back to during the weekend. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Specifically, I think you missed out $(BE_DLLLIBS) in SHLIB_LINK. >> We'll find out at the next mingw build... > Thanks. But what is BE_DLLLIBS? I can't find any description of it. It was the wrong theory anyway --- it already is included (in Makefile.shlib). But what it does is provide -lpostgres on platforms where that is needed, such as mingw. > I suspect the MinGW build will fail because of the missing PGDLLIMPORTs. Yeah. On closer investigation the problem seems to be -DBUILDING_DLL, which flips the meaning of PGDLLIMPORT. contrib/dblink, which surely works and has the same linkage requirements as walreceiver, does *not* use that. I've committed a patch to change that, we'll soon see if it works... > Before we sprinkle all the global variables it touches with that, let me > explain what I meant by dividing walreceiver code differently between > dynamically loaded module and backend code. Right now I have to go to > sleep, though, but I'll try to get back to during the weekend. Yeah, nothing to be done till we get another buildfarm cycle anyway. regards, tom lane
Tom Lane wrote: >> Before we sprinkle all the global variables it touches with that, let me >> explain what I meant by dividing walreceiver code differently between >> dynamically loaded module and backend code. Right now I have to go to >> sleep, though, but I'll try to get back to during the weekend. >> > > Yeah, nothing to be done till we get another buildfarm cycle anyway. > > > I ran an extra cycle. Still a bit of work to do: <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=dawn_bat&dt=2010-01-15%2023:04:54> cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I ran an extra cycle. Still a bit of work to do: > <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=dawn_bat&dt=2010-01-15%2023:04:54> Well, at least now we're down to the variables that haven't got PGDLLIMPORT, rather than wondering what's wrong with the build ... regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Before we sprinkle all the global variables it touches with that, let me >> explain what I meant by dividing walreceiver code differently between >> dynamically loaded module and backend code. Right now I have to go to >> sleep, though, but I'll try to get back to during the weekend. > > Yeah, nothing to be done till we get another buildfarm cycle anyway. Ok, looks like you did that anyway, let's see if it fixed it. Thanks. So what I'm playing with is to pull walreceiver back into the backend executable. To avoid the link dependency, walreceiver doesn't access libpq directly, but loads a module dynamically which implements this interface: bool walrcv_connect(char *conninfo, XLogRecPtr startpoint) Establish connection to the primary, and starts streaming from 'startpoint'. Returns true on success. bool walrcv_receive(int timeout, XLogRecPtr *recptr, char **buffer, int *len) Retrieve any WAL record available through the connection, blocking for maximum of 'timeout' ms. void walrcv_disconnect(void); Disconnect. This is the kind of API Greg Stark requested earlier (http://archives.postgresql.org/message-id/407d949e0912220336u595a05e0x20bd91b9fbc08d4d@mail.gmail.com), though I'm not planning to make it pluggable for 3rd party implementations yet. The module doesn't need to touch backend internals much at all, no tinkering with shared memory for example, so I would feel much better about moving that out of src/backend. Not sure where, though; it's not an executable, so src/bin is hardly the right place, but I wouldn't want to put it in contrib either, because it should still be built and installed by default. So I'm inclined to still leave it in src/backend/replication/ I've pushed that 'replication-dynmodule' branch in my git repo. The diff is hard to read, because it mostly just moves code around, but I've attached libpqwalreceiver.c here, which is the dynamic module part. You can also browse the tree via the web interface (http://git.postgresql.org/gitweb?p=users/heikki/postgres.git;a=tree;h=refs/heads/replication-dynmodule;hb=replication-dynmodule) I like this division of labor much more than making the whole walreceiver process a dynamically loaded module, so barring objections I will review and test this more, and commit next week. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com /*------------------------------------------------------------------------- * * libpqwalreceiver.c * * The WAL receiver process (walreceiver) is new as of Postgres 8.5. It * is the process in the standby server that takes charge of receiving * XLOG records from a primary server during streaming replication. * * When the startup process determines that it's time to start streaming, * it instructs postmaster to start walreceiver. Walreceiver first connects * connects to the primary server (it will be served by a walsender process * in the primary server), and then keeps receiving XLOG records and * writing them to the disk as long as the connection is alive. As XLOG * records are received and flushed to disk, it updates the * WalRcv->receivedUpTo variable in shared memory, to inform the startup * process of how far it can proceed with XLOG replay. * * Normal termination is by SIGTERM, which instructs the walreceiver to * exit(0). Emergency termination is by SIGQUIT; like any postmaster child * process, the walreceiver will simply abort and exit on SIGQUIT. A close * of the connection and a FATAL error are treated not as a crash but as * normal operation. * * Walreceiver is a postmaster child process like others, but it's compiled * as a dynamic module to avoid linking libpq with the main server binary. * * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group * * * IDENTIFICATION * $PostgreSQL$ * *------------------------------------------------------------------------- */ #include "postgres.h" #include <unistd.h> #include "libpq-fe.h" #include "access/xlog.h" #include "miscadmin.h" #include "replication/walreceiver.h" #include "utils/builtins.h" #ifdef HAVE_POLL_H #include <poll.h> #endif #ifdef HAVE_SYS_POLL_H #include <sys/poll.h> #endif #ifdef HAVE_SYS_SELECT_H #include <sys/select.h> #endif PG_MODULE_MAGIC; void _PG_init(void); /* streamConn is a PGconn object of a connection to walsender from walreceiver */ static PGconn *streamConn = NULL; static bool justconnected = false; /* Buffer for currently read records */ static char *recvBuf = NULL; /* Prototypes for interface functions */ static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint); static bool libpqrcv_receive(int timeout, XLogRecPtr *recptr, char **buffer, int *len); static void libpqrcv_disconnect(void); /* Prototypes for private functions */ static bool libpq_select(int timeout_ms); /* * Module load callback */ void _PG_init(void) { walrcv_connect = libpqrcv_connect; walrcv_receive = libpqrcv_receive; walrcv_disconnect = libpqrcv_disconnect; } /* * Establish the connection to the primary server for XLOG streaming */ static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint) { char conninfo_repl[MAXCONNINFO + 14]; char *primary_sysid; char standby_sysid[32]; TimeLineID primary_tli; TimeLineID standby_tli; PGresult *res; char cmd[64]; Assert(startpoint.xlogid != 0 || startpoint.xrecoff != 0); /* * Set up a connection for XLOG streaming */ snprintf(conninfo_repl, sizeof(conninfo_repl), "%s replication=true", conninfo); streamConn = PQconnectdb(conninfo_repl); if (PQstatus(streamConn) != CONNECTION_OK) ereport(ERROR, (errmsg("could not connect to the primary server : %s", PQerrorMessage(streamConn)))); /* * Get the system identifier and timeline ID as a DataRow message * from the primary server. */ res = PQexec(streamConn, "IDENTIFY_SYSTEM"); if (PQresultStatus(res) != PGRES_TUPLES_OK) { PQclear(res); ereport(ERROR, (errmsg("could not receive the SYSID and timeline ID from " "the primary server: %s", PQerrorMessage(streamConn)))); } if (PQnfields(res) != 2 || PQntuples(res) != 1) { int ntuples = PQntuples(res); int nfields = PQnfields(res); PQclear(res); ereport(ERROR, (errmsg("invalid response from primary server"), errdetail("expected 1 tuple with 2 fields, got %d tuples with %d fields", ntuples, nfields))); } primary_sysid = PQgetvalue(res, 0, 0); primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0); /* * Confirm that the system identifier of the primary is the same * as ours. */ snprintf(standby_sysid, sizeof(standby_sysid), UINT64_FORMAT, GetSystemIdentifier()); if (strcmp(primary_sysid, standby_sysid) != 0) { PQclear(res); ereport(ERROR, (errmsg("system differs between the primary and standby"), errdetail("the primary SYSID is %s, standby SYSID is %s", primary_sysid, standby_sysid))); } /* * Confirm that the current timeline of the primary is the same * as the recovery target timeline. */ standby_tli = GetRecoveryTargetTLI(); PQclear(res); if (primary_tli != standby_tli) ereport(ERROR, (errmsg("timeline %u of the primary does not match recovery target timeline %u", primary_tli, standby_tli))); ThisTimeLineID = primary_tli; /* Start streaming from the point requested by startup process */ snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X", startpoint.xlogid, startpoint.xrecoff); res = PQexec(streamConn, cmd); if (PQresultStatus(res) != PGRES_COPY_OUT) ereport(ERROR, (errmsg("could not start XLOG streaming: %s", PQerrorMessage(streamConn)))); PQclear(res); justconnected = true; return true; } /* * Wait until we can read WAL stream, or timeout. * * Returns true if data has become available for reading, false if timed out * or interrupted by signal. * * This is based on pqSocketCheck. */ static bool libpq_select(int timeout_ms) { int ret; Assert(streamConn != NULL); if (PQsocket(streamConn) < 0) ereport(ERROR, (errcode_for_socket_access(), errmsg("socket not open"))); /* We use poll(2) if available, otherwise select(2) */ { #ifdef HAVE_POLL struct pollfd input_fd; input_fd.fd = PQsocket(streamConn); input_fd.events = POLLIN | POLLERR; input_fd.revents = 0; ret = poll(&input_fd, 1, timeout_ms); #else /* !HAVE_POLL */ fd_set input_mask; struct timeval timeout; struct timeval *ptr_timeout; FD_ZERO(&input_mask); FD_SET(PQsocket(streamConn), &input_mask); if (timeout_ms < 0) ptr_timeout = NULL; else { timeout.tv_sec = timeout_ms / 1000; timeout.tv_usec = (timeout_ms % 1000) * 1000; ptr_timeout = &timeout; } ret = select(PQsocket(streamConn) + 1, &input_mask, NULL, NULL, ptr_timeout); #endif /* HAVE_POLL */ } if (ret == 0 || (ret < 0 && errno == EINTR)) return false; if (ret < 0) ereport(ERROR, (errcode_for_socket_access(), errmsg("select() failed: %m"))); return true; } /* * Clear our pid from shared memory at exit. */ static void libpqrcv_disconnect(void) { PQfinish(streamConn); justconnected = false; } /* * Receive any WAL records available from XLOG stream, blocking for * maximum of 'timeout' ms. * * Returns: * * True if data was received. *recptr, *buffer and *len are set to * the WAL location of the received data, buffer holding it, and length, * respectively. * * False if no data was available within timeout, or wait was interrupted * by signal. * * The buffer returned is only valid until the next call of this function or * libpq_connect/disconnect. * * ereports on error. */ static bool libpqrcv_receive(int timeout, XLogRecPtr *recptr, char **buffer, int *len) { int rawlen; if (recvBuf != NULL) PQfreemem(recvBuf); recvBuf = NULL; /* * If the caller requested to block, wait for data to arrive. But if * this is the first call after connecting, don't wait, because * there might already be some data in libpq buffer that we haven't * returned to caller. */ if (timeout > 0 && !justconnected) { if (!libpq_select(timeout)) return false; if (PQconsumeInput(streamConn) == 0) ereport(ERROR, (errmsg("could not read xlog records: %s", PQerrorMessage(streamConn)))); } justconnected = false; /* Receive CopyData message */ rawlen = PQgetCopyData(streamConn, &recvBuf, 1); if (rawlen == 0) /* no records available yet, then return */ return false; if (rawlen == -1) /* end-of-streaming or error */ { PGresult *res; res = PQgetResult(streamConn); if (PQresultStatus(res) == PGRES_COMMAND_OK) { PQclear(res); ereport(ERROR, (errmsg("replication terminated by primary server"))); } PQclear(res); ereport(ERROR, (errmsg("could not read xlog records: %s", PQerrorMessage(streamConn)))); } if (rawlen < -1) ereport(ERROR, (errmsg("could not read xlog records: %s", PQerrorMessage(streamConn)))); if (rawlen < sizeof(XLogRecPtr)) ereport(ERROR, (errmsg("invalid WAL message received from primary"))); /* Return received WAL records to caller */ *recptr = *((XLogRecPtr *) recvBuf); *buffer = recvBuf + sizeof(XLogRecPtr); *len = rawlen - sizeof(XLogRecPtr); return true; }
Heikki Linnakangas wrote: > I've pushed that 'replication-dynmodule' branch in my git repo. The diff > is hard to read, because it mostly just moves code around, but I've > attached libpqwalreceiver.c here, which is the dynamic module part. You > can also browse the tree via the web interface > (http://git.postgresql.org/gitweb?p=users/heikki/postgres.git;a=tree;h=refs/heads/replication-dynmodule;hb=replication-dynmodule) I just noticed that the comment at the top of libpqwalreceiver.c is a leftover, not much relevant to the contents of the file anymore, all the signal handling and interaction with startup process is in src/backend/replication/walreceiver.c now. That obviously needs to be fixed before committing.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > The module doesn't need to touch backend internals much at all, no > tinkering with shared memory for example, so I would feel much better > about moving that out of src/backend. Not sure where, though; it's not > an executable, so src/bin is hardly the right place, but I wouldn't want > to put it in contrib either, because it should still be built and > installed by default. So I'm inclined to still leave it in > src/backend/replication/ It should be possible to be in contrib and installed by default, even with the current tool set, by tweaking initdb to install the contrib into template1. But that would be a packaging / dependency issue I guess then. Of course the extension system would ideally "create extension foo;" for all foo in contrib at initdb time, then a user would have to "install extension foo;" and be done with it. Regards, -- dim
Dimitri Fontaine escreveu: > It should be possible to be in contrib and installed by default, even > And it could be uninstall too. Let's not do it for core functionalities. -- Euler Taveira de Oliveira http://www.timbira.com/