Thread: Streaming replication and non-blocking I/O

Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

08 December 2009, 10:23:35

I find the backend libpq changes related to non-blocking I/O quite
complex. Can we find a simpler solution?

The problem we're trying to solve is that while the walsender backend
sends a lot of WAL records to the client, the client can send a lot of
messages to the backend. If volume of the messages from client to server
exceeds both the input buffer in the server and the output buffer in the
client, the client will block until the server has read some data. But
if the client is blocked, it will not process incoming data from the
server, and eventually the server will block too. And we have a
deadlock. This:
http://florin.bjdean.id.au/docs/omnimark/omni55/docs/html/concept/717.htm
is a pretty good description of the problem.

The first question is: do we really need to be prepared for that? The
XLogRecPtr acknowledgment messages the client sends are very small, and
if the client is mindful about not sending them too often - perhaps max
1 ack per 1 received XLOG message - the receive buffer in the backend
should never fill up in practice.

If that's deemed not good enough, we could modify just internal_flush()
so that it uses secure_poll to wait for the possibility to either read
or write, instead of blocking for just write. Whenever there's incoming
data, read them into PqRecvBuffer for later processing, which keeps the
OS input buffer from filling up. If PqRecvBuffer fills up, it can be
extended, or we can start dropping old XLogRecPtr messages from it.

In any case, we'll need something like pq_wait to check if a message can
be read without blocking, but that's just a small additional function as
opposed to a whole new API for assembling and sending messages without
blocking.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

08 December 2009, 21:42:26

On Tue, Dec 8, 2009 at 11:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The first question is: do we really need to be prepared for that? The
> XLogRecPtr acknowledgment messages the client sends are very small, and
> if the client is mindful about not sending them too often - perhaps max
> 1 ack per 1 received XLOG message - the receive buffer in the backend
> should never fill up in practice.

It's OK to drop that feature.

> If that's deemed not good enough, we could modify just internal_flush()
> so that it uses secure_poll to wait for the possibility to either read
> or write, instead of blocking for just write. Whenever there's incoming
> data, read them into PqRecvBuffer for later processing, which keeps the
> OS input buffer from filling up. If PqRecvBuffer fills up, it can be
> extended, or we can start dropping old XLogRecPtr messages from it.

Extending PqRecvBuffer seems better because XLogRecPtr message
has some types (i.e., we cannot just drop old message without parsing
all messages in the buffer).

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

09 December 2009, 02:58:59

Fujii Masao wrote:
> On Tue, Dec 8, 2009 at 11:23 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> If that's deemed not good enough, we could modify just internal_flush()
>> so that it uses secure_poll to wait for the possibility to either read
>> or write, instead of blocking for just write. Whenever there's incoming
>> data, read them into PqRecvBuffer for later processing, which keeps the
>> OS input buffer from filling up. If PqRecvBuffer fills up, it can be
>> extended, or we can start dropping old XLogRecPtr messages from it.
> 
> Extending PqRecvBuffer seems better because XLogRecPtr message
> has some types (i.e., we cannot just drop old message without parsing
> all messages in the buffer).

True. Another idea I had was to introduce a callback that backend libpq
can call when the buffer fills. Walsender would set the callback to
ProcessStreamMsgs().

But if everyone is happy with just relying on the OS buffer to not fill
up, let's just drop it.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

09 December 2009, 03:55:50

On Wed, Dec 9, 2009 at 3:58 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> True. Another idea I had was to introduce a callback that backend libpq
> can call when the buffer fills. Walsender would set the callback to
> ProcessStreamMsgs().
>
> But if everyone is happy with just relying on the OS buffer to not fill
> up, let's just drop it.

The OS buffer is expected to be able to store a large number of
XLogRecPtr messages, because its size is small. So it's also OK
to just drop it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

09 December 2009, 11:00:44

Fujii Masao <masao.fujii@gmail.com> writes:
> On Wed, Dec 9, 2009 at 3:58 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> But if everyone is happy with just relying on the OS buffer to not fill
>> up, let's just drop it.

> The OS buffer is expected to be able to store a large number of
> XLogRecPtr messages, because its size is small. So it's also OK
> to just drop it.

It certainly seems to be something we could improve later, when and
if evidence emerges that it's a real-world problem.  For now,
simple is beautiful.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

10 December 2009, 02:41:25

On Thu, Dec 10, 2009 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The OS buffer is expected to be able to store a large number of
>> XLogRecPtr messages, because its size is small. So it's also OK
>> to just drop it.
>
> It certainly seems to be something we could improve later, when and
> if evidence emerges that it's a real-world problem.  For now,
> simple is beautiful.

I just dropped the backend libpq changes related to non-blocking I/O.
 git://git.postgresql.org/git/users/fujii/postgres.git branch: replication

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

12 December 2009, 04:09:22

Fujii Masao wrote:
> On Thu, Dec 10, 2009 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The OS buffer is expected to be able to store a large number of
>>> XLogRecPtr messages, because its size is small. So it's also OK
>>> to just drop it.
>> It certainly seems to be something we could improve later, when and
>> if evidence emerges that it's a real-world problem.  For now,
>> simple is beautiful.
> 
> I just dropped the backend libpq changes related to non-blocking I/O.
> 
>   git://git.postgresql.org/git/users/fujii/postgres.git
>   branch: replication

Thanks, much simpler now.

Changing the finish_time argument to pqWaitTimed into timeout_ms changes
the behavior connect_timeout option to PQconnectdb. It should wait for
max connect_timeout seconds in total, but now it is waiting for
connect_timeout seconds at each step in the connection process: opening
a socket, authenticating etc.

Could we change the API of PQgetXLogData to be more like PQgetCopyData?
I'm thinking of removing the timeout argument, and instead looping with
select/poll and PQconsumeInput in the caller. That probably means
introducing a new state analogous to PGASYNC_COPY_IN. I haven't thought
this fully through yet, but it seems like it would be good to have a
consistent API.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

12 December 2009, 11:20:07

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Changing the finish_time argument to pqWaitTimed into timeout_ms changes
> the behavior connect_timeout option to PQconnectdb. It should wait for
> max connect_timeout seconds in total, but now it is waiting for
> connect_timeout seconds at each step in the connection process: opening
> a socket, authenticating etc.

Refresh my memory as to why this patch is touching any of that code at
all?
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

12 December 2009, 16:42:43

Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Changing the finish_time argument to pqWaitTimed into timeout_ms changes
>> the behavior connect_timeout option to PQconnectdb. It should wait for
>> max connect_timeout seconds in total, but now it is waiting for
>> connect_timeout seconds at each step in the connection process: opening
>> a socket, authenticating etc.
> 
> Refresh my memory as to why this patch is touching any of that code at
> all?

Walreceiver wants to wait for data to arrive from the master or a
signal. PQgetXLogData(), which is the libpq function to read a piece of
WAL, takes a timeout argument to support that. Walreceiver calls
PQgetXLogData() in an endless loop, checking for a received sighup or
death of postmaster at every iteration.

In the synchronous replication mode, I presume it's also going to listen
for a signal from the startup process, so that it can send a
acknowledgment to the master as soon as a COMMIT record has been
replayed that a backend on the master is waiting for.

To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed
to take a timeout instead of finishing_time argument. Which is a mistake
because it breaks PQconnectdb, and as I said I don't think
PQgetXLogData(9 should have a timeout argument to begin with. Instead,
it should have a boolean 'async' argument to return immediately if
there's no data, and walreceiver main loop should call poll()/select()
to wait. Ie. just like PQgetCopyData() works.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

13 December 2009, 22:20:34

On Sun, Dec 13, 2009 at 5:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Walreceiver wants to wait for data to arrive from the master or a
> signal. PQgetXLogData(), which is the libpq function to read a piece of
> WAL, takes a timeout argument to support that. Walreceiver calls
> PQgetXLogData() in an endless loop, checking for a received sighup or
> death of postmaster at every iteration.
>
> In the synchronous replication mode, I presume it's also going to listen
> for a signal from the startup process, so that it can send a
> acknowledgment to the master as soon as a COMMIT record has been
> replayed that a backend on the master is waiting for.

Right.

> To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed
> to take a timeout instead of finishing_time argument. Which is a mistake
> because it breaks PQconnectdb, and as I said I don't think
> PQgetXLogData(9 should have a timeout argument to begin with. Instead,
> it should have a boolean 'async' argument to return immediately if
> there's no data, and walreceiver main loop should call poll()/select()
> to wait. Ie. just like PQgetCopyData() works.

Seems good. I'll revise the code.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

13 December 2009, 22:39:08

Fujii Masao <masao.fujii@gmail.com> writes:
> On Sun, Dec 13, 2009 at 5:42 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed
>> to take a timeout instead of finishing_time argument. Which is a mistake
>> because it breaks PQconnectdb, and as I said I don't think
>> PQgetXLogData(9 should have a timeout argument to begin with. Instead,
>> it should have a boolean 'async' argument to return immediately if
>> there's no data, and walreceiver main loop should call poll()/select()
>> to wait. Ie. just like PQgetCopyData() works.

> Seems good. I'll revise the code.

Do we need a new "PQgetXLogData" function at all?  Seems like you could
shove the data through the COPY protocol and not have to touch libpq
at all, rather than duplicating a nontrivial amount of code there.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

13 December 2009, 23:56:22

On Mon, Dec 14, 2009 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Do we need a new "PQgetXLogData" function at all?  Seems like you could
> shove the data through the COPY protocol and not have to touch libpq
> at all, rather than duplicating a nontrivial amount of code there.

Yeah, I also think that all data (the WAL data itself, its LSN and
the flag bits) which the "PQgetXLogData" handles could be shoved
through the COPY protocol. But, outside libpq, it's somewhat messy
to extract the LSN and the flag bits from the data buffer which
"PQgetCopyData" returns, by using ntohs(). So I provided the new
libpq function only for replication. That is, I didn't want to expose
the low layer of network which libpq should handle.

I think that the friendly function would be useful to implement
the standby program (e.g., a stand-alone walreceiver tool) outside
the core.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

14 December 2009, 09:44:47

On Sat, Dec 12, 2009 at 5:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Could we change the API of PQgetXLogData to be more like PQgetCopyData?
> I'm thinking of removing the timeout argument, and instead looping with
> select/poll and PQconsumeInput in the caller. That probably means
> introducing a new state analogous to PGASYNC_COPY_IN. I haven't thought
> this fully through yet, but it seems like it would be good to have a
> consistent API.

On a related issue, so far I haven't considered about the way to output
the notice message at all :( In the current SR, it's always written to
stderr by the defaultNoticeProcessor by using fprintf, whether the
log_destination is specified or not. This is bizarre, and would need to
be fixed.

I'm going to set the new function calling ereport as the current notice
processor by using PQsetNoticeProcessor. But the problem is that only the
completed message like "NOTICE: xxx" is passed to such notice processor,
i.e., the error level itself is not passed.

So I wonder which error level should be used to output the notice message.
There are some approaches to address this;

1. Always use a specific level without regard to the actual one
2. Reverse-engineer the level from the complete message
3. Change some libpq functions so as to pass the error level to the notice  processor

But nothing really stands out. Do you have another good idea?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

14 December 2009, 10:34:09

Fujii Masao <masao.fujii@gmail.com> writes:
> On Mon, Dec 14, 2009 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Do we need a new "PQgetXLogData" function at all? �Seems like you could
>> shove the data through the COPY protocol and not have to touch libpq
>> at all, rather than duplicating a nontrivial amount of code there.

> Yeah, I also think that all data (the WAL data itself, its LSN and
> the flag bits) which the "PQgetXLogData" handles could be shoved
> through the COPY protocol. But, outside libpq, it's somewhat messy
> to extract the LSN and the flag bits from the data buffer which
> "PQgetCopyData" returns, by using ntohs(). So I provided the new
> libpq function only for replication. That is, I didn't want to expose
> the low layer of network which libpq should handle.

I find that a completely unconvincing division of labor.  Who is to say
that the LSN is the only part of the data that needs special treatment?

The very, very large practical problem with this is that if you decide
to change the behavior at any time, the only way to be sure that the WAL
receiver is using the right libpq version is to perform a soname major
version bump.  The transformations done by libpq will essentially become
part of its ABI, and not a very visible part at that.

I am going to insist that no such logic be placed in libpq.  From a
packager's standpoint that's insanity.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

14 December 2009, 11:56:30

Fujii Masao <masao.fujii@gmail.com> writes:
> I'm going to set the new function calling ereport as the current notice
> processor by using PQsetNoticeProcessor. But the problem is that only the
> completed message like "NOTICE: xxx" is passed to such notice processor,
> i.e., the error level itself is not passed.

Use PQsetNoticeReceiver.  The other one is just there for backwards
compatibility.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

14 December 2009, 14:47:25

Tom Lane wrote:
> The very, very large practical problem with this is that if you decide
> to change the behavior at any time, the only way to be sure that the WAL
> receiver is using the right libpq version is to perform a soname major
> version bump.  The transformations done by libpq will essentially become
> part of its ABI, and not a very visible part at that.

Not having to change the libpq API would certainly be a big advantage.

It's going to be a bit more complicated in walsender/walreceiver to work
with the libpq COPY API. We're going to need a WAL sending/receiving
protocol on top of it, defined in terms of rows and columns passed
through the COPY protocol.

One problem is the the standby is supposed to send back acknowledgments
to the master, telling it how far it has received/replayed the WAL. Is
there any way to send information back to the server, while a COPY OUT
is in progress? That's not absolutely necessary with asynchronous
replication, but will be with synchronous.

One idea is to stop/start the COPY between every batch of WAL records
sent, giving the client (= walreceiver) a chance to send messages back.
But that will lead to extra round trips.

BTW, something that's been bothering me a bit with this patch is that we
now have to link the backend with libpq. I don't see an immediate
problem with that, but I'm not a packager. Does anyone see a problem
with that?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

14 December 2009, 15:01:29

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> It's going to be a bit more complicated in walsender/walreceiver to work
> with the libpq COPY API. We're going to need a WAL sending/receiving
> protocol on top of it, defined in terms of rows and columns passed
> through the COPY protocol.

AFAIR, libpq knows essentially nothing of the data being passed through
COPY --- it just treats that as a byte stream.  I think you can define
any data format you want, it doesn't need to look exactly like a COPY
of a table would.  In fact it's probably a lot better if it DOESN'T
look like COPY data once it gets past libpq, so that you can check
that it is WAL and not COPY data.

> One problem is the the standby is supposed to send back acknowledgments
> to the master, telling it how far it has received/replayed the WAL. Is
> there any way to send information back to the server, while a COPY OUT
> is in progress? That's not absolutely necessary with asynchronous
> replication, but will be with synchronous.

Well, a real COPY would of course not stop to look for incoming
messages, but I don't think that's inherent in the protocol.  You
would likely need some libpq adjustments so it didn't throw error
when you tried that, but it would be a small and one-time adjustment.

> BTW, something that's been bothering me a bit with this patch is that we
> now have to link the backend with libpq. I don't see an immediate
> problem with that, but I'm not a packager. Does anyone see a problem
> with that?

Yeah, I have a problem with that.  What's the backend doing with libpq?
It's not receiving this data, it's sending it.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

14 December 2009, 15:14:01

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Yeah, I have a problem with that.  What's the backend doing with libpq?
>> It's not receiving this data, it's sending it.

> walreceiver is a postmaster subprocess too.

Hm.  Perhaps it should be a loadable plugin and not hard-linked into the
backend?  Compare dblink.

The main concern I have with hard-linking libpq is that it has a lot of
symbol conflicts with the backend --- and at least the ones from
src/port/ aren't easily removed.  I foresee problems that will be very
difficult to fix on platforms where we can't filter the set of link
symbols exposed by libpq.  Linking a thread-enabled libpq into the
backend could also create problems on some platforms --- it would likely
cause a thread-capable libc to get linked, which is not what we want.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

14 December 2009, 15:14:45

Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> BTW, something that's been bothering me a bit with this patch is that we
>> now have to link the backend with libpq. I don't see an immediate
>> problem with that, but I'm not a packager. Does anyone see a problem
>> with that?
> 
> Yeah, I have a problem with that.  What's the backend doing with libpq?
> It's not receiving this data, it's sending it.

walreceiver is a postmaster subprocess too.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

15 December 2009, 02:07:42

On Tue, Dec 15, 2009 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm.  Perhaps it should be a loadable plugin and not hard-linked into the
> backend?  Compare dblink.

You mean that such plugin is supplied in shared_preload_libraries,
a new process is forked and the shared-memory related to walreceiver
is created by using shmem_startup_hook? Since this approach would
solve the problem discussed previously, ISTM this makes sense.
http://archives.postgresql.org/pgsql-hackers/2009-11/msg00031.php

Some additional code might be required to control the termination
of walreceiver.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

15 December 2009, 23:28:36

On Tue, Dec 15, 2009 at 3:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Tom Lane wrote:
>> The very, very large practical problem with this is that if you decide
>> to change the behavior at any time, the only way to be sure that the WAL
>> receiver is using the right libpq version is to perform a soname major
>> version bump.  The transformations done by libpq will essentially become
>> part of its ABI, and not a very visible part at that.
>
> Not having to change the libpq API would certainly be a big advantage.

Done; I replaced PQgetXLogData and PQputXLogRecPtr with PQgetCopyData and
PQputCopyData.
git://git.postgresql.org/git/users/fujii/postgres.gitbranch: replication

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

16 December 2009, 05:54:04

Fujii Masao wrote:
> On Tue, Dec 15, 2009 at 3:47 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Tom Lane wrote:
>>> The very, very large practical problem with this is that if you decide
>>> to change the behavior at any time, the only way to be sure that the WAL
>>> receiver is using the right libpq version is to perform a soname major
>>> version bump.  The transformations done by libpq will essentially become
>>> part of its ABI, and not a very visible part at that.
>> Not having to change the libpq API would certainly be a big advantage.
> 
> Done; I replaced PQgetXLogData and PQputXLogRecPtr with PQgetCopyData and
> PQputCopyData.

Great! The logical next step is move the handling of TimelineID and
system identifier out of libpq as well.

I'm thinking of refactoring the protocol along these lines:

0. Begin by connecting to the master just like a normal backend does. We
don't necessarily need the new ProtocolVersion code either, though it's
probably still a good idea to reject connections to older server versions.

1. Get the system identifier of the master.

Slave -> Master: Query message, with a query string like
"GET_SYSTEM_IDENTIFIER"

Master -> Slave: RowDescription, DataRow CommandComplete, and
ReadyForQuery messages. The system identifier is returned in the DataRow
message.

This is identical to what happens when a query is executed against a
normal backend using the simple query protocol, so walsender can use
PQexec() for this.

2. Another query exchange like above, for timeline ID. (or these two
steps can be joined into one query, to eliminate one round-trip).

3. Request a backup history file, if needed:

Slave -> Master: Query message, with a query string like
"GET_BACKUP_HISTORY_FILE XXX" where XXX is XLogRecPtr or file name.

Master -> Slave: RowDescription, DataRow CommandComplete and
ReadyForQuery messages as usual. The file contents are returned in the
DataRow message.

4. Start replication

Slave -> Master: Query message, with query string "START REPLICATION:
XXXX", where XXXX is the RecPtr of the starting point.

Master -> Slave: CopyOutResponse followed by a continuous stream of
CopyData messages with WAL contents.

This minimizes the changes to the protocol and libpq, with a clear way
of extending by adding new commands. Similar to what you did a long time
ago, connecting as an actual backend at first and then switching to
walsender mode after running a few queries, but this would all be
handled in a separate loop in walsender instead of running as a
full-blown backend. We'll still need small changes to libpq to allow
sending messages back to the server in COPY_IN mode (maybe add a new
COPY_IN_OUT mode for that).

Thoughts?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Greg Stark

Date:

16 December 2009, 06:24:00

<p>I'm interested in abstracting out features of replication from libpq too. It would be nice if we could implement
differentcommunication bus modules. <p>For example if you have dozens of replicas you may want to use something like
spreadto distribute the records using multicast. <p>Sorry for top posting -- I haven't yet figured out how not to in
thisclient. <p><blockquote type="cite">On 16 Dec 2009 09:54, "Heikki Linnakangas" <<a
href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>>wrote:<br /><br /><p><font
color="#500050">FujiiMasao wrote: > On Tue, Dec 15, 2009 at 3:47 AM, Heikki Linnakangas >
<heikki.linnakangas@enter...</font>Great!The logical next step is move the handling of TimelineID and<br /> system
identifierout of libpq as well.<br /><br /><br /> I'm thinking of refactoring the protocol along these lines:<br /><br
/>0. Begin by connecting to the master just like a normal backend does. We<br /> don't necessarily need the new
ProtocolVersioncode either, though it's<br /> probably still a good idea to reject connections to older server
versions.<br/><br /> 1. Get the system identifier of the master.<br /><br /> Slave -> Master: Query message, with a
querystring like<br /> "GET_SYSTEM_IDENTIFIER"<br /><br /> Master -> Slave: RowDescription, DataRow CommandComplete,
and<br/> ReadyForQuery messages. The system identifier is returned in the DataRow<br /> message.<br /><br /> This is
identicalto what happens when a query is executed against a<br /> normal backend using the simple query protocol, so
walsendercan use<br /> PQexec() for this.<br /><br /> 2. Another query exchange like above, for timeline ID. (or these
two<br/> steps can be joined into one query, to eliminate one round-trip).<br /><br /> 3. Request a backup history
file,if needed:<br /><br /> Slave -> Master: Query message, with a query string like<br /> "GET_BACKUP_HISTORY_FILE
XXX"where XXX is XLogRecPtr or file name.<br /><br /> Master -> Slave: RowDescription, DataRow CommandComplete
and<br/> ReadyForQuery messages as usual. The file contents are returned in the<br /> DataRow message.<br /><br /><br
/>4. Start replication<br /><br /> Slave -> Master: Query message, with query string "START REPLICATION:<br />
XXXX",where XXXX is the RecPtr of the starting point.<br /><br /> Master -> Slave: CopyOutResponse followed by a
continuousstream of<br /> CopyData messages with WAL contents.<br /><br /><br /> This minimizes the changes to the
protocoland libpq, with a clear way<br /> of extending by adding new commands. Similar to what you did a long time<br
/>ago, connecting as an actual backend at first and then switching to<br /> walsender mode after running a few queries,
butthis would all be<br /> handled in a separate loop in walsender instead of running as a<br /> full-blown backend.
We'llstill need small changes to libpq to allow<br /> sending messages back to the server in COPY_IN mode (maybe add a
new<br/> COPY_IN_OUT mode for that).<br /><br /> Thoughts?<br /><p><font color="#500050"> -- Heikki Linnakangas
EnterpriseDB<a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a> -- </font><p><font
color="#500050">Sentvia pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)To make changes to your
subscript...</font></blockquote>

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

17 December 2009, 05:12:59

On Wed, Dec 16, 2009 at 6:53 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Great! The logical next step is move the handling of TimelineID and
> system identifier out of libpq as well.

All right.

> 0. Begin by connecting to the master just like a normal backend does. We
> don't necessarily need the new ProtocolVersion code either, though it's
> probably still a good idea to reject connections to older server versions.

And, I think that such backend should switch to walsender mode when the startup
packet arrives. Otherwise, we would have to authenticate such backend twice
on different context, i.e., a normal backend and walsender. So the settings for
each context would be required in pg_hba.conf. This is odd, I think. Thought?

> 1. Get the system identifier of the master.
>
> Slave -> Master: Query message, with a query string like
> "GET_SYSTEM_IDENTIFIER"
>
> Master -> Slave: RowDescription, DataRow CommandComplete, and
> ReadyForQuery messages. The system identifier is returned in the DataRow
> message.
>
> This is identical to what happens when a query is executed against a
> normal backend using the simple query protocol, so walsender can use
> PQexec() for this.

s/walsender/walreceiver ?

A signal cannot cancel PQexec() during waiting for the message from the
server. We might need to change SIGTERM handler of walreceiver so as to
call proc_exit() immediately if it's during PQexec().

> 2. Another query exchange like above, for timeline ID. (or these two
> steps can be joined into one query, to eliminate one round-trip).
>
> 3. Request a backup history file, if needed:
>
> Slave -> Master: Query message, with a query string like
> "GET_BACKUP_HISTORY_FILE XXX" where XXX is XLogRecPtr or file name.
>
> Master -> Slave: RowDescription, DataRow CommandComplete and
> ReadyForQuery messages as usual. The file contents are returned in the
> DataRow message.
>
> 4. Start replication
>
> Slave -> Master: Query message, with query string "START REPLICATION:
> XXXX", where XXXX is the RecPtr of the starting point.
>
> Master -> Slave: CopyOutResponse followed by a continuous stream of
> CopyData messages with WAL contents.

Seems OK.

> This minimizes the changes to the protocol and libpq, with a clear way
> of extending by adding new commands. Similar to what you did a long time
> ago, connecting as an actual backend at first and then switching to
> walsender mode after running a few queries, but this would all be
> handled in a separate loop in walsender instead of running as a
> full-blown backend.

Agreed. Only walsender should be allowed to handle the query strings that
you proposed, in order that we avoid touching a parser.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

17 December 2009, 08:09:49

Fujii Masao wrote:
> On Wed, Dec 16, 2009 at 6:53 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> 0. Begin by connecting to the master just like a normal backend does. We
>> don't necessarily need the new ProtocolVersion code either, though it's
>> probably still a good idea to reject connections to older server versions.
> 
> And, I think that such backend should switch to walsender mode when the startup
> packet arrives. Otherwise, we would have to authenticate such backend twice
> on different context, i.e., a normal backend and walsender. So the settings for
> each context would be required in pg_hba.conf. This is odd, I think. Thought?

True.

>> This is identical to what happens when a query is executed against a
>> normal backend using the simple query protocol, so walsender can use
>> PQexec() for this.
> 
> s/walsender/walreceiver ?

Right.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

17 December 2009, 09:01:05

On Thu, Dec 17, 2009 at 9:02 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> And, I think that such backend should switch to walsender mode when the startup
>> packet arrives. Otherwise, we would have to authenticate such backend twice
>> on different context, i.e., a normal backend and walsender. So the settings for
>> each context would be required in pg_hba.conf. This is odd, I think. Thought?
>
> True.

Currently this switch depends on whether XLOG_STREAMING_CODE is sent from the
standby or not, also which depends on whether PQstartXLogStreaming() is called
or not. But, as the next step, we should get rid of also such changes of libpq.

I'm thinking of making the standby send the "walsender-switch-code" the same way
as application_name; walreceiver always specifies the option like
"replication=on"
in conninfo string and calls PQconnectdb(), which sends the code as a part of
startup packet. And, the environment variable for that should not be defined to
avoid user's mis-configuration, I think.

Thought? Better idea?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

17 December 2009, 09:25:20

Fujii Masao wrote:
> I'm thinking of making the standby send the "walsender-switch-code" the same way
> as application_name; walreceiver always specifies the option like
> "replication=on"
> in conninfo string and calls PQconnectdb(), which sends the code as a part of
> startup packet. And, the environment variable for that should not be defined to
> avoid user's mis-configuration, I think.

Sounds good.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

17 December 2009, 22:42:29

On Thu, Dec 17, 2009 at 10:25 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Fujii Masao wrote:
>> I'm thinking of making the standby send the "walsender-switch-code" the same way
>> as application_name; walreceiver always specifies the option like
>> "replication=on"
>> in conninfo string and calls PQconnectdb(), which sends the code as a part of
>> startup packet. And, the environment variable for that should not be defined to
>> avoid user's mis-configuration, I think.
>
> Sounds good.

Okey. Design clarification again;

0. Begin by connecting to the master using PQconnectdb() with new conninfo
option specifying the request of replication. The startup packet with the
request is sent to the master, then the backend switches to the walsender
mode. The walsender goes into the main loop and wait for the request from
the walreceiver.

1. Get the system identifier of the master.

Slave -> Master: Query message, with a query string like
"GET_SYSTEM_IDENTIFIER"

Master -> Slave: RowDescription, DataRow CommandComplete, and
ReadyForQuery messages. The system identifier is returned in the DataRow
message.

2. Another query exchange like above, for timeline ID.

Slave -> Master: Query message, with a query string like
"GET_TIMELINE"

Master -> Slave: RowDescription, DataRow CommandComplete, and
ReadyForQuery messages. The timeline ID is returned in the DataRow
message.

3. Request a backup history file, if needed:

Slave -> Master: Query message, with a query string like
"GET_BACKUP_HISTORY_FILE XXX" where XXX is XLogRecPtr.

Master -> Slave: RowDescription, DataRow CommandComplete and
ReadyForQuery messages as usual. The file contents are returned in the
DataRow message.

In 1, 2, 3, the walreceiver uses PQexec() to send Query message and receive
the results.

4. Start replication

Slave -> Master: Query message, with query string "START REPLICATION:
XXXX", where XXXX is the RecPtr of the starting point.

Master -> Slave: CopyOutResponse followed by a continuous stream of
CopyData messages with WAL contents.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

21 December 2009, 08:56:26

On Fri, Dec 18, 2009 at 11:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Okey. Design clarification again;
>
> 0. Begin by connecting to the master using PQconnectdb() with new conninfo
> option specifying the request of replication. The startup packet with the
> request is sent to the master, then the backend switches to the walsender
> mode. The walsender goes into the main loop and wait for the request from
> the walreceiver.
<snip>
> 4. Start replication
>
> Slave -> Master: Query message, with query string "START REPLICATION:
> XXXX", where XXXX is the RecPtr of the starting point.
>
> Master -> Slave: CopyOutResponse followed by a continuous stream of
> CopyData messages with WAL contents.

Done. Currently there is no new libpq function for replication. The
walreceiver uses only existing functions like PQconnectdb, PQexec,
PQgetCopyData, etc.
 git://git.postgresql.org/git/users/fujii/postgres.git branch: replication

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

21 December 2009, 14:04:12

Fujii Masao wrote:
> On Fri, Dec 18, 2009 at 11:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Okey. Design clarification again;
>>
>> 0. Begin by connecting to the master using PQconnectdb() with new conninfo
>> option specifying the request of replication. The startup packet with the
>> request is sent to the master, then the backend switches to the walsender
>> mode. The walsender goes into the main loop and wait for the request from
>> the walreceiver.
> <snip>
>> 4. Start replication
>>
>> Slave -> Master: Query message, with query string "START REPLICATION:
>> XXXX", where XXXX is the RecPtr of the starting point.
>>
>> Master -> Slave: CopyOutResponse followed by a continuous stream of
>> CopyData messages with WAL contents.
> 
> Done. Currently there is no new libpq function for replication. The
> walreceiver uses only existing functions like PQconnectdb, PQexec,
> PQgetCopyData, etc.

Ok thanks, sounds good, I'll take a look.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

21 December 2009, 14:05:34

Fujii Masao wrote:
> On Tue, Dec 15, 2009 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm.  Perhaps it should be a loadable plugin and not hard-linked into the
>> backend?  Compare dblink.
> 
> You mean that such plugin is supplied in shared_preload_libraries,
> a new process is forked and the shared-memory related to walreceiver
> is created by using shmem_startup_hook? Since this approach would
> solve the problem discussed previously, ISTM this makes sense.
> http://archives.postgresql.org/pgsql-hackers/2009-11/msg00031.php
> 
> Some additional code might be required to control the termination
> of walreceiver.

I'm not sure which problem in that thread you're referring to, but I can
see two options:

1. Use dlopen()/dlsym() in walreceiver to use libpq. A bit awkward,
though we could write a bunch of macros to hide that and make the libpq
calls look normal.

2. Move walreceiver altogether into a loadable module, which is linked
as usual to libpq. Like e.g contrib/dblink.

Thoughts? Both seem reasonable to me. I tested the 2nd option (see
'replication' branch in my git repository), splitting walreceiver.c into
two: the functions that run in the walreceiver process, and the
functions that are called from other processes to control walreceiver.
That's a quite nice separation, though of course we could do that with
the 1st approach as well.

PS. I just merged with CVS HEAD. Streaming replication is pretty awesome
with Hot Standby!

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

21 December 2009, 14:28:05

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Fujii Masao wrote:
> I'm not sure which problem in that thread you're referring to, but I can
> see two options:

> 1. Use dlopen()/dlsym() in walreceiver to use libpq. A bit awkward,
> though we could write a bunch of macros to hide that and make the libpq
> calls look normal.

> 2. Move walreceiver altogether into a loadable module, which is linked
> as usual to libpq. Like e.g contrib/dblink.

> Thoughts? Both seem reasonable to me.

From a packager's standpoint the second is much saner.  If you want to
use dlopen() then you will have to know the exact name of the .so file
(e.g. libpq.so.5.3) and possibly its location too.  Or you will have to
persuade packagers that they should ship bare "libpq.so" symlinks, which
is contrary to packaging standards on most Linux distros.
(walreceiver.so wouldn't be subject to those standards, but libpq is
because it's a regular library that can also be hard-linked by
applications.)
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

21 December 2009, 23:19:32

On Tue, Dec 22, 2009 at 2:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> 2. Move walreceiver altogether into a loadable module, which is linked
> as usual to libpq. Like e.g contrib/dblink.
>
> Thoughts? Both seem reasonable to me. I tested the 2nd option (see
> 'replication' branch in my git repository), splitting walreceiver.c into
> two: the functions that run in the walreceiver process, and the
> functions that are called from other processes to control walreceiver.
> That's a quite nice separation, though of course we could do that with
> the 1st approach as well.

Though I seem not to understand what a loadable module means, I wonder
how the walreceiver module is loaded. AFAIK, we need to manually install
the dblink functions by executing dblink.sql before using them. Likewise,
if we choose the 2nd option, we must manually install the walreceiver
module before starting replication?

Or we automatically install that by executing system_view.sql, like
pg_start_backup? I'd like to reduce the number of installation operations
as much as possible. Is my concern besides the point?

> PS. I just merged with CVS HEAD. Streaming replication is pretty awesome
> with Hot Standby!

Thanks!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

22 December 2009, 00:46:16

Fujii Masao <masao.fujii@gmail.com> writes:
> Though I seem not to understand what a loadable module means, I wonder
> how the walreceiver module is loaded.

Put it in shared_preload_libraries, perhaps.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

22 December 2009, 02:31:13

Fujii Masao wrote:
> On Tue, Dec 22, 2009 at 2:31 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> 2. Move walreceiver altogether into a loadable module, which is linked
>> as usual to libpq. Like e.g contrib/dblink.
>>
>> Thoughts? Both seem reasonable to me. I tested the 2nd option (see
>> 'replication' branch in my git repository), splitting walreceiver.c into
>> two: the functions that run in the walreceiver process, and the
>> functions that are called from other processes to control walreceiver.
>> That's a quite nice separation, though of course we could do that with
>> the 1st approach as well.
> 
> Though I seem not to understand what a loadable module means, I wonder
> how the walreceiver module is loaded. AFAIK, we need to manually install
> the dblink functions by executing dblink.sql before using them. Likewise,
> if we choose the 2nd option, we must manually install the walreceiver
> module before starting replication?

I think we can just use load_external_function() to load the library and
call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the
library name. Walreceiver is quite tightly coupled with the rest of the
backend anyway, so I don't think we need to come up with a pluggable API
at the moment.

That's the way I did it yesterday, see 'replication' branch in my git
repository, but it looks like I fumbled the commit so that some of the
changes were committed as part of the merge commit with origin/master
(=CVS HEAD). Sorry about that.

shared_preload_libraries seems like a bad place because the library
doesn't need to be loaded in all backends. Just the walreceiver process.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

22 December 2009, 03:21:20

On Tue, Dec 22, 2009 at 3:30 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I think we can just use load_external_function() to load the library and
> call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the
> library name. Walreceiver is quite tightly coupled with the rest of the
> backend anyway, so I don't think we need to come up with a pluggable API
> at the moment.
>
> That's the way I did it yesterday, see 'replication' branch in my git
> repository, but it looks like I fumbled the commit so that some of the
> changes were committed as part of the merge commit with origin/master
> (=CVS HEAD). Sorry about that.

Umm.., I still cannot find the place where the walreceiver module is
loaded by using load_external_function() in your 'replication' branch.
Also the compilation of that branch fails. Is the 'pushed' branch the
latest? Sorry if I'm missing something.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Greg Stark

Date:

22 December 2009, 07:36:37

On Tue, Dec 22, 2009 at 6:30 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I think we can just use load_external_function() to load the library and
> call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the
> library name. Walreceiver is quite tightly coupled with the rest of the
> backend anyway, so I don't think we need to come up with a pluggable API
> at the moment.

Please? I am really interested in replacing walsender and walreceiver
with something which uses a communication bus like spread instead of a
single point to point connection.

ISTM if we start with something tightly coupled it'll be hard to
decouple later. Whereas if we start with a limited interface we'll
learn just how much information is really required by the modules and
will have fewer surprises later when we find suprising
interdependencies.

-- 
greg

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

22 December 2009, 07:49:28

Fujii Masao wrote:
> On Tue, Dec 22, 2009 at 3:30 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> I think we can just use load_external_function() to load the library and
>> call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the
>> library name. Walreceiver is quite tightly coupled with the rest of the
>> backend anyway, so I don't think we need to come up with a pluggable API
>> at the moment.
>>
>> That's the way I did it yesterday, see 'replication' branch in my git
>> repository, but it looks like I fumbled the commit so that some of the
>> changes were committed as part of the merge commit with origin/master
>> (=CVS HEAD). Sorry about that.
> 
> Umm.., I still cannot find the place where the walreceiver module is
> loaded by using load_external_function() in your 'replication' branch.
> Also the compilation of that branch fails. Is the 'pushed' branch the
> latest? Sorry if I'm missing something.

Ah, I see. The changes were not included in the merge commit after all,
but I had simple forgot to "git add" them. Sorry about that, should be
there now.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

22 December 2009, 08:48:15

Greg Stark wrote:
> On Tue, Dec 22, 2009 at 6:30 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> I think we can just use load_external_function() to load the library and
>> call WalReceiverMain from AuxiliaryProcessMain(). Ie. hard-code the
>> library name. Walreceiver is quite tightly coupled with the rest of the
>> backend anyway, so I don't think we need to come up with a pluggable API
>> at the moment.
> 
> Please? I am really interested in replacing walsender and walreceiver
> with something which uses a communication bus like spread instead of a
> single point to point connection.

I think you'd still need to be able to request older WAL segments to
resync after a lost connection, restore from base backup etc., which
don't really fit into a publish/subscribe style communication bus. I'm
sure it could all be solved though. It would be a pretty cool feature,
for scaling to a large number of slaves.

> ISTM if we start with something tightly coupled it'll be hard to
> decouple later. Whereas if we start with a limited interface we'll
> learn just how much information is really required by the modules and
> will have fewer surprises later when we find suprising
> interdependencies.

I'm all ears if you have a concrete proposal.

I'm not too worried about it being hard to decouple later. The interface
is actually quite limited already, as the communication between
processes is done via shared memory. It probably wouldn't be hard to
turn it into an API, but I don't think there's a hurry to do that until
someone actually steps up to write an alternative walreceiver/walsender,

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

22 December 2009, 10:21:21

On Tue, Dec 22, 2009 at 8:49 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Ah, I see. The changes were not included in the merge commit after all,
> but I had simple forgot to "git add" them. Sorry about that, should be
> there now.

Thanks for doing "git push" again!

But the compilation still fails.
Attached patch addresses this problem.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

fix_makefile_bug.patch

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

04 January 2010, 11:23:11

I've merged the replication branch with PostgreSQL CVS HEAD now,
including the patch for end-of-backup WAL records I committed earlier
today. See 'replication' branch in my git repository.

There's also a couple of other small changes: I believe the SSL stuff
isn't really necessary, so I removed it. I also moved the
START_REPLICATION phase from the walreceiver main loop to WalRcvConnect,
as it's simpler that way.

I will continue reviewing..

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

05 January 2010, 22:31:28

On Tue, Jan 5, 2010 at 12:22 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I've merged the replication branch with PostgreSQL CVS HEAD now,
> including the patch for end-of-backup WAL records I committed earlier
> today. See 'replication' branch in my git repository.
>
> There's also a couple of other small changes: I believe the SSL stuff
> isn't really necessary, so I removed it. I also moved the
> START_REPLICATION phase from the walreceiver main loop to WalRcvConnect,
> as it's simpler that way.

I also fixed a couple of small bugs:

* The ErrorResponse message from the primary server had been ignored
* The segment-boundary had been wrongly handled
* Valid replication starting location had been wrongly regarded as invalid
git://git.postgresql.org/git/users/fujii/postgres.gitbranch: replication

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

12 January 2010, 12:58:49

On Tue, Dec 22, 2009 at 8:49 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Umm.., I still cannot find the place where the walreceiver module is
>> loaded by using load_external_function() in your 'replication' branch.
>> Also the compilation of that branch fails. Is the 'pushed' branch the
>> latest? Sorry if I'm missing something.
>
> Ah, I see. The changes were not included in the merge commit after all,
> but I had simple forgot to "git add" them. Sorry about that, should be
> there now.

This change which moves walreceiver process into a dynamically loaded
module caused the following compile error on my MinGW environment.

---------------------------
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing
-fwrapv -g  -I. -I../../../../src/interfaces/libpq
-I../../../../src/include -I./src/include/port/win32 -DEXEC_BACKEND
"-I../../../../src/include/port/win32" -DBUILDING_DLL  -c -o
walreceiverproc.o walreceiverproc.c
dlltool --export-all  --output-def libwalreceiverprocdll.def walreceiverproc.o
dllwrap  -o walreceiverproc.dll --dllname walreceiverproc.dll  --def
libwalreceiverprocdll.def walreceiverproc.o -L../../../../src/backend
-lpostgres -L../../../../src/interfaces/libpq -L../../../../src/port
-lpq
Info: resolving _pg_signal_mask by linking to __imp__pg_signal_mask
(auto-import)
Info: resolving _pg_signal_queue by linking to __imp__pg_signal_queue
(auto-import)
Info: resolving _InterruptPending by linking to
__imp__InterruptPending (auto-import)
Info: resolving _assert_enabled by linking to __imp__assert_enabled
(auto-import)
Info: resolving _WalRcv by linking to __imp__WalRcv (auto-import)
Info: resolving _proc_exit_inprogress by linking to
__imp__proc_exit_inprogress (auto-import)
Info: resolving _BlockSig by linking to __imp__BlockSig (auto-import)
Info: resolving _sync_method by linking to __imp__sync_method (auto-import)
Info: resolving _MyProcPid by linking to __imp__MyProcPid (auto-import)
Info: resolving _CurrentResourceOwner by linking to
__imp__CurrentResourceOwner (auto-import)
Info: resolving _TopMemoryContext by linking to
__imp__TopMemoryContext (auto-import)
Info: resolving _CurrentMemoryContext by linking to
__imp__CurrentMemoryContext (auto-import)
Info: resolving _PG_exception_stack by linking to
__imp__PG_exception_stack (auto-import)
Info: resolving _UnBlockSig by linking to __imp__UnBlockSig (auto-import)
Info: resolving _ThisTimeLineID by linking to __imp__ThisTimeLineID
(auto-import)
Info: resolving _error_context_stack by linking to
__imp__error_context_stack (auto-import)
Info: resolving _InterruptHoldoffCount by linking to
__imp__InterruptHoldoffCount (auto-import)
c:\MinGW\bin\..\lib\gcc\mingw32\3.4.2\..\..\..\..\mingw32\bin\ld.exe:
warning: auto-importing has been activated without
--enable-auto-import specified on the command line.
This should work unless it involves constant data structures
referencing symbols from auto-imported DLLs.
fu000001.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname'
fu000003.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname'
fu000005.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname'
fu000006.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname'
fu000008.o:(.idata$2+0xc): undefined reference to `libpostgres_a_iname'
fu000009.o:(.idata$2+0xc): more undefined references to
`libpostgres_a_iname' follow
nmth000000.o:(.idata$4+0x0): undefined reference to `_nm__pg_signal_mask'
nmth000002.o:(.idata$4+0x0): undefined reference to `_nm__pg_signal_queue'
nmth000004.o:(.idata$4+0x0): undefined reference to `_nm__InterruptPending'
nmth000007.o:(.idata$4+0x0): undefined reference to `_nm__assert_enabled'
nmth000012.o:(.idata$4+0x0): undefined reference to `_nm__WalRcv'
nmth000018.o:(.idata$4+0x0): undefined reference to `_nm__proc_exit_inprogress'
nmth000020.o:(.idata$4+0x0): undefined reference to `_nm__BlockSig'
nmth000023.o:(.idata$4+0x0): undefined reference to `_nm__sync_method'
nmth000026.o:(.idata$4+0x0): undefined reference to `_nm__MyProcPid'
nmth000028.o:(.idata$4+0x0): undefined reference to `_nm__CurrentResourceOwner'
nmth000030.o:(.idata$4+0x0): undefined reference to `_nm__TopMemoryContext'
nmth000032.o:(.idata$4+0x0): undefined reference to `_nm__CurrentMemoryContext'
nmth000035.o:(.idata$4+0x0): undefined reference to `_nm__PG_exception_stack'
nmth000037.o:(.idata$4+0x0): undefined reference to `_nm__UnBlockSig'
nmth000039.o:(.idata$4+0x0): undefined reference to `_nm__ThisTimeLineID'
nmth000041.o:(.idata$4+0x0): undefined reference to `_nm__error_context_stack'
nmth000043.o:(.idata$4+0x0): undefined reference to `_nm__InterruptHoldoffCount'
collect2: ld returned 1 exit status
c:\MinGW\bin\dllwrap.exe: c:\MinGW\bin\gcc exited with status 1
make[2]: *** [walreceiverproc.dll] Error 1
make[2]: Leaving directory
`/c/postgres/mmm/src/backend/postmaster/walreceiverproc'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/c/postgres/mmm/src'
make: *** [all] Error 2
---------------------------

Though I marked the variables shown in the above message as PGDLLIMPORT,
the "make" still fails in the same way. I struggled with this issue
for some time, but
could not fix it yet :(

Frankly I'm not familiar with that area. So it would be nice if
someone could analyze
this issue.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Magnus Hagander

Date:

12 January 2010, 14:37:53

On Tue, Jan 12, 2010 at 17:58, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Dec 22, 2009 at 8:49 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>>> Umm.., I still cannot find the place where the walreceiver module is
>>> loaded by using load_external_function() in your 'replication' branch.
>>> Also the compilation of that branch fails. Is the 'pushed' branch the
>>> latest? Sorry if I'm missing something.
>>
>> Ah, I see. The changes were not included in the merge commit after all,
>> but I had simple forgot to "git add" them. Sorry about that, should be
>> there now.
>
> This change which moves walreceiver process into a dynamically loaded
> module caused the following compile error on my MinGW environment.

That sounds strange - it should pick those up from the -lpostgres. Any
chance you have an old postgres binary around from a non-syncrep build
or something?


> ---------------------------
>
> Though I marked the variables shown in the above message as PGDLLIMPORT,
> the "make" still fails in the same way. I struggled with this issue
> for some time, but
> could not fix it yet :(
>
> Frankly I'm not familiar with that area. So it would be nice if
> someone could analyze
> this issue.

Do you have an environment to try to build it under msvc? in my
experience, that gives you easier-to-understand error messages in a
lot of cases like this - it removets the mingw black magic.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

13 January 2010, 04:48:01

Thanks for your advice!

On Wed, Jan 13, 2010 at 3:37 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> This change which moves walreceiver process into a dynamically loaded
>> module caused the following compile error on my MinGW environment.
>
> That sounds strange - it should pick those up from the -lpostgres. Any
> chance you have an old postgres binary around from a non-syncrep build
> or something?

No, there is no old postgres binary.

> Do you have an environment to try to build it under msvc?

No, unfortunately.

> in my
> experience, that gives you easier-to-understand error messages in a
> lot of cases like this - it removets the mingw black magic.

OK. I'll try to build it under msvc.

But since there seems to be a long way to go before doing that,
I would appreciate if someone could give me some advice.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

13 January 2010, 06:27:37

Fujii Masao wrote:
> Done. Currently there is no new libpq function for replication. The
> walreceiver uses only existing functions like PQconnectdb, PQexec,
> PQgetCopyData, etc.
> 
>   git://git.postgresql.org/git/users/fujii/postgres.git
>   branch: replication

Thanks!

I'm afraid we haven't quite nailed the select/poll issue yet. You copied
pq_wait() from the libpq pqSocketCheck(), but there's one big difference
between the backend and the frontend: the frontend always puts the
connection to non-blocking mode, while the backend uses blocking mode.
At least with SSL, I think it's possible for pq_wait() to return false
positives, if the SSL layer decides to renegotiate the connection
causing data to flow in the other direction in the underlying TCP
connection. A false positive would lead cause walsender to block
indefinitely on the pq_getbyte() call.

I don't even want to think about the changes required to put the backend
socket to non-blocking mode, I don't know that code well enough. Maybe
we could temporarily put it to non-blocking mode, read to see if there's
any data available, and put it back to blocking mode. But even then I
think we'd need to modify at least secure_read() to work correctly with
SSL in non-blocking mode.

Another idea is to use poll() to check for POLLHUP, on those platforms
that have poll(). AFAICS there is no equivalent for that in select(), so
for platforms that don't have poll() we would have to simply ignore the
issue or write some other platform-specific work-around (Windows
WSAEventSelect() seems to have a FD_CLOSE event for that). That would be
a quite localized change.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

14 January 2010, 05:02:59

On Wed, Jan 13, 2010 at 7:27 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> the frontend always puts the
> connection to non-blocking mode, while the backend uses blocking mode.

Really? By default (i.e., without the expressly setting by using
PQsetnonblocking()), the connection is set to blocking mode even
in frontend. Am I missing something?

> At least with SSL, I think it's possible for pq_wait() to return false
> positives, if the SSL layer decides to renegotiate the connection
> causing data to flow in the other direction in the underlying TCP
> connection. A false positive would lead cause walsender to block
> indefinitely on the pq_getbyte() call.

Sorry. I could not understand that issue scenario. Could you explain
it in more detail?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

14 January 2010, 06:04:45

Fujii Masao wrote:
> On Wed, Jan 13, 2010 at 7:27 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> the frontend always puts the
>> connection to non-blocking mode, while the backend uses blocking mode.
> 
> Really? By default (i.e., without the expressly setting by using
> PQsetnonblocking()), the connection is set to blocking mode even
> in frontend. Am I missing something?

That's right. The underlying socket is always put to non-blocking mode
in libpq. PQsetnonblocking() only affects whether libpq commands wait
and retry if the output buffer is full.

>> At least with SSL, I think it's possible for pq_wait() to return false
>> positives, if the SSL layer decides to renegotiate the connection
>> causing data to flow in the other direction in the underlying TCP
>> connection. A false positive would lead cause walsender to block
>> indefinitely on the pq_getbyte() call.
> 
> Sorry. I could not understand that issue scenario. Could you explain
> it in more detail?

1. Walsender calls pq_wait() which calls select(), waiting for timeout,
or data to become available for reading in the underlying socket.

2. Client issues an SSL renegotiation by sending a message to the server

3. Server receives the message, and select() returns indicating that
data has arrived

4. Walsender calls HandleEndOfRep() which calls pq_getbyte().
pq_readbyte() calls SSL_read(), which receives the renegotiation message
and handles it. No application data has arrived, however, so SSL_read()
blocks for some to arrive. It never does.

I don't understand enough of SSL to know if renegotiation can actually
happen like that, but the man page of SSL_read() suggests so. But a
similar thing can happen if an SSL record is broken into two TCP
packets. select() returns immediately as the first packet arrives, but
SSL_read() will block until the 2nd packet arrives.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Magnus Hagander

Date:

14 January 2010, 06:09:59

2010/1/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Fujii Masao wrote:
>> On Wed, Jan 13, 2010 at 7:27 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> the frontend always puts the
>>> connection to non-blocking mode, while the backend uses blocking mode.
>>
>> Really? By default (i.e., without the expressly setting by using
>> PQsetnonblocking()), the connection is set to blocking mode even
>> in frontend. Am I missing something?
>
> That's right. The underlying socket is always put to non-blocking mode
> in libpq. PQsetnonblocking() only affects whether libpq commands wait
> and retry if the output buffer is full.
>
>>> At least with SSL, I think it's possible for pq_wait() to return false
>>> positives, if the SSL layer decides to renegotiate the connection
>>> causing data to flow in the other direction in the underlying TCP
>>> connection. A false positive would lead cause walsender to block
>>> indefinitely on the pq_getbyte() call.
>>
>> Sorry. I could not understand that issue scenario. Could you explain
>> it in more detail?
>
> 1. Walsender calls pq_wait() which calls select(), waiting for timeout,
> or data to become available for reading in the underlying socket.
>
> 2. Client issues an SSL renegotiation by sending a message to the server
>
> 3. Server receives the message, and select() returns indicating that
> data has arrived
>
> 4. Walsender calls HandleEndOfRep() which calls pq_getbyte().
> pq_readbyte() calls SSL_read(), which receives the renegotiation message
> and handles it. No application data has arrived, however, so SSL_read()
> blocks for some to arrive. It never does.
>
> I don't understand enough of SSL to know if renegotiation can actually
> happen like that, but the man page of SSL_read() suggests so. But a
> similar thing can happen if an SSL record is broken into two TCP
> packets. select() returns immediately as the first packet arrives, but
> SSL_read() will block until the 2nd packet arrives.

I *think* renegotiation happens based on amount of content, not amount
of time. But it could still happen in cornercases I think. If the
renegotiation happens right after a complete packet has been sent
(which would be the logical place), but not fast enough that the SSL
library gets it in one read() from the socket, you could end up in
that situation. (if the SSL library gets the renegotiation request as
part of the first read(), it would probably do the renegotiation
before returning from that call to SSL_read(), in which case the
socket would be in the correct state before you call select)

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

14 January 2010, 08:16:05

After reading up on SSL_read() and SSL_pending(), it seems that there is
unfortunately no reliable way of checking if there is incoming data that
can be read using SSL_read() without blocking, short of putting the
socket to non-blocking mode. It also seems that we can't rely on poll()
returning POLLHUP if the remote end has disconnected; it's not doing
that at least on my laptop.

So, the only solution I can see is to put the socket to non-blocking
mode. But to keep the change localized, let's switch to non-blocking
mode only temporarily, just when polling to see if there's data to read
(or EOF), and switch back immediately afterwards.

I've added a pq_getbyte_if_available() function to pqcomm.c to do that.
The API to the upper levels is quite nice, the function returns a byte
if one is available without blocking. Only minimal changes are required
elsewhere.

See that in my git repository. Attached is a new version of the whole
streaming replication patch, for the benefit of archives and git non-users.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Attachment

sr-20100114.patch.gz

Re: Streaming replication and non-blocking I/O

From

Fujii Masao

Date:

14 January 2010, 08:46:20

On Thu, Jan 14, 2010 at 9:14 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> After reading up on SSL_read() and SSL_pending(), it seems that there is
> unfortunately no reliable way of checking if there is incoming data that
> can be read using SSL_read() without blocking, short of putting the
> socket to non-blocking mode. It also seems that we can't rely on poll()
> returning POLLHUP if the remote end has disconnected; it's not doing
> that at least on my laptop.
>
> So, the only solution I can see is to put the socket to non-blocking
> mode. But to keep the change localized, let's switch to non-blocking
> mode only temporarily, just when polling to see if there's data to read
> (or EOF), and switch back immediately afterwards.

Agreed. Though I also read some pages referring to that issue,
I was not able to find any better action other than the temporal
switch of the blocking mode.

> I've added a pq_getbyte_if_available() function to pqcomm.c to do that.
> The API to the upper levels is quite nice, the function returns a byte
> if one is available without blocking. Only minimal changes are required
> elsewhere.

Great! Thanks a lot!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

15 January 2010, 15:11:53

Fujii Masao wrote:
> On Wed, Jan 13, 2010 at 3:37 AM, Magnus Hagander <magnus@hagander.net> wrote:
>>> This change which moves walreceiver process into a dynamically loaded
>>> module caused the following compile error on my MinGW environment.
>> That sounds strange - it should pick those up from the -lpostgres. Any
>> chance you have an old postgres binary around from a non-syncrep build
>> or something?
> 
> No, there is no old postgres binary.
> 
>> Do you have an environment to try to build it under msvc?
> 
> No, unfortunately.
> 
>> in my
>> experience, that gives you easier-to-understand error messages in a
>> lot of cases like this - it removets the mingw black magic.
> 
> OK. I'll try to build it under msvc.
> 
> But since there seems to be a long way to go before doing that,
> I would appreciate if someone could give me some advice.

It looks like dawn_bat is experiencing the same problem. I don't think
we want to sprinkle all those variables with PGDLLIMPORT, and it didn't
fix the problem for you earlier anyway. Is there some other way to fix this?

Do people still use MinGW for any real work? Could we just drop
walreceiver support from MinGW builds?

Or maybe we should consider splitting walreceiver into two parts after
all. Only the bare minimum that needs to access libpq would go into the
shared object, and the rest would be linked with the backend as usual.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Magnus Hagander

Date:

15 January 2010, 15:15:14

2010/1/15 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Fujii Masao wrote:
>> On Wed, Jan 13, 2010 at 3:37 AM, Magnus Hagander <magnus@hagander.net> wrote:
>>>> This change which moves walreceiver process into a dynamically loaded
>>>> module caused the following compile error on my MinGW environment.
>>> That sounds strange - it should pick those up from the -lpostgres. Any
>>> chance you have an old postgres binary around from a non-syncrep build
>>> or something?
>>
>> No, there is no old postgres binary.
>>
>>> Do you have an environment to try to build it under msvc?
>>
>> No, unfortunately.
>>
>>> in my
>>> experience, that gives you easier-to-understand error messages in a
>>> lot of cases like this - it removets the mingw black magic.
>>
>> OK. I'll try to build it under msvc.
>>
>> But since there seems to be a long way to go before doing that,
>> I would appreciate if someone could give me some advice.
>
> It looks like dawn_bat is experiencing the same problem. I don't think
> we want to sprinkle all those variables with PGDLLIMPORT, and it didn't
> fix the problem for you earlier anyway. Is there some other way to fix this?
>
> Do people still use MinGW for any real work? Could we just drop
> walreceiver support from MinGW builds?

We don't know if this works on MSVC, because MSVC doesn't actually try
to build the walreceiver. I'm going to look at that tomorrow.

If we get the same issues there, we a problem in our code. If not, we
need to figure out what's up with mingw.


> Or maybe we should consider splitting walreceiver into two parts after
> all. Only the bare minimum that needs to access libpq would go into the
> shared object, and the rest would be linked with the backend as usual.

That would certainly be one option.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Streaming replication and non-blocking I/O

From

Andrew Dunstan

Date:

15 January 2010, 15:48:35

Heikki Linnakangas wrote:
> Do people still use MinGW for any real work? Could we just drop
> walreceiver support from MinGW builds?
>
> Or maybe we should consider splitting walreceiver into two parts after
> all. Only the bare minimum that needs to access libpq would go into the
> shared object, and the rest would be linked with the backend as usual.
>
>   

I use MinGW when doing Windows work (e.g. the threading piece in 
parallel pg_restore).  And I think it is generally desirable to be able 
to build on Windows using an open source tool chain. I'd want a damn 
good reason to abandon its use. And I don't like the idea of not 
supporting walreceiver on it either. Please find another solution if 
possible.

cheers

andrew

Re: Streaming replication and non-blocking I/O

From

Magnus Hagander

Date:

15 January 2010, 15:51:27

2010/1/15 Andrew Dunstan <andrew@dunslane.net>:
>
>
> Heikki Linnakangas wrote:
>>
>> Do people still use MinGW for any real work? Could we just drop
>> walreceiver support from MinGW builds?
>>
>> Or maybe we should consider splitting walreceiver into two parts after
>> all. Only the bare minimum that needs to access libpq would go into the
>> shared object, and the rest would be linked with the backend as usual.
>>
>>
>
> I use MinGW when doing Windows work (e.g. the threading piece in parallel pg_restore).  And I think it is generally
desirableto be able to build on Windows using an open source tool chain. I'd want a damn good reason to abandon its
use.And I don't like the idea of not supporting walreceiver on it either. Please find another solution if possible. 
>

Yeah. FWIW, I don't use mingw do do any windows development, but
definitely +1 on working hard to keep support for it if at all
possible.


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

15 January 2010, 16:20:12

Magnus Hagander wrote:
> 2010/1/15 Andrew Dunstan <andrew@dunslane.net>:
>>
>> Heikki Linnakangas wrote:
>>> Do people still use MinGW for any real work? Could we just drop
>>> walreceiver support from MinGW builds?
>>>
>>> Or maybe we should consider splitting walreceiver into two parts after
>>> all. Only the bare minimum that needs to access libpq would go into the
>>> shared object, and the rest would be linked with the backend as usual.
>>>
>> I use MinGW when doing Windows work (e.g. the threading piece in parallel pg_restore).  And I think it is generally
desirableto be able to build on Windows using an open source tool chain. I'd want a damn good reason to abandon its
use.And I don't like the idea of not supporting walreceiver on it either. Please find another solution if possible.
 
> 
> Yeah. FWIW, I don't use mingw do do any windows development, but
> definitely +1 on working hard to keep support for it if at all
> possible.

Ok. I'll look at splitting walreceiver code between the shared module
and backend binary slightly differently. At first glance, it doesn't
seem that hard after all, and will make the code more modular anyway.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

15 January 2010, 16:25:24

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Magnus Hagander wrote:
>> Yeah. FWIW, I don't use mingw do do any windows development, but
>> definitely +1 on working hard to keep support for it if at all
>> possible.

> Ok. I'll look at splitting walreceiver code between the shared module
> and backend binary slightly differently. At first glance, it doesn't
> seem that hard after all, and will make the code more modular anyway.

This is probably going in the wrong direction.  There is no good reason
why that module should be failing to link, and I don't think it's going
to be "more modular" if you're forced to avoid any global variable
references at all in some arbitrary portion of the code.

I think it's a tools/build process problem and should be attacked that
way.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Aidan Van Dyk

Date:

15 January 2010, 16:27:29

* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100115 15:20]:

> Ok. I'll look at splitting walreceiver code between the shared module
> and backend binary slightly differently. At first glance, it doesn't
> seem that hard after all, and will make the code more modular anyway.

Maybe an insane question, but why can postmaster just not "exec"
walreceiver?  I mean, because of windows, we already have that code
around, and then walreceiver could link directly to libpq and not have
to worry at all about linking all of postmaster backends to libpq...

But I do understand that's a radical change...

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

15 January 2010, 16:49:43

I wrote:
> I think it's a tools/build process problem and should be attacked that
> way.

Specifically, I think you missed out $(BE_DLLLIBS) in SHLIB_LINK.
We'll find out at the next mingw build...
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

15 January 2010, 16:53:32

Aidan Van Dyk <aidan@highrise.ca> writes:
> Maybe an insane question, but why can postmaster just not "exec"
> walreceiver?

It'd greatly complicate access to shared memory.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

15 January 2010, 17:10:06

Tom Lane wrote:
> I wrote:
>> I think it's a tools/build process problem and should be attacked that
>> way.
> 
> Specifically, I think you missed out $(BE_DLLLIBS) in SHLIB_LINK.
> We'll find out at the next mingw build...

Thanks. But what is BE_DLLLIBS? I can't find any description of it.

I suspect the MinGW build will fail because of the missing PGDLLIMPORTs.
Before we sprinkle all the global variables it touches with that, let me
explain what I meant by dividing walreceiver code differently between
dynamically loaded module and backend code. Right now I have to go to
sleep, though, but I'll try to get back to during the weekend.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

15 January 2010, 17:47:34

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Specifically, I think you missed out $(BE_DLLLIBS) in SHLIB_LINK.
>> We'll find out at the next mingw build...

> Thanks. But what is BE_DLLLIBS? I can't find any description of it.

It was the wrong theory anyway --- it already is included (in
Makefile.shlib).  But what it does is provide -lpostgres on platforms
where that is needed, such as mingw.

> I suspect the MinGW build will fail because of the missing PGDLLIMPORTs.

Yeah.  On closer investigation the problem seems to be -DBUILDING_DLL,
which flips the meaning of PGDLLIMPORT.  contrib/dblink, which surely
works and has the same linkage requirements as walreceiver, does *not*
use that.  I've committed a patch to change that, we'll soon see if it
works...

> Before we sprinkle all the global variables it touches with that, let me
> explain what I meant by dividing walreceiver code differently between
> dynamically loaded module and backend code. Right now I have to go to
> sleep, though, but I'll try to get back to during the weekend.

Yeah, nothing to be done till we get another buildfarm cycle anyway.
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Andrew Dunstan

Date:

15 January 2010, 19:30:58


Tom Lane wrote:
>> Before we sprinkle all the global variables it touches with that, let me
>> explain what I meant by dividing walreceiver code differently between
>> dynamically loaded module and backend code. Right now I have to go to
>> sleep, though, but I'll try to get back to during the weekend.
>>     
>
> Yeah, nothing to be done till we get another buildfarm cycle anyway.
>
>             
>   

I ran an extra cycle. Still a bit of work to do: 
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=dawn_bat&dt=2010-01-15%2023:04:54>

cheers

andrew

Re: Streaming replication and non-blocking I/O

From

Tom Lane

Date:

15 January 2010, 19:59:48

Andrew Dunstan <andrew@dunslane.net> writes:
> I ran an extra cycle. Still a bit of work to do: 
> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=dawn_bat&dt=2010-01-15%2023:04:54>

Well, at least now we're down to the variables that haven't got
PGDLLIMPORT, rather than wondering what's wrong with the build ...
        regards, tom lane

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

16 January 2010, 03:50:03

Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Before we sprinkle all the global variables it touches with that, let me
>> explain what I meant by dividing walreceiver code differently between
>> dynamically loaded module and backend code. Right now I have to go to
>> sleep, though, but I'll try to get back to during the weekend.
>
> Yeah, nothing to be done till we get another buildfarm cycle anyway.

Ok, looks like you did that anyway, let's see if it fixed it. Thanks.

So what I'm playing with is to pull walreceiver back into the backend
executable. To avoid the link dependency, walreceiver doesn't access
libpq directly, but loads a module dynamically which implements this
interface:

bool walrcv_connect(char *conninfo, XLogRecPtr startpoint)

Establish connection to the primary, and starts streaming from 'startpoint'.
Returns true on success.

bool walrcv_receive(int timeout, XLogRecPtr *recptr, char **buffer, int
*len)

Retrieve any WAL record available through the connection, blocking for
maximum of 'timeout' ms.

void walrcv_disconnect(void);

Disconnect.


This is the kind of API Greg Stark requested earlier
(http://archives.postgresql.org/message-id/407d949e0912220336u595a05e0x20bd91b9fbc08d4d@mail.gmail.com),
though I'm not planning to make it pluggable for 3rd party
implementations yet.

The module doesn't need to touch backend internals much at all, no
tinkering with shared memory for example, so I would feel much better
about moving that out of src/backend. Not sure where, though; it's not
an executable, so src/bin is hardly the right place, but I wouldn't want
to put it in contrib either, because it should still be built and
installed by default. So I'm inclined to still leave it in
src/backend/replication/

I've pushed that 'replication-dynmodule' branch in my git repo. The diff
is hard to read, because it mostly just moves code around, but I've
attached libpqwalreceiver.c here, which is the dynamic module part. You
can also browse the tree via the web interface

(http://git.postgresql.org/gitweb?p=users/heikki/postgres.git;a=tree;h=refs/heads/replication-dynmodule;hb=replication-dynmodule)

I like this division of labor much more than making the whole
walreceiver process a dynamically loaded module, so barring objections I
will review and test this more, and commit next week.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
/*-------------------------------------------------------------------------
 *
 * libpqwalreceiver.c
 *
 * The WAL receiver process (walreceiver) is new as of Postgres 8.5. It
 * is the process in the standby server that takes charge of receiving
 * XLOG records from a primary server during streaming replication.
 *
 * When the startup process determines that it's time to start streaming,
 * it instructs postmaster to start walreceiver. Walreceiver first connects
 * connects to the primary server (it will be served by a walsender process
 * in the primary server), and then keeps receiving XLOG records and
 * writing them to the disk as long as the connection is alive. As XLOG
 * records are received and flushed to disk, it updates the
 * WalRcv->receivedUpTo variable in shared memory, to inform the startup
 * process of how far it can proceed with XLOG replay.
 *
 * Normal termination is by SIGTERM, which instructs the walreceiver to
 * exit(0). Emergency termination is by SIGQUIT; like any postmaster child
 * process, the walreceiver will simply abort and exit on SIGQUIT. A close
 * of the connection and a FATAL error are treated not as a crash but as
 * normal operation.
 *
 * Walreceiver is a postmaster child process like others, but it's compiled
 * as a dynamic module to avoid linking libpq with the main server binary.
 *
 * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
 *
 *
 * IDENTIFICATION
 *      $PostgreSQL$
 *
 *-------------------------------------------------------------------------
 */
#include "postgres.h"

#include <unistd.h>

#include "libpq-fe.h"
#include "access/xlog.h"
#include "miscadmin.h"
#include "replication/walreceiver.h"
#include "utils/builtins.h"

#ifdef HAVE_POLL_H
#include <poll.h>
#endif
#ifdef HAVE_SYS_POLL_H
#include <sys/poll.h>
#endif
#ifdef HAVE_SYS_SELECT_H
#include <sys/select.h>
#endif

PG_MODULE_MAGIC;

void        _PG_init(void);

/* streamConn is a PGconn object of a connection to walsender from walreceiver */
static PGconn *streamConn = NULL;
static bool justconnected = false;

/* Buffer for currently read records */
static char *recvBuf = NULL;

/* Prototypes for interface functions */
static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
static bool libpqrcv_receive(int timeout, XLogRecPtr *recptr, char **buffer,
              int *len);
static void libpqrcv_disconnect(void);

/* Prototypes for private functions */
static bool libpq_select(int timeout_ms);

/*
 * Module load callback
 */
void
_PG_init(void)
{
    walrcv_connect = libpqrcv_connect;
    walrcv_receive = libpqrcv_receive;
    walrcv_disconnect = libpqrcv_disconnect;
}

/*
 * Establish the connection to the primary server for XLOG streaming
 */
static bool
libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
{
    char        conninfo_repl[MAXCONNINFO + 14];
    char       *primary_sysid;
    char        standby_sysid[32];
    TimeLineID    primary_tli;
    TimeLineID    standby_tli;
    PGresult   *res;
    char        cmd[64];

    Assert(startpoint.xlogid != 0 || startpoint.xrecoff != 0);

    /*
     * Set up a connection for XLOG streaming
     */
    snprintf(conninfo_repl, sizeof(conninfo_repl), "%s replication=true", conninfo);

    streamConn = PQconnectdb(conninfo_repl);
    if (PQstatus(streamConn) != CONNECTION_OK)
        ereport(ERROR,
                (errmsg("could not connect to the primary server : %s",
                        PQerrorMessage(streamConn))));

    /*
     * Get the system identifier and timeline ID as a DataRow message
     * from the primary server.
     */
    res = PQexec(streamConn, "IDENTIFY_SYSTEM");
    if (PQresultStatus(res) != PGRES_TUPLES_OK)
    {
        PQclear(res);
        ereport(ERROR,
                (errmsg("could not receive the SYSID and timeline ID from "
                        "the primary server: %s",
                        PQerrorMessage(streamConn))));
    }
    if (PQnfields(res) != 2 || PQntuples(res) != 1)
    {
        int ntuples = PQntuples(res);
        int nfields = PQnfields(res);
        PQclear(res);
        ereport(ERROR,
                (errmsg("invalid response from primary server"),
                 errdetail("expected 1 tuple with 2 fields, got %d tuples with %d fields",
                           ntuples, nfields)));
    }
    primary_sysid = PQgetvalue(res, 0, 0);
    primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);

    /*
     * Confirm that the system identifier of the primary is the same
     * as ours.
     */
    snprintf(standby_sysid, sizeof(standby_sysid), UINT64_FORMAT,
             GetSystemIdentifier());
    if (strcmp(primary_sysid, standby_sysid) != 0)
    {
        PQclear(res);
        ereport(ERROR,
                (errmsg("system differs between the primary and standby"),
                 errdetail("the primary SYSID is %s, standby SYSID is %s",
                           primary_sysid, standby_sysid)));
    }

    /*
     * Confirm that the current timeline of the primary is the same
     * as the recovery target timeline.
     */
    standby_tli = GetRecoveryTargetTLI();
    PQclear(res);
    if (primary_tli != standby_tli)
        ereport(ERROR,
                (errmsg("timeline %u of the primary does not match recovery target timeline %u",
                        primary_tli, standby_tli)));
    ThisTimeLineID = primary_tli;

    /* Start streaming from the point requested by startup process */
    snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
             startpoint.xlogid, startpoint.xrecoff);
    res = PQexec(streamConn, cmd);
    if (PQresultStatus(res) != PGRES_COPY_OUT)
        ereport(ERROR,
                (errmsg("could not start XLOG streaming: %s",
                        PQerrorMessage(streamConn))));
    PQclear(res);

    justconnected = true;

    return true;
}

/*
 * Wait until we can read WAL stream, or timeout.
 *
 * Returns true if data has become available for reading, false if timed out
 * or interrupted by signal.
 *
 * This is based on pqSocketCheck.
 */
static bool
libpq_select(int timeout_ms)
{
    int    ret;

    Assert(streamConn != NULL);
    if (PQsocket(streamConn) < 0)
        ereport(ERROR,
                (errcode_for_socket_access(),
                 errmsg("socket not open")));

    /* We use poll(2) if available, otherwise select(2) */
    {
#ifdef HAVE_POLL
        struct pollfd input_fd;

        input_fd.fd = PQsocket(streamConn);
        input_fd.events = POLLIN | POLLERR;
        input_fd.revents = 0;

        ret = poll(&input_fd, 1, timeout_ms);
#else                            /* !HAVE_POLL */

        fd_set        input_mask;
        struct timeval timeout;
        struct timeval *ptr_timeout;

        FD_ZERO(&input_mask);
        FD_SET(PQsocket(streamConn), &input_mask);

        if (timeout_ms < 0)
            ptr_timeout = NULL;
        else
        {
            timeout.tv_sec    = timeout_ms / 1000;
            timeout.tv_usec    = (timeout_ms % 1000) * 1000;
            ptr_timeout        = &timeout;
        }

        ret = select(PQsocket(streamConn) + 1, &input_mask,
                     NULL, NULL, ptr_timeout);
#endif   /* HAVE_POLL */
    }

    if (ret == 0 || (ret < 0 && errno == EINTR))
        return false;
    if (ret < 0)
        ereport(ERROR,
                (errcode_for_socket_access(),
                 errmsg("select() failed: %m")));
    return true;
}

/*
 * Clear our pid from shared memory at exit.
 */
static void
libpqrcv_disconnect(void)
{
    PQfinish(streamConn);
    justconnected = false;
}

/*
 * Receive any WAL records available from XLOG stream, blocking for
 * maximum of 'timeout' ms.
 *
 * Returns:
 *
 *   True if data was received. *recptr, *buffer and *len are set to
 *   the WAL location of the received data, buffer holding it, and length,
 *   respectively.
 *
 *   False if no data was available within timeout, or wait was interrupted
 *   by signal.
 *
 * The buffer returned is only valid until the next call of this function or
 * libpq_connect/disconnect.
 *
 * ereports on error.
 */
static bool
libpqrcv_receive(int timeout, XLogRecPtr *recptr, char **buffer, int *len)
{
    int            rawlen;

    if (recvBuf != NULL)
        PQfreemem(recvBuf);
    recvBuf = NULL;

    /*
     * If the caller requested to block, wait for data to arrive. But if
     * this is the first call after connecting, don't wait, because
     * there might already be some data in libpq buffer that we haven't
     * returned to caller.
     */
    if (timeout > 0 && !justconnected)
    {
        if (!libpq_select(timeout))
            return false;

        if (PQconsumeInput(streamConn) == 0)
            ereport(ERROR,
                    (errmsg("could not read xlog records: %s",
                            PQerrorMessage(streamConn))));
    }
    justconnected = false;

    /* Receive CopyData message */
    rawlen = PQgetCopyData(streamConn, &recvBuf, 1);
    if (rawlen == 0)    /* no records available yet, then return */
        return false;
    if (rawlen == -1)    /* end-of-streaming or error */
    {
        PGresult    *res;

        res = PQgetResult(streamConn);
        if (PQresultStatus(res) == PGRES_COMMAND_OK)
        {
            PQclear(res);
            ereport(ERROR,
                    (errmsg("replication terminated by primary server")));
        }
        PQclear(res);
        ereport(ERROR,
                (errmsg("could not read xlog records: %s",
                        PQerrorMessage(streamConn))));
    }
    if (rawlen < -1)
        ereport(ERROR,
                (errmsg("could not read xlog records: %s",
                        PQerrorMessage(streamConn))));

    if (rawlen < sizeof(XLogRecPtr))
        ereport(ERROR,
                (errmsg("invalid WAL message received from primary")));

    /* Return received WAL records to caller */
    *recptr = *((XLogRecPtr *) recvBuf);
    *buffer = recvBuf + sizeof(XLogRecPtr);
    *len = rawlen - sizeof(XLogRecPtr);

    return true;
}

Re: Streaming replication and non-blocking I/O

From

Heikki Linnakangas

Date:

16 January 2010, 03:54:48

Heikki Linnakangas wrote:
> I've pushed that 'replication-dynmodule' branch in my git repo. The diff
> is hard to read, because it mostly just moves code around, but I've
> attached libpqwalreceiver.c here, which is the dynamic module part. You
> can also browse the tree via the web interface
>
(http://git.postgresql.org/gitweb?p=users/heikki/postgres.git;a=tree;h=refs/heads/replication-dynmodule;hb=replication-dynmodule)

I just noticed that the comment at the top of libpqwalreceiver.c is a
leftover, not much relevant to the contents of the file anymore, all the
signal handling and interaction with startup process is in
src/backend/replication/walreceiver.c now. That obviously needs to be
fixed before committing..

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication and non-blocking I/O

From

Dimitri Fontaine

Date:

16 January 2010, 08:55:26

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> The module doesn't need to touch backend internals much at all, no
> tinkering with shared memory for example, so I would feel much better
> about moving that out of src/backend. Not sure where, though; it's not
> an executable, so src/bin is hardly the right place, but I wouldn't want
> to put it in contrib either, because it should still be built and
> installed by default. So I'm inclined to still leave it in
> src/backend/replication/

It should be possible to be in contrib and installed by default, even
with the current tool set, by tweaking initdb to install the contrib
into template1. But that would be a packaging / dependency issue I guess
then.

Of course the extension system would ideally "create extension foo;" for
all foo in contrib at initdb time, then a user would have to "install
extension foo;" and be done with it.

Regards,
-- 
dim

Re: Streaming replication and non-blocking I/O

From

Euler Taveira de Oliveira

Date:

16 January 2010, 09:29:36

Dimitri Fontaine escreveu:
> It should be possible to be in contrib and installed by default, even
> 
And it could be uninstall too. Let's not do it for core functionalities.


--  Euler Taveira de Oliveira http://www.timbira.com/