Thread: Trouble with replication

Trouble with replication

From

David Greco

Date:

05 June 2013, 20:40:12

I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:

2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed

",,,,,,,,,""

Checking the master, I see that file has in fact been removed from the pg_xlog directory. The master has archive_command setup to ship the wal files to the slave, and the slave is setup with a recovery_command to read them from that directory. In fact, that WAL segment exists in the slave’s pg_xlog directory as well.

Now, from what I can tell, the master archived this wal file out of its xlog directory (based on the keep wal segments setting). Then, why did the slave not pick it up from the directory that it was archived to? It is my understanding that the log shipping via archive_command from master to slave is precisely there to prevent this scenario. What am I doing wrong? Below are some of the pertinent settings.

On Master:

wal_level = hot_standby

archive_mode = on

archive_command = 'rsync -zaq %p postgres@pg-slave:/var/lib/pgsql/wal_restore/%f && test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp %p /var/lib/pgsql/backup/wal_archive/'

archive_timeout = 300

max_wal_senders = 5

wal_keep_segments = 0 # not sure why I’ve set it to this?

On Slave:

wal_level = hot_standby

archive_mode = on

archive_command = 'test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp -i %p /var/lib/pgsql/backup/wal_archive/%f < /dev/null'

hot_standby = on

wal_keep_segments = 1

On slave – recovery.conf:

standby_mode = 'on'

primary_conninfo = 'host=pg-master port=5432 user=replicator'

restore_command = 'cp /var/lib/psql/wal_restore/%f %p'

archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/wal_restore/ %r'

Re: Trouble with replication

From

John R Pierce

Date:

05 June 2013, 21:00:35

On 6/5/2013 1:39 PM, David Greco wrote:

I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:

2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed
",,,,,,,,,""

I suspect you'll need to set the parameter that controls how many WAL log files are kept on the server high enough to cover this replication catchup period.

geographically separated high latency connections are very problematic for any sort of replication.

-- 
john r pierce                                      37N 122W
somewhere on the middle of the left coast

Re: Trouble with replication

From

David Greco

Date:

05 June 2013, 22:24:06

From: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org] on behalf of John R Pierce [pierce@hogranch.com]

Sent: Wednesday, June 05, 2013 5:00 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Trouble with replication

On 6/5/2013 1:39 PM, David Greco wrote:

I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:

2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed
",,,,,,,,,""

On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?

Re: Trouble with replication

From

Michael Paquier

Date:

06 June 2013, 01:42:41

On Thu, Jun 6, 2013 at 7:23 AM, David Greco <David_Greco@harte-hanks.com> wrote:

On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?

In your case you need to transfer the WAL files using streaming replication, so you need to set wal_keep_segments to a value high enough on master such as the slave can can up. For reference:
http://www.postgresql.org/docs/9.2/static/runtime-config-replication.html

--
Michael

Re: Trouble with replication

From

David Greco

Date:

06 June 2013, 12:20:12

From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Wednesday, June 05, 2013 9:43 PM
To: David Greco
Cc: John R Pierce; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Trouble with replication

On Thu, Jun 6, 2013 at 7:23 AM, David Greco <David_Greco@harte-hanks.com> wrote:

On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?

--
Michael

Then what is the purpose to shipping the archived WAL files to the slave? i.e. if wal_keep_segments has to be high enough to cover any replication lag anyways, then should I even bother shipping them over?

Re: Trouble with replication

From

David Greco

Date:

06 June 2013, 12:22:31

On Thu, Jun 6, 2013 at 7:23 AM, David Greco <David_Greco@harte-hanks.com> wrote:

On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?

--
Michael

That document mentions “(However, the standby server can recover by fetching the segment from archive, if WAL archiving is in use.)”. That appears to be my problem, that is not actually occurring in my case.

Re: Trouble with replication

From

Jeff Janes

Date:

06 June 2013, 16:52:03

On Wed, Jun 5, 2013 at 1:39 PM, David Greco <David_Greco@harte-hanks.com> wrote:

I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:

2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed
",,,,,,,,,""

What are the messages before and after this?

Checking the master, I see that file has in fact been removed from the pg_xlog directory. The master has archive_command setup to ship the wal files to the slave, and the slave is setup with a recovery_command to read them from that directory.

Are you sure that these are set up correctly? What happens if you comment out primary_conninfo, so that the archive directory is the only way to deliver the files?

In fact, that WAL segment exists in the slave’s pg_xlog directory as well.

But is the existing file identical to the one the master (and the one in the archivedir)? It is probably a recycled file that has not yet been overwritten with received contents. That is, it has the contents of some past log file, but the name of some future one.

Now, from what I can tell, the master archived this wal file out of its xlog directory (based on the keep wal segments setting). Then, why did the slave not pick it up from the directory that it was archived to? It is my understanding that the log shipping via archive_command from master to slave is precisely there to prevent this scenario. What am I doing wrong? Below are some of the pertinent settings.

In my hands, this is what happens. After losing contact with the primary, it starts pulling files from the archive until it runs out of those, then tries to reconnect to the primary.

Cheers,

Jeff

Re: Trouble with replication

From

Michael Paquier

Date:

06 June 2013, 23:00:44

On Thu, Jun 6, 2013 at 9:19 PM, David Greco <David_Greco@harte-hanks.com> wrote:

Then what is the purpose to shipping the archived WAL files to the slave? i.e. if wal_keep_segments has to be high enough to cover any replication lag anyways, then should I even bother shipping them over?

Oh. I just noticed that you set up restore_command on slave, so if streaming replication failed due to a WAL file already removed on master, slave process will try to fetch missing WAL files from the archive.

Could you provide more logs of slave? Are you sure that the missing WAL file was not fetched from the archive after failing to get it through streaming replication?
More details at paragraph "Standby Server Operation" here:
http://www.postgresql.org/docs/9.2/static/warm-standby.html

--
Michael

Re: Trouble with replication

From

David Greco

Date:

10 June 2013, 13:30:13

From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Thursday, June 06, 2013 7:01 PM
To: David Greco
Cc: John R Pierce; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Trouble with replication

On Thu, Jun 6, 2013 at 9:19 PM, David Greco <David_Greco@harte-hanks.com> wrote:

Could you provide more logs of slave? Are you sure that the missing WAL file was not fetched from the archive after failing to get it through streaming replication?

The errors continued ad infinitum on the slave. I’ve since redone the replication setup with keep WAL segments set on the master to a rather large number, enough to nearly fill the drive dedicated to XLOG. Replication appears to be working properly now. Best I can figure, it had something to do with the pg_restore used to populate the master? This would be a large 30GB transaction.

Re: Trouble with replication

From

againstdemons84

Date:

12 September 2013, 13:11:22

I appreciate that this is a few months old however we had apparently
identical symptoms but on a Windows platform.

In the end we discovered that the reason the WAL archives that existed on
the slave were not being restored was because the Windows account postgres
was running as (Network Service) did not have permissions to access the WAL
Archive directory. Once we gave it permissions we started seeing restoration
messages in the log:

"2013-09-12 13:32:42 BST LOG:  restored log file "0000000100000090000000FD"
from archive"

Additionally, when we gave Network Service write permissions to the WAL
Archive directory, our pg_archivecleanup command also started succeeding
after apparently silently failing (I suspect stderr wasn't being written to
the log but I'm not certain).

As I say, I know that I'm referring to Windows here however this thread was
the only one we found describing "WAL archives shipped but not restored by
slave" so I've posted here in case permissions was also your issue David.





--
View this message in context: http://postgresql.1045698.n5.nabble.com/Trouble-with-replication-tp5758062p5770583.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.