Thread: Trouble with replication
I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:
2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed
",,,,,,,,,""
Checking the master, I see that file has in fact been removed from the pg_xlog directory. The master has archive_command setup to ship the wal files to the slave, and the slave is setup with a recovery_command to read them from that directory. In fact, that WAL segment exists in the slave’s pg_xlog directory as well.
Now, from what I can tell, the master archived this wal file out of its xlog directory (based on the keep wal segments setting). Then, why did the slave not pick it up from the directory that it was archived to? It is my understanding that the log shipping via archive_command from master to slave is precisely there to prevent this scenario. What am I doing wrong? Below are some of the pertinent settings.
On Master:
wal_level = hot_standby
archive_mode = on
archive_command = 'rsync -zaq %p postgres@pg-slave:/var/lib/pgsql/wal_restore/%f && test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp %p /var/lib/pgsql/backup/wal_archive/'
archive_timeout = 300
max_wal_senders = 5
wal_keep_segments = 0 # not sure why I’ve set it to this?
On Slave:
wal_level = hot_standby
archive_mode = on
archive_command = 'test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp -i %p /var/lib/pgsql/backup/wal_archive/%f < /dev/null'
hot_standby = on
wal_keep_segments = 1
On slave – recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=pg-master port=5432 user=replicator'
restore_command = 'cp /var/lib/psql/wal_restore/%f %p'
archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/wal_restore/ %r'
I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:
2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed
",,,,,,,,,""
I suspect you'll need to set the parameter that controls how many WAL log files are kept on the server high enough to cover this replication catchup period.
geographically separated high latency connections are very problematic for any sort of replication.
-- john r pierce 37N 122W somewhere on the middle of the left coast
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Trouble with replication
I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:
2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed
",,,,,,,,,""
I suspect you'll need to set the parameter that controls how many WAL log files are kept on the server high enough to cover this replication catchup period.
geographically separated high latency connections are very problematic for any sort of replication.
On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?
On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?
http://www.postgresql.org/docs/9.2/static/runtime-config-replication.html
Michael
From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Wednesday, June 05, 2013 9:43 PM
To: David Greco
Cc: John R Pierce; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Trouble with replication
On Thu, Jun 6, 2013 at 7:23 AM, David Greco <David_Greco@harte-hanks.com> wrote:
On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?
In your case you need to transfer the WAL files using streaming replication, so you need to set wal_keep_segments to a value high enough on master such as the slave can can up. For reference:
http://www.postgresql.org/docs/9.2/static/runtime-config-replication.html
--
Michael
Then what is the purpose to shipping the archived WAL files to the slave? i.e. if wal_keep_segments has to be high enough to cover any replication lag anyways, then should I even bother shipping them over?
From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Wednesday, June 05, 2013 9:43 PM
To: David Greco
Cc: John R Pierce; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Trouble with replication
On Thu, Jun 6, 2013 at 7:23 AM, David Greco <David_Greco@harte-hanks.com> wrote:
On the master or on the slave, or on both? I thought shipping the archived WAL files from the master to the slave did this already?
In your case you need to transfer the WAL files using streaming replication, so you need to set wal_keep_segments to a value high enough on master such as the slave can can up. For reference:
http://www.postgresql.org/docs/9.2/static/runtime-config-replication.html
--
Michael
That document mentions “(However, the standby server can recover by fetching the segment from archive, if WAL archiving is in use.)”. That appears to be my problem, that is not actually occurring in my case.
I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours. I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:
2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000022 has already been removed
",,,,,,,,,""
Checking the master, I see that file has in fact been removed from the pg_xlog directory. The master has archive_command setup to ship the wal files to the slave, and the slave is setup with a recovery_command to read them from that directory.
In fact, that WAL segment exists in the slave’s pg_xlog directory as well.
Now, from what I can tell, the master archived this wal file out of its xlog directory (based on the keep wal segments setting). Then, why did the slave not pick it up from the directory that it was archived to? It is my understanding that the log shipping via archive_command from master to slave is precisely there to prevent this scenario. What am I doing wrong? Below are some of the pertinent settings.
Then what is the purpose to shipping the archived WAL files to the slave? i.e. if wal_keep_segments has to be high enough to cover any replication lag anyways, then should I even bother shipping them over?
More details at paragraph "Standby Server Operation" here:
http://www.postgresql.org/docs/9.2/static/warm-standby.html
Michael
From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Thursday, June 06, 2013 7:01 PM
To: David Greco
Cc: John R Pierce; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Trouble with replication
On Thu, Jun 6, 2013 at 9:19 PM, David Greco <David_Greco@harte-hanks.com> wrote:
Then what is the purpose to shipping the archived WAL files to the slave? i.e. if wal_keep_segments has to be high enough to cover any replication lag anyways, then should I even bother shipping them over?
Oh. I just noticed that you set up restore_command on slave, so if streaming replication failed due to a WAL file already removed on master, slave process will try to fetch missing WAL files from the archive.
Could you provide more logs of slave? Are you sure that the missing WAL file was not fetched from the archive after failing to get it through streaming replication?
The errors continued ad infinitum on the slave. I’ve since redone the replication setup with keep WAL segments set on the master to a rather large number, enough to nearly fill the drive dedicated to XLOG. Replication appears to be working properly now. Best I can figure, it had something to do with the pg_restore used to populate the master? This would be a large 30GB transaction.
I appreciate that this is a few months old however we had apparently identical symptoms but on a Windows platform. In the end we discovered that the reason the WAL archives that existed on the slave were not being restored was because the Windows account postgres was running as (Network Service) did not have permissions to access the WAL Archive directory. Once we gave it permissions we started seeing restoration messages in the log: "2013-09-12 13:32:42 BST LOG: restored log file "0000000100000090000000FD" from archive" Additionally, when we gave Network Service write permissions to the WAL Archive directory, our pg_archivecleanup command also started succeeding after apparently silently failing (I suspect stderr wasn't being written to the log but I'm not certain). As I say, I know that I'm referring to Windows here however this thread was the only one we found describing "WAL archives shipped but not restored by slave" so I've posted here in case permissions was also your issue David. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Trouble-with-replication-tp5758062p5770583.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.