Thread: testing HS/SR - invalid magic number
This replication test that was working well earlier (it ran daily), stopped working after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least recent). This test copies a schema from a 8.4.3 instance to the 9.0 primary: pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas I tried it twice, with a complete reinstall from the cvs repository; last one just now (created 20100414 0128), and once a few hours earlier. -- logfile primary: LOG: database system was shut down at 2010-04-14 02:15:01 CEST LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: received smart shutdown request LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down 2010-04-14 02:15:14 CEST 14756 start=2010-04-14 02:15:14 CEST LOG: database system was shut down at 2010-04-14 02:15:12 CEST 2010-04-14 02:15:14 CEST 14752 start=2010-04-14 02:15:13 CEST LOG: database system is ready to accept connections 2010-04-14 02:18:23 CEST 15100 [unknown] rijkers start=2010-04-14 02:18:22 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=55248 [...] 2010-04-14 02:21:10 CEST 15100 [unknown] rijkers start=2010-04-14 02:18:22 CEST FATAL: requested WAL segment 000000010000000000000032 has already been removed 2010-04-14 02:21:11 CEST 15377 [unknown] rijkers start=2010-04-14 02:21:11 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50869 2010-04-14 02:21:11 CEST 15377 [unknown] rijkers start=2010-04-14 02:21:11 CEST FATAL: requested WAL segment 000000010000000000000032 has already been removed 2010-04-14 02:21:16 CEST 15384 [unknown] rijkers start=2010-04-14 02:21:16 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50870 2010-04-14 02:21:16 CEST 15384 [unknown] rijkers start=2010-04-14 02:21:16 CEST FATAL: requested WAL segment 000000010000000000000032 has already been removed 2010-04-14 02:21:21 CEST 15400 [unknown] rijkers start=2010-04-14 02:21:21 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50871 2010-04-14 02:21:22 CEST 15400 [unknown] rijkers start=2010-04-14 02:21:21 CEST FATAL: requested WAL segment 000000010000000000000032 has already been removed 2010-04-14 02:21:26 CEST 15406 [unknown] rijkers start=2010-04-14 02:21:26 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50872 etc. -- logfile standby: 2010-04-14 02:18:22 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: database system was interrupted; last known up at 2010-04-14 02:17:54 CEST 2010-04-14 02:18:22 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: entering standby mode cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001': No such file or directory 2010-04-14 02:18:23 CEST 5601 start=2010-04-14 02:18:22 CEST LOG: streaming replication successfully connected to primary 2010-04-14 02:18:24 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: redo starts at 0/1000020 2010-04-14 02:18:24 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: consistent recovery state reached at 0/2000000 2010-04-14 02:18:24 CEST 5597 start=2010-04-14 02:18:22 CEST LOG: database system is ready to accept read only connections 2010-04-14 02:21:11 CEST 5601 start=2010-04-14 02:18:22 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000032 has already been removed cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': No such file or directory 2010-04-14 02:21:11 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: invalid magic number 0000 in log file 0, segment 50, offset 13795328 cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': No such file or directory 2010-04-14 02:21:11 CEST 5784 start=2010-04-14 02:21:11 CEST LOG: streaming replication successfully connected to primary 2010-04-14 02:21:11 CEST 5784 start=2010-04-14 02:21:11 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000032 has already been removed cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': No such file or directory 2010-04-14 02:21:11 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: invalid magic number 0000 in log file 0, segment 50, offset 13795328 cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': No such file or directory 2010-04-14 02:21:16 CEST 5790 start=2010-04-14 02:21:16 CEST LOG: streaming replication successfully connected to primary 2010-04-14 02:21:16 CEST 5790 start=2010-04-14 02:21:16 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000032 has already been removed cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': No such file or directory 2010-04-14 02:21:16 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: invalid magic number 0000 in log file 0, segment 50, offset 13795328 cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': No such file or directory 2010-04-14 02:21:21 CEST 5795 start=2010-04-14 02:21:21 CEST LOG: streaming replication successfully connected to primary 2010-04-14 02:21:22 CEST 5795 start=2010-04-14 02:21:21 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000032 has already been removed etc.... that block of lines then endlessly repeats.
On Wed, April 14, 2010 02:34, Erik Rijkers wrote: > This replication test that was working well earlier (it ran daily), stopped working > after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least > recent). > > This test copies a schema from a 8.4.3 instance to the 9.0 primary: > > pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \ > | psql -1qtA -h /tmp -p 7575 -d replicas > The above does indeed not work anymore, but the logfiles I sent a little earlier were froma different pg_dump statements: pg_dump -c -h /tmp -p 5432 -t public.mytable --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas ... copying table by table to see if that improved matters, which it didn't.
Erik Rijkers wrote: > On Wed, April 14, 2010 02:34, Erik Rijkers wrote: >> This replication test that was working well earlier (it ran daily), stopped working >> after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least >> recent). >> >> This test copies a schema from a 8.4.3 instance to the 9.0 primary: >> >> pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \ >> | psql -1qtA -h /tmp -p 7575 -d replicas >> > > The above does indeed not work anymore, but the logfiles I sent a little earlier were from > a different pg_dump statements: > > pg_dump -c -h /tmp -p 5432 -t public.mytable --no-owner --no-privileges ms \ > | psql -1qtA -h /tmp -p 7575 -d replicas > > ... copying table by table to see if that improved matters, which it didn't. Streaming replication is completely oblivious of what's in the WAL it streams, you would probably get the same result with bulk insert or update. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Erik Rijkers wrote: > This replication test that was working well earlier (it ran daily), stopped working > after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least > recent). > ... > -- logfile standby: > ... > 2010-04-14 02:21:11 CEST 5601 start=2010-04-14 02:18:22 CEST FATAL: could not receive data from > WAL stream: FATAL: requested WAL segment 000000010000000000000032 has already been removed > > cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': > No such file or directory > 2010-04-14 02:21:11 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: invalid magic number 0000 in > log file 0, segment 50, offset 13795328 > cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032': > No such file or directory > 2010-04-14 02:21:11 CEST 5784 start=2010-04-14 02:21:11 CEST LOG: streaming replication > successfully connected to primary This is probably because of this change: > date: 2010/04/12 09:52:29; author: heikki; state: Exp; lines: +71 -23 > Change the logic to decide when to delete old WAL segments, so that it > doesn't take into account how far the WAL senders are. This way a hung > WAL sender doesn't prevent old WAL segments from being recycled/removed > in the primary, ultimately causing the disk to fill up. Instead add > standby_keep_segments setting to control how many old WAL segments are > kept in the primary. This also makes it more reliable to use streaming > replication without WAL archiving, assuming that you set > standby_keep_segments high enough. If you generate enough WAL records in the master that the standby can't keep up, the primary will eventually delete a WAL segment that hasn't been streamed to the standby yet, hence the "requested WAL segment 000000010000000000000032 has already been removed" error. It shouldn't remove the segment before it's successfully archived, though, and your logs show that the standby can't find that file in the archive either. Is archiving set up properly? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, April 14, 2010 08:23, Heikki Linnakangas wrote: > Erik Rijkers wrote: >> This replication test that was working well earlier (it ran daily), stopped working > > This is probably because of this change: > >> date: 2010/04/12 09:52:29; author: heikki; state: Exp; lines: +71 -23 >> Change the logic to decide when to delete old WAL segments, so that it >> doesn't take into account how far the WAL senders are. This way a hung >> WAL sender doesn't prevent old WAL segments from being recycled/removed >> in the primary, ultimately causing the disk to fill up. Instead add >> standby_keep_segments setting to control how many old WAL segments are >> kept in the primary. This also makes it more reliable to use streaming >> replication without WAL archiving, assuming that you set >> standby_keep_segments high enough. > > If you generate enough WAL records in the master that the standby can't > keep up, the primary will eventually delete a WAL segment that hasn't > been streamed to the standby yet, hence the "requested WAL segment > 000000010000000000000032 has already been removed" error. ah, I hadn't seen that change, and setting standby_keep_segments to some higher value (I chose 10000 for now) prevents these errors. >It shouldn't > remove the segment before it's successfully archived, though, and your > logs show that the standby can't find that file in the archive either. > Is archiving set up properly? It wasn't really - I was archiving on the primary server, instead of copying to the standby server. I'll fix that, thanks. Btw, typo alert: documentation 18.5.4, subhead 'standby_keep_segments', says: 'replciation' for 'replication'. http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION thanks, Erik Rijkers
Erik Rijkers wrote: > Btw, typo alert: documentation 18.5.4, subhead 'standby_keep_segments', > says: 'replciation' for 'replication'. > > http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION Thanks, fixed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com