Thread: testing HS/SR - invalid magic number

testing HS/SR - invalid magic number

From

"Erik Rijkers"

Date:

13 April 2010, 21:34:32

This replication test that was working well earlier (it ran daily), stopped working
after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least
recent).

This test copies a schema from a 8.4.3 instance to the 9.0 primary:

pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \  | psql -1qtA -h /tmp -p 7575 -d replicas


I tried it twice, with a complete reinstall from the cvs repository;
last one just now (created 20100414 0128), and once a few hours earlier.





-- logfile primary:

LOG:  database system was shut down at 2010-04-14 02:15:01 CEST
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
2010-04-14 02:15:14 CEST 14756   start=2010-04-14 02:15:14 CEST LOG:  database system was shut
down at 2010-04-14 02:15:12 CEST
2010-04-14 02:15:14 CEST 14752   start=2010-04-14 02:15:13 CEST LOG:  database system is ready to
accept connections
2010-04-14 02:18:23 CEST 15100 [unknown] rijkers start=2010-04-14 02:18:22 CEST LOG:  replication
connection authorized: user=rijkers host=127.0.0.1 port=55248
[...]
2010-04-14 02:21:10 CEST 15100 [unknown] rijkers start=2010-04-14 02:18:22 CEST FATAL:  requested
WAL segment 000000010000000000000032 has already been removed
2010-04-14 02:21:11 CEST 15377 [unknown] rijkers start=2010-04-14 02:21:11 CEST LOG:  replication
connection authorized: user=rijkers host=127.0.0.1 port=50869
2010-04-14 02:21:11 CEST 15377 [unknown] rijkers start=2010-04-14 02:21:11 CEST FATAL:  requested
WAL segment 000000010000000000000032 has already been removed
2010-04-14 02:21:16 CEST 15384 [unknown] rijkers start=2010-04-14 02:21:16 CEST LOG:  replication
connection authorized: user=rijkers host=127.0.0.1 port=50870
2010-04-14 02:21:16 CEST 15384 [unknown] rijkers start=2010-04-14 02:21:16 CEST FATAL:  requested
WAL segment 000000010000000000000032 has already been removed
2010-04-14 02:21:21 CEST 15400 [unknown] rijkers start=2010-04-14 02:21:21 CEST LOG:  replication
connection authorized: user=rijkers host=127.0.0.1 port=50871
2010-04-14 02:21:22 CEST 15400 [unknown] rijkers start=2010-04-14 02:21:21 CEST FATAL:  requested
WAL segment 000000010000000000000032 has already been removed
2010-04-14 02:21:26 CEST 15406 [unknown] rijkers start=2010-04-14 02:21:26 CEST LOG:  replication
connection authorized: user=rijkers host=127.0.0.1 port=50872

etc.


-- logfile standby:
2010-04-14 02:18:22 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  database system was
interrupted; last known up at 2010-04-14 02:17:54 CEST
2010-04-14 02:18:22 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  entering standby mode
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
2010-04-14 02:18:23 CEST 5601   start=2010-04-14 02:18:22 CEST LOG:  streaming replication
successfully connected to primary
2010-04-14 02:18:24 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  redo starts at 0/1000020
2010-04-14 02:18:24 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  consistent recovery state
reached at 0/2000000
2010-04-14 02:18:24 CEST 5597   start=2010-04-14 02:18:22 CEST LOG:  database system is ready to
accept read only connections
2010-04-14 02:21:11 CEST 5601   start=2010-04-14 02:18:22 CEST FATAL:  could not receive data from
WAL stream: FATAL:  requested WAL segment 000000010000000000000032 has already been removed

cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
No such file or directory
2010-04-14 02:21:11 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  invalid magic number 0000 in
log file 0, segment 50, offset 13795328
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
No such file or directory
2010-04-14 02:21:11 CEST 5784   start=2010-04-14 02:21:11 CEST LOG:  streaming replication
successfully connected to primary
2010-04-14 02:21:11 CEST 5784   start=2010-04-14 02:21:11 CEST FATAL:  could not receive data from
WAL stream: FATAL:  requested WAL segment 000000010000000000000032 has already been removed

cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
No such file or directory
2010-04-14 02:21:11 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  invalid magic number 0000 in
log file 0, segment 50, offset 13795328
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
No such file or directory
2010-04-14 02:21:16 CEST 5790   start=2010-04-14 02:21:16 CEST LOG:  streaming replication
successfully connected to primary
2010-04-14 02:21:16 CEST 5790   start=2010-04-14 02:21:16 CEST FATAL:  could not receive data from
WAL stream: FATAL:  requested WAL segment 000000010000000000000032 has already been removed

cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
No such file or directory
2010-04-14 02:21:16 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  invalid magic number 0000 in
log file 0, segment 50, offset 13795328
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
No such file or directory
2010-04-14 02:21:21 CEST 5795   start=2010-04-14 02:21:21 CEST LOG:  streaming replication
successfully connected to primary
2010-04-14 02:21:22 CEST 5795   start=2010-04-14 02:21:21 CEST FATAL:  could not receive data from
WAL stream: FATAL:  requested WAL segment 000000010000000000000032 has already been removed


etc....  that block of lines then endlessly repeats.

Re: testing HS/SR - invalid magic number

From

"Erik Rijkers"

Date:

13 April 2010, 21:51:55

On Wed, April 14, 2010 02:34, Erik Rijkers wrote:
> This replication test that was working well earlier (it ran daily), stopped working
> after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least
> recent).
>
> This test copies a schema from a 8.4.3 instance to the 9.0 primary:
>
> pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \
>    | psql -1qtA -h /tmp -p 7575 -d replicas
>

The above does indeed not work anymore, but the logfiles I sent a little earlier were froma different pg_dump
statements:

pg_dump -c -h /tmp -p 5432 -t public.mytable --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas

... copying table by table to see if that improved matters, which it didn't.

Re: testing HS/SR - invalid magic number

From

Heikki Linnakangas

Date:

14 April 2010, 03:18:41

Erik Rijkers wrote:
> On Wed, April 14, 2010 02:34, Erik Rijkers wrote:
>> This replication test that was working well earlier (it ran daily), stopped working
>> after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least
>> recent).
>>
>> This test copies a schema from a 8.4.3 instance to the 9.0 primary:
>>
>> pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \
>>    | psql -1qtA -h /tmp -p 7575 -d replicas
>>
> 
> The above does indeed not work anymore, but the logfiles I sent a little earlier were from
>  a different pg_dump statements:
> 
> pg_dump -c -h /tmp -p 5432 -t public.mytable --no-owner --no-privileges ms \
>   | psql -1qtA -h /tmp -p 7575 -d replicas
> 
> ... copying table by table to see if that improved matters, which it didn't.

Streaming replication is completely oblivious of what's in the WAL it
streams, you would probably get the same result with bulk insert or update.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: testing HS/SR - invalid magic number

From

Heikki Linnakangas

Date:

14 April 2010, 03:23:58

Erik Rijkers wrote:
> This replication test that was working well earlier (it ran daily), stopped working
> after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least
> recent).
> ...
> -- logfile standby:
> ...
> 2010-04-14 02:21:11 CEST 5601   start=2010-04-14 02:18:22 CEST FATAL:  could not receive data from
> WAL stream: FATAL:  requested WAL segment 000000010000000000000032 has already been removed
> 
> cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
> No such file or directory
> 2010-04-14 02:21:11 CEST 5598   start=2010-04-14 02:18:22 CEST LOG:  invalid magic number 0000 in
> log file 0, segment 50, offset 13795328
> cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000032':
> No such file or directory
> 2010-04-14 02:21:11 CEST 5784   start=2010-04-14 02:21:11 CEST LOG:  streaming replication
> successfully connected to primary

This is probably because of this change:

> date: 2010/04/12 09:52:29;  author: heikki;  state: Exp;  lines: +71 -23
> Change the logic to decide when to delete old WAL segments, so that it
> doesn't take into account how far the WAL senders are. This way a hung
> WAL sender doesn't prevent old WAL segments from being recycled/removed
> in the primary, ultimately causing the disk to fill up. Instead add
> standby_keep_segments setting to control how many old WAL segments are
> kept in the primary. This also makes it more reliable to use streaming
> replication without WAL archiving, assuming that you set
> standby_keep_segments high enough.

If you generate enough WAL records in the master that the standby can't
keep up, the primary will eventually delete a WAL segment that hasn't
been streamed to the standby yet, hence the "requested WAL segment
000000010000000000000032 has already been removed" error. It shouldn't
remove the segment before it's successfully archived, though, and your
logs show that the standby can't find that file in the archive either.
Is archiving set up properly?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: testing HS/SR - invalid magic number

From

"Erik Rijkers"

Date:

15 April 2010, 11:08:37

On Wed, April 14, 2010 08:23, Heikki Linnakangas wrote:
> Erik Rijkers wrote:
>> This replication test that was working well earlier (it ran daily), stopped working
>
> This is probably because of this change:
>
>> date: 2010/04/12 09:52:29;  author: heikki;  state: Exp;  lines: +71 -23
>> Change the logic to decide when to delete old WAL segments, so that it
>> doesn't take into account how far the WAL senders are. This way a hung
>> WAL sender doesn't prevent old WAL segments from being recycled/removed
>> in the primary, ultimately causing the disk to fill up. Instead add
>> standby_keep_segments setting to control how many old WAL segments are
>> kept in the primary. This also makes it more reliable to use streaming
>> replication without WAL archiving, assuming that you set
>> standby_keep_segments high enough.
>
> If you generate enough WAL records in the master that the standby can't
> keep up, the primary will eventually delete a WAL segment that hasn't
> been streamed to the standby yet, hence the "requested WAL segment
> 000000010000000000000032 has already been removed" error.

ah, I hadn't seen that change, and setting standby_keep_segments
to some higher value (I chose 10000 for now) prevents these errors.

>It shouldn't
> remove the segment before it's successfully archived, though, and your
> logs show that the standby can't find that file in the archive either.
> Is archiving set up properly?

It wasn't really - I was archiving on the primary server, instead of copying
to the standby server.  I'll fix that, thanks.


Btw, typo alert:  documentation 18.5.4, subhead 'standby_keep_segments',
says: 'replciation' for 'replication'.

http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION


thanks,


Erik Rijkers

Re: testing HS/SR - invalid magic number

From

Heikki Linnakangas

Date:

15 April 2010, 13:25:42

Erik Rijkers wrote:
> Btw, typo alert:  documentation 18.5.4, subhead 'standby_keep_segments',
> says: 'replciation' for 'replication'.
> 
> http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION

Thanks, fixed.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com