pg_standby stuck on a wal file size <16MB - Mailing list pgsql-general
From | Vladimir Kosilov |
---|---|
Subject | pg_standby stuck on a wal file size <16MB |
Date | |
Msg-id | 482F2841.7040400@gmail.com Whole thread Raw |
Responses |
Re: pg_standby stuck on a wal file size <16MB
|
List | pgsql-general |
I've been having problem with pgsql-8.2.5 master/slave warm standby replication setup where occasionally master node generates a wal file smaller then expected 16MB. pg_standby on slave gets stuck on such short files, and replication halts from that moment on. we have to do pg_start_backup/ rsync of data to slave / pg_stop_backup and restart slave in order to recover. database I'm replicating is write mostly. this is process on slave that is waiting on log file which is smaller in size: postgres 14277 11857 0 03:34 ? 00:00:00 sh -c /usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w 0 -t /usr/local/recovery_control/quit_recovery /usr/local/wal_archive 000000010000034000000020 pg_xlog/RECOVERYXLOG 2>> /usr/local/pgsql/data/standby.log postgres 14278 14277 0 03:34 ? 00:00:00 /usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w 0 -t /usr/local/recovery_control/quit_recovery /usr/local/wal_archive 000000010000034000000020 pg_xlog/RECOVERYXLOG here is a sample list of wal_archive directory slave, note the size of 000000010000034000000020 expected wal file is less then 16MB: ... -rw------- 1 postgres postgres 16777216 May 17 03:19 000000010000034000000017 -rw------- 1 postgres postgres 16777216 May 17 03:21 000000010000034000000018 -rw------- 1 postgres postgres 16777216 May 17 03:22 000000010000034000000019 -rw------- 1 postgres postgres 16777216 May 17 03:24 00000001000003400000001A -rw------- 1 postgres postgres 16777216 May 17 03:26 00000001000003400000001B -rw------- 1 postgres postgres 16777216 May 17 03:27 00000001000003400000001C -rw------- 1 postgres postgres 16777216 May 17 03:29 00000001000003400000001D -rw------- 1 postgres postgres 16777216 May 17 03:30 00000001000003400000001E -rw------- 1 postgres postgres 16777216 May 17 03:32 00000001000003400000001F -rw------- 1 postgres postgres 13746176 May 17 03:34 000000010000034000000020 -rw------- 1 postgres postgres 16777216 May 17 03:35 000000010000034000000021 -rw------- 1 postgres postgres 16777216 May 17 03:37 000000010000034000000022 -rw------- 1 postgres postgres 16777216 May 17 03:38 000000010000034000000023 -rw------- 1 postgres postgres 16777216 May 17 03:40 000000010000034000000024 -rw------- 1 postgres postgres 16777216 May 17 03:41 000000010000034000000025 -rw------- 1 postgres postgres 16777216 May 17 03:43 000000010000034000000026 -rw------- 1 postgres postgres 16777216 May 17 03:45 000000010000034000000027 ... skipping a bit further I see there is at least one other instance where wal file is shorter then normal 16MB: -rw------- 1 postgres postgres 16777216 May 17 05:42 00000001000003400000006F -rw------- 1 postgres postgres 16777216 May 17 05:44 000000010000034000000070 -rw------- 1 postgres postgres 16777216 May 17 05:46 000000010000034000000071 -rw------- 1 postgres postgres 16777216 May 17 05:47 000000010000034000000072 -rw------- 1 postgres postgres 16384 May 17 05:50 000000010000034000000073 -rw------- 1 postgres postgres 16777216 May 17 05:51 000000010000034000000074 -rw------- 1 postgres postgres 16777216 May 17 05:52 000000010000034000000075 -rw------- 1 postgres postgres 16777216 May 17 05:54 000000010000034000000076 why would a master node create a wal file smaller then normal checkpoint_segment size and how can this be avoided. I need a reliable replication mechanism even at a cost of longer recovery on standby. master's postgresql.conf: .. fsync = on wal_sync_method = open_sync wal_buffers = 128 checkpoint_segments = 64 archive_command = 'test ! -f /usr/local/wal_archive_local/%f && cp %p /usr/local/wal_archive_local/%f' archive files are then moved on master to standby every other minute: rsync -aq --remove-sent-files /usr/local/wal_archive_local/ slave::wal_archive/ slave's recovery.conf: restore_command = '/usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w 0 -t /usr/local/recovery_control/quit_recovery /usr/local/wal_archive %f %p 2>> /u sr/local/pgsql/data/standby.log' both servers are identical Dell PE1950 servers with 4 sas hd w/hardware RAID 1+0 running: 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:39:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux Thanks much in advance V.
pgsql-general by date: