Thread: hot standby lagging vs warm that is up-to-date

hot standby lagging vs warm that is up-to-date

From

MirrorX

Date:

29 August 2012, 09:16:22

hello!

i am facing a rather 'weird' issue so please if you have ideas/thoughs share
them.

i have a setup of a master server and hot standby one. the database settings
on both are identical, the specs of the servers are the same except of the
disks. the disks on the standby are much slower than the master's.

what happens is that the standby regularly falls behind. when this happens,
streaming replication does not work any more and the server starts applying
the wal archives (that are being rsync-ed from the master) from the point
when streaming replication was interrupted. when this is over and no more
archives are there to be applied, streaming replication is back online.

the problem is that during the apply of the archives, sometimes the process
is being 'stuck' for too long on some archives (maybe even more than 30
minutes for a single archive or even 2 hours on some occasions). at that
point, running an 'iostat' command shows one of the disks(not always the
same disk) being used 100%. if i stop the standby server and bring it back
online in a 'warm standby setup' (by using the pg_standby utility into the
recovery.conf file) then the apply of all the archives is very fast (even
for those archives that were stuck in the hot-standby setup) and the iostat
never shows more than 10-20% util on the disks where the data reside.

has anyone seen anything similar?
pls let me know which extra information would be useful.
some specs for the servers are the following:
16 cpus,
64GB ram,
red hat 5.6

and from the postgreql.conf the settings from the master are these (those of
the stadby are the same, except the hot_standby option that is switched to
'on') ->
 version                         | PostgreSQL 9.0.5 on
x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat
4.1.2-46), 64-bit
 archive_command                 | cp %p /archives/%f
 archive_mode                    | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_max_workers          | 5
 autovacuum_vacuum_cost_delay    | 10ms
 autovacuum_vacuum_scale_factor  | 0.2
 bgwriter_delay                  | 400ms
 bgwriter_lru_maxpages           | 50
 checkpoint_completion_target    | 0.9
 checkpoint_segments             | 300
 checkpoint_timeout              | 8min
 effective_cache_size            | 50GB
 hot_standby                     | on
 lc_collate                      | en_US.UTF-8
 lc_ctype                        | en_US.UTF-8
 listen_addresses                | *
 log_checkpoints                 | on
 log_destination                 | stderr
 log_filename                    | postgresql-%a.log
 log_line_prefix                 | %t [%p]: [%l-1] user=%u,db=%d,remote=%r
 log_min_duration_statement      | 1s
 log_rotation_age                | 1d
 log_truncate_on_rotation        | on
 logging_collector               | on
 maintenance_work_mem            | 2GB
 max_connections                 | 1200
 max_prepared_transactions       | 1000
 max_stack_depth                 | 6MB
 max_wal_senders                 | 5
 port                            | 5432
 server_encoding                 | UTF8
 shared_buffers                  | 10GB
 synchronous_commit              | off
 temp_buffers                    | 12800
 TimeZone                        | UTC
 wal_buffers                     | 16MB
 wal_keep_segments               | 768
 wal_level                       | hot_standby
 work_mem                        | 30MB


thank you in advance!





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/hot-standby-lagging-vs-warm-that-is-up-to-date-tp5721711.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

Re: hot standby lagging vs warm that is up-to-date

From

Tom Lane

Date:

29 August 2012, 15:11:56

MirrorX <mirrorx@gmail.com> writes:
> i am facing a rather 'weird' issue so please if you have ideas/thoughs share
> them.

> i have a setup of a master server and hot standby one. the database settings
> on both are identical, the specs of the servers are the same except of the
> disks. the disks on the standby are much slower than the master's.

That is not a good situation to be in.  Replay of WAL logs is typically
less efficient than original creation of them (there are various reasons
for this, but the big picture is that the replay environment can't use
as much caching as a normal server process).  If the master's workload
is mostly-write then you need a slave with at least the same spec I/O
system, or it's very likely to fall behind.

> the problem is that during the apply of the archives, sometimes the process
> is being 'stuck' for too long on some archives (maybe even more than 30
> minutes for a single archive or even 2 hours on some occasions). at that
> point, running an 'iostat' command shows one of the disks(not always the
> same disk) being used 100%.

Hm.  Can you get a stack trace of the startup process when it's doing
that?

            regards, tom lane

Re: hot standby lagging vs warm that is up-to-date

From

MirrorX

Date:

30 August 2012, 14:37:49

this is the stack trace
thx in advance :D

$ ps aux | grep 4500
postgres  4500  4.2 14.0 14134172 13924240 ?   Ds   Aug29  60:44 postgres:
startup process   recovering 00000002000040A3000000B4

$ gstack 4500
#0  0x00000037d10c63a0 in __read_nocancel () from /lib64/libc.so.6
#1  0x00000000005dc88a in FileRead ()
#2  0x00000000005edcb3 in mdread ()
#3  0x00000000005d9160 in ReadBuffer_common ()
#4  0x00000000005d9854 in ReadBufferWithoutRelcache ()
#5  0x000000000048b46c in XLogReadBufferExtended ()
#6  0x00000000004776cb in btree_redo ()
#7  0x0000000000487fe0 in StartupXLOG ()
#8  0x000000000048a0a8 in StartupProcessMain ()
#9  0x000000000049fec3 in AuxiliaryProcessMain ()
#10 0x00000000005c21e6 in StartChildProcess ()
#11 0x00000000005c44e7 in PostmasterMain ()
#12 0x000000000056e8de in main ()

postgres=# select now(),pg_last_xlog_replay_location();
              now              | pg_last_xlog_replay_location
-------------------------------+------------------------------
 2012-08-30 14:34:10.700851+00 | 40A3/B4D62038

postgres=# select now(),pg_last_xlog_replay_location();
             now              | pg_last_xlog_replay_location
------------------------------+------------------------------
 2012-08-30 14:36:49.67801+00 | 40A3/B4D62038




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/hot-standby-lagging-vs-warm-that-is-up-to-date-tp5721711p5721906.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

Re: hot standby lagging vs warm that is up-to-date

From

MirrorX

Date:

30 August 2012, 15:01:44

it is still stuck on the same wal record ->

              now              | pg_last_xlog_replay_location
-------------------------------+------------------------------
 2012-08-30 14:57:57.617171+00 | 40A3/B4D62038

and iostat shows 100% util on one disk ->
sdj              90.60     0.00 176.00  0.00 44032.00     0.00   250.18
1.60    8.69   5.67  99.76


there is no problem if this is only due to the slow disk, i just wanted to a
2nd opinion on this matter in case it is somekind of a bug.

thx again :)



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/hot-standby-lagging-vs-warm-that-is-up-to-date-tp5721711p5721913.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.