Thread: Monitoring Replication
Hello all, I use Nagios to monitor various things on a few servers and have recently set up a hot-standby server and would obviouslylike to include the state of streaming replication in my monitoring. I know about the pg_stat_replication view on the master and the pg_last_xlog_receive_location() system function on the standby...and while there is no traffic I know that the values from the sent_location column from the master view shouldmatch the value returned by pg_last_xlog_receive_location on the standby. I also assume that if streaming replicationfails completely the pg_stat_replication view on the master should simply return no records... so that shouldbe easy to detect. The confusion I have is how exactly can I determine just how far behind the replication is during loads? Currently withno traffic (servers not in production yet) sent_location on the master is "A/10018560" and pg_last_xlog_receive_location()on the standby also returns "A/10018560"... How far apart can these be for me to start worrying? I could make a bit more sense of all this if they were simple timestamps or something, but the hex values returnedboggle my mind. Any advice on these issues or other tips on monitoring the replication would be greatly appreciated. Thanks, Brandon
On Wed, Oct 12, 2011, Brandon Phelps wrote: > I use Nagios to monitor various things on a few servers and have > recently set up a hot-standby server and would obviously like to > include the state of streaming replication in my monitoring. > > [...] > > The confusion I have is how exactly can I determine just how far > behind the replication is during loads? Currently with no traffic > (servers not in production yet) sent_location on the master is > "A/10018560" and pg_last_xlog_receive_location() on the standby also > returns "A/10018560"... How far apart can these be for me to start > worrying? I could make a bit more sense of all this if they were > simple timestamps or something, but the hex values returned boggle my > mind. > > Any advice on these issues or other tips on monitoring the replication > would be greatly appreciated. Brandon: I'm using this script for Mon, you should be able to adapt it to whatever language and monitoring system you please. http://www.martini.nu/misc/db_replication.monitor.txt -- Mahlon E. Smith http://www.martini.nu/contact.html
Attachment
There is also http://bucardo.org/wiki/Check_postgres but I haven't been able to get it to work for monitoring replication. I am using a similar custom script as Mahlon, but written in perl. Looking at Mahlon's code has shown me an error in how I have been thinking about calculating the replication lag. Thanks :)
On Wed, Oct 12, 2011 at 3:28 PM, Mahlon E. Smith <mahlon@martini.nu> wrote:
On Wed, Oct 12, 2011, Brandon Phelps wrote:> [...]
> I use Nagios to monitor various things on a few servers and have
> recently set up a hot-standby server and would obviously like to
> include the state of streaming replication in my monitoring.
>>Brandon: I'm using this script for Mon, you should be able to adapt it
> The confusion I have is how exactly can I determine just how far
> behind the replication is during loads? Currently with no traffic
> (servers not in production yet) sent_location on the master is
> "A/10018560" and pg_last_xlog_receive_location() on the standby also
> returns "A/10018560"... How far apart can these be for me to start
> worrying? I could make a bit more sense of all this if they were
> simple timestamps or something, but the hex values returned boggle my
> mind.
>
> Any advice on these issues or other tips on monitoring the replication
> would be greatly appreciated.
to whatever language and monitoring system you please.
http://www.martini.nu/misc/db_replication.monitor.txt
--
Mahlon E. Smith
http://www.martini.nu/contact.html