Thread: Timeout and Synch Rep

Timeout and Synch Rep

From

Josh Berkus

Date:

07 October 2010, 16:50:55

All,

In my effort to make the discussion around the design decisions of synch
rep less opaque, I'm starting a separate thread about what has developed
to be one of the more contentious issues.

I'm going to champion timeouts because I plan to use them. In fact, I
plan to deploy synch rep with a timeout if it's available within 2 weeks
of 9.1 being released. Without a timeout (i.e. "wait forever" is the
only mode), that project will probably never use synch rep.

Let me give you my use-case so that you can understand why I want a timeout.

Client is a telecommunications service provider. They have a primary
server and a failover server for data updates. They also have two async
slaves on older machines for reporting purposes. The failover
currently does NOT accept any queries in order to keep it as current as
possible.

They would like the failover to be synchronous so that they can
guarentee no data loss in the event of a master failure. However, zero
data loss is less important to them than uptime ... they have a five9's
SLA with their clients, and the hardware on the master is very good.

So, if something happens to the standby, and it cannot return an ack in
30 seconds, they would like it to degrade to asynch mode. At that
point, they would also like to trigger a nagios alert which will wake up
the sysadmin with flashing red lights. Once he has resolved the
problem, he would like to promote the now-asynch standby back to synch
standby.

Yes, this means that, in the event of a standby failure, they have a
window where any failure on the master will mean data loss. The user
regards this risk as acceptable, given that both the master and the
failover are located in the same data center in any case, so there is
always a risk of a sufficient disaster wiping out all data back to the
daily backup.

-- -- Josh Berkus PostgreSQL Experts Inc.
http://www.pgexperts.com

Re: Timeout and Synch Rep

From

Fujii Masao

Date:

08 October 2010, 10:31:10

On Fri, Oct 8, 2010 at 4:50 AM, Josh Berkus <josh@agliodbs.com> wrote:
> In my effort to make the discussion around the design decisions of synch
> rep less opaque, I'm starting a separate thread about what has developed
> to be one of the more contentious issues.
>
> I'm going to champion timeouts because I plan to use them.  In fact, I
> plan to deploy synch rep with a timeout if it's available within 2 weeks
> of 9.1 being released.  Without a timeout (i.e. "wait forever" is the
> only mode), that project will probably never use synch rep.
>
> Let me give you my use-case so that you can understand why I want a timeout.
>
> Client is a telecommunications service provider.  They have a primary
> server and a failover server for data updates.  They also have two async
> slaves on older machines for reporting purposes.   The failover
> currently does NOT accept any queries in order to keep it as current as
> possible.
>
> They would like the failover to be synchronous so that they can
> guarentee no data loss in the event of a master failure.  However, zero
> data loss is less important to them than uptime ... they have a five9's
> SLA with their clients, and the hardware on the master is very good.
>
> So, if something happens to the standby, and it cannot return an ack in
> 30 seconds, they would like it to degrade to asynch mode.  At that
> point, they would also like to trigger a nagios alert which will wake up
> the sysadmin with flashing red lights.  Once he has resolved the
> problem, he would like to promote the now-asynch standby back to synch
> standby.
>
> Yes, this means that, in the event of a standby failure, they have a
> window where any failure on the master will mean data loss.  The user
> regards this risk as acceptable, given that both the master and the
> failover are located in the same data center in any case, so there is
> always a risk of a sufficient disaster wiping out all data back to the
> daily backup.

This explains very well why some systems require the timeout.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Timeout and Synch Rep

From

Thom Brown

Date:

08 October 2010, 10:40:35

On 7 October 2010 20:50, Josh Berkus <josh@agliodbs.com> wrote:
> All,
> So, if something happens to the standby, and it cannot return an ack in
> 30 seconds, they would like it to degrade to asynch mode.  At that
> point, they would also like to trigger a nagios alert which will wake up
> the sysadmin with flashing red lights.

How?

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Timeout and Synch Rep

From

Josh Berkus

Date:

08 October 2010, 13:36:29

>> So, if something happens to the standby, and it cannot return an ack in
>> 30 seconds, they would like it to degrade to asynch mode.  At that
>> point, they would also like to trigger a nagios alert which will wake up
>> the sysadmin with flashing red lights.
>
> How?

TBD, and before 9.1.  It's clear to *me* that we're going to need some 
read-only system views around replication in order to make monitoring 
work.  Also, given that we have this shiny new LISTEN/NOTIFY 
implementation, it would be peachy keen to maybe make use of it to emit 
an event when replication changes, no?

Otherwise, DBAs are forced to tail the logs to figure out what 
replication status is.  This is NOT adequate.


--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com