Thread: Timeout and Synch Rep
All, In my effort to make the discussion around the design decisions of synch rep less opaque, I'm starting a separate thread about what has developed to be one of the more contentious issues. I'm going to champion timeouts because I plan to use them. In fact, I plan to deploy synch rep with a timeout if it's available within 2 weeks of 9.1 being released. Without a timeout (i.e. "wait forever" is the only mode), that project will probably never use synch rep. Let me give you my use-case so that you can understand why I want a timeout. Client is a telecommunications service provider. They have a primary server and a failover server for data updates. They also have two async slaves on older machines for reporting purposes. The failover currently does NOT accept any queries in order to keep it as current as possible. They would like the failover to be synchronous so that they can guarentee no data loss in the event of a master failure. However, zero data loss is less important to them than uptime ... they have a five9's SLA with their clients, and the hardware on the master is very good. So, if something happens to the standby, and it cannot return an ack in 30 seconds, they would like it to degrade to asynch mode. At that point, they would also like to trigger a nagios alert which will wake up the sysadmin with flashing red lights. Once he has resolved the problem, he would like to promote the now-asynch standby back to synch standby. Yes, this means that, in the event of a standby failure, they have a window where any failure on the master will mean data loss. The user regards this risk as acceptable, given that both the master and the failover are located in the same data center in any case, so there is always a risk of a sufficient disaster wiping out all data back to the daily backup. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Fri, Oct 8, 2010 at 4:50 AM, Josh Berkus <josh@agliodbs.com> wrote: > In my effort to make the discussion around the design decisions of synch > rep less opaque, I'm starting a separate thread about what has developed > to be one of the more contentious issues. > > I'm going to champion timeouts because I plan to use them. In fact, I > plan to deploy synch rep with a timeout if it's available within 2 weeks > of 9.1 being released. Without a timeout (i.e. "wait forever" is the > only mode), that project will probably never use synch rep. > > Let me give you my use-case so that you can understand why I want a timeout. > > Client is a telecommunications service provider. They have a primary > server and a failover server for data updates. They also have two async > slaves on older machines for reporting purposes. The failover > currently does NOT accept any queries in order to keep it as current as > possible. > > They would like the failover to be synchronous so that they can > guarentee no data loss in the event of a master failure. However, zero > data loss is less important to them than uptime ... they have a five9's > SLA with their clients, and the hardware on the master is very good. > > So, if something happens to the standby, and it cannot return an ack in > 30 seconds, they would like it to degrade to asynch mode. At that > point, they would also like to trigger a nagios alert which will wake up > the sysadmin with flashing red lights. Once he has resolved the > problem, he would like to promote the now-asynch standby back to synch > standby. > > Yes, this means that, in the event of a standby failure, they have a > window where any failure on the master will mean data loss. The user > regards this risk as acceptable, given that both the master and the > failover are located in the same data center in any case, so there is > always a risk of a sufficient disaster wiping out all data back to the > daily backup. This explains very well why some systems require the timeout. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 7 October 2010 20:50, Josh Berkus <josh@agliodbs.com> wrote: > All, > So, if something happens to the standby, and it cannot return an ack in > 30 seconds, they would like it to degrade to asynch mode. At that > point, they would also like to trigger a nagios alert which will wake up > the sysadmin with flashing red lights. How? -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
>> So, if something happens to the standby, and it cannot return an ack in >> 30 seconds, they would like it to degrade to asynch mode. At that >> point, they would also like to trigger a nagios alert which will wake up >> the sysadmin with flashing red lights. > > How? TBD, and before 9.1. It's clear to *me* that we're going to need some read-only system views around replication in order to make monitoring work. Also, given that we have this shiny new LISTEN/NOTIFY implementation, it would be peachy keen to maybe make use of it to emit an event when replication changes, no? Otherwise, DBAs are forced to tail the logs to figure out what replication status is. This is NOT adequate. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com