Re: Sync Rep for 2011CF1 - Mailing list pgsql-hackers
From | Aidan Van Dyk |
---|---|
Subject | Re: Sync Rep for 2011CF1 |
Date | |
Msg-id | AANLkTi=6DZbQ0vcwvCV+K20phS4jv8bnRhENLXJuyvVf@mail.gmail.com Whole thread Raw |
In response to | Re: Sync Rep for 2011CF1 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Sync Rep for 2011CF1
|
List | pgsql-hackers |
On Fri, Jan 21, 2011 at 1:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Again, I'm trying to stop "forward progress" as soon as possible when >> a sync slave isn't replicating. And I'ld like clients to fail with >> errors sooner (hopefully they get to the commit point) rather than >> accumulate the WAL synced to the master and just wait at the commit. > Well, stopping all WAL activity with an error sounds *more* reasonable > than refusing all logins, but I'm not personally sold on it. For > example, a brief network disruption on the connection between master > and standby would cause the master to grind to a halt... and then > almost immediately resume operations. Yup. And I'm OK with that. In my case, it would be much better to have a few quick failures, which can complete automatically a few seconds later then to have a big buildup of transactions to re-verify by hand upon starting manual processing. But again, I'll stress that I'm talking about whe the master has no sync slave connected. a "brief netowrk disruption" between the master/slave isn't likely going to disconnect the slave. TCP is pretty good at handling those. If the master thinks it has a sync slave connected, I'm fine with it continuing to queue WAL for it even if it's lagging noticeably. > More generally, if you have > short-running transactions, there's not much difference between > wait-at-commit and wait-at-WAL, and if you have long-running > transactions, then wait-at-WAL might be gumming up the works more than > necessary. Again, when there is not sync slave *connected*, I don't want to wait *at all*. I want to fail ASAP. If there is a sync slave, and it's just slow, I don't really care where it waits. From my experience, if the slave is not connected (i.e TCP connection has been disconnected), then we're in something like: 1) Proper slave shutdown: pilot error here stopping it if the master requires it 2) Master start, slave not connected yet: I'm fine with getting errors here... We *hope* a slave will be here soon, but... 3) network has seperated master/slave: TCP means it's been like this for a long time already... 4) Slave hardware/os low-level hang/crash: TCP means it's been like this for a while already before master's os tears down the connection 5) Slave has crashed (or rebooted) and slave OS has closed/rejected our TCP connection In all of these, I'ld love for my master not to be generating WAL and letting clients think they are making progress. And I'm hoping that for #3 & 4 above, PG will have keepalive type traffic that will prevent me from queing WAL for normal TCP connection time values. > One idea might be to wait both before and after commit. If > allow_standalone_primary is off, and a commit is attempted, we check > whether there's a slave connected, and if not, wait for one to > connect. Then, we write and sync the commit WAL record. Next, we > wait for the WAL to be ack'd. Of course, the standby might disappear > between the first check and the second, but it would greatly reduce > the possibility of the master being ahead of the standby after a > crash, which might be useful for some people. Ya, but that becomes much more expensive. Instead of it just being a "write WAL, fsync WAL, send WAL, wait for slave", it becomes "write WAL, fsync WAL, send WAL, wait for slave fsync, write WAL, fsync WAL, send WAL, wait for slave fsync". And it's expense is all the time, rather than just when the "no slave no go" situations arise. And it doesn't reduce the transactions I need to verify by hand either, because that waiting/error still only happens at the COMMIT statement from the client. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
pgsql-hackers by date: