Thread: sync rep and smart shutdown
There is an open item for synchronous replication and smart shutdown, with a link to here: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01391.php The issue is not straightforward, however, so I want to get some broader input before proceeding. In short, the problem is that if synchronous replication is in use, no standbys are connected, and a smart shutdown is requested, any future commits will wait for a wake-up that will never come, because by that point postmaster is no longer accepting connections - thus no standby can reconnect to release waiters. Or, if there is a standby connected when the smart shutdown is requested, but it subsequently gets disconnected, it won't be able to reconnect, and again all waiters will get stuck. There are a couple of plausible ways to proceed here: 1. Do nothing. If this happens to you, you will need to request fast or immediate shutdown to get the system unstuck. Since it's pretty easy for this to happen already anyway (all you need is one connection to sit open doing nothing), most people probably already have provision for this and likely wouldn't be terribly inconvenienced by one more corner case. On the flip side, I would rather that we were moving in the direction of making it more likely for smart shutdown to actually shut down the system, rather than less likely. 2. When a smart shutdown is initiated, shut off synchronous replication. This definitely makes sure you won't get stuck waiting for sync rep, but on the other hand you probably configured sync rep because you wanted, uh, sync rep. Or alternatively, continue to allow sync rep for as long as there is a sync standby connected, but if the last sync standby drops off then shut it off. 3. Accept new replication connections even when the system is undergoing a smart shutdown. This is the approach that the above-linked patch tries to take, and it seems superficially sensible, but it doesn't really work. Currently, once a shutdown has been initiated and any on-line backup has been stopped, we stop creating regular backends; we instead only create dead-end backends that just return an error message and exit. Once no regular backends remain, we then stop accepting connections AT ALL and wait for the dead end backends to drain out. What this patch proposes to do (though it isn't real clear from the way it's written) is continue creating regular backends but boot out all but superuser and replication connections as soon as possible. However, that misses the reason why the current code works the way that it does: to make sure that even in the face of a continuing stream of connection requests, we actually eventually manage to stop talking and shut down. Basically, this patch would fix the smart-shutdown-sync-rep interaction at the expense of making smart shutdown considerably more fragile in other cases, which does not seem like a good trade-off. AFAICT, this whole approach is doomed to failure. Anyone else have an idea or opinion? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > There is an open item for synchronous replication and smart shutdown, > with a link to here: > http://archives.postgresql.org/pgsql-hackers/2011-03/msg01391.php > There are a couple of plausible ways to proceed here: > 1. Do nothing. > 2. When a smart shutdown is initiated, shut off synchronous > replication. > 3. Accept new replication connections even when the system is > undergoing a smart shutdown. I agree that #3 is impractical and #2 is a bad idea, which seems to leave us with #1 (unless anyone has a #4)? This is probably just something we should figure is going to be one of the rough edges in the first release of sync rep. A #4 idea did just come to mind: once we realize that there are no working replication connections, automatically do a fast shutdown instead, ie, forcibly roll back those transactions that are never gonna complete. Or at least have the postmaster bleat about it. But I'm not sure what it'd take to code that, and am also unsure that it's something to undertake at this stage of the cycle. regards, tom lane
On Fri, Apr 8, 2011 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> There is an open item for synchronous replication and smart shutdown, >> with a link to here: >> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01391.php > >> There are a couple of plausible ways to proceed here: > >> 1. Do nothing. > >> 2. When a smart shutdown is initiated, shut off synchronous >> replication. > >> 3. Accept new replication connections even when the system is >> undergoing a smart shutdown. > > I agree that #3 is impractical and #2 is a bad idea, which seems to > leave us with #1 (unless anyone has a #4)? This is probably just > something we should figure is going to be one of the rough edges > in the first release of sync rep. That's kind of where my mind was headed too, although I was (probably vainly) hoping for a better option. > A #4 idea did just come to mind: once we realize that there are no > working replication connections, automatically do a fast shutdown > instead, ie, forcibly roll back those transactions that are never > gonna complete. Or at least have the postmaster bleat about it. > But I'm not sure what it'd take to code that, and am also unsure > that it's something to undertake at this stage of the cycle. Well, you certainly can't do that. By the time a transaction is waiting for sync rep, it's too late to roll back; the commit record is already, and necessarily, on disk. But in theory we could notice that all of the remaining backends are waiting for sync rep, and switch to a fast shutdown. Several people have suggested refinements for smart shutdown in general, such as switching to fast shutdown after a certain number of seconds, or having backends exit at the end of the current transaction (or immediately if idle). Such things would both make this problem less irksome and increase the overall utility of smart shutdown tremendously. So maybe it's not worth expending too much effort on it right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 9, 2011 at 3:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> There are a couple of plausible ways to proceed here: >> >>> 1. Do nothing. >> >>> 2. When a smart shutdown is initiated, shut off synchronous >>> replication. >> >>> 3. Accept new replication connections even when the system is >>> undergoing a smart shutdown. >> >> I agree that #3 is impractical and #2 is a bad idea, which seems to >> leave us with #1 (unless anyone has a #4)? This is probably just >> something we should figure is going to be one of the rough edges >> in the first release of sync rep. > > That's kind of where my mind was headed too, although I was (probably > vainly) hoping for a better option. Though I proposed #3, I can live with #1 for now. Even if smart shutdown gets stuck, we can resolve that by requesting fast shutdown or emptying synchronous_standby_names. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center