Thread: stray SIGALRM
In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout seconds, even if authentication has been successful. Most of the time this doesn't hurt anyone, but there are cases, such as when the backend is doing the open() of a backend copy, when it breaks things and results in an error getting reported to the client. In particular, if you're doing a copy from a FIFO, it is normal for open() to block until the process at the other end has data ready, so you're very likely to have it interrupted by the SIGALRM and fail. To see the SIGALRM just run psql then determine your backend's pid, attach an strace to it, and wait 60 seconds, or whatever you've got authentication_timeout set to. This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a side-effect of speeding things up by getting rid of setitimer() calls; it's not obvious what's a good way to fix it without losing the benefits of that commit. Thanks Alvaro and Andres for helping me get from "why is my copy getting these signals" to understanding what's actually going on. Richard -- Richard Poole http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Richard Poole <richard@2ndQuadrant.com> writes: > In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout > seconds, even if authentication has been successful. Most of the time > this doesn't hurt anyone, but there are cases, such as when the backend > is doing the open() of a backend copy, when it breaks things and results > in an error getting reported to the client. In particular, if you're doing > a copy from a FIFO, it is normal for open() to block until the process at > the other end has data ready, so you're very likely to have it interrupted > by the SIGALRM and fail. > To see the SIGALRM just run psql then determine your backend's pid, > attach an strace to it, and wait 60 seconds, or whatever you've got > authentication_timeout set to. > This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a > side-effect of speeding things up by getting rid of setitimer() calls; > it's not obvious what's a good way to fix it without losing the benefits > of that commit. Ugh. It doesn't sound very practical to try to guarantee that every single kernel call in the backend is set up to recover from EINTR, even though ideally they should all be able to cope. Maybe we have to revert those signal-handling changes. regards, tom lane
I wrote: > Richard Poole <richard@2ndQuadrant.com> writes: >> This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a >> side-effect of speeding things up by getting rid of setitimer() calls; >> it's not obvious what's a good way to fix it without losing the benefits >> of that commit. > Ugh. It doesn't sound very practical to try to guarantee that every > single kernel call in the backend is set up to recover from EINTR, > even though ideally they should all be able to cope. On reflection though, we *do* need to make them cope, because even without lazy SIGALRM disable, any such place is still at risk. We surely must allow for the possibility of SIGHUP arriving at any instant, for example. Have you identified the specific place that's giving you trouble? regards, tom lane
On 2013-06-15 10:45:28 -0400, Tom Lane wrote: > I wrote: > > Richard Poole <richard@2ndQuadrant.com> writes: > >> This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a > >> side-effect of speeding things up by getting rid of setitimer() calls; > >> it's not obvious what's a good way to fix it without losing the benefits > >> of that commit. > > > Ugh. It doesn't sound very practical to try to guarantee that every > > single kernel call in the backend is set up to recover from EINTR, > > even though ideally they should all be able to cope. > > On reflection though, we *do* need to make them cope, because even > without lazy SIGALRM disable, any such place is still at risk. We > surely must allow for the possibility of SIGHUP arriving at any instant, > for example. All signal handlers we register, including SIGHUP, but the one for SIGALRM set SA_RESTART... I wonder if we can rejigger things so we don't need that? I am not sure if there's still a reason for that decision inside the backend. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-06-15 10:45:28 -0400, Tom Lane wrote: >> On reflection though, we *do* need to make them cope, because even >> without lazy SIGALRM disable, any such place is still at risk. We >> surely must allow for the possibility of SIGHUP arriving at any instant, >> for example. > All signal handlers we register, including SIGHUP, but the one for > SIGALRM set SA_RESTART... I wonder if we can rejigger things so we don't > need that? I am not sure if there's still a reason for that decision > inside the backend. Hmm. Personally I'd rather go in the other direction: http://www.postgresql.org/message-id/12819.1183306286@sss.pgh.pa.us but maybe the path of least resistance is to set SA_RESTART for SIGALRM too. Given our current usage of it, there seems no worse harm in allowing kernel calls to restart after a SIGALRM than any other signal. regards, tom lane
On 2013-06-15 11:29:45 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-06-15 10:45:28 -0400, Tom Lane wrote: > >> On reflection though, we *do* need to make them cope, because even > >> without lazy SIGALRM disable, any such place is still at risk. We > >> surely must allow for the possibility of SIGHUP arriving at any instant, > >> for example. > > > All signal handlers we register, including SIGHUP, but the one for > > SIGALRM set SA_RESTART... I wonder if we can rejigger things so we don't > > need that? I am not sure if there's still a reason for that decision > > inside the backend. > > Hmm. Personally I'd rather go in the other direction: > http://www.postgresql.org/message-id/12819.1183306286@sss.pgh.pa.us > but maybe the path of least resistance is to set SA_RESTART for SIGALRM > too. Given our current usage of it, there seems no worse harm in > allowing kernel calls to restart after a SIGALRM than any other signal. I am not actually objecting that reasoning, I think it would be rather useful to get there. But I don't think it's realistic to do that at this point in the 9.3 cycle. It seems like we would fight bugs around that for quite a while. We have a large number of syscall sites where we don't retry on EINTR/EAGAIN. And, as you note, that's not even talking about third party code. I'd be happy if we decide to go that way at the beginning of the 9.4 cycle. I.e. do it right now on HEAD for all syscalls. That way we have a chance to find most of the bad callsites before releasing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-06-15 11:29:45 -0400, Tom Lane wrote: >> Hmm. Personally I'd rather go in the other direction: >> http://www.postgresql.org/message-id/12819.1183306286@sss.pgh.pa.us > I am not actually objecting that reasoning, I think it would be rather > useful to get there. > But I don't think it's realistic to do that at this point in the > 9.3 cycle. It seems like we would fight bugs around that for quite a > while. We have a large number of syscall sites where we don't retry on > EINTR/EAGAIN. And, as you note, that's not even talking about third > party code. Yeah, it's the issue of third-party code within the backend (perl, python, etc etc etc etc) that really makes complete EINTR-proofing seem a bit impractical. Back in the day I was also worried about platforms that didn't have SA_RESTART, but that's probably pretty much the empty set by now (is anyone aware of a modern platform on which configure fails to set HAVE_POSIX_SIGNALS?). Also, our switch to latches for sleeping purposes should have ameliorated the issue of signals failing to wake processes when we wanted them to. Let's turn on SA_RESTART for SIGALRM in HEAD and 9.3 and see what beta testing says. regards, tom lane
I wrote: > ... Also, our switch to latches for sleeping purposes > should have ameliorated the issue of signals failing to wake processes > when we wanted them to. > Let's turn on SA_RESTART for SIGALRM in HEAD and 9.3 and see what beta > testing says. I experimented with this a bit on my old HPUX box (which is one of the ones where SA_RESTART signals don't terminate a select()), and soon found an annoying behavior: regression=# \timing Timing is on. regression=# set statement_timeout TO 2000; SET Time: 4.983 ms regression=# select generate_series(1,1000000); ERROR: canceling statement due to statement timeout Time: 2015.015 ms regression=# select pg_sleep(10); ERROR: canceling statement due to statement timeout Time: 3009.932 ms This happens because pg_sleep() is implemented with a loop around pg_usleep(1000000), and the latter no longer is terminated by a SIGALRM. It seems like it'd be a good idea to replace the pg_sleep implementation with something involving WaitLatch, which would not only ensure prompt response to SIGALRM (and other signals, eg SIGQUIT), but would eliminate useless process wakeups during a sleep. In general, we might want to consider replacing long sleep intervals with WaitLatch operations. I thought for a bit about trying to turn pg_usleep itself into a WaitLatch call; but it's also used in frontend code where that wouldn't work, and anyway it's not clear this would be a good thing for short sleeps. regards, tom lane
Tom Lane wrote: > In general, we might want to consider replacing long sleep intervals > with WaitLatch operations. I thought for a bit about trying to turn > pg_usleep itself into a WaitLatch call; but it's also used in frontend > code where that wouldn't work, and anyway it's not clear this would be > a good thing for short sleeps. How about having a #ifdef !FRONTEND code path that uses the latch, and sleep otherwise? And maybe use plain sleep for short sleeps in the backend also, to avoid the latch overhead. I notice we already have three implementations of pg_usleep. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
* Alvaro Herrera (alvherre@2ndquadrant.com) wrote: > Tom Lane wrote: > > In general, we might want to consider replacing long sleep intervals > > with WaitLatch operations. I thought for a bit about trying to turn > > pg_usleep itself into a WaitLatch call; but it's also used in frontend > > code where that wouldn't work, and anyway it's not clear this would be > > a good thing for short sleeps. > > How about having a #ifdef !FRONTEND code path that uses the latch, and > sleep otherwise? And maybe use plain sleep for short sleeps in the > backend also, to avoid the latch overhead. I notice we already have > three implementations of pg_usleep. Is there really serious overhead from using latches..? I thought much of the point of that approach was specifically to minimize overhead... Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > Is there really serious overhead from using latches..? IIRC there's at least one gettimeofday() involved in WaitLatch. There are platforms on which that already costs more than you want to spend for, say, CommitDelay. regards, tom lane