Thread: stray SIGALRM

stray SIGALRM

From

Richard Poole

Date:

14 June 2013, 22:42:31

In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout
seconds, even if authentication has been successful. Most of the time
this doesn't hurt anyone, but there are cases, such as when the backend
is doing the open() of a backend copy, when it breaks things and results
in an error getting reported to the client. In particular, if you're doing
a copy from a FIFO, it is normal for open() to block until the process at
the other end has data ready, so you're very likely to have it interrupted
by the SIGALRM and fail.

To see the SIGALRM just run psql then determine your backend's pid,
attach an strace to it, and wait 60 seconds, or whatever you've got
authentication_timeout set to.

This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a
side-effect of speeding things up by getting rid of setitimer() calls;
it's not obvious what's a good way to fix it without losing the benefits
of that commit.

Thanks Alvaro and Andres for helping me get from "why is my copy getting
these signals" to understanding what's actually going on.

Richard

-- 
Richard Poole                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: stray SIGALRM

From

Tom Lane

Date:

14 June 2013, 22:58:13

Richard Poole <richard@2ndQuadrant.com> writes:
> In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout
> seconds, even if authentication has been successful. Most of the time
> this doesn't hurt anyone, but there are cases, such as when the backend
> is doing the open() of a backend copy, when it breaks things and results
> in an error getting reported to the client. In particular, if you're doing
> a copy from a FIFO, it is normal for open() to block until the process at
> the other end has data ready, so you're very likely to have it interrupted
> by the SIGALRM and fail.

> To see the SIGALRM just run psql then determine your backend's pid,
> attach an strace to it, and wait 60 seconds, or whatever you've got
> authentication_timeout set to.

> This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a
> side-effect of speeding things up by getting rid of setitimer() calls;
> it's not obvious what's a good way to fix it without losing the benefits
> of that commit.

Ugh.  It doesn't sound very practical to try to guarantee that every
single kernel call in the backend is set up to recover from EINTR,
even though ideally they should all be able to cope.  Maybe we have to
revert those signal-handling changes.
        regards, tom lane

Re: stray SIGALRM

From

Tom Lane

Date:

15 June 2013, 14:45:57

I wrote:
> Richard Poole <richard@2ndQuadrant.com> writes:
>> This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a
>> side-effect of speeding things up by getting rid of setitimer() calls;
>> it's not obvious what's a good way to fix it without losing the benefits
>> of that commit.

> Ugh.  It doesn't sound very practical to try to guarantee that every
> single kernel call in the backend is set up to recover from EINTR,
> even though ideally they should all be able to cope.

On reflection though, we *do* need to make them cope, because even
without lazy SIGALRM disable, any such place is still at risk.  We
surely must allow for the possibility of SIGHUP arriving at any instant,
for example.

Have you identified the specific place that's giving you trouble?
        regards, tom lane

Re: stray SIGALRM

From

Andres Freund

Date:

15 June 2013, 15:08:40

On 2013-06-15 10:45:28 -0400, Tom Lane wrote:
> I wrote:
> > Richard Poole <richard@2ndQuadrant.com> writes:
> >> This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a
> >> side-effect of speeding things up by getting rid of setitimer() calls;
> >> it's not obvious what's a good way to fix it without losing the benefits
> >> of that commit.
> 
> > Ugh.  It doesn't sound very practical to try to guarantee that every
> > single kernel call in the backend is set up to recover from EINTR,
> > even though ideally they should all be able to cope.
> 
> On reflection though, we *do* need to make them cope, because even
> without lazy SIGALRM disable, any such place is still at risk.  We
> surely must allow for the possibility of SIGHUP arriving at any instant,
> for example.

All signal handlers we register, including SIGHUP, but the one for
SIGALRM set SA_RESTART... I wonder if we can rejigger things so we don't
need that? I am not sure if there's still a reason for that decision
inside the backend.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: stray SIGALRM

From

Tom Lane

Date:

15 June 2013, 15:31:13

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-06-15 10:45:28 -0400, Tom Lane wrote:
>> On reflection though, we *do* need to make them cope, because even
>> without lazy SIGALRM disable, any such place is still at risk.  We
>> surely must allow for the possibility of SIGHUP arriving at any instant,
>> for example.

> All signal handlers we register, including SIGHUP, but the one for
> SIGALRM set SA_RESTART... I wonder if we can rejigger things so we don't
> need that? I am not sure if there's still a reason for that decision
> inside the backend.

Hmm.  Personally I'd rather go in the other direction:
http://www.postgresql.org/message-id/12819.1183306286@sss.pgh.pa.us
but maybe the path of least resistance is to set SA_RESTART for SIGALRM
too.  Given our current usage of it, there seems no worse harm in
allowing kernel calls to restart after a SIGALRM than any other signal.
        regards, tom lane

Re: stray SIGALRM

From

Andres Freund

Date:

15 June 2013, 15:44:45

On 2013-06-15 11:29:45 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-06-15 10:45:28 -0400, Tom Lane wrote:
> >> On reflection though, we *do* need to make them cope, because even
> >> without lazy SIGALRM disable, any such place is still at risk.  We
> >> surely must allow for the possibility of SIGHUP arriving at any instant,
> >> for example.
> 
> > All signal handlers we register, including SIGHUP, but the one for
> > SIGALRM set SA_RESTART... I wonder if we can rejigger things so we don't
> > need that? I am not sure if there's still a reason for that decision
> > inside the backend.
> 
> Hmm.  Personally I'd rather go in the other direction:
> http://www.postgresql.org/message-id/12819.1183306286@sss.pgh.pa.us
> but maybe the path of least resistance is to set SA_RESTART for SIGALRM
> too.  Given our current usage of it, there seems no worse harm in
> allowing kernel calls to restart after a SIGALRM than any other signal.

I am not actually objecting that reasoning, I think it would be rather
useful to get there.
But I don't think it's realistic to do that at this point in the
9.3 cycle. It seems like we would fight bugs around that for quite a
while. We have a large number of syscall sites where we don't retry on
EINTR/EAGAIN. And, as you note, that's not even talking about third
party code.

I'd be happy if we decide to go that way at the beginning of the 9.4
cycle. I.e. do it right now on HEAD for all syscalls. That way we have a
chance to find most of the bad callsites before releasing.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: stray SIGALRM

From

Tom Lane

Date:

15 June 2013, 16:42:45

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-06-15 11:29:45 -0400, Tom Lane wrote:
>> Hmm.  Personally I'd rather go in the other direction:
>> http://www.postgresql.org/message-id/12819.1183306286@sss.pgh.pa.us

> I am not actually objecting that reasoning, I think it would be rather
> useful to get there.
> But I don't think it's realistic to do that at this point in the
> 9.3 cycle. It seems like we would fight bugs around that for quite a
> while. We have a large number of syscall sites where we don't retry on
> EINTR/EAGAIN. And, as you note, that's not even talking about third
> party code.

Yeah, it's the issue of third-party code within the backend (perl,
python, etc etc etc etc) that really makes complete EINTR-proofing seem
a bit impractical.

Back in the day I was also worried about platforms that didn't have
SA_RESTART, but that's probably pretty much the empty set by now (is
anyone aware of a modern platform on which configure fails to set
HAVE_POSIX_SIGNALS?).  Also, our switch to latches for sleeping purposes
should have ameliorated the issue of signals failing to wake processes
when we wanted them to.

Let's turn on SA_RESTART for SIGALRM in HEAD and 9.3 and see what beta
testing says.
        regards, tom lane

Re: stray SIGALRM

From

Tom Lane

Date:

15 June 2013, 19:10:16

I wrote:
> ... Also, our switch to latches for sleeping purposes
> should have ameliorated the issue of signals failing to wake processes
> when we wanted them to.

> Let's turn on SA_RESTART for SIGALRM in HEAD and 9.3 and see what beta
> testing says.

I experimented with this a bit on my old HPUX box (which is one of the
ones where SA_RESTART signals don't terminate a select()), and soon
found an annoying behavior:

regression=# \timing
Timing is on.
regression=# set statement_timeout TO 2000;
SET
Time: 4.983 ms
regression=# select generate_series(1,1000000);
ERROR:  canceling statement due to statement timeout
Time: 2015.015 ms
regression=# select pg_sleep(10);
ERROR:  canceling statement due to statement timeout
Time: 3009.932 ms

This happens because pg_sleep() is implemented with a loop around
pg_usleep(1000000), and the latter no longer is terminated by a SIGALRM.

It seems like it'd be a good idea to replace the pg_sleep implementation
with something involving WaitLatch, which would not only ensure prompt
response to SIGALRM (and other signals, eg SIGQUIT), but would eliminate
useless process wakeups during a sleep.

In general, we might want to consider replacing long sleep intervals
with WaitLatch operations.  I thought for a bit about trying to turn
pg_usleep itself into a WaitLatch call; but it's also used in frontend
code where that wouldn't work, and anyway it's not clear this would be
a good thing for short sleeps.
        regards, tom lane

Re: stray SIGALRM

From

Alvaro Herrera

Date:

16 June 2013, 01:54:24

Tom Lane wrote:

> In general, we might want to consider replacing long sleep intervals
> with WaitLatch operations.  I thought for a bit about trying to turn
> pg_usleep itself into a WaitLatch call; but it's also used in frontend
> code where that wouldn't work, and anyway it's not clear this would be
> a good thing for short sleeps.

How about having a #ifdef !FRONTEND code path that uses the latch, and
sleep otherwise?  And maybe use plain sleep for short sleeps in the
backend also, to avoid the latch overhead.  I notice we already have
three implementations of pg_usleep.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: stray SIGALRM

From

Stephen Frost

Date:

16 June 2013, 02:02:59

* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
> Tom Lane wrote:
> > In general, we might want to consider replacing long sleep intervals
> > with WaitLatch operations.  I thought for a bit about trying to turn
> > pg_usleep itself into a WaitLatch call; but it's also used in frontend
> > code where that wouldn't work, and anyway it's not clear this would be
> > a good thing for short sleeps.
>
> How about having a #ifdef !FRONTEND code path that uses the latch, and
> sleep otherwise?  And maybe use plain sleep for short sleeps in the
> backend also, to avoid the latch overhead.  I notice we already have
> three implementations of pg_usleep.

Is there really serious overhead from using latches..?  I thought much
of the point of that approach was specifically to minimize overhead...
Thanks,        Stephen

Re: stray SIGALRM

From

Tom Lane

Date:

16 June 2013, 02:32:56

Stephen Frost <sfrost@snowman.net> writes:
> Is there really serious overhead from using latches..?

IIRC there's at least one gettimeofday() involved in WaitLatch.  There
are platforms on which that already costs more than you want to spend
for, say, CommitDelay.
        regards, tom lane