Thread: Bug in walsender when calling out to do_pg_stop_backup (and others?)
When walsender calls out to do_pg_stop_backup() (during base backups), it is not possible to terminate the process with a SIGTERM - it requires a SIGKILL. This can leave unkillable backends for example if archive_mode is on and archive_command is failing (or not set). A similar thing would happen in other cases if walsender calls out to something that would block (do_pg_start_backup() for example), but the stop one is easy to provoke. ISTM one way to fix it is the attached, which is to have walsender set the "global" flags saying that we have received sigterm, which in turn causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly exit the process. AFAICT it works fine. Any holes in this approach? Second, I wonder if we should add a SIGINT handler as well, that would make it possible to send a cancel signal. Given that the end result would be the same (at least if we want to keep with the "walsender is simple" path), I'm not sure it's necessary - but it would at least help those doing pg_cancel_backend()... thoughts? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Attachment
On Wed, Oct 5, 2011 at 10:30 PM, Magnus Hagander <magnus@hagander.net> wrote: > When walsender calls out to do_pg_stop_backup() (during base backups), > it is not possible to terminate the process with a SIGTERM - it > requires a SIGKILL. This can leave unkillable backends for example if > archive_mode is on and archive_command is failing (or not set). A > similar thing would happen in other cases if walsender calls out to > something that would block (do_pg_start_backup() for example), but the > stop one is easy to provoke. Good catch! > ISTM one way to fix it is the attached, which is to have walsender set > the "global" flags saying that we have received sigterm, which in turn > causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly > exit the process. AFAICT it works fine. Any holes in this approach? Very simple patch. Looks fine. > Second, I wonder if we should add a SIGINT handler as well, that would > make it possible to send a cancel signal. Given that the end result > would be the same (at least if we want to keep with the "walsender is > simple" path), I'm not sure it's necessary - but it would at least > help those doing pg_cancel_backend()... thoughts? I don't think that's necessary because, as you suggested, there is no use case for *now*. We can add that handler when someone proposes the feature which requires that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Oct5, 2011, at 15:30 , Magnus Hagander wrote: > When walsender calls out to do_pg_stop_backup() (during base backups), > it is not possible to terminate the process with a SIGTERM - it > requires a SIGKILL. This can leave unkillable backends for example if > archive_mode is on and archive_command is failing (or not set). A > similar thing would happen in other cases if walsender calls out to > something that would block (do_pg_start_backup() for example), but the > stop one is easy to provoke. Hm, this seems to be related to another buglet I noticed a while ago, but then forgot about again. If one terminates pg_basebackup while it's waiting for all required WAL to be archived, the backend process only exits once that waiting phase is over. If, like in your failure case, archive_command fails indefinity (or isn't set), the backend process stays around forever. Your patch would improve that only insofar as it'd at least allow an immediate shutdown request to succeed - as it stands, that doesn't work because, as you mentioned, the blocked walsender doesn't handle SIGTERM. The question is, should we do more? To me, it'd make sense to terminate a backend once it's connection is gone. We could, for example, make pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a broken connection that same way as a SIGINT or SIGTERM. Thoughts? > ISTM one way to fix it is the attached, which is to have walsender set > the "global" flags saying that we have received sigterm, which in turn > causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly > exit the process. AFAICT it works fine. Any holes in this approach? Seems sensible, but my knowledge about these code paths is quite limited.. > Second, I wonder if we should add a SIGINT handler as well, that would > make it possible to send a cancel signal. Given that the end result > would be the same (at least if we want to keep with the "walsender is > simple" path), I'm not sure it's necessary - but it would at least > help those doing pg_cancel_backend()... thoughts? If all that's needed is a simple SIGINT handler that sets one flag, it'd say let's add it. best regards, Florian Pflug
On Thu, Oct 6, 2011 at 14:34, Florian Pflug <fgp@phlo.org> wrote: > On Oct5, 2011, at 15:30 , Magnus Hagander wrote: >> When walsender calls out to do_pg_stop_backup() (during base backups), >> it is not possible to terminate the process with a SIGTERM - it >> requires a SIGKILL. This can leave unkillable backends for example if >> archive_mode is on and archive_command is failing (or not set). A >> similar thing would happen in other cases if walsender calls out to >> something that would block (do_pg_start_backup() for example), but the >> stop one is easy to provoke. > > Hm, this seems to be related to another buglet I noticed a while ago, > but then forgot about again. If one terminates pg_basebackup while it's > waiting for all required WAL to be archived, the backend process only > exits once that waiting phase is over. If, like in your failure case, > archive_command fails indefinity (or isn't set), the backend process > stays around forever. Yes. > Your patch would improve that only insofar as it'd at least allow an > immediate shutdown request to succeed - as it stands, that doesn't work > because, as you mentioned, the blocked walsender doesn't handle SIGTERM. Exactly. > The question is, should we do more? To me, it'd make sense to terminate > a backend once it's connection is gone. We could, for example, make > pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a > broken connection that same way as a SIGINT or SIGTERM. The problem here is that we're hanging at a place where we don't touch the socket. So we won't notice the socket is gone. We'd have to do a select() or something like that at regular intervals to make sure it's there, no? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Thu, Oct 6, 2011 at 04:22, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Oct 5, 2011 at 10:30 PM, Magnus Hagander <magnus@hagander.net> wrote: >> When walsender calls out to do_pg_stop_backup() (during base backups), >> it is not possible to terminate the process with a SIGTERM - it >> requires a SIGKILL. This can leave unkillable backends for example if >> archive_mode is on and archive_command is failing (or not set). A >> similar thing would happen in other cases if walsender calls out to >> something that would block (do_pg_start_backup() for example), but the >> stop one is easy to provoke. > > Good catch! > >> ISTM one way to fix it is the attached, which is to have walsender set >> the "global" flags saying that we have received sigterm, which in turn >> causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly >> exit the process. AFAICT it works fine. Any holes in this approach? > > Very simple patch. Looks fine. Ok, thanks. Applied. >> Second, I wonder if we should add a SIGINT handler as well, that would >> make it possible to send a cancel signal. Given that the end result >> would be the same (at least if we want to keep with the "walsender is >> simple" path), I'm not sure it's necessary - but it would at least >> help those doing pg_cancel_backend()... thoughts? > > I don't think that's necessary because, as you suggested, there is no > use case for *now*. We can add that handler when someone proposes > the feature which requires that. Yeah. It's certainly not something backpatch-worthy. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Oct6, 2011, at 21:48 , Magnus Hagander wrote: >> The question is, should we do more? To me, it'd make sense to terminate >> a backend once it's connection is gone. We could, for example, make >> pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a >> broken connection that same way as a SIGINT or SIGTERM. > > The problem here is that we're hanging at a place where we don't touch > the socket. So we won't notice the socket is gone. We'd have to do a > select() or something like that at regular intervals to make sure it's > there, no? We do emit NOTICEs saying "pg_stop_backup still waiting ... " repeatedly, so we should notice a dead connection sooner or later. When I tried, I even got a message in the log complaining about the "broken pipe". As it stands, the interval between two NOTICEs grows exponentially - we send the first after waiting 5 second, the next after waiting 60 seconds, and then after waiting 120, 240, 480, ... seconds. This means that that the backend would in the worst case linger the same amount of time *after* pg_basebackup was cancelled that pg_basebackup waited for *before* it was cancelled. It'd be nice to generally terminate a backend if the client vanishes, but so far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately sends a signal *everytime* the fd becomes readable or writeable, not only on EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could make the postmaster keep the fd's of around even after forking a backend, and make it watch for broken connections using select(). But with a large max_backends settings, we'd risk running out of fds in the postmaster... best regards, Florian Pflug
On Thu, Oct 6, 2011 at 23:46, Florian Pflug <fgp@phlo.org> wrote: > On Oct6, 2011, at 21:48 , Magnus Hagander wrote: >>> The question is, should we do more? To me, it'd make sense to terminate >>> a backend once it's connection is gone. We could, for example, make >>> pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a >>> broken connection that same way as a SIGINT or SIGTERM. >> >> The problem here is that we're hanging at a place where we don't touch >> the socket. So we won't notice the socket is gone. We'd have to do a >> select() or something like that at regular intervals to make sure it's >> there, no? > > We do emit NOTICEs saying "pg_stop_backup still waiting ... " repeatedly, > so we should notice a dead connection sooner or later. When I tried, I even > got a message in the log complaining about the "broken pipe". Ah, good point, that should be doable. Forgot about that message... > As it stands, the interval between two NOTICEs grows exponentially - we > send the first after waiting 5 second, the next after waiting 60 seconds, > and then after waiting 120, 240, 480, ... seconds. This means that that the > backend would in the worst case linger the same amount of time *after* > pg_basebackup was cancelled that pg_basebackup waited for *before* it was > cancelled. > > It'd be nice to generally terminate a backend if the client vanishes, but so > far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately > sends a signal *everytime* the fd becomes readable or writeable, not only on > EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could > make the postmaster keep the fd's of around even after forking a backend, and > make it watch for broken connections using select(). But with a large max_backends > settings, we'd risk running out of fds in the postmaster... Ugh. Yeah. But at least catching it and terminating it when we *do* notice it's down would certainly make sense... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Oct10, 2011, at 21:25 , Magnus Hagander wrote: > On Thu, Oct 6, 2011 at 23:46, Florian Pflug <fgp@phlo.org> wrote: >> It'd be nice to generally terminate a backend if the client vanishes, but so >> far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately >> sends a signal *everytime* the fd becomes readable or writeable, not only on >> EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could >> make the postmaster keep the fd's of around even after forking a backend, and >> make it watch for broken connections using select(). But with a large max_backends >> settings, we'd risk running out of fds in the postmaster... > > Ugh. Yeah. But at least catching it and terminating it when we *do* > notice it's down would certainly make sense... I'll try to put together a patch that sets a flag if we discover a broken connection in pq_flush, and tests that flag in CHECK_FOR_INTERRUPTS. Unless you wanna, of course. best regards, Florian Pflug
On Tue, Oct 11, 2011 at 03:29, Florian Pflug <fgp@phlo.org> wrote: > On Oct10, 2011, at 21:25 , Magnus Hagander wrote: >> On Thu, Oct 6, 2011 at 23:46, Florian Pflug <fgp@phlo.org> wrote: >>> It'd be nice to generally terminate a backend if the client vanishes, but so >>> far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately >>> sends a signal *everytime* the fd becomes readable or writeable, not only on >>> EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could >>> make the postmaster keep the fd's of around even after forking a backend, and >>> make it watch for broken connections using select(). But with a large max_backends >>> settings, we'd risk running out of fds in the postmaster... >> >> Ugh. Yeah. But at least catching it and terminating it when we *do* >> notice it's down would certainly make sense... > > I'll try to put together a patch that sets a flag if we discover a broken > connection in pq_flush, and tests that flag in CHECK_FOR_INTERRUPTS. Unless you > wanna, of course. Please do, I won't have time to even think about it until after pgconf.eu anyway ;) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Oct11, 2011, at 09:21 , Magnus Hagander wrote: > On Tue, Oct 11, 2011 at 03:29, Florian Pflug <fgp@phlo.org> wrote: >> On Oct10, 2011, at 21:25 , Magnus Hagander wrote: >>> On Thu, Oct 6, 2011 at 23:46, Florian Pflug <fgp@phlo.org> wrote: >>>> It'd be nice to generally terminate a backend if the client vanishes, but so >>>> far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately >>>> sends a signal *everytime* the fd becomes readable or writeable, not only on >>>> EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could >>>> make the postmaster keep the fd's of around even after forking a backend, and >>>> make it watch for broken connections using select(). But with a large max_backends >>>> settings, we'd risk running out of fds in the postmaster... >>> >>> Ugh. Yeah. But at least catching it and terminating it when we *do* >>> notice it's down would certainly make sense... >> >> I'll try to put together a patch that sets a flag if we discover a broken >> connection in pq_flush, and tests that flag in CHECK_FOR_INTERRUPTS. Unless you >> wanna, of course. > > Please do, I won't have time to even think about it until after > pgconf.eu anyway ;) Ok, here's a first cut. I've based this on how query cancellation due to recovery conflicts work - internal_flush() sets QueryCancelPending and ClientConnectionLostPending. If QueryCancelPending is set, CHECK_FOR_INTERRUPTS checks ClientConnectionLostPending, and if it's set it does ereport(FATAL). I've only done light testing so far - basically the only case I've tested is killing pg_basebackup while it's waiting for all required WAL to be archived. best regards, Florian Pflug
Attachment
On Oct19, 2011, at 17:47 , Greg Jaskiewicz wrote: > On 15 Oct 2011, at 11:31, Florian Pflug wrote: >> >> Ok, here's a first cut. > > So I looked at the patch, and first thing that pops out, > is lack of the volatile keyword before the ClientConnectionLostPending variable is defined. Is that done on purpose ? Isthat on purpose ? That's on purpose. volatile is only necessary for variables which are either accessed from within signal handlers or whichlive in shared memory. Neither is true for ClientConnectionLostPending, so non-volatile should be fine. > Otherwise the patch itself looks ok. > I haven't tested the code, just reviewed the patch itself. And it obviously needs testing, should be easy to follow youroriginal problem description. Yeah, further testing is on my todo list. The interesting case is probably what happens if the connection is dropped whilethere's already a cancellation request pending. And also the other way around - a cancellation request arriving afterwe've already discovered that the connection is gone. > Btw, I just tried to do it through commitfest.postgresql.org , but before I get my head around on how to add myself tothe reviewer list there - I thought I'll just send this response here. You just need to click "Edit Patch" and put your name into the Reviewer field. You do need a postgres community account toedit patches, but the signup process is quite quick and painless AFAIR. best regards, Florian Pflug
On 19 Oct 2011, at 17:54, Florian Pflug wrote: > On Oct19, 2011, at 17:47 , Greg Jaskiewicz wrote: >> On 15 Oct 2011, at 11:31, Florian Pflug wrote: >>> >>> Ok, here's a first cut. >> >> So I looked at the patch, and first thing that pops out, >> is lack of the volatile keyword before the ClientConnectionLostPending variable is defined. Is that done on purpose ?Is that on purpose ? > > That's on purpose. volatile is only necessary for variables which are either accessed from within signal handlers or whichlive in shared memory. Neither is true for ClientConnectionLostPending, so non-volatile should be fine. Ok, cool. I'm aware of the reasons behind volatile, just noticed that some other flags used in similar way are marked as such. At theend of the day, this is just a hint to the compiler anyway. > >> Otherwise the patch itself looks ok. >> I haven't tested the code, just reviewed the patch itself. And it obviously needs testing, should be easy to follow youroriginal problem description. > > Yeah, further testing is on my todo list. The interesting case is probably what happens if the connection is dropped whilethere's already a cancellation request pending. And also the other way around - a cancellation request arriving afterwe've already discovered that the connection is gone. > >> Btw, I just tried to do it through commitfest.postgresql.org , but before I get my head around on how to add myself tothe reviewer list there - I thought I'll just send this response here. > > You just need to click "Edit Patch" and put your name into the Reviewer field. You do need a postgres community accountto edit patches, but the signup process is quite quick and painless AFAIR. Ok, clicking 'edit patch' sounds a bit big. Probably better would be, to just be able to click on some sort of "I'm in" button/checkboxtype of thing.
On Oct19, 2011, at 18:05 , Greg Jaskiewicz wrote: > On 19 Oct 2011, at 17:54, Florian Pflug wrote: > >> On Oct19, 2011, at 17:47 , Greg Jaskiewicz wrote: >>> On 15 Oct 2011, at 11:31, Florian Pflug wrote: >>>> >>>> Ok, here's a first cut. >>> >>> So I looked at the patch, and first thing that pops out, >>> is lack of the volatile keyword before the ClientConnectionLostPending variable is defined. Is that done on purpose ?Is that on purpose ? >> >> That's on purpose. volatile is only necessary for variables which are either accessed from within signal handlers or whichlive in shared memory. Neither is true for ClientConnectionLostPending, so non-volatile should be fine. > Ok, cool. > I'm aware of the reasons behind volatile, just noticed that some other flags used in similar way are marked as such. Atthe end of the day, this is just a hint to the compiler anyway. All the other flags which indicate cancellation reasons are set from signal handers, I believe. We could of course mark asClientConnectionLostPending as volatile just to be consistent. Not sure whether that's a good idea, or not. It might preventa mistake should we ever add code to detect lost connections asynchronously (i.e., from somewhere else than pq_flush).And the cost is probably negligible, because CHECK_FOR_INTERRUPTS tests for InterruptPending before calling ProcessInterrupts(),so we only pay the cost of volatile if there's actually an interrupt pending. But I still think it'sbetter to add qualifies such a volatile only when really necessary. A comment about why it *isn't* volatile is probablyin order, though, so I'll add that in the next version of the patch. best regards, Florian Pflug PS: Thanks for the review. It's very much appreciated!
On 15 Oct 2011, at 11:31, Florian Pflug wrote: > > Ok, here's a first cut. So I looked at the patch, and first thing that pops out, is lack of the volatile keyword before the ClientConnectionLostPending variable is defined. Is that done on purpose ? Isthat on purpose ? Otherwise the patch itself looks ok. I haven't tested the code, just reviewed the patch itself. And it obviously needs testing, should be easy to follow youroriginal problem description. Btw, I just tried to do it through commitfest.postgresql.org , but before I get my head around on how to add myself to thereviewer list there - I thought I'll just send this response here.
On 19 Oct 2011, at 18:28, Florian Pflug wrote: > All the other flags which indicate cancellation reasons are set from signal handers, I believe. We could of course markas ClientConnectionLostPending as volatile just to be consistent. Not sure whether that's a good idea, or not. It mightprevent a mistake should we ever add code to detect lost connections asynchronously (i.e., from somewhere else thanpq_flush). And the cost is probably negligible, because CHECK_FOR_INTERRUPTS tests for InterruptPending before callingProcessInterrupts(), so we only pay the cost of volatile if there's actually an interrupt pending. But I still thinkit's better to add qualifies such a volatile only when really necessary. A comment about why it *isn't* volatile isprobably in order, though, so I'll add that in the next version of the patch. > Makes sense. I had to ask, because it sticks out. And indeed there is a possibility that someone will one day will try to use from signalhandler, etc. > best regards, > Florian Pflug > > PS: Thanks for the review. It's very much appreciated! No problem, Got inspired by the talk I attended here at the pgconf.eu. Seems like a good idea to do these things, I haveyears of experience as developer and doing (relatively) well thanks to PostgreSQL - makes sense to contribute somethingback.
Greg Jaskiewicz wrote: > > On 19 Oct 2011, at 18:28, Florian Pflug wrote: > > All the other flags which indicate cancellation reasons are set from signal handers, I believe. We could of course markas ClientConnectionLostPending as volatile just to be consistent. Not sure whether that's a good idea, or not. It mightprevent a mistake should we ever add code to detect lost connections asynchronously (i.e., from somewhere else thanpq_flush). And the cost is probably negligible, because CHECK_FOR_INTERRUPTS tests for InterruptPending before callingProcessInterrupts(), so we only pay the cost of volatile if there's actually an interrupt pending. But I still thinkit's better to add qualifies such a volatile only when really necessary. A comment about why it *isn't* volatile isprobably in order, though, so I'll add that in the next version of the patch. > > > Makes sense. > > I had to ask, because it sticks out. And indeed there is a possibility > that someone will one day will try to use from signal handler, etc. A C comment might help there. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: Bug in walsender when calling out to do_pg_stop_backup (and others?)
From
Heikki Linnakangas
Date:
On 19.10.2011 19:41, Greg Jaskiewicz wrote: > On 19 Oct 2011, at 18:28, Florian Pflug wrote: >> All the other flags which indicate cancellation reasons are set from signal handers, I believe. We could of course markas ClientConnectionLostPending as volatile just to be consistent. Not sure whether that's a good idea, or not. It mightprevent a mistake should we ever add code to detect lost connections asynchronously (i.e., from somewhere else thanpq_flush). And the cost is probably negligible, because CHECK_FOR_INTERRUPTS tests for InterruptPending before callingProcessInterrupts(), so we only pay the cost of volatile if there's actually an interrupt pending. But I still thinkit's better to add qualifies such a volatile only when really necessary. A comment about why it *isn't* volatile isprobably in order, though, so I'll add that in the next version of the patch. >> > Makes sense. > > I had to ask, because it sticks out. And indeed there is a possibility that someone will one day will try to use from signalhandler, etc. Let's just mark it as volatile. It's not clear to me that we'll never set or read the flag while in a signal handler. We probably don't, but what if ImmediateInterruptOK is 'true', for example, just when we fail to send a response and try to set the flags. In that case, we have to be careful that ClientConnectionLost is set before InterruptPending (which we can only guarantee if both are volatile, I believe), otherwise a signal might arrive when we've just set InterruptPending, but not yet ClientConnectionLost. ProcessInterrupts() would clear InterruptPending, but not see ClientConnectionLost, so we would lose the interrupt. That's not serious, and the window for that happening would be extremely small, and I don't think it can actually happen as the code stands, because we never try to send anything while ImmediateInterruptOK == true. But better safe than sorry. One small change I'd like to make is to treat the loss of connection more as a new "top-level" event, rather than as a new reason for query cancellation. A lost connection is more like receiving SIGTERM, really. Please take a look at the attached patch. I've been testing this by doing "SELECT pg_sleep(1), a from generate_series(1,1000) a;", and killing the connection with "killall -9 psql". One remaining issue I have with this is the wording of the error message. The closest existing message we have is in old-mode COPY: ereport(FATAL, (errcode(ERRCODE_CONNECTION_FAILURE), errmsg("connection lost during COPY to stdout"))); In the patch, I made the new message just "connection lost", but does anyone else have a better opinion on that? Perhaps it should be "connection lost while sending response to client". Or "connection lost during execution of statement", but that might not be accurate if we're not executing a statement at the moment, but just sending a "sync" message or similar. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Detecting loss of client connection (was Re: Bug in walsender when calling out to do_pg_stop_backup (and others?))
From
Heikki Linnakangas
Date:
Thinking about this some more, why don't we just elog(FATAL) in internal_flush(), if the write fails, instead of setting the flag and waiting for the next CHECK_FOR_INTERRUPTS(). That sounds scary at first, but why not? There's this comment in internal_flush(): > /* > * Careful: an ereport() that tries to write to the client would > * cause recursion to here, leading to stack overflow and core > * dump! This message must go *only* to the postmaster log. That's understandable. > * If a client disconnects while we're in the midst of output, we > * might write quite a bit of data before we get to a safe query > * abort point. So, suppress duplicate log messages. But what about this? Tracing back the callers, I don't see any that would be upset if we just threw an error there. One scary aspect is if you're within a critical section, but I don't think we currently send any messages while in a critical section. And we could refrain from throwing the error if we're in a critical section, to be safe. > */ > if (errno != last_reported_send_errno) > { > last_reported_send_errno = errno; > ereport(COMMERROR, > (errcode_for_socket_access(), > errmsg("could not send data to client: %m"))); > } -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Bug in walsender when calling out to do_pg_stop_backup (and others?)
From
Heikki Linnakangas
Date:
On 03.12.2011 18:37, Heikki Linnakangas wrote: > One small change I'd like to make is to treat the loss of connection > more as a new "top-level" event, rather than as a new reason for query > cancellation. A lost connection is more like receiving SIGTERM, really. > Please take a look at the attached patch. I've been testing this by > doing "SELECT pg_sleep(1), a from generate_series(1,1000) a;", and > killing the connection with "killall -9 psql". Ok, committed this. > One remaining issue I have with this is the wording of the error > message. The closest existing message we have is in old-mode COPY: > > ereport(FATAL, > (errcode(ERRCODE_CONNECTION_FAILURE), > errmsg("connection lost during COPY to stdout"))); > > In the patch, I made the new message just "connection lost", but does > anyone else have a better opinion on that? Perhaps it should be > "connection lost while sending response to client". Or "connection lost > during execution of statement", but that might not be accurate if we're > not executing a statement at the moment, but just sending a "sync" > message or similar. I made the error message "connection to client lost" in the end. I still wonder if it would be safe to simply elog(FATAL) in internal_flush(), but didn't dare to do that without more thorough analysis. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com