Thread: Client failure allows backed to continue
As part of the training class I did, some people tested what happens when the client allocates tons of memory to store a result and aborts. What we found was that though elog was properly called: elog(COMMERROR, "pq_recvbuf: recv() failed: %m"); (I think that was the message.) the backend did not exit and kept eating CPU. I think the problem is that the elog code only exits on ERROR, not COMMERROR. Is there some way to fix this? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > As part of the training class I did, some people tested what happens > when the client allocates tons of memory to store a result and aborts. > What we found was that though elog was properly called: > elog(COMMERROR, "pq_recvbuf: recv() failed: %m"); > (I think that was the message.) the backend did not exit and kept > eating CPU. I think the problem is that the elog code only exits on > ERROR, not COMMERROR. Is there some way to fix this? There's been talk of setting the QueryCancel flag after detecting a client communication failure ... but no one has ever done the legwork to see if that works nicely, or what downsides it might have. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > As part of the training class I did, some people tested what happens > > when the client allocates tons of memory to store a result and aborts. > > > What we found was that though elog was properly called: > > > elog(COMMERROR, "pq_recvbuf: recv() failed: %m"); > > > (I think that was the message.) the backend did not exit and kept > > eating CPU. I think the problem is that the elog code only exits on > > ERROR, not COMMERROR. Is there some way to fix this? > > There's been talk of setting the QueryCancel flag after detecting a > client communication failure ... but no one has ever done the legwork > to see if that works nicely, or what downsides it might have. Why is COMMERROR not doing the longjump like ERROR? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Why is COMMERROR not doing the longjump like ERROR? Because it's defined to be like LOG. A more useful reply might be that I'm not sure it's safe to abort in the client I/O routines. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Why is COMMERROR not doing the longjump like ERROR? > > Because it's defined to be like LOG. > > A more useful reply might be that I'm not sure it's safe to abort in the > client I/O routines. Well, if we get an I/O error, I can't imagine why we would continue doing anything --- are any of those recoverable? Do we need a separate error type for I/O messages? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Well, if we get an I/O error, I can't imagine why we would continue > doing anything --- are any of those recoverable? Well, that's what's not clear --- it's hard to tell if a write failure is a hard error or just transient. If we make like elog(ERROR), returning to the main loop, and then a read from the client *doesn't* fail, we'll try to continue ... but we've just screwed the pooch, because we have not sent a complete message and therefore certainly have messed up frontend/backend synchronization. I have no idea whether it's really possible to recover from this situation or not, but that approach surely won't work. If you want to take a kamikaze any-comm-error-means-we're-dead approach, you might think about elog(FATAL). But that tries to send a message to the client. Instant infinite loop, if the error is hard. Complaints to the postmaster log, and abort at the next safe place (*not* partway through message output) seem like the way to go to me. > Do we need a separate error type for I/O messages? Uh ... see COMMERROR. regards, tom lane
Well, setting query_cancel then seems like a logical solution because it will exit at a reasonable point, hopefully. Right now we have statement_timeout and that exits at a give time, but I suppose it doesn't exit while data is transfering, so it may be different. --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Well, if we get an I/O error, I can't imagine why we would continue > > doing anything --- are any of those recoverable? > > Well, that's what's not clear --- it's hard to tell if a write failure > is a hard error or just transient. If we make like elog(ERROR), > returning to the main loop, and then a read from the client *doesn't* > fail, we'll try to continue ... but we've just screwed the pooch, > because we have not sent a complete message and therefore certainly have > messed up frontend/backend synchronization. I have no idea whether it's > really possible to recover from this situation or not, but that approach > surely won't work. > > If you want to take a kamikaze any-comm-error-means-we're-dead approach, > you might think about elog(FATAL). But that tries to send a message to > the client. Instant infinite loop, if the error is hard. > > Complaints to the postmaster log, and abort at the next safe place > (*not* partway through message output) seem like the way to go to me. > > > Do we need a separate error type for I/O messages? > > Uh ... see COMMERROR. > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073