Re: PQexec() hangs on OOM - Mailing list pgsql-bugs
From | Amit Kapila |
---|---|
Subject | Re: PQexec() hangs on OOM |
Date | |
Msg-id | CAA4eK1LyEcuX0QzpmQY5JtQkH8u6gCy8kFskn0+GOOgz5y6W2A@mail.gmail.com Whole thread Raw |
In response to | Re: PQexec() hangs on OOM (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: PQexec() hangs on OOM
|
List | pgsql-bugs |
On Sat, Sep 19, 2015 at 10:44 AM, Michael Paquier <michael.paquier@gmail.com > wrote: > > > On Fri, Sep 18, 2015 at 11:32 PM, Amit Kapila wrote: > > IIRC, this is required to sanely report "out of memory" error in case > > of replication protocol (master-standby setup). This loop and in-particular > > this check is quite similar to PQexecFinish() functions check and loop > > where we return last result. I think it is better to keep both the places > > in-sync > > and also I think this is required to report the error appropriately. I have > > tried manual debugging for the out of memory error for this case and > > it works well with this check and without the check it doesn't report > > the error in an appropriate way(don't remember exactly what was > > the difference). If required, I can again try to reproduce the scenario > > and share the exact report. > > I just had a look at that again... Put for example a call to pg_usleep in libpqrcv_identify_system after executing IDENTIFY_SYSTEM and before calling PQresultStatus, then take a breakpoint on the WAL receiver process when starting up a standby. This gives plenty of room to emulate the OOM failure in getCopyStart. When the check on PGRES_FATAL_ERROR is not added and when emulating the OOM immediately, libpqrcv_PQexec loops twice and thinks it can start up strrep but fails afterwards. Here is the failure seen from the standby: > LOG: started streaming WAL from primary at 0/3000000 on timeline 1 > FATAL: could not send data to WAL stream: server closed the connection unexpectedly > And from the master: > LOG: unexpected EOF on standby connection > The WAL receiver process ultimately restarts after. > > When the check on PGRES_FATAL_ERROR is added, strrep fails to start appropriately and the error is fetched correctly by the WAL receiver: > FATAL: could not start WAL streaming: out of memory > > In short: Amit seems right to have added this check. > Okay, in that case, you can revert that change in your first patch and then that patch will be Ready for committer. In second patch [1], the handling is to continue on error as below: - if (getParamDescriptions(conn)) + if (getParamDescriptions(conn, msgLength)) return; - break; + /* getParamDescriptions() moves inStart itself */ + continue; Now, assume there is "out of memory" error in getParamDescription(), the next message is 'T' which leads to a call to getRowDescriptions() and in that function if there is an error in below part of code, then I think it will just overwrite the previous "out of memory" error (I have tried by manual debugging and forcefully generating the below error): getRowDescriptions() { .. if (pqGetInt(&(result->numAttributes), 2, conn)) { /* We should not run out of data here, so complain */ errmsg = libpq_gettext("insufficient data in \"T\" message"); goto advance_and_error; } .. } Here, the point to note is that for similar handling of error from getRowDescriptions(), we just skip the consecutive 'D' messages by checking resultStatus. I think overwriting of error for this case is not the right behaviour. [1] - 0002-Fix-OOM-error-handling-in-BIND-protocol-of-libpq-2.patch With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-bugs by date: