OOM in libpq and infinite loop with getCopyStart() - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | OOM in libpq and infinite loop with getCopyStart() |
Date | |
Msg-id | CAB7nPqSuM7MqC+viXPgy3vu+kc7ZC_HNSrSej+mvc8N7ca9WdA@mail.gmail.com Whole thread Raw |
Responses |
Re: OOM in libpq and infinite loop with getCopyStart()
Re: OOM in libpq and infinite loop with getCopyStart() |
List | pgsql-hackers |
Hi all, One year and a half ago Heikki has reported that libpq could run on an infinite loop in a couple of code paths should an OOM happen in PQExec(): http://www.postgresql.org/message-id/547480DE.4040408@vmware.com The code paths are: - BIND messages for getRowDescriptions(), or 'T' messages, fixed per 7b96bf44 - Code paths handling some calls to PQmakeEmptyPGresult(), fixed per 414bef3 - Start message for COPY command, not fixed yet. The two first problems have been addressed, not the last one. And as the last thread became long and confusing here is a revival of the topic on a new thread, with a fresh patch, and a fresh study of the problem submitted for this CF because it would be nice to get things fixed, or improved in some cases, but see below... Here is a summary of what this patch does and improves in some cases: 1) COPY IN and COPY OUT, without the patch, those two remain stuck in an infinite loop... With the patch an error "out of memory" is reported to the caller. I had a look as well at some community utilities, like pgloader and pg_bulkload, though none of them are using getCopyStart(). Do other people have ideas to things to look at? pgadmin for example? Now in the case of COPY_BOTH. let's take a couple of examples: 2) pg_receivexlog: In the case of pg_receivexlog, the patch allows libpq to throw correctly an error back to the caller, then pg_receivexlog will try to send START_REPLICATION again at its next loop: pg_receivexlog: could not send replication command "START_REPLICATION": out of memory Without the patch pg_receivexlog would just ignore the error, and at the next iteration of ParseInput() the message obtained from server would be treated because the message cursor does not move on in this case. 3) pg_recvlogical Similarly pg_recvlogical fails with the following error with the patch attached: pg_recvlogical: could not send replication command "START_REPLICATION SLOT "toto" LOGICAL 0/0": out of memory And after the failure it loops back to the next attempt, and is able to perform its streaming stuff. Without the patch, similarly to pg_receivexlog, COPY_BOTH would take the code path of ParseInput() once again and it would treat the message, ignoring any OOM problems on the way. 4) libpqwalreceiver: upon noticing the OOM error with the patch attached, a standby would fail after sending START_REPLICATION, log the OOM error properly and attempt to restart later a new WAL receiver process to try again. Without this patch, the WAL receiver would still fail to start, and log the following, unhelpful message: 2016-03-01 14:07:00 JST [84058]: [22-1] db=,user=,app=,client= LOG: invalid record length at 0/3000970 That's not really informational for the user :( Note though that in this case the WAL receiver does not remain in an infinite loop, and that it is able to restart properly because it fails with this "invalid record length error", and then start streaming at its next attempt when the WAL receiver is started again. So, that's how things are standing. I still see that it would be a gain in terms of failure visibility to log properly this OOM failure in all cases. Now it is true that if there are some applications that may expect libpq to *not* return an error and treat the COPY start message at the next loop of ParseInput(), though by looking at what is in-core things can be handled properly. Thoughts? I have registered that in the CF app, and a patch is attached. -- Michael
Attachment
pgsql-hackers by date: