Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw |
Date | |
Msg-id | CAA4eK1LTh=hJYauQa9ghLU75Skoez=+6HYE8ci4L2sJPkdRdDQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw
|
List | pgsql-hackers |
On Thu, May 4, 2017 at 1:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 20, 2017 at 10:27 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> The logs above show that 34 seconds elapsed between starting to abort >> the transaction and knowing that the foreign server is unreachable. It >> looks like it took that much time for the local server to realise that >> the foreign server is not reachable. Looking at PQcancel code, it >> seems to be trying to connect to the foreign server to cancel the >> query. But somehow it doesn't seem to honor connect_timeout setting. >> Is that expected? > > Well, there's no code to do anything else. Regular connections go > through connectDBComplete() which uses non-blocking mode and timed > waits to respect connection_timeout. PQcancel() has no such handling. > internal_cancel just opens a socket and, without setting any options > on it, calls connect(). No loop, no timed waits, nada. So it's going > to take as long as it takes for the operating system to notice. > >> Irrespective of what PQcancel does, it looks like postgres_fdw should >> just slam the connection if query is being aborted because of >> statement_timeout. But then pgfdw_xact_callback() doesn't seem to have >> a way to know whether this ABORT is because of user's request to >> cancel the query, statement timeout, an abort because of some other >> error or a user requested abort. Except statement timeout (may be >> user's request to cancel the query?), it should try to keep the >> connection around to avoid any future reconnection. But I am not able >> to see how can we provide that information to pgfdw_xact_callback(). > > I don't think we can. In general, PostgreSQL has no facility for > telling error cleanup handlers why the abort happened, and it wouldn't > really be the right thing anyway. The fact that statement_timeout > fired doesn't necessarily mean that the connection is dead; it could > equally well mean that the query ran for a long time. I think we > basically have two choices. One is to bound the amount of time we > spend performing error cleanup, and the other is just to always drop > the connection. A problem with the latter is that it might do the > wrong thing if we're aborting a subtransaction but not the whole > transaction. In that case, we need to undo changes since the relevant > savepoint, but not the ones made before that; closing the connection > amounts to a full rollback. > > Therefore, I think the patch you proposed in > https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com > isn't correct, because if the cancel fails it will slam the connection > shut regardless of whether we're in pgfdw_xact_callback or > pgfdw_subxact_callback. > > It looks to me like there's a fairly lengthy list of conceptually > separate but practically related problems here: > > 1. PQcancel() ignores the keepalive parameters, because it makes no > attempt to set the relevant socket options before connecting. > > 2. PQcancel() ignores connection_timeout, because it doesn't use > non-blocking mode and has no wait loop. > > 3. There is no asynchronous equivalent of PQcancel(), so we can't use > a loop to allow responding to interrupts while the cancel is > outstanding (see pgfdw_get_result for an example). > > 4. pgfdw_xact_callback() and pgfdw_subxact_callback() use PQexec() > rather than the asynchronous interfaces for sending queries and > checking for results, so the SQL commands they send are not > interruptible. > > 5. pgfdw_xact_callback() and pgfdw_subxact_callback() try to get the > connection back to a good state by using ABORT TRANSACTION for the > former and ROLLBACK TO SAVEPOINT and RELEASE SAVEPOINT for the latter. > But if it doesn't work, then they just emit a WARNING and continue on > as if they'd succeeded. That seems highly likely to make the next use > of that connection fail in unpredictable ways. > In pgfdw_xact_callback, if the execution of ABORT TRANSACTION fails due to any reason then I think it will close the connection. The relavant code is: if (PQstatus(entry->conn) != CONNECTION_OK || PQtransactionStatus(entry->conn) != PQTRANS_IDLE) Basically, if abort transaction fails then transaction status won't be PQTRANS_IDLE. Also, does it matter if pgfdw_subxact_callback fails during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs to execute rollback to savepoint command before proceeding and that should be good enough? About patch: + /* Rollback all remote subtransactions during abort */ + snprintf(sql, sizeof(sql), + "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d", + curlevel, curlevel); + if (!pgfdw_exec_cleanup_query(entry->conn, sql)) + abort_cleanup_failure = true; If the above execution fails due to any reason, then it won't allow even committing the mail transaction which I am not sure is the right thing to do. I have tried it manually editing the above command in debugger and the result is as below: Create a foreign table and try below scenario. postgres=# begin; BEGIN postgres=# savepoint s1; SAVEPOINT postgres=# insert into foreign_test values(8); INSERT 0 1 postgres=# savepoint s2; SAVEPOINT postgres=# insert into foreign_test values(generate_series(1,1000000)); Cancel request sent WARNING: syntax error at or near "OOLLBACK" ERROR: canceling statement due to user request -- In the debugger, I have changed ROLLBACK to mimic the failure. postgres=# Rollback to savepoint s2; ROLLBACK postgres=# commit; ERROR: connection to server "foreign_server" was lost Considering above, I am not sure if we should consider all failures during abort of subtransaction to indicate pending cleanup state and then don't allow further commits. #include "utils/memutils.h" - +#include "utils/syscache.h" Looks like spurious line removal > 6. postgres_fdw doesn't use PQsetnonblocking() anywhere, so even if we > converted pgfdw_xact_callback() and pgfdw_subxact_callback() to use > pgfdw_exec_query() or something like it rather than PQexec() to submit > queries, they might still block if we fail to send the query, as the > comments for pgfdw_exec_query() explain. This is not possibly but not > particularly likely to happen for queries being sent out of error > handling paths, because odds are good that the connection was sent > just before. However, if the user is pressing ^C because the remote > server isn't responding, it's quite probable that we'll run into this > exact issue. > > 7. postgres_fdw never considers slamming the connection shut as a > response to trouble. It seems pretty clear that this is only a safe > response if we're aborting the toplevel transaction. If we did it for > a subtransaction, we'd end up reconnecting if the server were accessed > again, which would at the very least change the snapshot (note that we > use at least REPEATABLE READ on the remote side regardless of the > local isolation level) and might discard changes made on the remote > side at outer transaction levels. Even for a top-level transaction, > it might not always be the right way to proceed, as it incurs some > overhead to reconnect, but it seems worth considering at least in the > case where we've already got some indication that the connection is > unhealthy (e.g. a previous attempt to clean up the connection state > failed, as in #5, or PQcancel failed, as in Ashutosh's proposed > patch). > > 8. Before 9.6, PQexec() is used in many more places, so many more > queries are entirely non-interruptible. > > It seems pretty clear to me that we are not going to fix all of these > issues in one patch. Here's a sketch of an idea for how to start > making things better: > > - Add an in_abort_cleanup flag to ConnCacheEntry. > > - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort > cleanup, check whether the flag is set. If so, slam the connection > shut unless that's already been done; furthermore, if the flag is set > and we're in pgfdw_xact_callback (i.e. this is a toplevel abort), > forget about the connection entry entirely. On the other hand, if the > flag is not set, set it flag and attempt abort cleanup. If we > succeed, clear the flag. > > - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin > pre-commit processing, check whether the flag is set. If so, throw an > ERROR, so that we switch over to abort processing. > > - Change uses of PQexec() in the abort path to use pgfdw_exec_query() > instead. If we exit pgfdw_exec_query() with an error, we'll re-enter > the abort path, but now in_abort_cleanup will be set, so we'll just > drop the connection (and force any outer transaction levels to abort > as well). > > - For bonus points, give pgfdw_exec_query() an optional timeout > argument, and set it to 30 seconds or so when we're doing abort > cleanup. If the timeout succeeds, it errors out (which again > re-enters the abort path, dropping the connection and forcing outer > transaction levels to abort as well). > Why do we need this 30 seconds timeout for abort cleanup failures? Isn't it sufficient if we propagate the error only on failures? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: