Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw |
Date | |
Msg-id | CA+TgmobP_huxdQ8QWr1SPpzdWtrcmWYrLpHacsqftUoaYNenLA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>) |
Responses |
Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw
Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw |
List | pgsql-hackers |
On Thu, Apr 20, 2017 at 10:27 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > The logs above show that 34 seconds elapsed between starting to abort > the transaction and knowing that the foreign server is unreachable. It > looks like it took that much time for the local server to realise that > the foreign server is not reachable. Looking at PQcancel code, it > seems to be trying to connect to the foreign server to cancel the > query. But somehow it doesn't seem to honor connect_timeout setting. > Is that expected? Well, there's no code to do anything else. Regular connections go through connectDBComplete() which uses non-blocking mode and timed waits to respect connection_timeout. PQcancel() has no such handling. internal_cancel just opens a socket and, without setting any options on it, calls connect(). No loop, no timed waits, nada. So it's going to take as long as it takes for the operating system to notice. > Irrespective of what PQcancel does, it looks like postgres_fdw should > just slam the connection if query is being aborted because of > statement_timeout. But then pgfdw_xact_callback() doesn't seem to have > a way to know whether this ABORT is because of user's request to > cancel the query, statement timeout, an abort because of some other > error or a user requested abort. Except statement timeout (may be > user's request to cancel the query?), it should try to keep the > connection around to avoid any future reconnection. But I am not able > to see how can we provide that information to pgfdw_xact_callback(). I don't think we can. In general, PostgreSQL has no facility for telling error cleanup handlers why the abort happened, and it wouldn't really be the right thing anyway. The fact that statement_timeout fired doesn't necessarily mean that the connection is dead; it could equally well mean that the query ran for a long time. I think we basically have two choices. One is to bound the amount of time we spend performing error cleanup, and the other is just to always drop the connection. A problem with the latter is that it might do the wrong thing if we're aborting a subtransaction but not the whole transaction. In that case, we need to undo changes since the relevant savepoint, but not the ones made before that; closing the connection amounts to a full rollback. Therefore, I think the patch you proposed in https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com isn't correct, because if the cancel fails it will slam the connection shut regardless of whether we're in pgfdw_xact_callback or pgfdw_subxact_callback. It looks to me like there's a fairly lengthy list of conceptually separate but practically related problems here: 1. PQcancel() ignores the keepalive parameters, because it makes no attempt to set the relevant socket options before connecting. 2. PQcancel() ignores connection_timeout, because it doesn't use non-blocking mode and has no wait loop. 3. There is no asynchronous equivalent of PQcancel(), so we can't use a loop to allow responding to interrupts while the cancel is outstanding (see pgfdw_get_result for an example). 4. pgfdw_xact_callback() and pgfdw_subxact_callback() use PQexec() rather than the asynchronous interfaces for sending queries and checking for results, so the SQL commands they send are not interruptible. 5. pgfdw_xact_callback() and pgfdw_subxact_callback() try to get the connection back to a good state by using ABORT TRANSACTION for the former and ROLLBACK TO SAVEPOINT and RELEASE SAVEPOINT for the latter. But if it doesn't work, then they just emit a WARNING and continue on as if they'd succeeded. That seems highly likely to make the next use of that connection fail in unpredictable ways. 6. postgres_fdw doesn't use PQsetnonblocking() anywhere, so even if we converted pgfdw_xact_callback() and pgfdw_subxact_callback() to use pgfdw_exec_query() or something like it rather than PQexec() to submit queries, they might still block if we fail to send the query, as the comments for pgfdw_exec_query() explain. This is not possibly but not particularly likely to happen for queries being sent out of error handling paths, because odds are good that the connection was sent just before. However, if the user is pressing ^C because the remote server isn't responding, it's quite probable that we'll run into this exact issue. 7. postgres_fdw never considers slamming the connection shut as a response to trouble. It seems pretty clear that this is only a safe response if we're aborting the toplevel transaction. If we did it for a subtransaction, we'd end up reconnecting if the server were accessed again, which would at the very least change the snapshot (note that we use at least REPEATABLE READ on the remote side regardless of the local isolation level) and might discard changes made on the remote side at outer transaction levels. Even for a top-level transaction, it might not always be the right way to proceed, as it incurs some overhead to reconnect, but it seems worth considering at least in the case where we've already got some indication that the connection is unhealthy (e.g. a previous attempt to clean up the connection state failed, as in #5, or PQcancel failed, as in Ashutosh's proposed patch). 8. Before 9.6, PQexec() is used in many more places, so many more queries are entirely non-interruptible. It seems pretty clear to me that we are not going to fix all of these issues in one patch. Here's a sketch of an idea for how to start making things better: - Add an in_abort_cleanup flag to ConnCacheEntry. - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort cleanup, check whether the flag is set. If so, slam the connection shut unless that's already been done; furthermore, if the flag is set and we're in pgfdw_xact_callback (i.e. this is a toplevel abort), forget about the connection entry entirely. On the other hand, if the flag is not set, set it flag and attempt abort cleanup. If we succeed, clear the flag. - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin pre-commit processing, check whether the flag is set. If so, throw an ERROR, so that we switch over to abort processing. - Change uses of PQexec() in the abort path to use pgfdw_exec_query() instead. If we exit pgfdw_exec_query() with an error, we'll re-enter the abort path, but now in_abort_cleanup will be set, so we'll just drop the connection (and force any outer transaction levels to abort as well). - For bonus points, give pgfdw_exec_query() an optional timeout argument, and set it to 30 seconds or so when we're doing abort cleanup. If the timeout succeeds, it errors out (which again re-enters the abort path, dropping the connection and forcing outer transaction levels to abort as well). Assuming the above works out, I propose to back-patch f039eaac7131ef2a4cf63a10cf98486f8bcd09d2, 1b812afb0eafe125b820cc3b95e7ca03821aa675, and the changes above to all supported releases. That would fix problems 4, 5, 7, and 8 from the above list for all branches. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: