Re: [HACKERS] Walsender timeouts and large transactions - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Re: [HACKERS] Walsender timeouts and large transactions |
Date | |
Msg-id | 9f192946-aa3a-dbb2-768b-ee17ebc8c7eb@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] Walsender timeouts and large transactions (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: [HACKERS] Walsender timeouts and large transactions
|
List | pgsql-hackers |
Hi, sorry for the delay but I didn't have much time in past weeks to follow this thread. On 02/10/17 05:44, Kyotaro HORIGUCHI wrote: > Hello Sokolov. > > At Fri, 29 Sep 2017 15:19:23 +0300, Sokolov Yura <funny.falcon@postgrespro.ru> wrote in <d076dae18b437be89c787a854034f3f2@postgrespro.ru> >> I don't want to make test to lasts so long and generate so many data. >> That is why I used such small timeouts for tests. > > I understand your point, but still *I* think such a short timeout > is out of expectation by design. (But it can be set.) > > Does anyone have opinions on this? Yes, it's not practically useful to have such small wal_sender_timeout given that the main purpose of that is to detect network issues. > >> Test is failing if there is "short quit" after >> `!pq_is_send_pending()`, >> so I doubt your patch will pass the test. > > It is because I think that the test "should" fail since the > timeout is out of expected range. I (and perhaps also Petr) is > thinking that the problem is just that a large transaction causes > a timeout with an ordinary timeout. My test case is based on the > assumption. > > Your test is for a timeout during replication-startup with > extremely short timeout. This may be a different problem to > discuss, but perhaps better to be solved together. > > I'd like to have opnions from others on this point. I think it's different problem and because of what I wrote above it does not seem to me like something we should spend out time on trying to fix. > > Uggh! I misunderstood there. It wais for writing socket so the > sleep is wrong and WaitLatchOrSocket is right. > > After all, I put +1 for Petr's latest patch. Sorry for my > carelessness. > Great, I attached version 3 which is just rebased on current master and also does not move the GetCurrentTimestamp() call because the concern about skewing the timestamp during config reload (and also network flush as I realized later) is valid. It's rather hard to test all the scenarios that this patch fixes in automated fashion without generating a lot of wal or adding sleeps to the processing. That's why I didn't produce usable TAP test. Since it seems like some of my reasoning is unclear I will try to describe it once more. The main problem we have is that unless we call the ProcessRepliesIfAny() before the wal_sender_timeout expires we'll die because of timeout eventually. The current coding will skip that call if there is a long transaction being processed (if network is fast enough). This is what the first part (first 2 hunks) of the patch solves. There is also issue that while this is happening the walsender ignores signals so it's impossible to stop it which is why I added the CHECK_FOR_INTERRUPTS to the fast path. The second problem is that if we solved just the first one, then if downstream (and again network) is fast enough, the ProcessRepliesIfAny() will not do anything useful because downstream is not going to send any response while the network buffer contains any data. This is caused by the fact that we normally code the receiver side to receive until there is something and only send reply when there is a "pause" in the stream. To get around this problem we also need to make sure that we send WalSndKeepaliveIfNecessary() periodically and that will not happen on fast network unless we do the second part of the patch (3rd hunk), ie, move the pq_is_send_pending() after the keepalive handling. This code is specific to logical decoding walsender interface so it only happens for logical decoding/replication (which means it should be backported all the way to 9.4). The physical one These two issues happen quite normally in the wild as all we need is big data load in single transaction, or update of large part of an table or something similar for this to happen with default settings. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: