should there be a hard-limit on the number of transactions pending undo? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | should there be a hard-limit on the number of transactions pending undo? |
Date | |
Msg-id | CA+Tgmoafc_RZNW7G2w1CpkDUjaUzTRJaE_Uh6dx_sDLGJFsKKg@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Andres Freund <andres@anarazel.de>) |
Responses |
Re: should there be a hard-limit on the number of transactionspending undo?
Re: should there be a hard-limit on the number of transactionspending undo? Re: should there be a hard-limit on the number of transactionspending undo? |
List | pgsql-hackers |
On Tue, Jul 16, 2019 at 6:23 PM Andres Freund <andres@anarazel.de> wrote: > Yea, that seems like a question independent of the "completeness" > requirement. If desirable, it seems trivial to either have > RollbackHashEntry have per-persistence level status (for one entry per > xid), or not (for per-persistence entries). I want to talk more about the "completeness" issue, which is basically a question of whether we should (a) put a hard limit on the number of transactions that have unprocessed undo and that are either aborted or in progress (and thus capable of aborting) as proposed by Andres or (b) not have such a hard limit, as originally proposed by me. I think everyone who is not working on this code has installed an automatic filter rule to send the original thread to /dev/null, so I'm changing the subject line in the hopes of getting some of those people to pay attention. If it doesn't work, at least the concerns will be memorialized in case it comes up later. I originally proposed (b) because if undo application is failing for some reason, it seemed better not to bring the whole system to a screeching halt, but rather just cause incremental performance degradation or something. However, Andres has pointed out this may postpone remedial action and let bugs go undetected, so it might not actually be better. Also, some of the things we need to do, like computing the oldest XID whose undo has not been retired, are tricky/expensive if you don't have complete data in memory, and you can't be sure of having complete data in shared memory without a hard limit. No matter which way we go, failures to apply undo had better be really rare, or we're going to be in really serious trouble, so we're only concerned here with how to handle what is hopefully a very rare scenario, not the common case. I want to consider three specific scenarios that could cause undo application to fail, and then offer some observations about them. Scenario #1: 1. Sessions 1..N each begin a transaction and write a bunch of data to a table (at least enough that we'll try to perform undo in the background). 2. Session N+1 begins a transaction and tries to lock the same table. It blocks. 3. Sessions 1..N abort, successfully pushing the undo work into the background. 4. Session N+1 now acquires the lock and sits on it. 5. Optionally, repeat steps 1-4 K times, each time for a different table. Scenario #2: 1. Any number of sessions begin a transaction, write a bunch of data, and then abort. 2. They all try to perform undo in the foreground. 3. They get killed using pg_terminate_backend(). Scenario #3: 1. A transaction begins, does some work, and then aborts. 2. When undo processing occurs, 1% of such transactions fail during undo apply because of a bug in the table AM. 3. When undo processing retries after a failure, it fails again because the bug is triggered by something about the contents of the undo record, rather than by, say, concurrency. In scenario one, the problem is mostly self-correcting. When we decide that we've got too many things queued up for background processing, and start to force undo processing to happen in the foreground, it will start succeeding, because the foreground process will have retained the lock that it took before writing any data and can therefore undo those writes without having to wait for the lock. However, this will do nothing to help the requests that are already in the background, which will just sit there until they can get the lock. I think there is a good argument that they should not actually wait for the lock, or only for a certain time period, and then give up and put the transaction on the error queue for reprocessing at a later time. Otherwise, we're pinning down undo workers, which could easily lead to starvation, just as it does for autovacuum. On the whole, this doesn't sound too bad. We shouldn't be able to fill up the queue with small transactions, because of the restriction that we only push undo work into the background when the transaction is big enough, and if we fill it up with big transactions, then (1) back-pressure will prevent the problem from crippling the system and (2) eventually the problem will be self-correcting, because when the transaction in session N+1 ends, the undo will all go through and everything will be fine. The only real user impact of this scenario is that unrelated work on the system might notice that large rollbacks are now happening in the foreground rather than the background, and if that causes a problem, the DBA can fix it by terminating session N+1. Even if she doesn't, you shouldn't ever hit the hard cap. However, if prepared transactions are in use, we could have a variant of scenario #1 in which each transaction is first prepared, and then the prepared transaction is rolled back. Unlike the ordinary case, this can lead to a nearly-unbounded growth in the number of transactions that are pending undo, because we don't have a way to transfer the locks held by PGPROC used for the prepare to some running session that could perform the undo. It's not necessary to have a large value for max_prepared_transactions; it only has to be greater than 0, because we can keep reusing the same slots with different tables. That is, let N = max_prepared_xacts, and let K be anything at all; session N+1 can just stay in the same transaction and keep on taking new locks one at a time until the lock table fills up; not sure exactly how long that will take, but it's probably a five digit number of transactions, or maybe six. In this case, we can't force undo into the foreground, so we can exceed the number of transactions that are supposed to be backgrounded. We'll eventually have to just start refusing new transactions permission to attach to an undo log, and they'll error out. Although unpleasant, I don't think that this scenario is a death sentence for the idea of having a hard cap on the table size, because if we can the cap is 100k or so, you shouldn't really hit it unless you specifically make it your goal to do so. At least, not this way. But if you have a lower cap, like 1k, it doesn't seem crazy to think that you could hit this in a non-artificial scenario; you just need lots of rolled-back prepared transactions plus some long-running DDL. We could mitigate the prepared transaction scenario by providing a way to transfer locks from the prepared transaction to the backend doing the ROLLBACK PREPARED and then make it try to execute the undo actions. I think that would bring this scenario into parity with the non-prepared case. We could still try to background large rollbacks, but if the queue gets too full then ROLLBACK PREPARED would do the work instead, and, with the hypothetical lock transfer mechanism, that would dodge the locking issues. In scenario #2, the undo work is going to have to be retried in the background, and perforce that means reacquiring locks that have been released, and so there is a chance of long lock waits and/or deadlock that cannot really be avoided. I think there is basically no way at all to avoid an unbounded accumulation of transactions requiring undo in this case, just as in the similar case where the cluster is repeatedly shut down or repeatedly crashes. Eventually, if you have a hard cap on the number of transactions requiring undo, you're going to hit it, and have to start refusing new undo-using transactions. As Thomas pointed out, that might still be better than some other systems which use undo, where the system doesn't open for any transactions at all after a restart until all undo is retired, and/or where undo is never processed in the background. But it's a possible concern. On the other hand, if you don't have a hard cap, the system may just get further and further behind until it eventually melts, and that's also a possible concern. How plausible is this scenario? For most users, cluster restarts and crashes are uncommon, so that variant isn't likely to happen unless something else is going badly wrong. As to the scenario as written, it's not crazy to think that a DBA might try to kill off sessions that sitting there stuck in undo processing for long periods of time, but that doesn't make it a good idea. Whatever problems it causes are analogous to the problems you get if you keep killing of autovacuum processes: the system is trying to make you do the right thing, and if you fight it, you will have some kind of trouble no matter what design decisions we make. In scenario #3, the hard limit is likely to bring things to a screeching halt pretty quickly; you'll just run out of space in the in-memory data structures. Otherwise, the problem will not be obvious unless you're keeping an eye on error messages in your logs, the first sign of trouble may be that the undo logs fill up the disk. It's not really clear which is better. There is value in knowing about the problem sooner (because then you can file a bug report right away and get a fix sooner) but there is also value in having the system limp along instead of grinding to a halt (because then you might not be totally down while you're waiting for that bug fix to become available). One other thing that seems worth noting is that we have to consider what happens after a restart. After a crash, and depending on exactly how we design it perhaps also after a non-crash restart, we won't immediately know how many outstanding transactions need undo; we'll have to grovel through the undo logs to find out. If we've got a hard cap, we can't allow new undo-using transactions to start until we finish that work. It's possible that, at the moment of the crash, the maximum number of items had already been pushed into the background, and every foreground session was busy trying to undo an abort as well. If so, we're already up against the limit. We'll have to scan through all of the undo logs and examine each transaction to get a count on how many transactions are already in a needs-undo-work state; only once we have that value do we know whether it's OK to admit new transactions to using the undo machinery, and how many we can admit. In typical cases, that won't take long at all, because there won't be any pending undo work, or not much, and we'll very quickly read the handful of transaction headers that we need to consult and away we go. However, if the hard limit is pretty big, and we're pretty close to it, counting might take a long time. It seems bothersome to have this interval between when we start accepting transactions and when we can accept transactions that use undo. Instead of throwing an ERROR, we can probably just teach the system to wait for the background process to finish doing the counting; that's what Amit's patch does currently. Or, we could not even open for connections until the counting has been completed. When I first thought about this, I was really concerned about the idea of a hard limit, but the more I think about it the less problematic it seems. I think in the end it boils down to a question of: when things break, what behavior would users prefer? You can either have a fairly quick, hard breakage which will definitely get your attention, or you can have a long, slow process of gradual degradation that doesn't actually stop the system until, say, the XIDs stuck in the undo processing queue become old enough to threaten wraparound, or the disk fills up. Which is less evil? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: