Re: Allow "snapshot too old" error, to prevent bloat - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: Allow "snapshot too old" error, to prevent bloat |
Date | |
Msg-id | 68410939.1739882.1424361693462.JavaMail.yahoo@mail.yahoo.com Whole thread Raw |
In response to | Re: Allow "snapshot too old" error, to prevent bloat (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Allow "snapshot too old" error, to prevent bloat
Re: Allow "snapshot too old" error, to prevent bloat |
List | pgsql-hackers |
Stephen Frost <sfrost@snowman.net> wrote: > Kevin Grittner (kgrittn@ymail.com) wrote: >> Stephen Frost <sfrost@snowman.net> wrote: >>> In the end, with a single long-running transaction, the worst bloat you >>> would have is double the size of the system at the time the long-running >>> transaction started. >> >> I agree that limiting bloat to one dead tuple for every live one >> for each old snapshot is a limit that has value, and it was unfair >> of me to characterize that as not being a limit. Sorry for that. >> >> This possible solution was discussed with the user whose feedback >> caused me to write this patch, but there were several reasons they >> dismissed that as a viable total solution for them, two of which I >> can share: >> >> (1) They have a pool of connections each of which can have several >> long-running cursors, so the limit from that isn't just doubling >> the size of their database, it is limiting it to some two-or-three >> digit multiple of the necessary size. > > This strikes me as a bit off-the-cuff; was an analysis done which > deteremined that would be the result? If it sounds like off-the-cuff it is because I am summarizing the mountain of data which has been accumulated over weeks of discussions and many rounds of multi-day tests using their actual software driven by a software test driver that simulates users, with "think time" and all. To be usable, we need to run on a particular set of hardware with a simulated load of 3000 users. In a two-day test without the patches I'm proposing, their actual, working production software running against PostgreSQL bloated the database by 37% and was on a linear rate of increase. Unpatched, CPU time consumed at a linear rate due to the bloat until in the second day it saturated the CPUs and the driver was unable to get acceptable user performance. Because of that we haven't moved on to test what the bloat rate would be like with 3000 users, but I assume it would be higher than with 300 users. With the two patches I submitted, bloat stabilized at less than 5% except for some btree indexes which followed pattern of inserting at the end and deleting most (but not all) of the entries over time. I've been working on a separate patch for that, but it's not ready to present, so I probably won't post that until the first CF of the next release -- unless someone's interested enough to want to discuss it before then. >> (2) They are already prepared to deal with "snapshot too old" >> errors on queries that run more than about 20 minutes and which >> access tables which are modified. They would rather do that than >> suffer the bloat beyond that point. > > That, really, is the crux here- they've already got support for dealing > with it the way Oracle did and they'd like PG to do that too. > Unfortunately, that, by itself, isn't a good reason for a particular > capability (we certainly aren't going to be trying to duplicate PL/SQL > in PG any time soon). I understand that, and if the community doesn't want this style of limitation on bloat we can make it part of PPAS (which *does* duplicate PL/SQL, among other things). We are offering it to the community first. > That said, there are capabilities in other RDBMS's which are > valuable and which we *do* want, so the fact that Oracle does this > also isn't a reason to not include it. :-) >>> I'm not against having a knob like this, which is defaulted to off, >> >> Thanks! I'm not sure that amounts to a +1, but at least it doesn't >> sound like a -1. :-) > > So, at the time I wrote that, I wasn't sure if it was a +1 or not > myself. I've been thinking about it since then, however, and I'm > leaning more towards having the capability than not, so perhaps that's a > +1, but it doesn't excuse the need to come up with an implementation > that everyone can be happy with and what you've come up with so far > doesn't have a lot of appeal, based on the feedback (I've only glanced > through it myself, but I agree with Andres and Tom that it's a larger > footprint than we'd want for this and the number of places having to be > touched is concerning as it could lead to future bugs). This conversation has gotten a little confusing because some of my posts seem to have been held up in the email systems for some reason, and are arriving out-of-order (and at least one I don't see yet). But I have been responding to these points. For one thing I stated four days ago that I now feel that the metric for determining that a snapshot is "old" should be *time*, not transactions IDs consumed, and offered to modify that patch in that direction if people agreed. I now see one person agreeing and several people arguing that I should do just that (as though they had not seem me propose it), so I'll try to get a new version out today like that. In any event, I am sure it is possible and feel that it would be better. (Having posted that now at least 4 times, hopefully people will pick up on it.) For another thing, as mentioned earlier, this is the btree footprint: src/backend/access/nbtree/nbtinsert.c | 7 ++++---src/backend/access/nbtree/nbtpage.c | 2 +-src/backend/access/nbtree/nbtsearch.c| 43 ++++++++++++++++++++++++++++++++++---------src/include/access/nbtree.h | 7 ++++---4 files changed, 43 insertions(+), 16 deletions(-) That consists mainly of two things. First, there are 13 places (all in nbtsearch.c) immediately following a call to BufferGetPage() that a line like this was added: TestForOldSnapshot(snapshot, rel, page); Second, some functions that were not previously passed a snapshot had that added to the function signature and the snapshot was passed from the caller where needed. C files other than nbtsearch were only modified to pass a NULL in to calls to the functions with modified signatures -- in a total of four places. That is it. To be sure I had to initially spend some time figuring out which BufferGetPage() calls needed to be followed with this test, but it's not rocket science. > A lot of that would go away if there was a way to avoid having to mess > with the index AMs, I'd think, but I wonder if we'd actually need more > done there- it's not immediately obvious to me how an index-only scan is > safe with this. Whenever an actual page is visited, we can check the > LSN, and the relation can't be truncated by vacuum since the transaction > will still have a lock on the table which prevents it, but does the > visibility-map update check make sure to never mark pages all-visible > when one of these old transactions is running around? On what basis? I will review that; it may indeed need some attention. The index entry is removed before the line pointer can be freed, but the question is whether there is a window where pruning or vacuum has removed the tuple and changed the page visibility map bit to say that everything on the page is visible to all before the index entry is removed. There may be a need to, as you say, suppress setting the all-visible flag in some particular cases. >>> but I do think we'd be a lot better off with a system that could >>> realize when rows are not visible to any currently running transaction >>> and clean them up. >> >> +1 >> >> But they are not mutually exclusive; I see them as complementary. > > I can see how they would be, provided we can be confident that we're > going to actually throw an error when the snapshot is out of date and > not end up returning incorrect results. We need to be darn sure of > that, both now and in a few years from now when many of us may have > forgotten about this knob.. ;) I get that this is not zero-cost to maintain, and that it might not be of interest to the community largely for that reason. If you have any ideas about how to improve that, I'm all ears. But do consider the actual scope of what was needed for the btree coverage (above). (Note that the index-only scan issue doesn't look like anything AM-specific; if something is needed for that it would be in the pruning/vacuum area.) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: