Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: snapshot too old issues, first around wraparound and then more. |
Date | |
Msg-id | CAH2-WzmM5wmuw8xMmapBFkSx5puMg4oU7r8PvncC2e=VUJ-Tcg@mail.gmail.com Whole thread Raw |
In response to | Re: snapshot too old issues, first around wraparound and then more. (Andres Freund <andres@anarazel.de>) |
Responses |
Re: snapshot too old issues, first around wraparound and then more.
Re: snapshot too old issues, first around wraparound and then more. |
List | pgsql-hackers |
On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote: > Since this is a feature that can result in wrong query results (and > quite possibly crashes / data corruption), I don't think we can just > leave this unfixed. But given the amount of code / infrastructure > changes required to get this into a working feature, I don't see how we > can unleash those changes onto the stable branches. I don't think that the feature can be allowed to remain in anything like its current form. The current design is fundamentally unsound. > I don't really know what to do here. Causing problems by neutering a > feature in the back branch *sucks*. While not quite as bad, removing a > feature without a replacement in a major release is pretty harsh > too. But I don't really see any other realistic path forward. I have an idea that might allow us to insulate some users from the problem caused by a full revert (or disabling the feature) in the backbranches. I wouldn't usually make such a radical suggestion, but the current situation is exceptional. Anything that avoids serious pain for users deserves to be considered. Kevin said this about the feature very recently: """ Keep in mind that the real goal of this feature is not to eagerly _see_ "snapshot too old" errors, but to prevent accidental debilitating bloat due to one misbehaving user connection. This is particularly easy to see (and therefore unnervingly common) for those using ODBC, which in my experience tends to correspond to the largest companies which are using PostgreSQL. In some cases, the snapshot which is preventing removal of the rows will never be used again; removal of the rows will not actually affect the result of any query, but only the size and performance of the database. This is a "soft limit" -- kinda like max_wal_size. Where there was a trade-off between accuracy of the limit and performance, the less accurate way was intentionally chosen. I apologize for not making that more clear in comments. """ ODBC uses cursors in rather strange ways, often to implement a kind of ODBC-level cache. See the description of "Use Declare/Fetch" from https://odbc.postgresql.org/docs/config.html to get some idea of what this can look like. I think that it's worth considering whether or not there are a significant number of "snapshot too old" users that rarely or never rely on old snapshots used by new queries. Kevin said that this happens "in some cases", but how many cases? Might it be that many "snapshot too old" users could get by with a version of the feature that makes the most conservative possible assumptions, totally giving up on the idea of differentiating which blocks are truly safe to access with an "old" snapshot? (In other words, one that assumes that they're *all* unsafe for an "old" snapshot.) I'm thinking of a version of "snapshot too old" that amounts to a statement timeout that gets applied for xmin horizon type purposes in the conventional way, while only showing an error to the client if and when they access literally any buffer (though not when the relation is a system catalog). Is it possible that something along those lines is appreciably better than nothing to users? If it is, and if we can find a way to manage the transition, then maybe we could tolerate supporting this greatly simplified implementation of "snapshot too old". I feel slightly silly for even suggesting this. I have to ask. Maybe nobody noticed a problem with the feature before now (at least in part) because they didn't truly care about old snapshots anyway. They just wanted to avoid a significant impact from buggy code that leaks cursors and things like that. Or, they were happy as long as they could still access ODBC's "100 rows in a cache" through the cursor. The docs say that a old_snapshot_threshold setting in the hours is about the lowest reasonable setting for production use, which seems rather high to me. It almost seems as if the feature specifically targets misbehaving applications already. -- Peter Geoghegan
pgsql-hackers by date: