Thread: [bugfix] commit timestamps ERROR on lookup of FrozenTransactionId
Hi all Today I ran into an issue where commit timestamp lookups were failing with ERROR: cannot retrieve commit timestamp for transaction 2 which is of course FrozenTransactionId. TransactionIdGetCommitTsData(...) ERRORs on !TransactionIdIsNormal(), which I think is wrong. Attached is a patch to make it return 0 for FrozenTransactionId and BootstrapTransactionId, like it does for xids that are too old. Note that the prior behaviour was as designed and has tests to enforce it. I just think it's wrong, and it's also not documented. IMO this should be back-patched to 9.6 and, without the TAP test part, to 9.5. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 23 November 2016 at 20:58, Craig Ringer <craig@2ndquadrant.com> wrote: > Hi all > > Today I ran into an issue where commit timestamp lookups were failing with > > ERROR: cannot retrieve commit timestamp for transaction 2 > > which is of course FrozenTransactionId. > > TransactionIdGetCommitTsData(...) ERRORs on !TransactionIdIsNormal(), > which I think is wrong. Attached is a patch to make it return 0 for > FrozenTransactionId and BootstrapTransactionId, like it does for xids > that are too old. > > Note that the prior behaviour was as designed and has tests to enforce > it. I just think it's wrong, and it's also not documented. > > IMO this should be back-patched to 9.6 and, without the TAP test part, to 9.5. Updated to correct the other expected file, since there's an alternate. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 2016-11-23 20:58:22 +0800, Craig Ringer wrote: > Today I ran into an issue where commit timestamp lookups were failing with > > ERROR: cannot retrieve commit timestamp for transaction 2 > > which is of course FrozenTransactionId. > > TransactionIdGetCommitTsData(...) ERRORs on !TransactionIdIsNormal(), > which I think is wrong. Attached is a patch to make it return 0 for > FrozenTransactionId and BootstrapTransactionId, like it does for xids > that are too old. Why? It seems quite correct to not allow lookups for special case values, as it seems sensible to give them special treatmeant at the call site? > IMO this should be back-patched to 9.6 and, without the TAP test part, > to 9.5. Why would we want to backpatch a behaviour change, where arguments for the current and proposed behaviour exists? Andres
On 24 November 2016 at 02:32, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2016-11-23 20:58:22 +0800, Craig Ringer wrote: >> Today I ran into an issue where commit timestamp lookups were failing with >> >> ERROR: cannot retrieve commit timestamp for transaction 2 >> >> which is of course FrozenTransactionId. >> >> TransactionIdGetCommitTsData(...) ERRORs on !TransactionIdIsNormal(), >> which I think is wrong. Attached is a patch to make it return 0 for >> FrozenTransactionId and BootstrapTransactionId, like it does for xids >> that are too old. > > Why? It seems quite correct to not allow lookups for special case > values, as it seems sensible to give them special treatmeant at the call > site? It's surprising behaviour that doesn't make sense. Look at it this way: - We do some work, generating rows that have commit timestamps - TransactionIdGetCommitTsData() on those rows returns their cts fine - The commit timestamp data ages out - TransactionIdGetCommitTsData() returns 0 on these rows - vacuum comes alone and freezes the rows, even though nothing's changed - TransactionIdGetCommitTsData() suddenly ERRORs Nothing has meaningfully changed on these rows. They have gone from "old, committed, past the commit timestamp threshold" to "old, commited, past the commit timestamp threshold, frozen". It makes no sense to ERROR when vacuum gets around to freezing the tuples, when we don't also ERROR when we pass the cts threshold. ERRORing on BootstrapTransactionId is slightly more reasonable since those rows can never have had a cts in the first place, but it's also unnecessary since they're effectively "oldest always-committed xids". Making it ERROR on FrozenTransactionId was a mistake and should be corrected. >> IMO this should be back-patched to 9.6 and, without the TAP test part, >> to 9.5. > > Why would we want to backpatch a behaviour change, where arguments for > the current and proposed behaviour exists? I don't think it's crucial since callers can just work around it, but IMO the current behaviour is a design oversight that should be corrected and can be safely and sensibly corrected. Nobody's going to rely on FrozenTransactionId ERRORing. I don't think a backpatch is crucial though; as you note, C-level callers can work around the problem pretty simply, and that's just what I've done in pglogical for existing versions. I just think it's ugly, should be fixed, and is safe to fix. It's slightly harder for SQL-level callers to work around since they must hardcode a CASE that tests for xmin = XID '1' OR xmin = XID '2', and it's much less reasonable to expect SQL level callers to deal with this sort of mess with low level state. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
I considered the argument here for a bit and I think Craig is right -- FrozenXid eventually makes it to a tuple's xmin where it becomes a burden to the caller, making our interface bug-prone -- sure you can special-case it, but you don't until it first happens ... and it may not until you're deep into production. Even the code comment is confused: "error if the given Xid doesn't normally commit". But surely FrozenXid *does* commit in the sense that it appears in committed tuples' Xmin. We already have a good mechanism for replying to the query with "this value is too old for us to have its commit TS", which is a false return value. We should use that. I think not backpatching is worse, because then users have to be aware that they need to handle the FrozenXid case specially, but only on 9.5/9.6 ... I think the reason it took this long to pop up is because it has taken this long to get to replication systems on which this issue matters. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I considered the argument here for a bit and I think Craig is right -- FWIW, I agree. We shouldn't require every call site to special-case this, and we definitely don't want it to require special cases in SQL code. (And I'm for back-patching, too.) regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > I considered the argument here for a bit and I think Craig is right -- > > FWIW, I agree. We shouldn't require every call site to special-case this, > and we definitely don't want it to require special cases in SQL code. > > (And I'm for back-patching, too.) Pushed. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Craig Ringer wrote: > Updated to correct the other expected file, since there's an alternate. FWIW I don't know what you did here, but you did not patch the alternate expected file. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 November 2016 at 02:44, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Craig Ringer wrote: > >> Updated to correct the other expected file, since there's an alternate. > > FWIW I don't know what you did here, but you did not patch the > alternate expected file. Damn. Attached the first patch a second time is what I did. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services