Thread: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
Hi, I've been investigating some failures in test_decoding regression tests, and it seems to me the error-handling in ReorderBufferCommit() is somewhat broken, leading to segfault crashes. The problematic part is this: PG_CATCH() { /* * Force cache invalidation to happen outside of a valid transaction * to prevent catalog access as we just caughtan error. */ AbortCurrentTransaction(); /* make sure there's no cache pollution */ ReorderBufferExecuteInvalidations(rb, txn); ... } Triggering it trivial - just add elog(ERROR,...) at the beginning of the PG_TRY() block. The problem is that AbortCurrentTransaction() apparently releases the memory where txn is allocated, making it entirely bogus. So in assert builds txn->ivalidations are 0x7f7f7f7f7f7f7f7f, triggering a segfault. Similar issues apply to subsequent calls in the catch block, that also use txn in some way (e.g. through snapshot_now). I suppose this is not quite intentional, but rather an example that error-handling code is an order of magnitude more complicated to write and test. I've only noticed as I'm investigating some regression failures on Postgres-XL 10, which does not support subtransactions and so the BeginInternalSubTransaction() call in the try branch always fails, triggering the issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote: > I've been investigating some failures in test_decoding regression tests, > and it seems to me the error-handling in ReorderBufferCommit() is > somewhat broken, leading to segfault crashes. > > The problematic part is this: > > PG_CATCH() > { > /* > * Force cache invalidation to happen outside of a valid transaction > * to prevent catalog access as we just caught an error. > */ > AbortCurrentTransaction(); > > /* make sure there's no cache pollution */ > ReorderBufferExecuteInvalidations(rb, txn); > > ... > > } > > Triggering it trivial - just add elog(ERROR,...) at the beginning of the > PG_TRY() block. That's not really a valid thing to do, you should put it after the BeginInternalSubTransaction()/StartTransactionCommand(). It's basically assumed that those won't fail - arguably they should be outside of a PG_TRY then, but that's a different matter. If you start decoding outside of SQL failing before those will lead to rolling back the parent tx... > I suppose this is not quite intentional, but rather an example that > error-handling code is an order of magnitude more complicated to write > and test. I've only noticed as I'm investigating some regression > failures on Postgres-XL 10, which does not support subtransactions and > so the BeginInternalSubTransaction() call in the try branch always > fails, triggering the issue. So, IIUC, there's no live problem in postgres core, besides some ugly & undocumented assumptions? Greetings, Andres Freund
On 08/30/2017 01:34 AM, Andres Freund wrote: > Hi, > > On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote: >> I've been investigating some failures in test_decoding regression tests, >> and it seems to me the error-handling in ReorderBufferCommit() is >> somewhat broken, leading to segfault crashes. >> >> The problematic part is this: >> >> PG_CATCH() >> { >> /* >> * Force cache invalidation to happen outside of a valid transaction >> * to prevent catalog access as we just caught an error. >> */ >> AbortCurrentTransaction(); >> >> /* make sure there's no cache pollution */ >> ReorderBufferExecuteInvalidations(rb, txn); >> >> ... >> >> } >> >> Triggering it trivial - just add elog(ERROR,...) at the beginning of the >> PG_TRY() block. > > That's not really a valid thing to do, you should put it after the > BeginInternalSubTransaction()/StartTransactionCommand(). It's basically > assumed that those won't fail - arguably they should be outside of a > PG_TRY then, but that's a different matter. If you start decoding > outside of SQL failing before those will lead to rolling back the parent > tx... > > >> I suppose this is not quite intentional, but rather an example that >> error-handling code is an order of magnitude more complicated to write >> and test. I've only noticed as I'm investigating some regression >> failures on Postgres-XL 10, which does not support subtransactions and >> so the BeginInternalSubTransaction() call in the try branch always >> fails, triggering the issue. > > So, IIUC, there's no live problem in postgres core, besides some ugly & > undocumented assumptions? > I'm not really following your reasoning. You may very well be right that the BeginInternalSubTransaction() example is not quite valid on postgres core, but I don't see how that implies there can be no other errors in the PG_TRY block. It was merely an explanation how I noticed this issue. To make it absolutely clear, I claim that the PG_CATCH() block is demonstrably broken as it calls AbortCurrentTransaction() and then accesses already freed memory. The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in the switch, and AFAICS hitting any of them will have exactly the same effect as failure in BeginInternalSubTransaction(). And I suppose there many other ways to trigger an error and get into the catch block. In other words, why would we have the block at all? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote: > > > On 08/30/2017 01:34 AM, Andres Freund wrote: > > Hi, > > > > On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote: > >> I've been investigating some failures in test_decoding regression tests, > >> and it seems to me the error-handling in ReorderBufferCommit() is > >> somewhat broken, leading to segfault crashes. > >> > >> The problematic part is this: > >> > >> PG_CATCH() > >> { > >> /* > >> * Force cache invalidation to happen outside of a valid transaction > >> * to prevent catalog access as we just caught an error. > >> */ > >> AbortCurrentTransaction(); > >> > >> /* make sure there's no cache pollution */ > >> ReorderBufferExecuteInvalidations(rb, txn); > >> > >> ... > >> > >> } > >> > >> Triggering it trivial - just add elog(ERROR,...) at the beginning of the > >> PG_TRY() block. > > > > That's not really a valid thing to do, you should put it after the > > BeginInternalSubTransaction()/StartTransactionCommand(). It's basically > > assumed that those won't fail - arguably they should be outside of a > > PG_TRY then, but that's a different matter. If you start decoding > > outside of SQL failing before those will lead to rolling back the parent > > tx... > > > > > >> I suppose this is not quite intentional, but rather an example that > >> error-handling code is an order of magnitude more complicated to write > >> and test. I've only noticed as I'm investigating some regression > >> failures on Postgres-XL 10, which does not support subtransactions and > >> so the BeginInternalSubTransaction() call in the try branch always > >> fails, triggering the issue. > > > > So, IIUC, there's no live problem in postgres core, besides some ugly & > > undocumented assumptions? > > > > I'm not really following your reasoning. You may very well be right that > the BeginInternalSubTransaction() example is not quite valid on postgres > core, but I don't see how that implies there can be no other errors in > the PG_TRY block. It was merely an explanation how I noticed this issue. > > To make it absolutely clear, I claim that the PG_CATCH() block is > demonstrably broken as it calls AbortCurrentTransaction() and then > accesses already freed memory. Yea, but that's not what happens normally. The txn memory isn't allocated in the context created by the started [sub]transaction. I think you're just seeing what you're seeing because an error thrown before the BeginInternalSubTransaction() succeeds, will roll back the *containing* transaction (the one that did the SQL SELECT * FROM pg_*logical*() call), and free all the entire reorderbuffer stuff. > The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in > the switch, and AFAICS hitting any of them will have exactly the same > effect as failure in BeginInternalSubTransaction(). No, absolutely not. Once the subtransaction starts, the AbortCurrentTransaction() will abort that subtransaction, rather than the containing one. - Andres
On 08/30/2017 02:19 AM, Andres Freund wrote: > On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote: >> >> I'm not really following your reasoning. You may very well be right that >> the BeginInternalSubTransaction() example is not quite valid on postgres >> core, but I don't see how that implies there can be no other errors in >> the PG_TRY block. It was merely an explanation how I noticed this issue. >> >> To make it absolutely clear, I claim that the PG_CATCH() block is >> demonstrably broken as it calls AbortCurrentTransaction() and then >> accesses already freed memory. > > Yea, but that's not what happens normally. The txn memory isn't > allocated in the context created by the started [sub]transaction. I > think you're just seeing what you're seeing because an error thrown > before the BeginInternalSubTransaction() succeeds, will roll back the > *containing* transaction (the one that did the SQL SELECT * FROM > pg_*logical*() call), and free all the entire reorderbuffer stuff. > > >> The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in >> the switch, and AFAICS hitting any of them will have exactly the same >> effect as failure in BeginInternalSubTransaction(). > > No, absolutely not. Once the subtransaction starts, the > AbortCurrentTransaction() will abort that subtransaction, rather than > the containing one. > Ah, I see. You're right elog(ERROR) calls after the subtransaction start are handled correctly and don't trigger a segfault. Sorry for the noise and thanks for the explanation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services