Thread: condition variable cleanup and subtransactions
Hi, On 2017-12-21 02:42:25 -0800, Andres Freund wrote: > Trying to debug this I found another issue. I'd placed a sleep(10) in > ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for > some reason. Segfault time: > > #0 0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at /home/andres/src/postgresql/src/include/storage/s_lock.h:228 > #1 0x000055bfbac42b4d in ConditionVariableCancelSleep () at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173 > #2 0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478 > #3 0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:4387 > So, afaics no workers had yet attached, the leader accepted the cancel > interrupt, the dsm segments were destroyed, and as part of cleanup > cv_sleep_target was supposed to be reset, which fails, because it's > memory has since been freed. Looking at how that can happen. Oh. This seems to be a condition variable bug independent of PHJ. The problem is that the DSM segment etc all get cleaned up in *subtransaction* abort. Afaict it's a bug that AbortTransaction() does ConditionVariableCancelSleep() but AbortSubTransaction() does not, despite the latter releasing dsm segments via ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS). Adding that seems to fix the crash. This seems like something we need to backpatch. Greetings, Andres Freund
On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote: > Oh. This seems to be a condition variable bug independent of PHJ. The > problem is that the DSM segment etc all get cleaned up in > *subtransaction* abort. > > Afaict it's a bug that AbortTransaction() does > ConditionVariableCancelSleep() but AbortSubTransaction() does not, > despite the latter releasing dsm segments via > ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS). > > Adding that seems to fix the crash. > > This seems like something we need to backpatch. Agreed. That affects any user of condition variables inside DSM segments, including the released Parallel Index Scan and Parallel Bitmap Heap Scan code. -- Thomas Munro http://www.enterprisedb.com
On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote: > Oh. This seems to be a condition variable bug independent of PHJ. The > problem is that the DSM segment etc all get cleaned up in > *subtransaction* abort. > > Afaict it's a bug that AbortTransaction() does > ConditionVariableCancelSleep() but AbortSubTransaction() does not, > despite the latter releasing dsm segments via > ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS). > > Adding that seems to fix the crash. > > This seems like something we need to backpatch. Agreed. That affects any user of condition variables inside DSM segments, including the released Parallel Index Scan and Parallel Bitmap Heap Scan code. -- Thomas Munro http://www.enterprisedb.com
On Thu, Dec 21, 2017 at 6:46 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote: >> Oh. This seems to be a condition variable bug independent of PHJ. The >> problem is that the DSM segment etc all get cleaned up in >> *subtransaction* abort. >> >> Afaict it's a bug that AbortTransaction() does >> ConditionVariableCancelSleep() but AbortSubTransaction() does not, >> despite the latter releasing dsm segments via >> ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS). >> >> Adding that seems to fix the crash. >> >> This seems like something we need to backpatch. > > Agreed. That affects any user of condition variables inside DSM > segments, including the released Parallel Index Scan and Parallel > Bitmap Heap Scan code. Fixed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 21, 2017 at 6:46 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote: >> Oh. This seems to be a condition variable bug independent of PHJ. The >> problem is that the DSM segment etc all get cleaned up in >> *subtransaction* abort. >> >> Afaict it's a bug that AbortTransaction() does >> ConditionVariableCancelSleep() but AbortSubTransaction() does not, >> despite the latter releasing dsm segments via >> ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS). >> >> Adding that seems to fix the crash. >> >> This seems like something we need to backpatch. > > Agreed. That affects any user of condition variables inside DSM > segments, including the released Parallel Index Scan and Parallel > Bitmap Heap Scan code. Fixed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company