Re: BUG #16990: Random PANIC in qemu user context - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: BUG #16990: Random PANIC in qemu user context |
Date | |
Msg-id | 3714052.1619980429@sss.pgh.pa.us Whole thread Raw |
In response to | BUG #16990: Random PANIC in qemu user context (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #16990: Random PANIC in qemu user context
|
List | pgsql-bugs |
PG Bug reporting form <noreply@postgresql.org> writes: > Within GitHub Actions Workflow, a qemu chrooted environment is created from > a RaspiOS lite image, within which latest availble postgresql is installed > from apt (postgresql 11.11). > Then tests of embedded software are executed, which includes creating a > postgresql database and performing few benign operations (as far as > PostgreSQL is concerned). Tests run perfectly fine in a desktop-like > environment as well as on real devices. > Within this qemu context, randomly yet quite frequently, postgresql > PANICs. > Latest log was the following : > 2021-05-02 09:22:21.591 BST [15024] PANIC: stuck spinlock detected at > LWLockWaitListLock, > /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832 Hm. Looking at the lwlock.c source code, that's not actually a stuck spinlock (in the sense of a loop around a TAS() call), but a loop waiting for an LWLock's LW_FLAG_LOCKED bit to become clear. It's morally the same thing though, in that we don't expect the conflicting lock to be held for more than a few instructions, so we just busy-wait and delay until the lock can be obtained. Seems like there are a few possible explanations: 1. Compiler bug generating incorrect code for the wait loop (e.g., failing to re-fetch the volatile variable each time though). The difficulty with this theory is that then you'd expect to see the same freezeup in normal non-qemu execution. But maybe qemu slows things down enough that the window for contention on an LWLock can be hit, whereas you'd hardly ever see that without qemu. Seems unlikely, but maybe it'd be worth disassembling LWLockWaitListLock to check. 2. qemu bug in emulating the atomic-update instructions that are used to set/clear LW_FLAG_LOCKED. This doesn't seem real probable either, but maybe it's the most likely of a bad lot. 3. qemu is so slow that the spinlock delay times out. I don't believe this one either, mainly because we haven't seen it in our own occasional uses of qemu; and if it were that slow it'd be entirely unusable. The spinlock timeout is normally multiple seconds, which is several orders of magnitude longer than such locks ought to be held. 4. Postgres bug causing the lock to never get released. This theory has the same problem as #1, ie you have to explain why it's not seen in any other environment. 5. The lock does get released, but there are enough processes contending for it that some process times out before it successfully acquires the lock. It's possible perhaps that that could happen under a very high-load scenario, but that doesn't seem like the category of test that would be sane to run under qemu. Not sure what to tell you, other than "make sure qemu and your build toolchain are up-to-date". regards, tom lane
pgsql-bugs by date: