Re: BUG #16990: Random PANIC in qemu user context - Mailing list pgsql-bugs
From | Paul Guyot |
---|---|
Subject | Re: BUG #16990: Random PANIC in qemu user context |
Date | |
Msg-id | 86C24765-95F7-464F-9677-B09A396A5F69@kallisys.net Whole thread Raw |
In response to | Re: BUG #16990: Random PANIC in qemu user context (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #16990: Random PANIC in qemu user context
|
List | pgsql-bugs |
> Not sure what to tell you, other than "make sure qemu and your > build toolchain are up-to-date". In this scenario, I use postgresql 11.11 that was compiled by raspbian folks. I also used the qemu binary provided by ubuntufor focal, which happens to be 4.2 (not the latest). I found out the corresponding function using readelf to locate the string constant. For the record, the C function is here: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L811 The tight read loop is as follows: 32b548: e28d0004 add r0, sp, #4 32b54c: eb000679 bl 32cf38 <perform_spin_delay@@Base> 32b550: e5943004 ldr r3, [r4, #4] 32b554: e3130201 tst r3, #268435456 ; 0x10000000 32b558: 1afffffa bne 32b548 <RememberSimpleDeadLock@@Base+0xc4> At address 32b550, it does perform a read, honoring the volatile pointer. I guess the lock is acquired by the same function: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L824 The corresponding code is the following 32b508: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b50c: e1953f9f ldrex r3, [r5] 32b510: e3832201 orr r2, r3, #268435456 ; 0x10000000 32b514: e1851f92 strex r1, r2, [r5] 32b518: e3510000 cmp r1, #0 32b51c: 1afffffa bne 32b50c <RememberSimpleDeadLock@@Base+0x88> 32b520: e3130201 tst r3, #268435456 ; 0x10000000 32b524: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b528: 0a00000e beq 32b568 <RememberSimpleDeadLock@@Base+0xe4> mcr 15, 0, r0, cr7, cr10, {5} is __sync_synchronize() and based on the previous instructions, r5 is equal to r4+4 as usedin the tight loop. I also guess the corresponding unlock function just follows, and disassembling it reveals the same use of __sync_synchronize(). 32b644: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b648: e1932f9f ldrex r2, [r3] 32b64c: e3c22201 bic r2, r2, #268435456 ; 0x10000000 32b650: e1831f92 strex r1, r2, [r3] 32b654: e3510000 cmp r1, #0 32b658: 1afffffa bne 32b648 <RememberSimpleDeadLock@@Base+0x1c4> 32b65c: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b660: e8bd8070 pop {r4, r5, r6, pc} QEMU user emulation documentation mentions something specific to threading on ARM. https://qemu.readthedocs.io/en/latest/user/main.html > Threading: > On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each emulatedthread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock inorder to preserve their semantics. I have yet to determine what impact it could have here. Can we imagine a situation where the memory barrier was not honoredand an unlock would be overwritten with a lock? Eventually, I have tried to run the whole script with taskset -c 0 (which is fine with the tests as the target system, aRaspberry Pi Zero, is single core, while GitHub Linux runners have 2 vCPUs). https://github.com/pguyot/pynab/commit/91011e68e446c69e317fd1198c58f85ff0cd5fb1 https://github.com/pguyot/pynab/runs/2486051700?check_suite_focus=true I ran it four times so far, and no postgresql PANIC happens. So your hypothesis of a bug (limitation) of qemu 4.2 seems probable… FYI, newer ARM architectures, starting with armv7l, have a dedicated instruction for memory barriers which is not used hereas it is not recognized by Raspberry PI Zero CPU. Paul
pgsql-bugs by date: