Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | CA+hUKGK2woMXTbG9xsuQ-d3o8N8du40F6tH9sAiKCY3eTN_VXQ@mail.gmail.com Whole thread Raw |
In response to | Non-reproducible AIO failure (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Non-reproducible AIO failure
|
List | pgsql-hackers |
On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Can you get a core and print *ioh in the debugger? > > So far, I've failed to get anything useful out of core files > from this failure. The trace goes back no further than > > (lldb) bt > * thread #1 > * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8 > > That's quite odd in itself: while I don't find the debugging > environment on macOS to be the greatest, it's not normally > this unhelpful. (And Alexander reported the same off-list.). It's interesting that the elog.c backtrace stuff is able to analyse the stack and it looks normal AFAICS. Could that be interfering with the stack in the core?! I doubt it ... I kinda wonder if the debugger might be confused about libsystem sometimes since it has ceased to be a regular Mach-O file on disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue? So far we have: TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 20931 0 postgres 0x0000000105299c84 ExceptionalCondition + 108 1 postgres 0x00000001051159ac WaitReadBuffers + 616 2 postgres 0x00000001053611ec read_stream_next_buffer.cold.1 + 184 3 postgres 0x0000000105111630 read_stream_next_buffer + 300 4 postgres 0x0000000104e0b994 heap_fetch_next_buffer + 136 5 postgres 0x0000000104e018f4 heapgettup_pagemode + 204 Hmm, looking around that code and wracking my brain for things that might happen on one OS but not others, I wonder about partial I/Os. Perhaps combined with some overlapping requests. TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 0 postgres 0x0000000104f078f4 ExceptionalCondition + 236 1 postgres 0x0000000104c0ebd4 pgaio_io_before_start + 260 2 postgres 0x0000000104c0ea94 pgaio_io_start_readv + 36 3 postgres 0x0000000104c2d4e8 FileStartReadV + 252 4 postgres 0x0000000104c807c8 mdstartreadv + 668 5 postgres 0x0000000104c83db0 smgrstartreadv + 116 But this one seems like a more basic confusion... wild writes somewhere? Hmm, we need to see what's in that struct. If we can't get a debugger to break there or a core file to be analysable, maybe we should try logging as much info as possible at those points to learn a bit more? I would be digging like that myself but I haven't seen this failure on my little M4 MacBook Air yet (Sequoia 15.5, Apple clang-1700.0.13.3). It is infected with corporate security-ware that intercepts at least file system stuff and slows it down and I can't even convince it to dump core files right now. Could you guys please share your exact repro steps?
pgsql-hackers by date: