Re: Non-reproducible AIO failure - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Non-reproducible AIO failure
Date
Msg-id CA+hUKGK2woMXTbG9xsuQ-d3o8N8du40F6tH9sAiKCY3eTN_VXQ@mail.gmail.com
Whole thread Raw
In response to Non-reproducible AIO failure  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Non-reproducible AIO failure
List pgsql-hackers
On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Can you get a core and print *ioh in the debugger?
>
> So far, I've failed to get anything useful out of core files
> from this failure.  The trace goes back no further than
>
> (lldb) bt
> * thread #1
>   * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8
>
> That's quite odd in itself: while I don't find the debugging
> environment on macOS to be the greatest, it's not normally
> this unhelpful.

(And Alexander reported the same off-list.). It's interesting that the
elog.c backtrace stuff is able to analyse the stack and it looks
normal AFAICS.  Could that be interfering with the stack in the core?!
 I doubt it ... I kinda wonder if the debugger might be confused about
libsystem sometimes since it has ceased to be a regular Mach-O file on
disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue?

So far we have:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),
File: "bufmgr.c", Line: 1605, PID: 20931
0   postgres                            0x0000000105299c84
ExceptionalCondition + 108
1   postgres                            0x00000001051159ac WaitReadBuffers + 616
2   postgres                            0x00000001053611ec
read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000105111630
read_stream_next_buffer + 300
4   postgres                            0x0000000104e0b994
heap_fetch_next_buffer + 136
5   postgres                            0x0000000104e018f4
heapgettup_pagemode + 204

Hmm, looking around that code and wracking my brain for things that
might happen on one OS but not others, I wonder about partial I/Os.
Perhaps combined with some overlapping requests.

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
Line: 161, PID: 32355
0   postgres                            0x0000000104f078f4
ExceptionalCondition + 236
1   postgres                            0x0000000104c0ebd4
pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94
pgaio_io_start_readv + 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116

But this one seems like a more basic confusion...  wild writes
somewhere?  Hmm, we need to see what's in that struct.

If we can't get a debugger to break there or a core file to be
analysable, maybe we should try logging as much info as possible at
those points to learn a bit more?  I would be digging like that myself
but I haven't seen this failure on my little M4 MacBook Air yet
(Sequoia 15.5, Apple clang-1700.0.13.3).  It is infected with
corporate security-ware that intercepts at least file system stuff and
slows it down and I can't even convince it to dump core files right
now.  Could you guys please share your exact repro steps?



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: MERGE issues around inheritance
Next
From: Tom Lane
Date:
Subject: Re: Fixing memory leaks in postgres_fdw