Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | f975aa25-eddd-45ca-90b5-855338ed75d2@gmail.com Whole thread Raw |
In response to | Non-reproducible AIO failure (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
Hello Thomas, 24.05.2025 14:42, Thomas Munro wrote: > On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So it seems that "very low-probability issue in our Mac AIO code" is >> the most probable description. > There isn't any macOS-specific AIO code so my first guess would be > that it might be due to aarch64 weak memory reordering (though Andres > speculated that itt should all be one backend, huh), if it's not just > a timing luck thing. Alexander, were the other OSes you tried all on > x86? As I wrote off-list before, I had tried x86_64 only, but since then I tried to reproduce the issue on an aarch64 server with Ubuntu 24.04, running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've also multiplied "test: brin ..." line x10. But the issue is still not reproduced (in 8+ hours). However, I've managed to get an AIO-related assertion failure on macOS 14.5 (Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl the same way: 2 # +++ tap check in src/test/recovery_2 +++ 3 # +++ tap check in src/test/recovery_3 +++ 4 # +++ tap check in src/test/recovery_4 +++ 5 # +++ tap check in src/test/recovery_5 +++ 1 # +++ tap check in src/test/recovery_1 +++ 7 # +++ tap check in src/test/recovery_7 +++ 6 # +++ tap check in src/test/recovery_6 +++ 8 # +++ tap check in src/test/recovery_8 +++ 1 [16:07:31] t/027_stream_regress.pl .. 1 Dubious, test returned 29 (wstat 7424, 0x1d00) 1 All 1 subtests passed 1 [16:07:31] 1 1 Test Summary Report 1 ------------------- 1 t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0) 1 Non-zero exit status: 29 1 Parse errors: No plan found in TAP output 1 Files=1, Tests=1, 197 wallclock secs ( 0.01 usr 0.01 sys + 4.47 cusr 5.79 csys = 10.28 CPU) 1 Result: FAIL TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 0 postgres 0x0000000104f078f4 ExceptionalCondition + 236 1 postgres 0x0000000104c0ebd4 pgaio_io_before_start + 260 2 postgres 0x0000000104c0ea94 pgaio_io_start_readv + 36 3 postgres 0x0000000104c2d4e8 FileStartReadV + 252 4 postgres 0x0000000104c807c8 mdstartreadv + 668 5 postgres 0x0000000104c83db0 smgrstartreadv + 116 6 postgres 0x0000000104c1675c AsyncReadBuffers + 1444 7 postgres 0x0000000104c15868 StartReadBuffersImpl + 1196 8 postgres 0x0000000104c153ac StartReadBuffers + 64 9 postgres 0x0000000104c129b4 read_stream_start_pending_read + 1204 10 postgres 0x0000000104c12018 read_stream_look_ahead + 812 11 postgres 0x0000000104c11cd0 read_stream_next_buffer + 1396 12 postgres 0x0000000104606c38 heap_fetch_next_buffer + 284 13 postgres 0x00000001045f79d4 heapgettup_pagemode + 192 14 postgres 0x00000001045f7fa4 heap_getnextslot + 84 15 postgres 0x000000010497a404 table_scan_getnextslot + 340 16 postgres 0x000000010497a210 SeqNext + 148 17 postgres 0x000000010497a8e4 ExecScanFetch + 772 18 postgres 0x000000010497a4b8 ExecScanExtended + 164 19 postgres 0x0000000104979bf0 ExecSeqScanWithProject + 244 20 postgres 0x0000000104953ce0 ExecProcNode + 68 21 postgres 0x000000010494f148 MultiExecPrivateHash + 64 22 postgres 0x000000010494ed48 MultiExecHash + 100 23 postgres 0x0000000104925a18 MultiExecProcNode + 180 24 postgres 0x0000000104957220 ExecHashJoinImpl + 628 25 postgres 0x00000001049566dc ExecHashJoin + 32 26 postgres 0x0000000104925958 ExecProcNodeFirst + 100 27 postgres 0x000000010491c08c ExecProcNode + 68 28 postgres 0x000000010491692c ExecutePlan + 268 29 postgres 0x000000010491679c standard_ExecutorRun + 504 30 pg_stat_statements.dylib 0x00000001055e132c pgss_ExecutorRun + 212 31 postgres 0x0000000104916580 ExecutorRun + 72 32 postgres 0x000000010491f59c ParallelQueryMain + 508 33 postgres 0x00000001046968d4 ParallelWorkerMain + 1712 34 postgres 0x0000000104b476f4 BackgroundWorkerMain + 824 35 postgres 0x0000000104b4b87c postmaster_child_launch + 492 36 postgres 0x0000000104b56610 StartBackgroundWorker + 416 37 postgres 0x0000000104b51378 maybe_start_bgworkers + 552 38 postgres 0x0000000104b540cc LaunchMissingBackgroundProcesses + 1316 39 postgres 0x0000000104b51650 ServerLoop + 616 40 postgres 0x0000000104b4fef0 PostmasterMain + 6632 41 postgres 0x00000001049bfe20 main + 952 42 dyld 0x00000001849d60e0 start + 2360 2025-05-24 16:07:23.427 EDT [22120:4] LOG: background worker "parallel worker" (PID 32355) was terminated by signal 6: Abort trap: 6 2025-05-24 16:07:23.427 EDT [22120:5] DETAIL: Failed process was running: select count(*) from join_foo left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss I'm yet to see the Assert triggered on the buildfarm, but this one looks interesting too. (I can share the complete patch + script for such testing, if it can be helpful.) Best regards, Alexander Lakhin Neon (https://neon.tech)
pgsql-hackers by date: