Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | effb92fd-978f-479c-9fe9-aef04d04c96f@gmail.com Whole thread Raw |
In response to | Non-reproducible AIO failure (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Non-reproducible AIO failure
|
List | pgsql-hackers |
Hello Thomas, 25.05.2025 05:45, Thomas Munro wrote: >> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 > Can you get a core and print *ioh in the debugger? It looks like I managed to restore all the conditions needed to reproduce that Assert more or less reliably (within a couple of hours), so I can continue experiments. I've added the following debugging: +++ b/src/backend/storage/aio/aio_io.c @@ -158,6 +158,7 @@ pgaio_io_before_start(PgAioHandle *ioh) Assert(ioh->state == PGAIO_HS_HANDED_OUT); Assert(pgaio_my_backend->handed_out_io == ioh); Assert(pgaio_io_has_target(ioh)); +if (ioh->op != PGAIO_OP_INVALID) elog(LOG, "!!!pgaio_io_before_start| ioh: %p, ioh->op: %d", ioh, ioh->op); Assert(ioh->op == PGAIO_OP_INVALID); --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -67,6 +67,7 @@ #include "utils/resowner.h" #include "utils/timestamp.h" +#include "storage/aio_internal.h" ... @@ -1855,7 +1860,11 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress) pgaio_submit_staged(); ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return); +fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(2): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle *)ioh)->op, blocknum); } +else +fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(1): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle *)ioh)->op, blocknum); + /* * Check if we can start IO on the first to-be-read buffer. ... (using fprintf here to rule out side effects of interrupt processing inside elog(), and it looks like fprintf there eases the reproducing) --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -712,7 +712,7 @@ sub init print $conf "wal_log_hints = on\n"; print $conf "hot_standby = on\n"; # conservative settings to ensure we can run multiple postmasters: - print $conf "shared_buffers = 1MB\n"; + print $conf "shared_buffers = 512kB\n"; print $conf "max_connections = 10\n"; (it looks like it increases the probability of failure too, will gather more statistics...) And got: ... !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 121 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 122 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 123 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 124 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 125 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 126 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 1, blocknum: 127 2025-05-30 14:38:40.812 EDT [97859:121] pg_regress/replica_identity LOG: !!!pgaio_io_before_start| ioh: 0x106efd540, ioh->op: 1 2025-05-30 14:38:40.812 EDT [97859:122] pg_regress/replica_identity STATEMENT: SELECT c.oid, n.nspname, c.relname FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname OPERATOR(pg_catalog.~) '^(test_replica_identity3)$' COLLATE pg_catalog.default AND pg_catalog.pg_table_is_visible(c.oid) ORDER BY 2, 3; TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 97859 0 postgres 0x0000000105823878 ExceptionalCondition + 236 1 postgres 0x000000010552ab54 pgaio_io_before_start + 452 2 postgres 0x000000010552a954 pgaio_io_start_readv + 36 3 postgres 0x000000010554946c FileStartReadV + 252 4 postgres 0x000000010559c74c mdstartreadv + 668 5 postgres 0x000000010559fd34 smgrstartreadv + 116 6 postgres 0x00000001055326e0 AsyncReadBuffers + 1664 7 postgres 0x00000001055317e8 StartReadBuffersImpl + 1196 8 postgres 0x000000010553132c StartReadBuffers + 64 9 postgres 0x000000010552e934 read_stream_start_pending_read + 1204 10 postgres 0x000000010552df98 read_stream_look_ahead + 812 11 postgres 0x000000010552dc50 read_stream_next_buffer + 1396 12 postgres 0x0000000104f22af8 heap_fetch_next_buffer + 284 13 postgres 0x0000000104f13894 heapgettup_pagemode + 192 14 postgres 0x0000000104f13e64 heap_getnextslot + 84 15 postgres 0x00000001052962c4 table_scan_getnextslot + 340 16 postgres 0x00000001052960d0 SeqNext + 148 17 postgres 0x00000001052967a4 ExecScanFetch + 772 18 postgres 0x0000000105296378 ExecScanExtended + 164 ... And another failed run: ... !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 30 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 31 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 34 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 35 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 36 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 37 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 38 !!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 253 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 39 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 40 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 41 !!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 254 !!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 255 2025-05-30 22:48:00.265 EDT [5043:16] pg_regress/memoize LOG: !!!pgaio_io_before_start| ioh: 0x102d14f10, ioh->op: 1 2025-05-30 22:48:00.265 EDT [5043:17] pg_regress/memoize CONTEXT: PL/pgSQL function explain_memoize(text,boolean) line 5 at FOR over EXECUTE statement 2025-05-30 22:48:00.265 EDT [5043:18] pg_regress/memoize STATEMENT: SELECT explain_memoize(' SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN LATERAL (SELECT t1.twenty AS c1, t2.unique1 AS c2, t2.two FROM tenk1 t2) s ON t1.two = s.two WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false); TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 5043 ... 0 postgres 0x000000010163b878 ExceptionalCondition + 236 1 postgres 0x0000000101342b54 pgaio_io_before_start + 452 2 postgres 0x0000000101342954 pgaio_io_start_readv + 36 3 postgres 0x000000010136146c FileStartReadV + 252 4 postgres 0x00000001013b474c mdstartreadv + 668 5 postgres 0x00000001013b7d34 smgrstartreadv + 116 6 postgres 0x000000010134a6e0 AsyncReadBuffers + 1664 7 postgres 0x00000001013497e8 StartReadBuffersImpl + 1196 8 postgres 0x000000010134932c StartReadBuffers + 64 9 postgres 0x0000000101346934 read_stream_start_pending_read + 1204 10 postgres 0x0000000101345f98 read_stream_look_ahead + 812 11 postgres 0x0000000101345c50 read_stream_next_buffer + 1396 12 postgres 0x0000000100d3aaf8 heap_fetch_next_buffer + 284 13 postgres 0x0000000100d2b894 heapgettup_pagemode + 192 14 postgres 0x0000000100d2be64 heap_getnextslot + 84 15 postgres 0x00000001010ae2c4 table_scan_getnextslot + 340 16 postgres 0x00000001010ae0d0 SeqNext + 148 ... I run the same loop as shown at [1], but with a random number of parallel processes: np=$((1 + $RANDOM % 5)), and as I can see the failure is more likely with np == 5. I also tried running stress-ng --hdd 1 --timeout 60m concurrently, but I'm still not sure if it affects the reproducing directly. I use MacBook M1 and don't know how to disable E/P-cores, moreover, from my observation, the test performs slower (up to 10%) when it runs for hours in a loop, probably because of the CPU heating up and lowering it's frequency, so if the issue is really timing-dependent, it's not that easy to get more or less the same timings once more (I tried to add sleep into the loop, to let CPU cool down...). [1] https://www.postgresql.org/message-id/87673f76-1ed7-44dc-8e8b-6aaf78e2817a%40gmail.com Best regards, Alexander Lakhin Neon (https://neon.tech)
pgsql-hackers by date: