IO in wrong state on riscv64 - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | IO in wrong state on riscv64 |
Date | |
Msg-id | d79691be-22bd-457d-9d90-18033b78c40a@gmail.com Whole thread Raw |
List | pgsql-hackers |
Hello Andres, I've spotted one more interesting AIO-related failure on a riscv64 machine [1]: === dumping /home/gburd/build/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/regression.diffs === diff -U3 /home/gburd/build/HEAD/pgsql/src/test/regress/expected/tsearch.out /home/gburd/build/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/tsearch.out --- /home/gburd/build/HEAD/pgsql/src/test/regress/expected/tsearch.out 2025-10-09 00:06:10.612959611 +0000 +++ /home/gburd/build/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/tsearch.out 2025-10-09 00:27:46.697588198 +0000 @@ -163,11 +163,7 @@ (1 row) SELECT count(*) FROM test_tsvector WHERE a @@ '!qe <2> qt'; - count -------- - 6 -(1 row) - +ERROR: IO in wrong state: 0 SELECT count(*) FROM test_tsvector WHERE a @@ '!(pl <-> yh)'; count ------- I've managed to reproduce it using qemu-system-riscv64 with Debian trixie (6.11.10-riscv64, clang 19.1.4): $ CC=clang-19 ./configure --enable-debug --enable-cassert --enable-tap-tests && time make -j8 -s ... configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wmissing-variable-declarations -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-format-truncation -Wno-cast-function-type-strict -g -O2 when running `make check` in a loop (it fails within 10 iterations for me) with the reduced parallel_schedule: test: test_setup test: boolean char name varchar text int2 int4 int8 oid float8 x10 and extra.config: log_line_prefix = '%m [%p:%l] %q%a ' log_connections = 'true' log_disconnections = 'true' log_statement = 'all' fsync = off restart_after_crash = off debug_parallel_query = regress shared_buffers = 512kB backtrace_functions = 'pgaio_io_wait' # mostly reflects greenfly's config, plus shared_buffers decreased # below 027_stream_regress's 1MB. With the following debug logging added: --- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -578,6 +578,20 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation) PgAioHandleState state; bool am_owner; +int inistate = ioh->state; +if (inistate == PGAIO_HS_IDLE) +{ + fprintf(stderr, "!!!pgaio_io_wait[%d]| initial ioh->state: %d\n", getpid(), inistate); +} +for (int i = 0; i < 10; i++) +{ + int curstate = ioh->state; + if (curstate != inistate) + { + elog(PANIC, "!!!pgaio_io_wait| ioh->state changed from %d to %d at iteration %d", inistate, curstate, i); + break; + } +} am_owner = ioh->owner_procno == MyProcNumber; if (pgaio_io_was_recycled(ioh, ref_generation, &state)) @@ -599,6 +613,10 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation) { if (pgaio_io_was_recycled(ioh, ref_generation, &state)) return; +if (ioh->state == PGAIO_HS_IDLE) +{ + fprintf(stderr, "!!!pgaio_io_wait[%d]| ioh->state: %d (initial state: %d)\n", getpid(), ioh->state, inistate); +} switch ((PgAioHandleState) state) { I'm observing several cases: 1) 2025-10-11 11:45:17.947 UTC [1191355:258] pg_regress/float8 LOG: statement: SELECT x, sind(x), sind(x) IN (-1,-0.5,0,0.5,1) AS sind_exact FROM (VALUES (0), (30), (90), (150), (180), (210), (270), (330), (360)) AS t(x); 2025-10-11 11:45:17.947 UTC [1191359:206] pg_regress/int8 LOG: statement: SELECT * FROM generate_series('+4567890123456789'::int8, '+4567890123456799'::int8); !!!pgaio_io_wait[1191744]| ioh->state: 0 (initial state: 5) 2025-10-11 11:45:17.993 GMT [1191744:1] ERROR: IO in wrong state: 0 2025-10-11 11:45:17.993 GMT [1191744:2] BACKTRACE: postgres: parallel worker for PID 1191355 (+0x4372a0) [0x555561b8a2a0] postgres: parallel worker for PID 1191355 (+0x442c66) [0x555561b95c66] postgres: parallel worker for PID 1191355 (StartBufferIO+0x66) [0x555561b95ac4] postgres: parallel worker for PID 1191355 (+0x43e014) [0x555561b91014] postgres: parallel worker for PID 1191355 (StartReadBuffer+0x308) [0x555561b90958] ... postgres: parallel worker for PID 1191355 (+0x3194b2) [0x555561a6c4b2] /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffb9ef791c] /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fffb9ef79c4] postgres: parallel worker for PID 1191355 (_start+0x20) [0x5555618382c8] ... 2025-10-11 11:45:17.999 UTC [1191355:259] pg_regress/float8 ERROR: IO in wrong state: 0 2025-10-11 11:45:17.999 UTC [1191355:260] pg_regress/float8 BACKTRACE: postgres: debian regression [local] SELECT(ProcessParallelMessages+0x114) [0x5555618dca6c] postgres: debian regression [local] SELECT(ProcessInterrupts+0x340) [0x555561bce864] postgres: debian regression [local] SELECT(+0x2e0e9c) [0x555561a33e9c] postgres: debian regression [local] SELECT(standard_ExecutorRun+0x10c) [0x555561a19c20] ... /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffb9ef791c] /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fffb9ef79c4] postgres: debian regression [local] SELECT(_start+0x20) [0x5555618382c8] 2025-10-11 11:45:17.999 UTC [1191355:261] pg_regress/float8 STATEMENT: SELECT x, sind(x), sind(x) IN (-1,-0.5,0,0.5,1) AS sind_exact FROM (VALUES (0), (30), (90), (150), (180), (210), (270), (330), (360)) AS t(x); 2025-10-11 11:45:17.999 UTC [1189893:98] LOG: background worker "parallel worker" (PID 1191744) exited with exit code 1 2) !!!pgaio_io_wait[1353521]| ioh->state: 0 (initial state: 4) ... 2025-10-11 12:48:00.503 UTC [1353521:1] ERROR: IO in wrong state: 0 2025-10-11 12:48:00.503 UTC [1353521:2] BACKTRACE: postgres: parallel worker for PID 1353449 (+0x4372a0) [0x5555905562a0] postgres: parallel worker for PID 1353449 (+0x442c66) [0x555590561c66] postgres: parallel worker for PID 1353449 (StartBufferIO+0x66) [0x555590561ac4] postgres: parallel worker for PID 1353449 (+0x43e014) [0x55559055d014] postgres: parallel worker for PID 1353449 (StartReadBuffer+0x308) [0x55559055c958] ... /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffaacf791c] /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fffaacf79c4] postgres: parallel worker for PID 1353449 (_start+0x20) [0x5555902042c8] 2025-10-11 12:48:00.503 UTC [1353521:3] STATEMENT: SELECT c.f1 FROM NAME_TBL c WHERE c.f1 ~ '[0-9]'; ... 2025-10-11 12:48:00.512 UTC [1352470:51] LOG: background worker "parallel worker" (PID 1353521) exited with exit code 1 2025-10-11 12:48:00.513 UTC [1353449:24] pg_regress/name ERROR: IO in wrong state: 0 2025-10-11 12:48:00.513 UTC [1353449:25] pg_regress/name BACKTRACE: postgres: debian regression [local] SELECT(ProcessParallelMessages+0x114) [0x5555902a8a6c] postgres: debian regression [local] SELECT(ProcessInterrupts+0x340) [0x55559059a864] postgres: debian regression [local] SELECT(+0x2e0e9c) [0x5555903ffe9c] postgres: debian regression [local] SELECT(standard_ExecutorRun+0x10c) [0x5555903e5c20] postgres: debian regression [local] SELECT(+0x481a30) [0x5555905a0a30] postgres: debian regression [local] SELECT(PortalRun+0x180) [0x5555905a06e6] postgres: debian regression [local] SELECT(+0x480880) [0x55559059f880] postgres: debian regression [local] SELECT(PostgresMain+0xdfc) [0x55559059d7d8] postgres: debian regression [local] SELECT(+0x4793f0) [0x5555905983f0] postgres: debian regression [local] SELECT(postmaster_child_launch+0x132) [0x5555904f787c] postgres: debian regression [local] SELECT(+0x3dcd52) [0x5555904fbd52] postgres: debian regression [local] SELECT(InitProcessGlobals+0) [0x5555904f9c96] postgres: debian regression [local] SELECT(+0x3194b2) [0x5555904384b2] /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffaacf791c] /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fffaacf79c4] postgres: debian regression [local] SELECT(_start+0x20) [0x5555902042c8] 2025-10-11 12:48:00.513 UTC [1353449:26] pg_regress/name STATEMENT: SELECT c.f1 FROM NAME_TBL c WHERE c.f1 ~ '[0-9]'; I also observed authentication failed due to this error. 3) !!!pgaio_io_wait[1169773]| initial ioh->state: 0 2025-10-11 11:34:46.793 GMT [1169773:1] PANIC: !!!pgaio_io_wait| ioh->state changed from 0 to 1 at iteration 0 # no other iteration number observed 2025-10-11 11:34:46.793 GMT [1169773:2] BACKTRACE: postgres: parallel worker for PID 1169684 (+0x4372a0) [0x55555c3492a0] postgres: parallel worker for PID 1169684 (+0x442c66) [0x55555c354c66] postgres: parallel worker for PID 1169684 (StartBufferIO+0x66) [0x55555c354ac4] postgres: parallel worker for PID 1169684 (+0x43e014) [0x55555c350014] postgres: parallel worker for PID 1169684 (StartReadBuffer+0x308) [0x55555c34f958] ... /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fff8faf791c] /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fff8faf79c4] postgres: parallel worker for PID 1169684 (_start+0x20) [0x55555bff72c8] 2025-10-11 11:34:46.804 UTC [1169684:93] pg_regress/text ERROR: !!!pgaio_io_wait| ioh->state changed from 0 to 1 at iteration 0 2025-10-11 11:34:46.804 UTC [1169684:94] pg_regress/text BACKTRACE: postgres: debian regression [local] SELECT(ProcessParallelMessages+0x114) [0x55555c09ba6c] postgres: debian regression [local] SELECT(ProcessInterrupts+0x340) [0x55555c38d864] postgres: debian regression [local] SELECT(+0x2e0e9c) [0x55555c1f2e9c] postgres: debian regression [local] SELECT(standard_ExecutorRun+0x10c) [0x55555c1d8c20] ... /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fff8faf791c] /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fff8faf79c4] postgres: debian regression [local] SELECT(_start+0x20) [0x55555bff72c8] With CFLAGS="-O0", it fails much sooner... Probably the latter case is OK, I just wanted to check if the field is apparently changed on the fly... [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=greenfly&dt=2025-10-09%2000%3A06%3A03 Best regards, Alexander
pgsql-hackers by date: