Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |
Date | |
Msg-id | 7b0b1053-0465-b3a6-0e8f-b5291b7510fa@gmail.com Whole thread Raw |
In response to | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
|
List | pgsql-hackers |
Hello, 03.09.2023 00:00, Alexander Lakhin wrote: > I'll try to test this guess on the target machine... > I got access to dikkop thanks to Tomas Vondra, and started reproducing the issue. It was rather difficult to catch the lockup as Tomas and Tom noticed before. I tried to use stress-ng to affect reproduction somehow and gradually managed to reduce the join_hash test to just a single query (if I were venturous enough, I could reduce that test to it in just one step, but I choose to move little by little). At the end I constructed a reproducing script that could catch the lockup on that ARM machine during an hour on average. Today I've booted FreeBSD-14.0-ALPHA4 on my secondary AMD (Ryzen 7 3700X) machine and managed to reproduce the issue with that script (see attachment). (I run the script just after `gmake && gmake check`.) To reproduce it faster on my machine, I run it on tmpfs and use parallel: for ((i=1;i<=4;i++)); do echo "ITERATION $i"; parallel --halt now,fail=1 -j8 --linebuffer --tag time .../repro+.sh {} ::: 01 02 03 04 || break; done It produces the lockup as follows: ITERATION 1 ... 02 waiting for server to shut down.... done 02 server stopped 02 02 real 3m2.420s 02 user 0m2.896s 02 sys 0m1.685s 04 TIMEOUT on iteration 448 04 04 real 3m16.212s 04 user 0m2.322s 04 sys 0m1.904s ... psql -p 15435 regression -c "SELECT * FROM pg_stat_activity;" 16384 | regression | 53696 | | 10 | user | psql | | | -1 | 2023-09-08 18:44:27.572426+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 | 2023-09-08 18:44:27.795519+00 | IPC | HashBuildHashOuter | active | | 731 | | explain (analyze) select count(*) from simple r join simple s using (id); | client backend 16384 | regression | 53894 | 53696 | 10 | user | psql | | | | 2023-09-08 18:44:27.796749+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 | 2023-09-08 18:44:27.799261+00 | IPC | HashBuildHashOuter | active | | 731 | | explain (analyze) select count(*) from simple r join simple s using (id); | parallel worker 16384 | regression | 53896 | 53696 | 10 | user | psql | | | | 2023-09-08 18:44:27.797436+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 | 2023-09-08 18:44:27.799291+00 | IPC | HashBuildHashInner | active | | 731 | | explain (analyze) select count(*) from simple r join simple s using (id); | parallel worker procstat -i 53896 53896 postgres URG P-C tail server04.log 2023-09-08 18:44:27.777 UTC|user|regression|53696|64fb6b8b.d1c0|LOG: statement: explain (analyze) select count(*) from simple r join simple s using (id); 2023-09-08 18:44:27.786 UTC|user|regression|53696|64fb6b8b.d1c0|LOG: statement: explain (analyze) select count(*) from simple r join simple s using (id); 2023-09-08 18:44:27.795 UTC|user|regression|53696|64fb6b8b.d1c0|LOG: statement: explain (analyze) select count(*) from simple r join simple s using (id); 2023-09-08 18:45:38.685 UTC|[unknown]|[unknown]|66915|64fb6bd2.10563|LOG: connection received: host=[local] 2023-09-08 18:45:38.685 UTC|user|regression|66915|64fb6bd2.10563|LOG: connection authorized: user=user database=regression application_name=psql 2023-09-08 18:45:38.686 UTC|user|regression|66915|64fb6bd2.10563|LOG: statement: SELECT * FROM pg_stat_activity; It takes less than 10 minutes on average for me. I checked REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined forcefully) — they all are affected. I could not reproduce the lockup on my Ubuntu box (with HAVE_SYS_EPOLL_H undefined manually). And surprisingly for me, I could not reproduce it on master and REL_16_STABLE. `git bisect` for this behavior change pointed at 7389aad63 (though maybe it just greatly decreased probability of the failure; I'm going to double-check this). In particular, that commit changed this: - /* - * Ignore SIGURG for now. Child processes may change this (see - * InitializeLatchSupport), but they will not receive any such signals - * until they wait on a latch. - */ - pqsignal_pm(SIGURG, SIG_IGN); /* ignored */ -#endif + /* This may configure SIGURG, depending on platform. */ + InitializeLatchSupport(); + InitProcessLocalLatch(); With debugging logging added I see (on 7389aad63~1) that one process really sends SIGURG to another, and the latter reaches poll(), but it just got no signal, it's signal handler not called and poll() just waits... So it looks like the ARM weak memory model is not the root cause of the issue. But as far as I can see, it's still specific to FreeBSD (but not specific to a compiler — I used gcc and clang with the same success). Best regards, Alexander
Attachment
pgsql-hackers by date: