Thread: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
From
Tomas Vondra
Date:
Hi, I have noticed one of my buildfarm machines - widowbird - did not report any results since February 17. And it seems to be stuck somewhere in amcheck: $ ps ax | grep postgres 1180067 ? Ss 0:02 /mnt/data/buildfarm/buildroot/HEAD/inst/bin/postgres -D data-C 1180069 ? Ss 0:00 postgres: checkpointer 1180070 ? Ss 0:00 postgres: background writer 1180072 ? Ss 0:00 postgres: walwriter 1180073 ? Ss 0:01 postgres: autovacuum launcher 1180074 ? Ss 0:00 postgres: logical replication launcher 1180107 ? Ss 0:05 postgres: buildfarm contrib_regression_amcheck [local] INSERT 1180111 ? Ss 0:00 postgres: autovacuum worker 1180134 ? Ss 0:00 postgres: autovacuum worker 1180135 ? Ss 0:00 postgres: autovacuum worker 1374029 pts/0 S+ 0:00 grep --color=auto postgres So there's PID 1180107, executing an insert, but not progressing. The backtrace looks like this (first couple lines, full backtrace attached): #0 0x0000007fa64b8ddc in __GI_epoll_pwait (epfd=5, events=0x55ad6285a8, maxevents=1, timeout=timeout@entry=-1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42 #1 0x0000007fa64b8fe8 in epoll_wait (epfd=<optimized out>, events=<optimized out>, maxevents=<optimized out>, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:32 #2 0x000000558f043588 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ff8ed4e18, cur_timeout=-1, set=0x55ad628540) at latch.c:1571 #3 WaitEventSetWait (set=0x55ad628540, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ff8ed4e18, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217781) at latch.c:1519 #4 0x000000558f043778 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217781) at latch.c:538 #5 0x000000558f052274 in ConditionVariableTimedSleep (cv=0x7f9ac9deb0, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217781) at condition_variable.c:163 #6 0x000000558f05286c in ConditionVariableTimedSleep (wait_event_info=134217781, timeout=-1, cv=<optimized out>) at condition_variable.c:135 #7 0x000000558ed2fc90 in AdvanceXLInsertBuffer (upto=upto@entry=608174080, tli=tli@entry=1, opportunistic=opportunistic@entry=false) at xlog.c:2224 So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another interesting fact is it's testing 75dfde13639, which is just a couple commits after 6a2275b895: commit 6a2275b8953a4462d44daf001bdd60b3d48f0946 Author: Alexander Korotkov <akorotkov@postgresql.org> Date: Mon Feb 17 04:19:01 2025 +0200 Get rid of WALBufMappingLock Allow multiple backends to initialize WAL buffers concurrently. This way `MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without taking a single LWLock in exclusive mode. ... which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the last (successful) run on widorbird was on eaf502747b, which already includes 6a2275b895, so maybe it's unrelated. Is there something else I could collect from the stuck instance, before I restart it? regards -- Tomas Vondra
Attachment
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
From
Peter Geoghegan
Date:
On Wed, Feb 26, 2025 at 5:09 PM Tomas Vondra <tomas@vondra.me> wrote: > So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another > interesting fact is it's testing 75dfde13639, which is just a couple > commits after 6a2275b895: > which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the > last (successful) run on widorbird was on eaf502747b, which already > includes 6a2275b895, so maybe it's unrelated. Are you aware that Alexander reverted this only a day later, in commit 3fb58625? -- Peter Geoghegan
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
From
Tomas Vondra
Date:
On 2/26/25 23:13, Peter Geoghegan wrote: > On Wed, Feb 26, 2025 at 5:09 PM Tomas Vondra <tomas@vondra.me> wrote: >> So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another >> interesting fact is it's testing 75dfde13639, which is just a couple >> commits after 6a2275b895: > >> which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the >> last (successful) run on widorbird was on eaf502747b, which already >> includes 6a2275b895, so maybe it's unrelated. > > Are you aware that Alexander reverted this only a day later, in commit 3fb58625? > Oh, I forgot about that. I guess if the theory is the commit might be causing this, and it's been already reverted, I should just restart the instance, so that it tests with the revert. regards -- Tomas Vondra
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
From
Peter Geoghegan
Date:
On Wed, Feb 26, 2025 at 5:16 PM Tomas Vondra <tomas@vondra.me> wrote: > Oh, I forgot about that. I guess if the theory is the commit might be > causing this, and it's been already reverted, I should just restart the > instance, so that it tests with the revert. According to the revert commit's commit message there was also problems on batta at the time. Might have been the same issue that you saw on widowbird. Alexander? -- Peter Geoghegan
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
From
Michael Paquier
Date:
On Wed, Feb 26, 2025 at 05:23:51PM -0500, Peter Geoghegan wrote: > According to the revert commit's commit message there was also > problems on batta at the time. Might have been the same issue that you > saw on widowbird. Alexander? This problem has already been discussed, and I saw the same backtrace as you do on my hosts: https://www.postgresql.org/message-id/Z7PSeT7AbrjmiCCV@paquier.xyz It is not the first time where we have a problematic patch causing some animals to freeze, so I would recommend using a wait_timeout => 43200 or similar in the configuration of your buildfarm animals so as they would unfreeze themselves without the need for a manual intervention. This setting has saved me quite a few maintenance cycles. -- Michael
Attachment
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
From
Alexander Korotkov
Date:
On Thu, Feb 27, 2025 at 12:24 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Feb 26, 2025 at 5:16 PM Tomas Vondra <tomas@vondra.me> wrote: > > Oh, I forgot about that. I guess if the theory is the commit might be > > causing this, and it's been already reverted, I should just restart the > > instance, so that it tests with the revert. > > According to the revert commit's commit message there was also > problems on batta at the time. Might have been the same issue that you > saw on widowbird. Alexander? Yes, the backtrace posted by Tomas from widowbird looks quite the same as posted by Michael from batta. My apologies for committing not well-tested patch. ------ Regards, Alexander Korotkov Supabase