Thread: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)

Hi,

I have noticed one of my buildfarm machines - widowbird - did not report
any results since February 17. And it seems to be stuck somewhere in
amcheck:

$ ps ax | grep postgres
1180067 ?        Ss     0:02
/mnt/data/buildfarm/buildroot/HEAD/inst/bin/postgres -D data-C
1180069 ?        Ss     0:00 postgres: checkpointer
1180070 ?        Ss     0:00 postgres: background writer
1180072 ?        Ss     0:00 postgres: walwriter
1180073 ?        Ss     0:01 postgres: autovacuum launcher
1180074 ?        Ss     0:00 postgres: logical replication launcher
1180107 ?        Ss     0:05 postgres: buildfarm
contrib_regression_amcheck [local] INSERT
1180111 ?        Ss     0:00 postgres: autovacuum worker
1180134 ?        Ss     0:00 postgres: autovacuum worker
1180135 ?        Ss     0:00 postgres: autovacuum worker
1374029 pts/0    S+     0:00 grep --color=auto postgres

So there's PID 1180107, executing an insert, but not progressing. The
backtrace looks like this (first couple lines, full backtrace attached):

#0  0x0000007fa64b8ddc in __GI_epoll_pwait (epfd=5, events=0x55ad6285a8,
maxevents=1, timeout=timeout@entry=-1, set=set@entry=0x0) at
../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1  0x0000007fa64b8fe8 in epoll_wait (epfd=<optimized out>,
events=<optimized out>, maxevents=<optimized out>, timeout=timeout@entry=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:32
#2  0x000000558f043588 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ff8ed4e18, cur_timeout=-1, set=0x55ad628540) at
latch.c:1571
#3  WaitEventSetWait (set=0x55ad628540, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ff8ed4e18,
nevents=nevents@entry=1,
    wait_event_info=wait_event_info@entry=134217781) at latch.c:1519
#4  0x000000558f043778 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781)
    at latch.c:538
#5  0x000000558f052274 in ConditionVariableTimedSleep (cv=0x7f9ac9deb0,
timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781) at condition_variable.c:163
#6  0x000000558f05286c in ConditionVariableTimedSleep
(wait_event_info=134217781, timeout=-1, cv=<optimized out>) at
condition_variable.c:135
#7  0x000000558ed2fc90 in AdvanceXLInsertBuffer
(upto=upto@entry=608174080, tli=tli@entry=1,
opportunistic=opportunistic@entry=false) at xlog.c:2224

So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another
interesting fact is it's testing 75dfde13639, which is just a couple
commits after 6a2275b895:

    commit 6a2275b8953a4462d44daf001bdd60b3d48f0946
    Author: Alexander Korotkov <akorotkov@postgresql.org>
    Date:   Mon Feb 17 04:19:01 2025 +0200

    Get rid of WALBufMappingLock

    Allow multiple backends to initialize WAL buffers concurrently.
    This way `MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in
    parallel without taking a single LWLock in exclusive mode.

    ...

which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the
last (successful) run on widorbird was on eaf502747b, which already
includes 6a2275b895, so maybe it's unrelated.

Is there something else I could collect from the stuck instance, before
I restart it?

regards

-- 
Tomas Vondra

Attachment
On Wed, Feb 26, 2025 at 5:09 PM Tomas Vondra <tomas@vondra.me> wrote:
> So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another
> interesting fact is it's testing 75dfde13639, which is just a couple
> commits after 6a2275b895:

> which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the
> last (successful) run on widorbird was on eaf502747b, which already
> includes 6a2275b895, so maybe it's unrelated.

Are you aware that Alexander reverted this only a day later, in commit 3fb58625?

--
Peter Geoghegan



On 2/26/25 23:13, Peter Geoghegan wrote:
> On Wed, Feb 26, 2025 at 5:09 PM Tomas Vondra <tomas@vondra.me> wrote:
>> So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another
>> interesting fact is it's testing 75dfde13639, which is just a couple
>> commits after 6a2275b895:
> 
>> which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the
>> last (successful) run on widorbird was on eaf502747b, which already
>> includes 6a2275b895, so maybe it's unrelated.
> 
> Are you aware that Alexander reverted this only a day later, in commit 3fb58625?
> 

Oh, I forgot about that. I guess if the theory is the commit might be
causing this, and it's been already reverted, I should just restart the
instance, so that it tests with the revert.

regards

-- 
Tomas Vondra




On Wed, Feb 26, 2025 at 5:16 PM Tomas Vondra <tomas@vondra.me> wrote:
> Oh, I forgot about that. I guess if the theory is the commit might be
> causing this, and it's been already reverted, I should just restart the
> instance, so that it tests with the revert.

According to the revert commit's commit message there was also
problems on batta at the time. Might have been the same issue that you
saw on widowbird. Alexander?


--
Peter Geoghegan



On Wed, Feb 26, 2025 at 05:23:51PM -0500, Peter Geoghegan wrote:
> According to the revert commit's commit message there was also
> problems on batta at the time. Might have been the same issue that you
> saw on widowbird. Alexander?

This problem has already been discussed, and I saw the same backtrace
as you do on my hosts:
https://www.postgresql.org/message-id/Z7PSeT7AbrjmiCCV@paquier.xyz

It is not the first time where we have a problematic patch causing
some animals to freeze, so I would recommend using a wait_timeout =>
43200 or similar in the configuration of your buildfarm animals so as
they would unfreeze themselves without the need for a manual
intervention.  This setting has saved me quite a few maintenance
cycles.
--
Michael

Attachment
On Thu, Feb 27, 2025 at 12:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Feb 26, 2025 at 5:16 PM Tomas Vondra <tomas@vondra.me> wrote:
> > Oh, I forgot about that. I guess if the theory is the commit might be
> > causing this, and it's been already reverted, I should just restart the
> > instance, so that it tests with the revert.
>
> According to the revert commit's commit message there was also
> problems on batta at the time. Might have been the same issue that you
> saw on widowbird. Alexander?

Yes, the backtrace posted by Tomas from widowbird looks quite the same
as posted by Michael from batta.  My apologies for committing not
well-tested patch.

------
Regards,
Alexander Korotkov
Supabase