Re: BF mamba failure - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: BF mamba failure
Date
Msg-id aMNrSgJt6_oRqh45@paquier.xyz
Whole thread Raw
In response to Re: BF mamba failure  (Kouber Saparev <kouber@gmail.com>)
List pgsql-hackers
On Thu, Sep 11, 2025 at 04:35:01PM +0300, Kouber Saparev wrote:
> The pattern is the same, although I am not 100% sure that the same replica
> is having this - it is a cascaded streaming replica, i.e. a replica of
> another replica. Once we had this in October 2024, with version 15.4, then
> in August 2025 with 17.3, and now in September again (17.3). The database
> is working for month(s) perfectly fine in a heavy production workload (lots
> of WALs etc.), and then all of a sudden it shuts down.

The shutdown is caused by the startup process choking on redo.  FWIW

> Thanks for the feedback, and let me know if I could provide any additional
> info.

Okay, the bit about the cascading standby is a useful piece of
information.  Do you have some data about the relation reported in the
error message this is choking on based on its OID?  Is this actively
used in read-only workloads, with the relation looked at in the
cascading standby?  Is hot_standby_feedback enabled in the cascading
standby?  With which process has this cascading standby been created?
Does the workload of the primary involve a high consumption of OIDs
for relations, say many temporary tables?

Another thing that may help is the WAL record history.  Are you for
example seeing attempts to drop twice the same pgstats entry in WAL
records?  Perhaps the origin of the problem is in this area.  A
refcount of 2 is relevant, of course.

I have looked a bit around but nothing has popped up here, so as far
as I know you seem to be the only one impacted by that.

1d6a03ea4146 and dc5f9054186a are in 17.3, so perhaps something is
still off with the drop when applied to cascading standbys.  A vital
piece of information may also be with "generation", which would show
up in the error report thanks to bdda6ba30cbe, and that's included in
17.6.  A first thing would be to update to 17.6 and see how things
go for these cascading setups.  If it takes a couple of weeks to have
one report, we have a hunt that may take a few months at least, except
if somebody is able to find out the race condition here, me or someone
else.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: "Efrain J. Berdecia"
Date:
Subject: New recovery_target_timeline=primary option
Next
From: Michael Paquier
Date:
Subject: Re: PgStat_HashKey padding issue when passed by reference