Re: BF mamba failure - Mailing list pgsql-hackers

From Kouber Saparev
Subject Re: BF mamba failure
Date
Msg-id CAN4RuQvQ3ATcYvfTR1LzJnUJXpo_F8mgz-+WxoZsyusLLmCwYA@mail.gmail.com
Whole thread Raw
In response to Re: BF mamba failure  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers


На пт, 12.09.2025 г. в 3:37 Michael Paquier <michael@paquier.xyz> написа:
Okay, the bit about the cascading standby is a useful piece of
information.  Do you have some data about the relation reported in the
error message this is choking on based on its OID?  Is this actively
used in read-only workloads, with the relation looked at in the
cascading standby?

This objoid=767325170 is non-existent, nor was it present in the previous shutdown (objoid=4169049057). So I guess it is something quasi-temporary that has been dropped afterwards.
 
  Is hot_standby_feedback enabled in the cascading
standby?

Yes, hot_standby_feedback = on.
 
With which process has this cascading standby been created?
Does the workload of the primary involve a high consumption of OIDs
for relations, say many temporary tables?

Yes, we have around 150 entries added and deleted per second in pg_class, and around 800 in pg_attribute. So something is actively creating and dropping tables all the time.
 

Another thing that may help is the WAL record history.  Are you for
example seeing attempts to drop twice the same pgstats entry in WAL
records?  Perhaps the origin of the problem is in this area.  A
refcount of 2 is relevant, of course.

How could we dig into this, i.e. inspecting such attempts in the WAL records?
 

I have looked a bit around but nothing has popped up here, so as far
as I know you seem to be the only one impacted by that.

1d6a03ea4146 and dc5f9054186a are in 17.3, so perhaps something is
still off with the drop when applied to cascading standbys.  A vital
piece of information may also be with "generation", which would show
up in the error report thanks to bdda6ba30cbe, and that's included in
17.6.  A first thing would be to update to 17.6 and see how things
go for these cascading setups.  If it takes a couple of weeks to have
one report, we have a hunt that may take a few months at least, except
if somebody is able to find out the race condition here, me or someone
else.


Is it enough to upgrade the replicas or we need to upgrade the primary as well?

--
Kouber 

pgsql-hackers by date:

Previous
From: Ashutosh Sharma
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance
Next
From: "Core Studios Inc."
Date:
Subject: Re: Incorrect result of bitmap heap scan.