Okay, the bit about the cascading standby is a useful piece of information. Do you have some data about the relation reported in the error message this is choking on based on its OID? Is this actively used in read-only workloads, with the relation looked at in the cascading standby?
This objoid=767325170 is non-existent, nor was it present in the previous shutdown (objoid=4169049057). So I guess it is something quasi-temporary that has been dropped afterwards.
Is hot_standby_feedback enabled in the cascading standby?
Yes, hot_standby_feedback = on.
With which process has this cascading standby been created? Does the workload of the primary involve a high consumption of OIDs for relations, say many temporary tables?
Yes, we have around 150 entries added and deleted per second in pg_class, and around 800 in pg_attribute. So something is actively creating and dropping tables all the time.
Another thing that may help is the WAL record history. Are you for example seeing attempts to drop twice the same pgstats entry in WAL records? Perhaps the origin of the problem is in this area. A refcount of 2 is relevant, of course.
How could we dig into this, i.e. inspecting such attempts in the WAL records?
I have looked a bit around but nothing has popped up here, so as far as I know you seem to be the only one impacted by that.
1d6a03ea4146 and dc5f9054186a are in 17.3, so perhaps something is still off with the drop when applied to cascading standbys. A vital piece of information may also be with "generation", which would show up in the error report thanks to bdda6ba30cbe, and that's included in 17.6. A first thing would be to update to 17.6 and see how things go for these cascading setups. If it takes a couple of weeks to have one report, we have a hunt that may take a few months at least, except if somebody is able to find out the race condition here, me or someone else.
Is it enough to upgrade the replicas or we need to upgrade the primary as well?