Thread: Recent pg_rewind test failures in buildfarm

Recent pg_rewind test failures in buildfarm

From

Tom Lane

Date:

15 April, 05:58:28

In the last day or so, both skink and mamba have hit this
in the pg_rewind test suite [1][2]:

#3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66 
#4  0x01d7994c in pgstat_report_stat (force=force@entry=true) at pgstat.c:734
#5  0x01d7a248 in pgstat_shutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:615
#6  0x01d1e7c0 in shmem_exit (code=code@entry=0) at ipc.c:243
#7  0x01d1e92c in proc_exit_prepare (code=code@entry=0) at ipc.c:198
#8  0x01d1e9fc in proc_exit (code=0) at ipc.c:111
#9  0x01cd5bd0 in ProcessRepliesIfAny () at walsender.c:2294
#10 0x01cd8084 in WalSndLoop (send_data=send_data@entry=0x1cd7628 <XLogSendPhysical>) at walsender.c:2790
#11 0x01cd8f40 in StartReplication (cmd=0xfd048210) at walsender.c:960
#12 exec_replication_command (cmd_string=cmd_string@entry=0xfd53c098 "START_REPLICATION 0/3000000 TIMELINE 2") at
walsender.c:2135
#13 0x01d5bd84 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4767

That assert appears to be several years old, and the
008_min_recovery_point.pl test script that's triggering it hasn't
changed very recently either, so I'm baffled where to start digging.
It has the odor of a timing problem, so maybe we just started hitting
this by chance.  Still ... anybody have an idea?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-13%2018%3A55%3A03
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-04-15%2001%3A00%3A04

Re: Recent pg_rewind test failures in buildfarm

From

Alexander Lakhin

Date:

15 April, 08:00:00

Hello Tom,

15.04.2025 05:58, Tom Lane wrote:
> In the last day or so, both skink and mamba have hit this
> in the pg_rewind test suite [1][2]:
>
> #3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66
 
> ...
>
> That assert appears to be several years old, and the
> 008_min_recovery_point.pl test script that's triggering it hasn't
> changed very recently either, so I'm baffled where to start digging.
> It has the odor of a timing problem, so maybe we just started hitting
> this by chance.  Still ... anybody have an idea?
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-13%2018%3A55%3A03
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-04-15%2001%3A00%3A04
>

FWIW, there was also another test (028_row_filter.pl) failed due to that
assert triggered: [3].

I've managed to reproduce this failure locally when running 10 instances
of 008_min_recovery_point.pl in parallel under Valgrind.

Will try to investigate the issue during this week.

[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-11%2007%3A41%3A36

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Recent pg_rewind test failures in buildfarm

From

Bertrand Drouvot

Date:

15 April, 08:50:32

Hi,

On Tue, Apr 15, 2025 at 08:00:00AM +0300, Alexander Lakhin wrote:
> Hello Tom,
> 
> 15.04.2025 05:58, Tom Lane wrote:
> > In the last day or so, both skink and mamba have hit this
> > in the pg_rewind test suite [1][2]:
> > 
> > #3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66
 
> > ...
> > 
> > That assert appears to be several years old, and the
> > 008_min_recovery_point.pl test script that's triggering it hasn't
> > changed very recently either, so I'm baffled where to start digging.
> > It has the odor of a timing problem, so maybe we just started hitting
> > this by chance.  Still ... anybody have an idea?

Sorry, can't look at the details right now but it might be linked to 
039549d70f6 which is recent enough and in this area. Will give it a look once
I've time.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Recent pg_rewind test failures in buildfarm

From

Michael Paquier

Date:

15 April, 09:02:15

On Mon, Apr 14, 2025 at 10:58:28PM -0400, Tom Lane wrote:
> That assert appears to be several years old, and the
> 008_min_recovery_point.pl test script that's triggering it hasn't
> changed very recently either, so I'm baffled where to start digging.
> It has the odor of a timing problem, so maybe we just started hitting
> this by chance.  Still ... anybody have an idea?

Likely 039549d70f6a.  The proc_exit(0) calls done at this stage are
reporting that we have non-flushed stats, but we should have some
remaining around.  This sequence is part of the pgstat_shutdown_hook
callback, which has been initialized by pgstat_initialize() done for
the WAL senders in BaseInit().  It looks like we are going to need to
be more aggressive with the stats flushes, so as we have nothing left
when reaching this stage of the shutdown in ProcessRepliesIfAny().

(Sorry for the fuzzy reply, I cannot dive into all the details now,
that's just a quick take on the matter so I may be wrong with my
assumptions.)
--
Michael

Attachment

signature.asc

Re: Recent pg_rewind test failures in buildfarm

From

Andres Freund

Date:

15 April, 13:04:54

Hi,

On 2025-04-14 22:58:28 -0400, Tom Lane wrote:
> In the last day or so, both skink and mamba have hit this
> in the pg_rewind test suite [1][2]:
> 
> #3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66
 
> #4  0x01d7994c in pgstat_report_stat (force=force@entry=true) at pgstat.c:734
> #5  0x01d7a248 in pgstat_shutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:615
> #6  0x01d1e7c0 in shmem_exit (code=code@entry=0) at ipc.c:243
> #7  0x01d1e92c in proc_exit_prepare (code=code@entry=0) at ipc.c:198
> #8  0x01d1e9fc in proc_exit (code=0) at ipc.c:111
> #9  0x01cd5bd0 in ProcessRepliesIfAny () at walsender.c:2294
> #10 0x01cd8084 in WalSndLoop (send_data=send_data@entry=0x1cd7628 <XLogSendPhysical>) at walsender.c:2790
> #11 0x01cd8f40 in StartReplication (cmd=0xfd048210) at walsender.c:960
> #12 exec_replication_command (cmd_string=cmd_string@entry=0xfd53c098 "START_REPLICATION 0/3000000 TIMELINE 2") at
walsender.c:2135
> #13 0x01d5bd84 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4767
> 
> That assert appears to be several years old, and the
> 008_min_recovery_point.pl test script that's triggering it hasn't
> changed very recently either, so I'm baffled where to start digging.
> It has the odor of a timing problem, so maybe we just started hitting
> this by chance.  Still ... anybody have an idea?

See also https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq

I am fairly certain it's the fault of 039549d70f6

Greetings,

Andres Freund

Re: Recent pg_rewind test failures in buildfarm

From

Andres Freund

Date:

15 April, 13:20:51

Hi,

On 2025-04-15 06:04:54 -0400, Andres Freund wrote:
> See also https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq

Huh, turns out my emails of the last few days had gotten stuck. I only noticed
when trying to open the above link and it was dead.  I had been wondering why
neither Bertrand nor Michael had reacted to the mail...

Greetings,

Andres Freund

Re: Recent pg_rewind test failures in buildfarm

From

Michael Paquier

Date:

21 April, 04:17:40

On Tue, Apr 15, 2025 at 05:50:32AM +0000, Bertrand Drouvot wrote:
> Sorry, can't look at the details right now but it might be linked to
> 039549d70f6 which is recent enough and in this area. Will give it a look once
> I've time.

Playing catch-up with various things this week, and I have been
looking at this one.

So, we are triggering this assertion in the shutdown sequence of the
WAL sender because there is nothing to flush based on what the
callbacks are reporting, still pending_since could have been set by a
previous call of pgstat_report_stat(), which could come from a
PostgresMain() path for example, depending on the frequency of such
calls.  The important point is that we don't lose WAL sender stats at
shutdown, and well, we don't lose any data for the WAL sender based on
what this assertion tells us, just that there is some friction with
the new I/O and backend flush calls.

pg_stat_io has been added in v16, but isn't that something that could
be reached even today down to v15?  For example, imagine the case of a
background worker that does periodic stats reports with interactions
on existing stats.  pgstats stored in shmem has been added in v15.

Thoughts?
--
Michael

Attachment

Re: Recent pg_rewind test failures in buildfarm

From

Bertrand Drouvot

Date:

22 April, 09:22:51

Hi,

On Mon, Apr 21, 2025 at 10:17:40AM +0900, Michael Paquier wrote:
> On Tue, Apr 15, 2025 at 05:50:32AM +0000, Bertrand Drouvot wrote:
> > Sorry, can't look at the details right now but it might be linked to 
> > 039549d70f6 which is recent enough and in this area. Will give it a look once
> > I've time.
> 
> Playing catch-up with various things this week, and I have been
> looking at this one.
> 
> So, we are triggering this assertion in the shutdown sequence of the
> WAL sender because there is nothing to flush based on what the
> callbacks are reporting, still pending_since could have been set by a
> previous call of pgstat_report_stat(), which could come from a
> PostgresMain() path for example, depending on the frequency of such
> calls.  The important point is that we don't lose WAL sender stats at
> shutdown,

Right.

> and well, we don't lose any data for the WAL sender based on
> what this assertion tells us, just that there is some friction with
> the new I/O and backend flush calls.

I do agree.

> pg_stat_io has been added in v16, but isn't that something that could
> be reached even today down to v15?  For example, imagine the case of a
> background worker that does periodic stats reports with interactions
> on existing stats.  pgstats stored in shmem has been added in v15.
> 
> Thoughts?

Yeah, unless that might come from fc415edf8ca but I don't think that's the case.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Recent pg_rewind test failures in buildfarm

From

Michael Paquier

Date:

23 April, 07:55:31

On Tue, Apr 22, 2025 at 06:22:51AM +0000, Bertrand Drouvot wrote:
> Yeah, unless that might come from fc415edf8ca but I don't think that's the case.

I've been eyeing this whole area of the code for a few hours to
convince myself on HEAD, and I cannot really find a defect directly
related to it.

I am wondering if we should do a PGSTAT_BACKEND_FLUSH_WAL in the WAL
sender, but the end of any transaction happening in a logical WAL
sender would make sure that this happens on a periodic basis.

Anyway, sorry for make everybody waiting here.  I've now removed the
assertion down to v15.
--
Michael

Attachment

signature.asc