Thread: tests against running server occasionally fail, postgres_fdw & tenk1

tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

09 December 2022, 00:15:11

Hi,

Since

commit 3f0e786ccbf
Author: Andres Freund <andres@anarazel.de>
Date:   2022-12-07 12:13:35 -0800

    meson: Add 'running' test setup, as a replacement for installcheck

CI tests the pg_regress/isolationtester tests that support doing so against a
running server.


Unfortunately cfbot shows that that doesn't work entirely reliably.

The most frequent case is postgres_fdw, which somewhat regularly fails with a
regression.diff like this:

diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
--- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out    2022-12-08 20:35:24.772888000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out    2022-12-08
20:43:38.199450000+0000
 
@@ -9911,8 +9911,7 @@
     WHERE application_name = 'fdw_retry_check';
  pg_terminate_backend
 ----------------------
- t
-(1 row)
+(0 rows)

 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.


See e.g.
https://cirrus-ci.com/task/5925540020879360

https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log


The following comment in the test provides a hint what might be happening:

-- If debug_discard_caches is active, it results in
-- dropping remote connections after every transaction, making it
-- impossible to test termination meaningfully.  So turn that off
-- for this test.
SET debug_discard_caches = 0;


I guess that a cache reset message arrives and leads to the connection being
terminated. Unfortunately that's hard to see right now, as the relevant log
messages are output with DEBUG3 - it's quite verbose, so enabling it for all
tests will be painful.

I *think* I have seen this failure locally at least once, when running the
test normally.


I'll try to reproduce this locally for a bit. If I can't, the only other idea
I have for debugging this is to change log_min_messages in that section of the
postgres_fdw test to DEBUG3.



The second failure case I observed, at a lower frequency, is in the main
regression tests:
https://cirrus-ci.com/task/5640584912699392

https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/regress-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/runningcheck.log

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out    2022-12-08 16:49:28.239508000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out    2022-12-08
16:57:17.286650000+0000
 
@@ -1910,11 +1910,15 @@
 SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
-                      QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_unique1 on tenk1
-   Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
-(2 rows)
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+   Sort Key: unique1
+   ->  Bitmap Heap Scan on tenk1
+         Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+         ->  Bitmap Index Scan on tenk1_unique1
+               Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(6 rows)

 SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)


Which I think we've seen a number of times before, even in the temp-install
case. We fixed one source of this issue in this thread:
https://www.postgresql.org/message-id/1346227.1649887693%40sss.pgh.pa.us
but it looks like there's some remaining instability.

According to the server log (link above), there's no autovacuum on
tenk1.

Unfortunately we don't log non-automatic vacuums unless they are run with
verbose, so we can't see what horizon was used (see heap_vacuum_rel()'s
computation of 'instrument').

I don't have a better idea than to either change the above, or to revert
91998539b227dfc6dd091714da7d106f2c95a321.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

09 December 2022, 00:36:07

Hi,

On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> commit 3f0e786ccbf
> Author: Andres Freund <andres@anarazel.de>
> Date:   2022-12-07 12:13:35 -0800
> 
>     meson: Add 'running' test setup, as a replacement for installcheck
> 
> CI tests the pg_regress/isolationtester tests that support doing so against a
> running server.
> 
> 
> Unfortunately cfbot shows that that doesn't work entirely reliably.
> 
> The most frequent case is postgres_fdw, which somewhat regularly fails with a
> regression.diff like this:
> 
> diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out    2022-12-08 20:35:24.772888000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out    2022-12-08
20:43:38.199450000+0000
 
> @@ -9911,8 +9911,7 @@
>      WHERE application_name = 'fdw_retry_check';
>   pg_terminate_backend
>  ----------------------
> - t
> -(1 row)
> +(0 rows)
> 
>  -- This query should detect the broken connection when starting new remote
>  -- transaction, reestablish new connection, and then succeed.
> 
> 
> See e.g.
> https://cirrus-ci.com/task/5925540020879360
>
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
> https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log
> 
> 
> The following comment in the test provides a hint what might be happening:
> 
> -- If debug_discard_caches is active, it results in
> -- dropping remote connections after every transaction, making it
> -- impossible to test termination meaningfully.  So turn that off
> -- for this test.
> SET debug_discard_caches = 0;
> 
> 
> I guess that a cache reset message arrives and leads to the connection being
> terminated. Unfortunately that's hard to see right now, as the relevant log
> messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> tests will be painful.
> 
> I *think* I have seen this failure locally at least once, when running the
> test normally.
> 
> 
> I'll try to reproduce this locally for a bit. If I can't, the only other idea
> I have for debugging this is to change log_min_messages in that section of the
> postgres_fdw test to DEBUG3.

Oh. I tried to reproduce the issue, without success so far, but eventually my
test loop got stuck in something I reported previously and forgot about since:
https://www.postgresql.org/message-id/20220925232237.p6uskba2dw6fnwj2%40awork3.anarazel.de

I wonder if the timing on the freebsd CI task works out to hitting a "smaller
version" of the problem that eventually resolves itself, which then leads to a
sinval reset getting sent, causing the observable problem.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

07 February 2023, 01:53:00

Hi,

On 2022-12-08 16:36:07 -0800, Andres Freund wrote:
> On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> > Unfortunately cfbot shows that that doesn't work entirely reliably.
> >
> > The most frequent case is postgres_fdw, which somewhat regularly fails with a
> > regression.diff like this:
> >
> > diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> > --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out    2022-12-08 20:35:24.772888000 +0000
> > +++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out    2022-12-08
20:43:38.199450000+0000
 
> > @@ -9911,8 +9911,7 @@
> >      WHERE application_name = 'fdw_retry_check';
> >   pg_terminate_backend
> >  ----------------------
> > - t
> > -(1 row)
> > +(0 rows)
> >
> >  -- This query should detect the broken connection when starting new remote
> >  -- transaction, reestablish new connection, and then succeed.

> >
> > I guess that a cache reset message arrives and leads to the connection being
> > terminated. Unfortunately that's hard to see right now, as the relevant log
> > messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> > tests will be painful.
> >
> > I *think* I have seen this failure locally at least once, when running the
> > test normally.
> >
> >
> > I'll try to reproduce this locally for a bit. If I can't, the only other idea
> > I have for debugging this is to change log_min_messages in that section of the
> > postgres_fdw test to DEBUG3.
>
> Oh. I tried to reproduce the issue, without success so far, but eventually my
> test loop got stuck in something I reported previously and forgot about since:
> https://www.postgresql.org/message-id/20220925232237.p6uskba2dw6fnwj2%40awork3.anarazel.de
>
> I wonder if the timing on the freebsd CI task works out to hitting a "smaller
> version" of the problem that eventually resolves itself, which then leads to a
> sinval reset getting sent, causing the observable problem.

The issue referenced above is now fixed, and I haven't seen instances of it
since then. I also just now fixed the issue that often lead to failing to
upload logs.

Unfortunately the fdw_retry_check issue from above has re-occurred since then:

https://cirrus-ci.com/task/4901157940756480

https://api.cirrus-ci.com/v1/artifact/task/4901157940756480/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs


Another run hit an issue we've been fighting repeatedly on the buildfarm / CI:
https://cirrus-ci.com/task/5527490404286464

https://api.cirrus-ci.com/v1/artifact/task/5527490404286464/testrun/build/testrun/regress-running/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out    2023-02-06 23:52:43.627604000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out    2023-02-07
00:03:04.535232000+0000
 
@@ -1930,12 +1930,13 @@
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: thousand
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(4 rows)

 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)


I'd be tempted to disable the test, but it also found genuine issues in a
bunch of CF entries, and all these test instabilities seem like ones we'd also
see, with a lower occurrence on the buildfarm.


WRT the fdw_retry_check: I wonder if we should increase the log level of
a) pgfdw_inval_callback deciding to disconnect
b) ReceiveSharedInvalidMessages() deciding to reset

to DEBUG1, at least temporarily?

Alternatively we could add a
SET log_min_messages=debug4;
...
RESET log_min_messages;

around the postgres_fdw connection re-establishment test?


One thing nudging me towards the more global approach is that I have the vague
feelign that there's a wider issue around hitting more sinval resets than we
should - and it'd right now be very hard to know about that.


Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

07 February 2023, 03:29:46

Hi,

On 2023-02-06 17:53:00 -0800, Andres Freund wrote:
> WRT the fdw_retry_check: I wonder if we should increase the log level of
> a) pgfdw_inval_callback deciding to disconnect
> b) ReceiveSharedInvalidMessages() deciding to reset
>
> to DEBUG1, at least temporarily?
>
> Alternatively we could add a
> SET log_min_messages=debug4;
> ...
> RESET log_min_messages;
>
> around the postgres_fdw connection re-establishment test?
>
>
> One thing nudging me towards the more global approach is that I have the vague
> feelign that there's a wider issue around hitting more sinval resets than we
> should - and it'd right now be very hard to know about that.

Luckily it proved to be easy enough to reproduce on a private branch, by
setting the test to repeat a couple times.

I set the aforementioned log messages to LOG. And indeed:

2023-02-07 02:54:18.773 UTC [10800][client backend] [pg_regress/postgres_fdw][7/10526:0] LOG:  cache state reset
2023-02-07 02:54:18.773 UTC [10800][client backend] [pg_regress/postgres_fdw][7/10526:0] LOG:  discarding connection
0x802251f00

that was preceded by another log message less than 200 ms before:
2023-02-07 02:54:18.588 UTC [10800][client backend] [pg_regress/postgres_fdw][7/10523:55242] STATEMENT:  ALTER SERVER
loopbackOPTIONS (application_name 'fdw_retry_check');

The log statements inbetween are about isolation/reindex-concurrently-toast
and pg_regress/indexing.

So the problem is indeed that we somehow quickly overflow the sinval queue. I
guess we need a bit more logging around the size of the sinval queue and its
"fullness"?

I'm a bit surprised that MAXNUMMESSAGES is a hardcoded 4096.  It's not
particularly surprising that that's quickly overflown?

There's something off. Isolationtester's control connection emits *loads* of
invalidation messages:
2023-02-06 19:29:06.430 PST [2125297][client backend][6/0:121864][isolation/receipt-report/control connection] LOG:
previouslyemitted 7662 messages, 21 this time

2023-02-06 19:29:06.566 PST [2125297][client backend][6/0:121873][isolation/receipt-report/control connection] LOG:
previouslyemitted 8155 messages, 99 this time

2023-02-06 19:29:06.655 PST [2125297][client backend][6/0:121881][isolation/receipt-report/control connection] LOG:
previouslyemitted 8621 messages, 99 this time

2023-02-06 19:29:06.772 PST [2125297][client backend][6/0:121892][isolation/receipt-report/control connection] LOG:
previouslyemitted 9208 messages, 85 this time

2023-02-06 19:29:06.867 PST [2125297][client backend][6/0:121900][isolation/receipt-report/control connection] LOG:
previouslyemitted 9674 messages, 85 this time

and this happens with lots of other tests.

Greetings,

Andres Freund

PS: The reindex-concurrently-toast output seems to indicate something is
broken in the test... There's lots of non-existing table references in the
expected file, without that immediately making sense.

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

07 February 2023, 04:47:05

Hi,

On 2023-02-06 19:29:46 -0800, Andres Freund wrote:
> There's something off. Isolationtester's control connection emits *loads* of
> invalidation messages:
> 2023-02-06 19:29:06.430 PST [2125297][client backend][6/0:121864][isolation/receipt-report/control connection] LOG:
previouslyemitted 7662 messages, 21 this time

> 2023-02-06 19:29:06.566 PST [2125297][client backend][6/0:121873][isolation/receipt-report/control connection] LOG:
previouslyemitted 8155 messages, 99 this time

> 2023-02-06 19:29:06.655 PST [2125297][client backend][6/0:121881][isolation/receipt-report/control connection] LOG:
previouslyemitted 8621 messages, 99 this time

> 2023-02-06 19:29:06.772 PST [2125297][client backend][6/0:121892][isolation/receipt-report/control connection] LOG:
previouslyemitted 9208 messages, 85 this time

> 2023-02-06 19:29:06.867 PST [2125297][client backend][6/0:121900][isolation/receipt-report/control connection] LOG:
previouslyemitted 9674 messages, 85 this time

Ah, that's just due to setup-teardown happening in the control connection.

FWIW, I see plenty of sinval resets even if I increase the sinval queue size
substantially. I suspect we've increased the number of sinval messages we sent
a good bit over time, due to additional syscaches.

As we only process catchup interrupts while in ReadCommand(), it's not
surprising that we very quickly get behind. A single statement suffices.

Anyway, that's not really a correctness issue, just a performance issue.

But the postgres_fdw.sql vulnerability to syscache resets seems not
great. Perhaps pgfdw_inval_callback() could check if the definition of the
foreign server actually changed, before dropping the connection?

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

08 February 2023, 02:47:48

Hi,

On 2023-02-06 17:53:00 -0800, Andres Freund wrote:
> Another run hit an issue we've been fighting repeatedly on the buildfarm / CI:
> https://cirrus-ci.com/task/5527490404286464
>
https://api.cirrus-ci.com/v1/artifact/task/5527490404286464/testrun/build/testrun/regress-running/regress/regression.diffs
>
> diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
> --- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out    2023-02-06 23:52:43.627604000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out    2023-02-07
00:03:04.535232000+0000

> @@ -1930,12 +1930,13 @@
>  SELECT thousand, tenthous FROM tenk1
>  WHERE thousand < 2 AND tenthous IN (1001,3000)
>  ORDER BY thousand;
> -                      QUERY PLAN
> --------------------------------------------------------
> - Index Only Scan using tenk1_thous_tenthous on tenk1
> -   Index Cond: (thousand < 2)
> -   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
> -(3 rows)
> +                                      QUERY PLAN
> +--------------------------------------------------------------------------------------
> + Sort
> +   Sort Key: thousand
> +   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
> +         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
> +(4 rows)
>
>  SELECT thousand, tenthous FROM tenk1
>  WHERE thousand < 2 AND tenthous IN (1001,3000)
>
>
> I'd be tempted to disable the test, but it also found genuine issues in a
> bunch of CF entries, and all these test instabilities seem like ones we'd also
> see, with a lower occurrence on the buildfarm.

The last occasion we hit this was at: https://www.postgresql.org/message-id/1346227.1649887693%40sss.pgh.pa.us

I'm working on cleaning up the patch used for debugging in that thread, to
make VACUUM log to the server log if VERBOSE isn't used.

One thing I'm not quite sure what to do about is that we atm use a hardcoded
DEBUG2 (not controlled by VERBOSE) in a bunch of places:

    ereport(DEBUG2,
            (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
                    vacrel->relname, (long long) index, vacuumed_pages)));

    ivinfo.message_level = DEBUG2;

I find DEBUG2 hard to use to run the entire regression tests, it results in a
lot of output. Lots of it far less important than these kinds of details
here. So I'd like to use a different log level for them - but without further
complications that'd mean they'd show up in VACUUUM VERBOSE.

I made them part of VERBOSE for now, but not because I think that's
necessarily the right answer, but because it could be useful for debugging
this stupid flapping test.

I right now set instrument = true when
message_level_is_interesting(DEBUG1). But that probably should be false? I set
it to true because of starttime, but it'd probably be better to just move it
out of the if (instrument). Also would require re-jiggering the condition of
the "main block" doing the logging.

FWIW, running all regression tests that support doing so against a running
server with DEBUG1 results in 8.1MB, DEBUG2 in 17MB.

Greetings,

Andres Freund

Attachment

v2-0001-wip-Log-VACUUM-information-as-DEBUG1-when-VERBOSE.patch

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Peter Geoghegan

Date:

08 February 2023, 22:03:49

On Tue, Feb 7, 2023 at 6:47 PM Andres Freund <andres@anarazel.de> wrote:
> One thing I'm not quite sure what to do about is that we atm use a hardcoded
> DEBUG2 (not controlled by VERBOSE) in a bunch of places:
>
>         ereport(DEBUG2,
>                         (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
>                                         vacrel->relname, (long long) index, vacuumed_pages)));
>
>     ivinfo.message_level = DEBUG2;
>
> I find DEBUG2 hard to use to run the entire regression tests, it results in a
> lot of output. Lots of it far less important than these kinds of details
> here. So I'd like to use a different log level for them - but without further
> complications that'd mean they'd show up in VACUUUM VERBOSE.

I think that these DEBUG2 traces are of limited value, even for
experts. I personally never use them -- I just use VACUUM
VERBOSE/autovacuum logging for everything, since it's far easier to
read, and isn't missing something that the DEBUG2 traces have. In fact
I wonder if we should even have them at all.

I generally don't care about the details when VACUUM runs out of space
for dead_items -- these days the important thing is to just avoid it
completely. I also generally don't care about how many index tuples
were deleted by each index's ambulkdelete() call, since VACUUM VERBOSE
already shows me the number of LP_DEAD stubs encountered/removed in
the heap. I can see the size of indexes and information about page
deletion in VACUUM VERBOSE these days, too.

Don't get me wrong. It *would* be useful to see more information about
each index in VACUUM VERBOSE output -- just not the number of tuples
deleted. Tuples really don't matter much at this level. But seeing
something about the number of WAL records written while vacuuming each
index is another story. That's a cost that is likely to vary in
possibly-interesting ways amongst indexes on the table, unlike
IndexBulkDeleteResult.tuples_removed, which is very noisy, and
signifies almost nothing important on its own.

-- 
Peter Geoghegan

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

09 February 2023, 00:29:35

Hi,

On 2023-02-08 14:03:49 -0800, Peter Geoghegan wrote:
> On Tue, Feb 7, 2023 at 6:47 PM Andres Freund <andres@anarazel.de> wrote:
> > One thing I'm not quite sure what to do about is that we atm use a hardcoded
> > DEBUG2 (not controlled by VERBOSE) in a bunch of places:
> >
> >         ereport(DEBUG2,
> >                         (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> >                                         vacrel->relname, (long long) index, vacuumed_pages)));
> >
> >     ivinfo.message_level = DEBUG2;
> >
> > I find DEBUG2 hard to use to run the entire regression tests, it results in a
> > lot of output. Lots of it far less important than these kinds of details
> > here. So I'd like to use a different log level for them - but without further
> > complications that'd mean they'd show up in VACUUUM VERBOSE.
>
> I think that these DEBUG2 traces are of limited value, even for
> experts. I personally never use them -- I just use VACUUM
> VERBOSE/autovacuum logging for everything, since it's far easier to
> read, and isn't missing something that the DEBUG2 traces have. In fact
> I wonder if we should even have them at all.

I find it useful information when debugging problems. Without it, the log
doesn't tell you which index was processed when a problem started to occur. Or
even that we were scanning indexes at all.

IOW, I care a lot more about log entries denoting the start / end of an index
vacuum than attached numbers. Although I also care about those, to some
degree.

> I generally don't care about the details when VACUUM runs out of space
> for dead_items -- these days the important thing is to just avoid it
> completely.

I wonder if it'd possibly make sense to log more verbosely if we do end up
running out of space, but not otherwise? Particularly because it's important
to avoid multi-pass vacuums? Right now there's basically no log output until
the vacuum finished, which in a bad situation could take days, with many index
scan cycles. Logging enough to be able to see such things happening IMO is
important.

> I also generally don't care about how many index tuples
> were deleted by each index's ambulkdelete() call, since VACUUM VERBOSE
> already shows me the number of LP_DEAD stubs encountered/removed in
> the heap.

Isn't it actually quite useful to see how many entries were removed from the
index relative to the number of LP_DEAD removed in the heap? And relative to
other indexes? E.g. to understand how effective killtuple style logic is?

One annoyance with the multiple logs as part of a [auto]vacuum is that we end
up logging the context / statement repeatedly, leading to pointless output
like this:

2023-02-08 15:55:01.117 PST [3904676][client backend][2/55:0][psql] LOG:  vacuuming "postgres.public.large"
2023-02-08 15:55:01.117 PST [3904676][client backend][2/55:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] LOG:  scanned index "large_pkey" to remove 499994
rowversions

2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] CONTEXT:  while vacuuming index "large_pkey" of
relation"public.large"

2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
...
2023-02-08 15:55:03.496 PST [3904676][client backend][2/55:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] LOG:  vacuuming
"postgres.pg_toast.pg_toast_3370138"
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] LOG:  finished vacuuming
"postgres.pg_toast.pg_toast_3370138":index scans: 0

I think we should emit most of the non-error/warning messages in vacuum with
errhidestmt(true), errhidecontext(true) to avoid that. The error context is
quite helpful for error messages due to corruption, cancellations and such,
but not for messages where we already are careful to include relation names.

I generally like the improved log output for [auto]vacuum, but the fact that
you can't see anymore when index scans start imo problematic. Right now you
can't even infer how long the first index scan takes, which really isn't
great.

I'd thus like to:

1) Use errhidestmt(true), errhidecontext(true) for vacuum
   ereport(non-error-or-warning)

2) Add a message to lazy_vacuum() or lazy_vacuum_all_indexes(), that includes
   - num_index_scans
   - how many indexes we'll scan
   - how many dead tids we're working on removing

3) Add a log at the end of lazy_vacuum_heap_rel() that's logged only (or more
   verbosely) when lazy_vacuum() was run due to running out of space

   If we just do the heap scan once, this can be easily inferred from the
   other messages.

4) When we run out of space for dead tids, increase the log level for the rest
   of vacuum. It's sufficiently bad if that happens that we really ought to
   include it in the log by default.

5) Remove the row versions from vac_bulkdel_one_index()'s message, it's
   already included in 2).

   Instead we could emit the content from vac_cleanup_one_index(), imo that's
   a lot more useful when separated for each scan.

6) Possibly remove the log output from vac_cleanup_one_index()?

2) and 3) together allow to figure out how long individual scan / vacuum
phases are taking. 1) should reduce log verbosity sufficiently to make it
easier to actually read the output.

FWIW, I'm not proposing to do all of that in one patch, once I understand a
bit more what we have concensus and what not I can split it into steps.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Peter Geoghegan

Date:

09 February 2023, 02:37:41

On Wed, Feb 8, 2023 at 4:29 PM Andres Freund <andres@anarazel.de> wrote:
> I find it useful information when debugging problems. Without it, the log
> doesn't tell you which index was processed when a problem started to occur. Or
> even that we were scanning indexes at all.

I guess it might have some limited value when doing some sort of
top-down debugging of the regression tests, where (say) some random
buildfarm issue is the starting point, and you don't really know what
you're looking for at all. I cannot remember ever approaching things
that way myself, though.

> > I generally don't care about the details when VACUUM runs out of space
> > for dead_items -- these days the important thing is to just avoid it
> > completely.
>
> I wonder if it'd possibly make sense to log more verbosely if we do end up
> running out of space, but not otherwise? Particularly because it's important
> to avoid multi-pass vacuums? Right now there's basically no log output until
> the vacuum finished, which in a bad situation could take days, with many index
> scan cycles. Logging enough to be able to see such things happening IMO is
> important.

That definitely could make sense, but ISTM that it should be some
totally orthogonal thing.

> > I also generally don't care about how many index tuples
> > were deleted by each index's ambulkdelete() call, since VACUUM VERBOSE
> > already shows me the number of LP_DEAD stubs encountered/removed in
> > the heap.
>
> Isn't it actually quite useful to see how many entries were removed from the
> index relative to the number of LP_DEAD removed in the heap? And relative to
> other indexes? E.g. to understand how effective killtuple style logic is?

Way less than you'd think. I'd even go so far as to say that showing
users the number of index tuples deleted by VACUUM at the level of
individual indexes could be very misleading. The simplest way to see
that this is true is with an example:

Assume that we have two indexes on the same table: A UUID v4 index,
and a traditional serial/identity column style primary key index.
Further assume that index tuple deletion does a perfect job with both
indexes, in the sense that no leaf page is ever split while the
pre-split leaf page has even one single remaining delete-safe index
tuple remaining. So the index deletion stuff is equally effective in
both indexes, in a way that could be measured exactly (by custom
instrumentation code, say). What does the instrumentation of the
number of index tuples deleted by VACUUM reveal here?

I would expect the UUID index to have *many* more index tuples deleted
by VACUUM than the traditional serial/identity column style primary
key index did in the same period (unless perhaps there was an
unrealistically uniform distribution of updates across the PK's key
space). We're talking about a 3x difference here, possibly much more.
In the case of the UUID index, we'll have needed fewer opportunistic
index deletion passes because there was naturally more free space on
each leaf page due to generic B-Tree overhead -- allowing
opportunistic index tuple deletion to be much more lazy overall,
relative to how things went with the other index. In reality we get
the same optimal outcome for each index, but
IndexBulkDeleteResult.tuples_removed suggests that just the opposite
has happened. That's just perverse.

This isn't just a cute example. If anything it *understates* the
extent to which these kinds of differences are possible. I could come
up with a case where the difference was far larger still, just by
adding a few more details. Users ought to focus on the picture over
time, and the space utilization for remaining live tuples. To a large
degree it doesn't actually matter whether it's VACUUM or opportunistic
deletion that does most of the deleting, provided it happens and is
reasonably efficient. They're two sides of the same coin.

Space utilization over time for live tuples matters most. Ideally it
can be normalized to account for the effects of these workload
differences, and things like nbtree deduplication. But even just
dividing the size of the index in pages by the number of live tuples
in the index tells me plenty, with no noise from VACUUM implementation
details.

We care about signal to noise ratio. Managing the noise is no less
important than increasing the signal. It might even be more important.

> I think we should emit most of the non-error/warning messages in vacuum with
> errhidestmt(true), errhidecontext(true) to avoid that. The error context is
> quite helpful for error messages due to corruption, cancellations and such,
> but not for messages where we already are careful to include relation names.

Agreed.

> I'd thus like to:
>
> 1) Use errhidestmt(true), errhidecontext(true) for vacuum
>    ereport(non-error-or-warning)

Makes sense.

> 2) Add a message to lazy_vacuum() or lazy_vacuum_all_indexes(), that includes
>    - num_index_scans
>    - how many indexes we'll scan
>    - how many dead tids we're working on removing

It's not obvious how you can know the number of index scans at this
point. Well, it depends on how you define "index scan". It's true that
the number shown as "index scans" by VACUUM VERBOSE could be shown
here instead, earlier on. However, there are cases where VACUUM
VERBOSE shows 0 index scans, but also shows that it has scanned one or
more indexes (usually not all the indexes, just a subset). This
happens whenever an amvacuumcleanup() routine decides it needs to scan
an index to do stuff like recycle previously deleted pages.

After 14, nbtree does a pretty good job of avoiding that when it
doesn't really make sense. But it's still possible. It's also quite
common with GIN indexes, I think -- in fact it can be quite painful
there. This is a good thing for performance, of course, but it also
makes VACUUM VERBOSE show information that makes sense to users, since
things actually happen in a way that makes a lot more sense. I'm quite
happy about the fact that the new VACUUM VERBOSE allows users to
mostly ignore obscure details like whether an index was scanned by
amvacuumcleanup() or by ambulkdelete() -- stuff that basically nobody
understands. That seems worth preserving.

> 3) Add a log at the end of lazy_vacuum_heap_rel() that's logged only (or more
>    verbosely) when lazy_vacuum() was run due to running out of space
>
>    If we just do the heap scan once, this can be easily inferred from the
>    other messages.

I don't mind adding something that makes it easier to notice the
number of index scans early. However, the ambulkdelete vs
amvacuumcleanup index scan situation needs more thought.

> 4) When we run out of space for dead tids, increase the log level for the rest
>    of vacuum. It's sufficiently bad if that happens that we really ought to
>    include it in the log by default.

That makes sense. Same could be done when the failsafe triggers.

> 2) and 3) together allow to figure out how long individual scan / vacuum
> phases are taking. 1) should reduce log verbosity sufficiently to make it
> easier to actually read the output.

It's not just verbosity. It's also showing the same details
consistently for the same table over time, so that successive VACUUMs
can be compared to each other easily. The worst thing about the old
VACUUM VERBOSE was that it was inconsistent about how much it showed
in a way that made little sense, based on low level details like the
order that things happen in, not the order that actually made sense.

As I said, I don't mind making VACUUM VERBOSE behave a little bit more
like a progress indicator, which is how it used to work. Maybe I went
a little too far in the direction of neatly summarizing the whole
VACUUM operation in one go. But I doubt that I went too far with it by
all that much. Overall, the old VACUUM VERBOSE was extremely hard to
use, and was poorly maintained -- let's not go back to that. (See
commit ec196930 for evidence of how sloppily it was maintained.)

--
Peter Geoghegan

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

09 February 2023, 03:18:05

Hi,

On 2023-02-08 18:37:41 -0800, Peter Geoghegan wrote:
> On Wed, Feb 8, 2023 at 4:29 PM Andres Freund <andres@anarazel.de> wrote:
> > 2) Add a message to lazy_vacuum() or lazy_vacuum_all_indexes(), that includes
> >    - num_index_scans
> >    - how many indexes we'll scan
> >    - how many dead tids we're working on removing
>
> It's not obvious how you can know the number of index scans at this
> point. Well, it depends on how you define "index scan".

What I mean is to show the number of times we've done lazy_vacuum() so far,
although probably 1 based.  Particularly if we do implement my proposal to
turn up verbosity once we're doing more than one pass, that'll allow at least
some insight to how bad things are from the log.



> This is a good thing for performance, of course, but it also makes VACUUM
> VERBOSE show information that makes sense to users, since things actually
> happen in a way that makes a lot more sense. I'm quite happy about the fact
> that the new VACUUM VERBOSE allows users to mostly ignore obscure details
> like whether an index was scanned by amvacuumcleanup() or by ambulkdelete()
> -- stuff that basically nobody understands. That seems worth preserving.

I don't mind making the messages as similar as possible, but I do mind if I as
a postgres hacker, or an expert consultant, can't parse that detail out. We
need to be able to debug things like amvacuumcleanup() doing too much work too
often.


> As I said, I don't mind making VACUUM VERBOSE behave a little bit more
> like a progress indicator, which is how it used to work. Maybe I went
> a little too far in the direction of neatly summarizing the whole
> VACUUM operation in one go. But I doubt that I went too far with it by
> all that much. Overall, the old VACUUM VERBOSE was extremely hard to
> use, and was poorly maintained -- let's not go back to that. (See
> commit ec196930 for evidence of how sloppily it was maintained.)

I don't want to go back to that either, as I said I mostly like the new
output.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Peter Geoghegan

Date:

10 February 2023, 01:28:46

On Wed, Feb 8, 2023 at 7:18 PM Andres Freund <andres@anarazel.de> wrote:
> > This is a good thing for performance, of course, but it also makes VACUUM
> > VERBOSE show information that makes sense to users, since things actually
> > happen in a way that makes a lot more sense. I'm quite happy about the fact
> > that the new VACUUM VERBOSE allows users to mostly ignore obscure details
> > like whether an index was scanned by amvacuumcleanup() or by ambulkdelete()
> > -- stuff that basically nobody understands. That seems worth preserving.
>
> I don't mind making the messages as similar as possible, but I do mind if I as
> a postgres hacker, or an expert consultant, can't parse that detail out. We
> need to be able to debug things like amvacuumcleanup() doing too much work too
> often.

FWIW you can tell even today. You can observe that the number of index
scans is 0, and that one or more indexes have their size reported --
that indicates that an amvacuumcleanup()-only scan took place, say
because we needed to put some preexisting deleted pages in the FSM.

There is also another detail that strongly hints that VACUUM VERBOSE
had to scan an index during its call to amvacuumcleanup(), which is
atypical: it only shows details for that particular index, which is
really noticeable. It won't report anything about those indexes that
had no-op calls to amvacuumcleanup().

It kind of makes sense that we report 0 index scans when there were 0
calls to ambulkdelete(), even though there may still have been some
index scans during a call to some amvacuumcleanup() routine. The
common case is that they're no-op calls for every index, but even when
they're not there is still probably only one or two indexes that have
to do a noticeable amount of I/O. It makes sense to "round down to 0".

Granted, there are some notable exceptions. For example,
gistvacuumcleanup() doesn't even try to avoid scanning the index. But
that's really a problem in gistvacuumcleanup() -- since it really
doesn't make very much sense, even from a GiST point of view. It can
follow exactly the same approach as B-Tree here, since its approach to
page deletion is already directly based on nbtree.

-- 
Peter Geoghegan

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

26 February 2023, 19:43:40

Hi,

On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> The most frequent case is postgres_fdw, which somewhat regularly fails with a
> regression.diff like this:
> 
> diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out    2022-12-08 20:35:24.772888000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out    2022-12-08
20:43:38.199450000+0000
 
> @@ -9911,8 +9911,7 @@
>      WHERE application_name = 'fdw_retry_check';
>   pg_terminate_backend
>  ----------------------
> - t
> -(1 row)
> +(0 rows)
> 
>  -- This query should detect the broken connection when starting new remote
>  -- transaction, reestablish new connection, and then succeed.
> 
> 
> See e.g.
> https://cirrus-ci.com/task/5925540020879360
>
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
> https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log
> 
> 
> The following comment in the test provides a hint what might be happening:
> 
> -- If debug_discard_caches is active, it results in
> -- dropping remote connections after every transaction, making it
> -- impossible to test termination meaningfully.  So turn that off
> -- for this test.
> SET debug_discard_caches = 0;
> 
> 
> I guess that a cache reset message arrives and leads to the connection being
> terminated. Unfortunately that's hard to see right now, as the relevant log
> messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> tests will be painful.

Downthread I reported that I was able to pinpoint that the source of the issue
indeed is a cache inval message arriving in the wrong moment.


We've had trouble with this test for years by now. We added workarounds, like

commit 1273a15bf91fa322915e32d3b6dc6ec916397268
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2021-05-04 13:36:26 -0400

    Disable cache clobber to avoid breaking postgres_fdw termination test.

But that didn't suffice to make it reliable. Not entirely surprising, given
there are cache resource sources other than clobber cache.

Unless somebody comes up with a way to make the test more reliable pretty
soon, I think we should just remove it. It's one of the most frequently
flapping tests at the moment.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

26 February 2023, 19:51:45

Andres Freund <andres@anarazel.de> writes:
> On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
>> The most frequent case is postgres_fdw, which somewhat regularly fails with a
>> regression.diff like this:
>> WHERE application_name = 'fdw_retry_check';
>> pg_terminate_backend
>> ----------------------
>> - t
>> -(1 row)
>> +(0 rows)

> Unless somebody comes up with a way to make the test more reliable pretty
> soon, I think we should just remove it. It's one of the most frequently
> flapping tests at the moment.

If that's the only diff, we could just hide it, say by writing

do $$ begin
PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
WHERE application_name = 'fdw_retry_check';
end $$;

The actually important thing is the failure check after this;
we don't care that much whether the initially-created connection
is still live at this point.

            regards, tom lane

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

26 February 2023, 20:06:14

Hi,

On 2023-02-26 14:51:45 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> >> The most frequent case is postgres_fdw, which somewhat regularly fails with a
> >> regression.diff like this:
> >> WHERE application_name = 'fdw_retry_check';
> >> pg_terminate_backend
> >> ----------------------
> >> - t
> >> -(1 row)
> >> +(0 rows)
> 
> > Unless somebody comes up with a way to make the test more reliable pretty
> > soon, I think we should just remove it. It's one of the most frequently
> > flapping tests at the moment.
> 
> If that's the only diff, we could just hide it, say by writing
> 
> do $$ begin
> PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
> WHERE application_name = 'fdw_retry_check';
> end $$;
> 
> The actually important thing is the failure check after this;
> we don't care that much whether the initially-created connection
> is still live at this point.

Hm, yea, that should work. It's indeed the entirety of the diff

https://api.cirrus-ci.com/v1/artifact/task/4718859714822144/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs

If we go that way we can remove the debug_discard muckery as well, I think?

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

26 February 2023, 20:36:09

Andres Freund <andres@anarazel.de> writes:
> On 2023-02-26 14:51:45 -0500, Tom Lane wrote:
>> If that's the only diff, we could just hide it, say by writing

> Hm, yea, that should work. It's indeed the entirety of the diff
>
https://api.cirrus-ci.com/v1/artifact/task/4718859714822144/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs

> If we go that way we can remove the debug_discard muckery as well, I think?

Perhaps.  I'll check to see if that stanza passes with debug_discard on.

            regards, tom lane

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

26 February 2023, 20:57:01

Andres Freund <andres@anarazel.de> writes:
> Hm, yea, that should work. It's indeed the entirety of the diff
>
https://api.cirrus-ci.com/v1/artifact/task/4718859714822144/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs

> If we go that way we can remove the debug_discard muckery as well, I think?

Okay, so that seems to work for the "reestablish new connection" test:
as coded here, it passes with or without debug_discard_caches enabled,
and I believe it's testing what it intends to either way.  So that's
good.

However, the other stanza with debug_discard_caches muckery is the
one about "test postgres_fdw.application_name GUC", and in that case
ignoring the number of terminated connections would destroy the
point of the test entirely; because without that, you're proving
nothing about what the remote's application_name actually looks like.

I'm inclined to think we should indeed just nuke that test.  It's
overcomplicated and it expends a lot of test cycles on a pretty
marginal feature.

So I propose the attached.

            regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d5fc61446a..d0ba40aca3 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9926,11 +9926,6 @@ WARNING:  there is no transaction in progress
 -- Change application_name of remote connection to special one
 -- so that we can easily terminate the connection later.
 ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');
--- If debug_discard_caches is active, it results in
--- dropping remote connections after every transaction, making it
--- impossible to test termination meaningfully.  So turn that off
--- for this test.
-SET debug_discard_caches = 0;
 -- Make sure we have a remote connection.
 SELECT 1 FROM ft1 LIMIT 1;
  ?column?
@@ -9939,13 +9934,12 @@ SELECT 1 FROM ft1 LIMIT 1;
 (1 row)

 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+-- (If a cache flush happens, the remote connection might have already been
+-- dropped; so code this step in a way that doesn't fail if no connection.)
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
- pg_terminate_backend
-----------------------
- t
-(1 row)
-
+END $$;
 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.
 BEGIN;
@@ -9958,13 +9952,10 @@ SELECT 1 FROM ft1 LIMIT 1;
 -- If we detect the broken connection when starting a new remote
 -- subtransaction, we should fail instead of establishing a new connection.
 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
- pg_terminate_backend
-----------------------
- t
-(1 row)
-
+END $$;
 SAVEPOINT s;
 -- The text of the error might vary across platforms, so only show SQLSTATE.
 \set VERBOSITY sqlstate
@@ -9972,7 +9963,6 @@ SELECT 1 FROM ft1 LIMIT 1;    -- should fail
 ERROR:  08006
 \set VERBOSITY default
 COMMIT;
-RESET debug_discard_caches;
 -- =============================================================================
 -- test connection invalidation cases and postgres_fdw_get_connections function
 -- =============================================================================
@@ -11627,77 +11617,6 @@ ALTER FOREIGN DATA WRAPPER postgres_fdw OPTIONS (nonexistent 'fdw');
 ERROR:  invalid option "nonexistent"
 HINT:  There are no valid options in this context.
 -- ===================================================================
--- test postgres_fdw.application_name GUC
--- ===================================================================
---- Turn debug_discard_caches off for this test to make sure that
---- the remote connection is alive when checking its application_name.
-SET debug_discard_caches = 0;
--- Specify escape sequences in application_name option of a server
--- object so as to test that they are replaced with status information
--- expectedly.
---
--- Since pg_stat_activity.application_name may be truncated to less than
--- NAMEDATALEN characters, note that substring() needs to be used
--- at the condition of test query to make sure that the string consisting
--- of database name and process ID is also less than that.
-ALTER SERVER loopback2 OPTIONS (application_name 'fdw_%d%p');
-SELECT 1 FROM ft6 LIMIT 1;
- ?column?
-----------
-        1
-(1 row)
-
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
-  WHERE application_name =
-    substring('fdw_' || current_database() || pg_backend_pid() for
-      current_setting('max_identifier_length')::int);
- pg_terminate_backend
-----------------------
- t
-(1 row)
-
--- postgres_fdw.application_name overrides application_name option
--- of a server object if both settings are present.
-SET postgres_fdw.application_name TO 'fdw_%a%u%%';
-SELECT 1 FROM ft6 LIMIT 1;
- ?column?
-----------
-        1
-(1 row)
-
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
-  WHERE application_name =
-    substring('fdw_' || current_setting('application_name') ||
-      CURRENT_USER || '%' for current_setting('max_identifier_length')::int);
- pg_terminate_backend
-----------------------
- t
-(1 row)
-
--- Test %c (session ID) and %C (cluster name) escape sequences.
-SET postgres_fdw.application_name TO 'fdw_%C%c';
-SELECT 1 FROM ft6 LIMIT 1;
- ?column?
-----------
-        1
-(1 row)
-
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
-  WHERE application_name =
-    substring('fdw_' || current_setting('cluster_name') ||
-      to_hex(trunc(EXTRACT(EPOCH FROM (SELECT backend_start FROM
-      pg_stat_get_activity(pg_backend_pid()))))::integer) || '.' ||
-      to_hex(pg_backend_pid())
-      for current_setting('max_identifier_length')::int);
- pg_terminate_backend
-----------------------
- t
-(1 row)
-
---Clean up
-RESET postgres_fdw.application_name;
-RESET debug_discard_caches;
--- ===================================================================
 -- test parallel commit
 -- ===================================================================
 ALTER SERVER loopback OPTIONS (ADD parallel_commit 'true');
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 1e50be137b..50e0867341 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3100,18 +3100,16 @@ ROLLBACK;
 -- so that we can easily terminate the connection later.
 ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');

--- If debug_discard_caches is active, it results in
--- dropping remote connections after every transaction, making it
--- impossible to test termination meaningfully.  So turn that off
--- for this test.
-SET debug_discard_caches = 0;
-
 -- Make sure we have a remote connection.
 SELECT 1 FROM ft1 LIMIT 1;

 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+-- (If a cache flush happens, the remote connection might have already been
+-- dropped; so code this step in a way that doesn't fail if no connection.)
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
+END $$;

 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.
@@ -3121,8 +3119,10 @@ SELECT 1 FROM ft1 LIMIT 1;
 -- If we detect the broken connection when starting a new remote
 -- subtransaction, we should fail instead of establishing a new connection.
 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
+END $$;
 SAVEPOINT s;
 -- The text of the error might vary across platforms, so only show SQLSTATE.
 \set VERBOSITY sqlstate
@@ -3130,8 +3130,6 @@ SELECT 1 FROM ft1 LIMIT 1;    -- should fail
 \set VERBOSITY default
 COMMIT;

-RESET debug_discard_caches;
-
 -- =============================================================================
 -- test connection invalidation cases and postgres_fdw_get_connections function
 -- =============================================================================
@@ -3847,52 +3845,6 @@ CREATE FOREIGN TABLE inv_bsz (c1 int )
 -- No option is allowed to be specified at foreign data wrapper level
 ALTER FOREIGN DATA WRAPPER postgres_fdw OPTIONS (nonexistent 'fdw');

--- ===================================================================
--- test postgres_fdw.application_name GUC
--- ===================================================================
---- Turn debug_discard_caches off for this test to make sure that
---- the remote connection is alive when checking its application_name.
-SET debug_discard_caches = 0;
-
--- Specify escape sequences in application_name option of a server
--- object so as to test that they are replaced with status information
--- expectedly.
---
--- Since pg_stat_activity.application_name may be truncated to less than
--- NAMEDATALEN characters, note that substring() needs to be used
--- at the condition of test query to make sure that the string consisting
--- of database name and process ID is also less than that.
-ALTER SERVER loopback2 OPTIONS (application_name 'fdw_%d%p');
-SELECT 1 FROM ft6 LIMIT 1;
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
-  WHERE application_name =
-    substring('fdw_' || current_database() || pg_backend_pid() for
-      current_setting('max_identifier_length')::int);
-
--- postgres_fdw.application_name overrides application_name option
--- of a server object if both settings are present.
-SET postgres_fdw.application_name TO 'fdw_%a%u%%';
-SELECT 1 FROM ft6 LIMIT 1;
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
-  WHERE application_name =
-    substring('fdw_' || current_setting('application_name') ||
-      CURRENT_USER || '%' for current_setting('max_identifier_length')::int);
-
--- Test %c (session ID) and %C (cluster name) escape sequences.
-SET postgres_fdw.application_name TO 'fdw_%C%c';
-SELECT 1 FROM ft6 LIMIT 1;
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
-  WHERE application_name =
-    substring('fdw_' || current_setting('cluster_name') ||
-      to_hex(trunc(EXTRACT(EPOCH FROM (SELECT backend_start FROM
-      pg_stat_get_activity(pg_backend_pid()))))::integer) || '.' ||
-      to_hex(pg_backend_pid())
-      for current_setting('max_identifier_length')::int);
-
---Clean up
-RESET postgres_fdw.application_name;
-RESET debug_discard_caches;
-
 -- ===================================================================
 -- test parallel commit
 -- ===================================================================

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

26 February 2023, 21:03:13

I wrote:
> I'm inclined to think we should indeed just nuke that test.  It's
> overcomplicated and it expends a lot of test cycles on a pretty
> marginal feature.

Perhaps a better idea: at the start of the test, set
postgres_fdw.application_name to something that exercises all the
available escape sequences, but don't try to verify what the
result looks like.  That at least gives us code coverage for the
escape sequence processing code, even if it doesn't prove that
the output is desirable.

            regards, tom lane

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

26 February 2023, 21:09:38

Hi,

On 2023-02-26 15:57:01 -0500, Tom Lane wrote:
> However, the other stanza with debug_discard_caches muckery is the
> one about "test postgres_fdw.application_name GUC", and in that case
> ignoring the number of terminated connections would destroy the
> point of the test entirely; because without that, you're proving
> nothing about what the remote's application_name actually looks like.
> 
> I'm inclined to think we should indeed just nuke that test.  It's
> overcomplicated and it expends a lot of test cycles on a pretty
> marginal feature.

It does seem fairly complicated...

*If* we wanted to rescue it, we probably could just use a transaction around
the SELECT and the termination, which ought to prevent sinval issues.

Not that I understand why that tries to terminate connections, instead of just
looking at application name.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

26 February 2023, 23:59:32

Andres Freund <andres@anarazel.de> writes:
> Not that I understand why that tries to terminate connections, instead of just
> looking at application name.

The test is trying to verify the application name reported by the
"remote" session, which isn't constant, so we can't just do "select
application_name from pg_stat_activity".  I agree that terminating the
connection seems like kind of a strange thing to do --- maybe it's to
ensure that we get a new session with the updated application name
for the next test case?  If not, maybe we could do "select 1 from
pg_stat_activity where application_name = computed-pattern", but that
has the same problem that a cache flush might have terminated the
remote session.

            regards, tom lane

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

27 February 2023, 15:01:25

I wrote:
> ... maybe we could do "select 1 from
> pg_stat_activity where application_name = computed-pattern", but that
> has the same problem that a cache flush might have terminated the
> remote session.

Hah - I thought of a solution.  We can avoid this race condition if
we make the remote session itself inspect pg_stat_activity and
return its displayed application_name.  Just need a foreign table
that maps onto pg_stat_activity.  Of course, this'd add yet another
layer of baroque-ness to a test section that I already don't think
is worth the trouble.  Should we go that way, or just rip it out?

            regards, tom lane

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

27 February 2023, 17:42:00

I wrote:
> Hah - I thought of a solution.  We can avoid this race condition if
> we make the remote session itself inspect pg_stat_activity and
> return its displayed application_name.  Just need a foreign table
> that maps onto pg_stat_activity.

I went ahead and coded it that way, and it doesn't look too awful.
Any objections?

            regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d5fc61446a..e3275caa2d 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9926,11 +9926,6 @@ WARNING:  there is no transaction in progress
 -- Change application_name of remote connection to special one
 -- so that we can easily terminate the connection later.
 ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');
--- If debug_discard_caches is active, it results in
--- dropping remote connections after every transaction, making it
--- impossible to test termination meaningfully.  So turn that off
--- for this test.
-SET debug_discard_caches = 0;
 -- Make sure we have a remote connection.
 SELECT 1 FROM ft1 LIMIT 1;
  ?column?
@@ -9939,13 +9934,12 @@ SELECT 1 FROM ft1 LIMIT 1;
 (1 row)

 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+-- (If a cache flush happens, the remote connection might have already been
+-- dropped; so code this step in a way that doesn't fail if no connection.)
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
- pg_terminate_backend
-----------------------
- t
-(1 row)
-
+END $$;
 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.
 BEGIN;
@@ -9958,13 +9952,10 @@ SELECT 1 FROM ft1 LIMIT 1;
 -- If we detect the broken connection when starting a new remote
 -- subtransaction, we should fail instead of establishing a new connection.
 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
- pg_terminate_backend
-----------------------
- t
-(1 row)
-
+END $$;
 SAVEPOINT s;
 -- The text of the error might vary across platforms, so only show SQLSTATE.
 \set VERBOSITY sqlstate
@@ -9972,7 +9963,6 @@ SELECT 1 FROM ft1 LIMIT 1;    -- should fail
 ERROR:  08006
 \set VERBOSITY default
 COMMIT;
-RESET debug_discard_caches;
 -- =============================================================================
 -- test connection invalidation cases and postgres_fdw_get_connections function
 -- =============================================================================
@@ -11629,74 +11619,66 @@ HINT:  There are no valid options in this context.
 -- ===================================================================
 -- test postgres_fdw.application_name GUC
 -- ===================================================================
---- Turn debug_discard_caches off for this test to make sure that
---- the remote connection is alive when checking its application_name.
-SET debug_discard_caches = 0;
+-- To avoid race conditions in checking the remote session's application_name,
+-- use this view to make the remote session itself read its application_name.
+CREATE VIEW my_application_name AS
+  SELECT application_name FROM pg_stat_activity WHERE pid = pg_backend_pid();
+CREATE FOREIGN TABLE remote_application_name (application_name text)
+  SERVER loopback2
+  OPTIONS (schema_name 'public', table_name 'my_application_name');
+SELECT * FROM remote_application_name;
+ application_name
+------------------
+ postgres_fdw
+(1 row)
+
 -- Specify escape sequences in application_name option of a server
 -- object so as to test that they are replaced with status information
--- expectedly.
+-- expectedly.  Note that we are also relying on ALTER SERVER to force
+-- the remote session to be restarted with its new application name.
 --
 -- Since pg_stat_activity.application_name may be truncated to less than
 -- NAMEDATALEN characters, note that substring() needs to be used
 -- at the condition of test query to make sure that the string consisting
 -- of database name and process ID is also less than that.
 ALTER SERVER loopback2 OPTIONS (application_name 'fdw_%d%p');
-SELECT 1 FROM ft6 LIMIT 1;
- ?column?
-----------
-        1
-(1 row)
-
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+SELECT count(*) FROM remote_application_name
   WHERE application_name =
     substring('fdw_' || current_database() || pg_backend_pid() for
       current_setting('max_identifier_length')::int);
- pg_terminate_backend
-----------------------
- t
+ count
+-------
+     1
 (1 row)

 -- postgres_fdw.application_name overrides application_name option
 -- of a server object if both settings are present.
+ALTER SERVER loopback2 OPTIONS (SET application_name 'fdw_wrong');
 SET postgres_fdw.application_name TO 'fdw_%a%u%%';
-SELECT 1 FROM ft6 LIMIT 1;
- ?column?
-----------
-        1
-(1 row)
-
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+SELECT count(*) FROM remote_application_name
   WHERE application_name =
     substring('fdw_' || current_setting('application_name') ||
       CURRENT_USER || '%' for current_setting('max_identifier_length')::int);
- pg_terminate_backend
-----------------------
- t
+ count
+-------
+     1
 (1 row)

+RESET postgres_fdw.application_name;
 -- Test %c (session ID) and %C (cluster name) escape sequences.
-SET postgres_fdw.application_name TO 'fdw_%C%c';
-SELECT 1 FROM ft6 LIMIT 1;
- ?column?
-----------
-        1
-(1 row)
-
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+ALTER SERVER loopback2 OPTIONS (SET application_name 'fdw_%C%c');
+SELECT count(*) FROM remote_application_name
   WHERE application_name =
     substring('fdw_' || current_setting('cluster_name') ||
       to_hex(trunc(EXTRACT(EPOCH FROM (SELECT backend_start FROM
       pg_stat_get_activity(pg_backend_pid()))))::integer) || '.' ||
       to_hex(pg_backend_pid())
       for current_setting('max_identifier_length')::int);
- pg_terminate_backend
-----------------------
- t
+ count
+-------
+     1
 (1 row)

---Clean up
-RESET postgres_fdw.application_name;
-RESET debug_discard_caches;
 -- ===================================================================
 -- test parallel commit
 -- ===================================================================
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 1e50be137b..6248825d51 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3100,18 +3100,16 @@ ROLLBACK;
 -- so that we can easily terminate the connection later.
 ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');

--- If debug_discard_caches is active, it results in
--- dropping remote connections after every transaction, making it
--- impossible to test termination meaningfully.  So turn that off
--- for this test.
-SET debug_discard_caches = 0;
-
 -- Make sure we have a remote connection.
 SELECT 1 FROM ft1 LIMIT 1;

 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+-- (If a cache flush happens, the remote connection might have already been
+-- dropped; so code this step in a way that doesn't fail if no connection.)
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
+END $$;

 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.
@@ -3121,8 +3119,10 @@ SELECT 1 FROM ft1 LIMIT 1;
 -- If we detect the broken connection when starting a new remote
 -- subtransaction, we should fail instead of establishing a new connection.
 -- Terminate the remote connection and wait for the termination to complete.
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+DO $$ BEGIN
+PERFORM pg_terminate_backend(pid, 180000) FROM pg_stat_activity
     WHERE application_name = 'fdw_retry_check';
+END $$;
 SAVEPOINT s;
 -- The text of the error might vary across platforms, so only show SQLSTATE.
 \set VERBOSITY sqlstate
@@ -3130,8 +3130,6 @@ SELECT 1 FROM ft1 LIMIT 1;    -- should fail
 \set VERBOSITY default
 COMMIT;

-RESET debug_discard_caches;
-
 -- =============================================================================
 -- test connection invalidation cases and postgres_fdw_get_connections function
 -- =============================================================================
@@ -3850,38 +3848,45 @@ ALTER FOREIGN DATA WRAPPER postgres_fdw OPTIONS (nonexistent 'fdw');
 -- ===================================================================
 -- test postgres_fdw.application_name GUC
 -- ===================================================================
---- Turn debug_discard_caches off for this test to make sure that
---- the remote connection is alive when checking its application_name.
-SET debug_discard_caches = 0;
+-- To avoid race conditions in checking the remote session's application_name,
+-- use this view to make the remote session itself read its application_name.
+CREATE VIEW my_application_name AS
+  SELECT application_name FROM pg_stat_activity WHERE pid = pg_backend_pid();
+
+CREATE FOREIGN TABLE remote_application_name (application_name text)
+  SERVER loopback2
+  OPTIONS (schema_name 'public', table_name 'my_application_name');
+
+SELECT * FROM remote_application_name;

 -- Specify escape sequences in application_name option of a server
 -- object so as to test that they are replaced with status information
--- expectedly.
+-- expectedly.  Note that we are also relying on ALTER SERVER to force
+-- the remote session to be restarted with its new application name.
 --
 -- Since pg_stat_activity.application_name may be truncated to less than
 -- NAMEDATALEN characters, note that substring() needs to be used
 -- at the condition of test query to make sure that the string consisting
 -- of database name and process ID is also less than that.
 ALTER SERVER loopback2 OPTIONS (application_name 'fdw_%d%p');
-SELECT 1 FROM ft6 LIMIT 1;
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+SELECT count(*) FROM remote_application_name
   WHERE application_name =
     substring('fdw_' || current_database() || pg_backend_pid() for
       current_setting('max_identifier_length')::int);

 -- postgres_fdw.application_name overrides application_name option
 -- of a server object if both settings are present.
+ALTER SERVER loopback2 OPTIONS (SET application_name 'fdw_wrong');
 SET postgres_fdw.application_name TO 'fdw_%a%u%%';
-SELECT 1 FROM ft6 LIMIT 1;
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+SELECT count(*) FROM remote_application_name
   WHERE application_name =
     substring('fdw_' || current_setting('application_name') ||
       CURRENT_USER || '%' for current_setting('max_identifier_length')::int);
+RESET postgres_fdw.application_name;

 -- Test %c (session ID) and %C (cluster name) escape sequences.
-SET postgres_fdw.application_name TO 'fdw_%C%c';
-SELECT 1 FROM ft6 LIMIT 1;
-SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
+ALTER SERVER loopback2 OPTIONS (SET application_name 'fdw_%C%c');
+SELECT count(*) FROM remote_application_name
   WHERE application_name =
     substring('fdw_' || current_setting('cluster_name') ||
       to_hex(trunc(EXTRACT(EPOCH FROM (SELECT backend_start FROM
@@ -3889,10 +3894,6 @@ SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
       to_hex(pg_backend_pid())
       for current_setting('max_identifier_length')::int);

---Clean up
-RESET postgres_fdw.application_name;
-RESET debug_discard_caches;
-
 -- ===================================================================
 -- test parallel commit
 -- ===================================================================

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Andres Freund

Date:

27 February 2023, 19:50:27

Hi,

On 2023-02-27 12:42:00 -0500, Tom Lane wrote:
> I wrote:
> > Hah - I thought of a solution.  We can avoid this race condition if
> > we make the remote session itself inspect pg_stat_activity and
> > return its displayed application_name.  Just need a foreign table
> > that maps onto pg_stat_activity.

Sounds reasonable. I guess you could also do it with a function that is
allowed to be pushed down. But given that you already solved it this way...

I think it's worth having an example for checks like this in the postgres_fdw
tests, even if it's perhaps not worth it for the application_name GUC on its
own. We saw that the GUC test copied the debug_discard_caches use of another
test...

> I went ahead and coded it that way, and it doesn't look too awful.
> Any objections?

Looks good to me.

I think it'd be an indication of a bug around the invalidation handling if the
terminations were required. So even leaving other things aside, I prefer this
version.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From

Tom Lane

Date:

27 February 2023, 20:46:32

Andres Freund <andres@anarazel.de> writes:
> On 2023-02-27 12:42:00 -0500, Tom Lane wrote:
>> I went ahead and coded it that way, and it doesn't look too awful.
>> Any objections?

> Looks good to me.

> I think it'd be an indication of a bug around the invalidation handling if the
> terminations were required. So even leaving other things aside, I prefer this
> version.

Sounds good.  I'll work on getting this back-patched.

            regards, tom lane