tests against running server occasionally fail, postgres_fdw & tenk1 - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | tests against running server occasionally fail, postgres_fdw & tenk1 |
| Date | |
| Msg-id | 20221209001511.n3vqodxobqgscuil@awork3.anarazel.de Whole thread Raw |
| Responses |
Re: tests against running server occasionally fail, postgres_fdw & tenk1
Re: tests against running server occasionally fail, postgres_fdw & tenk1 |
| List | pgsql-hackers |
Hi,
Since
commit 3f0e786ccbf
Author: Andres Freund <andres@anarazel.de>
Date: 2022-12-07 12:13:35 -0800
meson: Add 'running' test setup, as a replacement for installcheck
CI tests the pg_regress/isolationtester tests that support doing so against a
running server.
Unfortunately cfbot shows that that doesn't work entirely reliably.
The most frequent case is postgres_fdw, which somewhat regularly fails with a
regression.diff like this:
diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
--- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 2022-12-08 20:35:24.772888000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out 2022-12-08
20:43:38.199450000+0000
@@ -9911,8 +9911,7 @@
WHERE application_name = 'fdw_retry_check';
pg_terminate_backend
----------------------
- t
-(1 row)
+(0 rows)
-- This query should detect the broken connection when starting new remote
-- transaction, reestablish new connection, and then succeed.
See e.g.
https://cirrus-ci.com/task/5925540020879360
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log
The following comment in the test provides a hint what might be happening:
-- If debug_discard_caches is active, it results in
-- dropping remote connections after every transaction, making it
-- impossible to test termination meaningfully. So turn that off
-- for this test.
SET debug_discard_caches = 0;
I guess that a cache reset message arrives and leads to the connection being
terminated. Unfortunately that's hard to see right now, as the relevant log
messages are output with DEBUG3 - it's quite verbose, so enabling it for all
tests will be painful.
I *think* I have seen this failure locally at least once, when running the
test normally.
I'll try to reproduce this locally for a bit. If I can't, the only other idea
I have for debugging this is to change log_min_messages in that section of the
postgres_fdw test to DEBUG3.
The second failure case I observed, at a lower frequency, is in the main
regression tests:
https://cirrus-ci.com/task/5640584912699392
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/regress-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/runningcheck.log
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out 2022-12-08 16:49:28.239508000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out 2022-12-08
16:57:17.286650000+0000
@@ -1910,11 +1910,15 @@
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
ORDER BY unique1;
- QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_unique1 on tenk1
- Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
-(2 rows)
+ QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+ Sort Key: unique1
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+ -> Bitmap Index Scan on tenk1_unique1
+ Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(6 rows)
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
Which I think we've seen a number of times before, even in the temp-install
case. We fixed one source of this issue in this thread:
https://www.postgresql.org/message-id/1346227.1649887693%40sss.pgh.pa.us
but it looks like there's some remaining instability.
According to the server log (link above), there's no autovacuum on
tenk1.
Unfortunately we don't log non-automatic vacuums unless they are run with
verbose, so we can't see what horizon was used (see heap_vacuum_rel()'s
computation of 'instrument').
I don't have a better idea than to either change the above, or to revert
91998539b227dfc6dd091714da7d106f2c95a321.
Greetings,
Andres Freund
pgsql-hackers by date: