tests against running server occasionally fail, postgres_fdw & tenk1 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | tests against running server occasionally fail, postgres_fdw & tenk1 |
Date | |
Msg-id | 20221209001511.n3vqodxobqgscuil@awork3.anarazel.de Whole thread Raw |
Responses |
Re: tests against running server occasionally fail, postgres_fdw & tenk1
Re: tests against running server occasionally fail, postgres_fdw & tenk1 |
List | pgsql-hackers |
Hi, Since commit 3f0e786ccbf Author: Andres Freund <andres@anarazel.de> Date: 2022-12-07 12:13:35 -0800 meson: Add 'running' test setup, as a replacement for installcheck CI tests the pg_regress/isolationtester tests that support doing so against a running server. Unfortunately cfbot shows that that doesn't work entirely reliably. The most frequent case is postgres_fdw, which somewhat regularly fails with a regression.diff like this: diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 2022-12-08 20:35:24.772888000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out 2022-12-08 20:43:38.199450000+0000 @@ -9911,8 +9911,7 @@ WHERE application_name = 'fdw_retry_check'; pg_terminate_backend ---------------------- - t -(1 row) +(0 rows) -- This query should detect the broken connection when starting new remote -- transaction, reestablish new connection, and then succeed. See e.g. https://cirrus-ci.com/task/5925540020879360 https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log The following comment in the test provides a hint what might be happening: -- If debug_discard_caches is active, it results in -- dropping remote connections after every transaction, making it -- impossible to test termination meaningfully. So turn that off -- for this test. SET debug_discard_caches = 0; I guess that a cache reset message arrives and leads to the connection being terminated. Unfortunately that's hard to see right now, as the relevant log messages are output with DEBUG3 - it's quite verbose, so enabling it for all tests will be painful. I *think* I have seen this failure locally at least once, when running the test normally. I'll try to reproduce this locally for a bit. If I can't, the only other idea I have for debugging this is to change log_min_messages in that section of the postgres_fdw test to DEBUG3. The second failure case I observed, at a lower frequency, is in the main regression tests: https://cirrus-ci.com/task/5640584912699392 https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/regress-running/regress/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/runningcheck.log diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out --- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out 2022-12-08 16:49:28.239508000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out 2022-12-08 16:57:17.286650000+0000 @@ -1910,11 +1910,15 @@ SELECT unique1 FROM tenk1 WHERE unique1 IN (1,42,7) ORDER BY unique1; - QUERY PLAN -------------------------------------------------------- - Index Only Scan using tenk1_unique1 on tenk1 - Index Cond: (unique1 = ANY ('{1,42,7}'::integer[])) -(2 rows) + QUERY PLAN +------------------------------------------------------------------- + Sort + Sort Key: unique1 + -> Bitmap Heap Scan on tenk1 + Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[])) + -> Bitmap Index Scan on tenk1_unique1 + Index Cond: (unique1 = ANY ('{1,42,7}'::integer[])) +(6 rows) SELECT unique1 FROM tenk1 WHERE unique1 IN (1,42,7) Which I think we've seen a number of times before, even in the temp-install case. We fixed one source of this issue in this thread: https://www.postgresql.org/message-id/1346227.1649887693%40sss.pgh.pa.us but it looks like there's some remaining instability. According to the server log (link above), there's no autovacuum on tenk1. Unfortunately we don't log non-automatic vacuums unless they are run with verbose, so we can't see what horizon was used (see heap_vacuum_rel()'s computation of 'instrument'). I don't have a better idea than to either change the above, or to revert 91998539b227dfc6dd091714da7d106f2c95a321. Greetings, Andres Freund
pgsql-hackers by date: