Re: pgsql: Test replay of regression tests, attempt II. - Mailing list pgsql-committers
From | Thomas Munro |
---|---|
Subject | Re: pgsql: Test replay of regression tests, attempt II. |
Date | |
Msg-id | CA+hUKG+nHX+NNjm-ig0zWLxeMiivH8omey5Onfhnxzh6g524Cg@mail.gmail.com Whole thread Raw |
In response to | Re: pgsql: Test replay of regression tests, attempt II. (Andres Freund <andres@anarazel.de>) |
Responses |
Re: pgsql: Test replay of regression tests, attempt II.
|
List | pgsql-committers |
On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-01-18 17:19:06 -0500, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > That's an extremely small shared_buffers for running the regression tests, it'd not > > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM > > > ends up skipping over a page because of page contention? > > > > Hmm, good thought. I tried running the test with even smaller > > shared_buffers, but could not make the reloptions test fall over for > > me. But this theory implies a strong timing dependency, so it might > > still only happen on particular machines. (If anyone else tries it: > > below about 400kB, other tests start failing with "no free unpinned > > buffers" and the like.) > > I ran the test in a loop for 200+ times now, without reproducing the > problem. Rorqual runs on a shared machine though, so it's quite possible that > IO will be slower, and thus triggering the issue. > > I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM - > that'd show information about the number of pages with tuples etc. But I don't > currently see a way of that causing the regression tests to fail. > > Even if I set client_min_messages=error, the messages still get sent to the > client, because elevel == INFO is special cased in > should_output_to_client(). And I don't see a way of redirecting the output of > common.c:NoticeProcessor() in psql either. I hacked a branch thusly: @@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, verbose = (params->options & VACOPT_VERBOSE) != 0; instrument = (verbose || (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)); + instrument = true; if (instrument) { pg_rusage_init(&ru0); Having failed to reproduce this locally, I clicked on "re-run tests" all afternoon on CI until eventually I captured a failure log[1] there, with the smoking gun: pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen There are three places that skip and bump that counter, but two of them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this one: LockBuffer(buf, BUFFER_LOCK_SHARE); if (!lazy_check_needs_freeze(buf, &hastup, vacrel)) { UnlockReleaseBuffer(buf); vacrel->scanned_pages++; vacrel->pinskipped_pages++; if (hastup) vacrel->nonempty_pages = blkno + 1; continue; } Since this page doesn't require wraparound vacuuming, if we fail to conditionally acquire the cleanup lock, this block skips the page. [1] https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log
pgsql-committers by date: