Re: pgsql: Test replay of regression tests, attempt II. - Mailing list pgsql-committers
| From | Thomas Munro |
|---|---|
| Subject | Re: pgsql: Test replay of regression tests, attempt II. |
| Date | |
| Msg-id | CA+hUKG+nHX+NNjm-ig0zWLxeMiivH8omey5Onfhnxzh6g524Cg@mail.gmail.com Whole thread Raw |
| In response to | Re: pgsql: Test replay of regression tests, attempt II. (Andres Freund <andres@anarazel.de>) |
| Responses |
Re: pgsql: Test replay of regression tests, attempt II.
|
| List | pgsql-committers |
On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-18 17:19:06 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > That's an extremely small shared_buffers for running the regression tests, it'd not
> > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
> > > ends up skipping over a page because of page contention?
> >
> > Hmm, good thought. I tried running the test with even smaller
> > shared_buffers, but could not make the reloptions test fall over for
> > me. But this theory implies a strong timing dependency, so it might
> > still only happen on particular machines. (If anyone else tries it:
> > below about 400kB, other tests start failing with "no free unpinned
> > buffers" and the like.)
>
> I ran the test in a loop for 200+ times now, without reproducing the
> problem. Rorqual runs on a shared machine though, so it's quite possible that
> IO will be slower, and thus triggering the issue.
>
> I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
> that'd show information about the number of pages with tuples etc. But I don't
> currently see a way of that causing the regression tests to fail.
>
> Even if I set client_min_messages=error, the messages still get sent to the
> client, because elevel == INFO is special cased in
> should_output_to_client(). And I don't see a way of redirecting the output of
> common.c:NoticeProcessor() in psql either.
I hacked a branch thusly:
@@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
verbose = (params->options & VACOPT_VERBOSE) != 0;
instrument = (verbose || (IsAutoVacuumWorkerProcess() &&
params->log_min_duration >= 0));
+ instrument = true;
if (instrument)
{
pg_rusage_init(&ru0);
Having failed to reproduce this locally, I clicked on "re-run tests"
all afternoon on CI until eventually I captured a failure log[1]
there, with the smoking gun:
pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen
There are three places that skip and bump that counter, but two of
them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this
one:
LockBuffer(buf, BUFFER_LOCK_SHARE);
if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
{
UnlockReleaseBuffer(buf);
vacrel->scanned_pages++;
vacrel->pinskipped_pages++;
if (hastup)
vacrel->nonempty_pages = blkno + 1;
continue;
}
Since this page doesn't require wraparound vacuuming, if we fail to
conditionally acquire the cleanup lock, this block skips the page.
[1]
https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log
pgsql-committers by date: