Thread: Testing autovacuum wraparound (including failsafe)

Testing autovacuum wraparound (including failsafe)

From

Andres Freund

Date:

23 April 2021, 20:43:06

Hi,

I started to write a test for $Subject, which I think we sorely need.

Currently my approach is to:
- start a cluster, create a few tables with test data
- acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent
  autovacuum from doing anything
- cause dead tuples to exist
- restart
- run pg_resetwal -x 2000027648
- do things like acquiring pins on pages that block vacuum from progressing
- commit prepared transaction
- wait for template0, template1 datfrozenxid to increase
- wait for relfrozenxid for most relations in postgres to increase
- release buffer pin
- wait for postgres datfrozenxid to increase

So far so good. But I've encountered a few things that stand in the way of
enabling such a test by default:

1) During startup StartupSUBTRANS() zeroes out all pages between
   oldestActiveXID and nextXid. That takes 8s on my workstation, but only
   because I have plenty memory - pg_subtrans ends up 14GB as I currently do
   the test. Clearly not something we could do on the BF.

2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
   failsafe mode, we can't really create 4GB relations on the BF. While
   writing the tests I've lowered this to 4MB...

3) pg_resetwal -x requires to carefully choose an xid: It needs to be the
   first xid on a clog page. It's not hard to determine which xids are but it
   depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a
   value appropriate for 8KB, but ...


I have 2 1/2 ideas about addressing 1);

- We could exposing functionality to do advance nextXid to a future value at
  runtime, without filling in clog/subtrans pages. Would probably have to live
  in varsup.c and be exposed via regress.so or such?

- The only reason StartupSUBTRANS() does that work is because of the prepared
  transaction holding back oldestActiveXID. That transaction in turn exists to
  prevent autovacuum from doing anything before we do test setup
  steps.

  Perhaps it'd be sufficient to set autovacuum_naptime really high initially,
  perform the test setup, set naptime to something lower, reload config. But
  I'm worried that might not be reliable: If something ends up allocating an
  xid we'd potentially reach the path in GetNewTransaction() that wakes up the
  launcher?  But probably there wouldn't be anything doing so?

  Another aspect that might not make this a good choice is that it actually
  seems relevant to be able to test cases where there are very old still
  running transactions...

- As a variant of the previous idea: If that turns out to be unreliable, we
  could instead set nextxid, start in single user mode, create a blocking 2PC
  transaction, start normally. Because there's no old active xid we'd not run
  into the StartupSUBTRANS problem.


For 2), I don't really have a better idea than making that configurable
somehow?

3) is probably tolerable for now, we could skip the test if BLCKSZ isn't 8KB,
or we could hardcode the calculation for different block sizes.



I noticed one minor bug that's likely new:

2021-04-23 13:32:30.899 PDT [2027738] LOG:  automatic aggressive vacuum to prevent wraparound of table
"postgres.public.small_trunc":index scans: 1
 
        pages: 400 removed, 28 remain, 0 skipped due to pins, 0 skipped frozen
        tuples: 14000 removed, 1000 remain, 0 are dead but not yet removable, oldest xmin: 2000027651
        buffer usage: 735 hits, 1262 misses, 874 dirtied
        index scan needed: 401 pages from table (1432.14% of total) had 14000 dead item identifiers removed
        index "small_trunc_pkey": pages: 43 in total, 37 newly deleted, 37 currently deleted, 0 reusable
        avg read rate: 559.048 MB/s, avg write rate: 387.170 MB/s
        system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
        WAL usage: 1809 records, 474 full page images, 3977538 bytes

'1432.14% of total' - looks like removed pages need to be added before the
percentage calculation?

Greetings,

Andres Freund

Re: Testing autovacuum wraparound (including failsafe)

From

Justin Pryzby

Date:

23 April 2021, 23:08:12

On Fri, Apr 23, 2021 at 01:43:06PM -0700, Andres Freund wrote:
> 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
>    failsafe mode, we can't really create 4GB relations on the BF. While
>    writing the tests I've lowered this to 4MB...

> For 2), I don't really have a better idea than making that configurable
> somehow?

Does it work to shut down the cluster and create the .0,.1,.2,.3 segments of a
new, empty relation with zero blocks using something like truncate -s 1G ?

-- 
Justin

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

23 April 2021, 23:12:33

On Fri, Apr 23, 2021 at 1:43 PM Andres Freund <andres@anarazel.de> wrote:
> I started to write a test for $Subject, which I think we sorely need.

+1

> Currently my approach is to:
> - start a cluster, create a few tables with test data
> - acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent
>   autovacuum from doing anything
> - cause dead tuples to exist
> - restart
> - run pg_resetwal -x 2000027648
> - do things like acquiring pins on pages that block vacuum from progressing
> - commit prepared transaction
> - wait for template0, template1 datfrozenxid to increase
> - wait for relfrozenxid for most relations in postgres to increase
> - release buffer pin
> - wait for postgres datfrozenxid to increase

Just having a standard-ish way to do stress testing like this would
add something.

> 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
>    failsafe mode, we can't really create 4GB relations on the BF. While
>    writing the tests I've lowered this to 4MB...

The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
how often we'll consider the failsafe in the single-pass/no-indexes
case.

I see no reason why it cannot be changed now. VACUUM_FSM_EVERY_PAGES
also frustrates FSM testing in the single-pass case in about the same
way, so maybe that should be considered as well? Note that the FSM
handling for the single pass case is actually a bit different to the
two pass/has-indexes case, since the single pass case calls
lazy_vacuum_heap_page() directly in its first and only pass over the
heap (that's the whole point of having it of course).

> 3) pg_resetwal -x requires to carefully choose an xid: It needs to be the
>    first xid on a clog page. It's not hard to determine which xids are but it
>    depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a
>    value appropriate for 8KB, but ...

Ugh.

> For 2), I don't really have a better idea than making that configurable
> somehow?

That could make sense as a developer/testing option, I suppose. I just
doubt that it makes sense as anything else.

> 2021-04-23 13:32:30.899 PDT [2027738] LOG:  automatic aggressive vacuum to prevent wraparound of table
"postgres.public.small_trunc":index scans: 1

>         pages: 400 removed, 28 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 14000 removed, 1000 remain, 0 are dead but not yet removable, oldest xmin: 2000027651
>         buffer usage: 735 hits, 1262 misses, 874 dirtied
>         index scan needed: 401 pages from table (1432.14% of total) had 14000 dead item identifiers removed
>         index "small_trunc_pkey": pages: 43 in total, 37 newly deleted, 37 currently deleted, 0 reusable
>         avg read rate: 559.048 MB/s, avg write rate: 387.170 MB/s
>         system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
>         WAL usage: 1809 records, 474 full page images, 3977538 bytes
>
> '1432.14% of total' - looks like removed pages need to be added before the
> percentage calculation?

Clearly this needs to account for removed heap pages in order to
consistently express the percentage of pages with LP_DEAD items in
terms of a percentage of the original table size. I can fix this
shortly.

--
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Andres Freund

Date:

23 April 2021, 23:26:17

Hi,

On 2021-04-23 18:08:12 -0500, Justin Pryzby wrote:
> On Fri, Apr 23, 2021 at 01:43:06PM -0700, Andres Freund wrote:
> > 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
> >    failsafe mode, we can't really create 4GB relations on the BF. While
> >    writing the tests I've lowered this to 4MB...
> 
> > For 2), I don't really have a better idea than making that configurable
> > somehow?
> 
> Does it work to shut down the cluster and create the .0,.1,.2,.3 segments of a
> new, empty relation with zero blocks using something like truncate -s 1G ?

I'd like this to be portable to at least windows - I don't know how well
that deals with sparse files. But the bigger issue is that that IIRC
will trigger vacuum to try to initialize all those pages, which will
then force all that space to be allocated anyway...

Greetings,

Andres Freund

Re: Testing autovacuum wraparound (including failsafe)

From

Andres Freund

Date:

24 April 2021, 00:29:21

Hi,

On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote:
> The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
> related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
> how often we'll consider the failsafe in the single-pass/no-indexes
> case.

I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES
and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me?

> I see no reason why it cannot be changed now. VACUUM_FSM_EVERY_PAGES
> also frustrates FSM testing in the single-pass case in about the same
> way, so maybe that should be considered as well? Note that the FSM
> handling for the single pass case is actually a bit different to the
> two pass/has-indexes case, since the single pass case calls
> lazy_vacuum_heap_page() directly in its first and only pass over the
> heap (that's the whole point of having it of course).

I'm not opposed to lowering VACUUM_FSM_EVERY_PAGES (the costs don't seem
all that high compared to vacuuming?), but I don't think there's as
clear a need for testing around that as there is around wraparound.

The failsafe mode affects the table scan itself by disabling cost
limiting. As far as I can see the ways it triggers for the table scan (vs
truncation or index processing) are:

1) Before vacuuming starts, for heap phases and indexes, if already
   necessary at that point
2) For a table with indexes, before/after each index vacuum, if now
   necessary
3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary

Why would we want to trigger the failsafe mode during a scan of a table
with dead tuples and no indexes, but not on a table without dead tuples
or with indexes but fewer than m_w_m dead tuples? That makes little
sense to me.

It seems that for the no-index case the warning message is quite off?

        ereport(WARNING,
                (errmsg("abandoned index vacuuming of table \"%s.%s.%s\" as a failsafe after %d index scans",

Doesn't exactly make one understand that vacuum cost limiting now is
disabled? And is confusing because there would never be index vacuuming?

And even in the cases indexes exist, it's odd to talk about abandoning
index vacuuming that hasn't even started yet?

> > For 2), I don't really have a better idea than making that configurable
> > somehow?
> 
> That could make sense as a developer/testing option, I suppose. I just
> doubt that it makes sense as anything else.

Yea, I only was thinking of making it configurable to be able to test
it. If we change the limit to something considerably lower I wouldn't
see a need for that anymore.

Greetings,

Andres Freund

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

24 April 2021, 02:15:43

On Fri, Apr 23, 2021 at 5:29 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote:
> > The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
> > related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
> > how often we'll consider the failsafe in the single-pass/no-indexes
> > case.
>
> I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES
> and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me?

VACUUM_FSM_EVERY_PAGES controls how often VACUUM does work that
usually takes place right after the two pass case finishes a round of
index and heap vacuuming. This is work that we certainly don't want to
do every time we process a single heap page in the one-pass/no-indexes
case. Initially this just meant FSM vacuuming, but it now includes a
failsafe check.

Of course all of the precise details here are fairly arbitrary
(including VACUUM_FSM_EVERY_PAGES, which has been around for a couple
of releases now). The overall goal that I had in mind was to make the
one-pass case's use of the failsafe have analogous behavior to the
two-pass/has-indexes case -- a goal which was itself somewhat
arbitrary.

> The failsafe mode affects the table scan itself by disabling cost
> limiting. As far as I can see the ways it triggers for the table scan (vs
> truncation or index processing) are:
>
> 1) Before vacuuming starts, for heap phases and indexes, if already
>    necessary at that point
> 2) For a table with indexes, before/after each index vacuum, if now
>    necessary
> 3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary
>
> Why would we want to trigger the failsafe mode during a scan of a table
> with dead tuples and no indexes, but not on a table without dead tuples
> or with indexes but fewer than m_w_m dead tuples? That makes little
> sense to me.

What alternative does make sense to you?

It seemed important to put the failsafe check at points where we do
other analogous work in all cases. We made a pragmatic trade-off. In
theory almost any scheme might not check often enough, and/or might
check too frequently.

> It seems that for the no-index case the warning message is quite off?

I'll fix that up some point soon. FWIW this happened because the
support for one-pass VACUUM was added quite late, at Robert's request.

Another issue with the failsafe commit is that we haven't considered
the autovacuum_multixact_freeze_max_age table reloption -- we only
check the GUC. That might have accidentally been the right thing to
do, though, since the reloption is interpreted as lower than the GUC
in all cases anyway -- arguably the
autovacuum_multixact_freeze_max_age GUC should be all we care about
anyway. I will need to think about this question some more, though.

> > > For 2), I don't really have a better idea than making that configurable
> > > somehow?
> >
> > That could make sense as a developer/testing option, I suppose. I just
> > doubt that it makes sense as anything else.
>
> Yea, I only was thinking of making it configurable to be able to test
> it. If we change the limit to something considerably lower I wouldn't
> see a need for that anymore.

It would probably be okay to just lower it significantly. Not sure if
that's the best approach, though. Will pick it up next week.

-- 
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Andres Freund

Date:

24 April 2021, 02:33:14

Hi,

On 2021-04-23 19:15:43 -0700, Peter Geoghegan wrote:
> > The failsafe mode affects the table scan itself by disabling cost
> > limiting. As far as I can see the ways it triggers for the table scan (vs
> > truncation or index processing) are:
> >
> > 1) Before vacuuming starts, for heap phases and indexes, if already
> >    necessary at that point
> > 2) For a table with indexes, before/after each index vacuum, if now
> >    necessary
> > 3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary
> >
> > Why would we want to trigger the failsafe mode during a scan of a table
> > with dead tuples and no indexes, but not on a table without dead tuples
> > or with indexes but fewer than m_w_m dead tuples? That makes little
> > sense to me.
> 
> What alternative does make sense to you?

Check it every so often, independent of whether there are indexes or
dead tuples? Or just check it at the boundaries.

I'd make it dependent on the number of pages scanned, rather than the
block distance to the last check - otherwise we might end up doing it
way too often when there's only a few individual pages not in the freeze
map.

Greetings,

Andres Freund

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

24 April 2021, 02:42:30

On Fri, Apr 23, 2021 at 7:33 PM Andres Freund <andres@anarazel.de> wrote:
> Check it every so often, independent of whether there are indexes or
> dead tuples? Or just check it at the boundaries.

I think that the former suggestion might be better -- I actually
thought about doing it that way myself.

The latter suggestion sounds like you're suggesting that we just check
it at the beginning and the end in all cases (we do the beginning in
all cases already, but now we'd also do the end outside of the loop in
all cases). Is that right? If that is what you meant, then you should
note that there'd hardly be any check in the one-pass case with that
scheme (apart from the initial check that we do already). The only
work we'd be skipping at the end (in the event of that check
triggering the failsafe) would be heap truncation, which (as you've
pointed out yourself) doesn't seem particularly likely to matter.

-- 
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Andres Freund

Date:

24 April 2021, 02:53:03

Hi,

On 2021-04-23 19:42:30 -0700, Peter Geoghegan wrote:
> On Fri, Apr 23, 2021 at 7:33 PM Andres Freund <andres@anarazel.de> wrote:
> > Check it every so often, independent of whether there are indexes or
> > dead tuples? Or just check it at the boundaries.
>
> I think that the former suggestion might be better -- I actually
> thought about doing it that way myself.

Cool.


> The latter suggestion sounds like you're suggesting that we just check
> it at the beginning and the end in all cases (we do the beginning in
> all cases already, but now we'd also do the end outside of the loop in
> all cases). Is that right?

Yes.


> If that is what you meant, then you should note that there'd hardly be
> any check in the one-pass case with that scheme (apart from the
> initial check that we do already). The only work we'd be skipping at
> the end (in the event of that check triggering the failsafe) would be
> heap truncation, which (as you've pointed out yourself) doesn't seem
> particularly likely to matter.

I mainly suggested it because to me the current seems hard to
understand. I do think it'd be better to check more often. But checking
depending on the amount of dead tuples at the right time doesn't strike
me as a good idea - a lot of anti-wraparound vacuums will mainly be
freezing tuples, rather than removing a lot of dead rows. Which makes it
hard to understand when the failsafe kicks in.

Greetings,

Andres Freund

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

24 April 2021, 02:56:38

On Fri, Apr 23, 2021 at 7:53 PM Andres Freund <andres@anarazel.de> wrote:
> I mainly suggested it because to me the current seems hard to
> understand. I do think it'd be better to check more often. But checking
> depending on the amount of dead tuples at the right time doesn't strike
> me as a good idea - a lot of anti-wraparound vacuums will mainly be
> freezing tuples, rather than removing a lot of dead rows. Which makes it
> hard to understand when the failsafe kicks in.

I'm convinced -- decoupling the logic from the one-pass-not-two pass
case seems likely to be simpler and more useful. For both the one pass
and two pass/has indexes case.

-- 
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

14 May 2021, 01:03:47

On Fri, Apr 23, 2021 at 7:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm convinced -- decoupling the logic from the one-pass-not-two pass
> case seems likely to be simpler and more useful. For both the one pass
> and two pass/has indexes case.

Attached draft patch does it that way.

-- 
Peter Geoghegan

Attachment

v1-0001-Consider-triggering-failsafe-during-first-scan.patch

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

18 May 2021, 05:28:53

On Sat, Apr 24, 2021 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Apr 23, 2021 at 5:29 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote:
> > > The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
> > > related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
> > > how often we'll consider the failsafe in the single-pass/no-indexes
> > > case.
> >
> > I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES
> > and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me?
>
> VACUUM_FSM_EVERY_PAGES controls how often VACUUM does work that
> usually takes place right after the two pass case finishes a round of
> index and heap vacuuming. This is work that we certainly don't want to
> do every time we process a single heap page in the one-pass/no-indexes
> case. Initially this just meant FSM vacuuming, but it now includes a
> failsafe check.
>
> Of course all of the precise details here are fairly arbitrary
> (including VACUUM_FSM_EVERY_PAGES, which has been around for a couple
> of releases now). The overall goal that I had in mind was to make the
> one-pass case's use of the failsafe have analogous behavior to the
> two-pass/has-indexes case -- a goal which was itself somewhat
> arbitrary.
>
> > The failsafe mode affects the table scan itself by disabling cost
> > limiting. As far as I can see the ways it triggers for the table scan (vs
> > truncation or index processing) are:
> >
> > 1) Before vacuuming starts, for heap phases and indexes, if already
> >    necessary at that point
> > 2) For a table with indexes, before/after each index vacuum, if now
> >    necessary
> > 3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary
> >
> > Why would we want to trigger the failsafe mode during a scan of a table
> > with dead tuples and no indexes, but not on a table without dead tuples
> > or with indexes but fewer than m_w_m dead tuples? That makes little
> > sense to me.
>
> What alternative does make sense to you?
>
> It seemed important to put the failsafe check at points where we do
> other analogous work in all cases. We made a pragmatic trade-off. In
> theory almost any scheme might not check often enough, and/or might
> check too frequently.
>
> > It seems that for the no-index case the warning message is quite off?
>
> I'll fix that up some point soon. FWIW this happened because the
> support for one-pass VACUUM was added quite late, at Robert's request.

+1 to fix this. Are you already working on fixing this? If not, I'll
post a patch.

>
> Another issue with the failsafe commit is that we haven't considered
> the autovacuum_multixact_freeze_max_age table reloption -- we only
> check the GUC. That might have accidentally been the right thing to
> do, though, since the reloption is interpreted as lower than the GUC
> in all cases anyway -- arguably the
> autovacuum_multixact_freeze_max_age GUC should be all we care about
> anyway. I will need to think about this question some more, though.

FWIW, I intentionally ignored the reloption there since they're
interpreted as lower than the GUC as you mentioned and the situation
where we need to enter the failsafe mode is not the table-specific
problem but a system-wide problem.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

18 May 2021, 05:42:25

On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> +1 to fix this. Are you already working on fixing this? If not, I'll
> post a patch.

I posted a patch recently (last Thursday my time). Perhaps you can review it?

-- 
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

18 May 2021, 05:46:07

On Tue, May 18, 2021 at 2:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > +1 to fix this. Are you already working on fixing this? If not, I'll
> > post a patch.
>
> I posted a patch recently (last Thursday my time). Perhaps you can review it?

Oh, I missed that the patch includes that fix. I'll review the patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

18 May 2021, 07:09:37

On Tue, May 18, 2021 at 2:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, May 18, 2021 at 2:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > +1 to fix this. Are you already working on fixing this? If not, I'll
> > > post a patch.
> >
> > I posted a patch recently (last Thursday my time). Perhaps you can review it?
>
> Oh, I missed that the patch includes that fix. I'll review the patch.
>

I've reviewed the patch. Here is one comment:

    if (vacrel->num_index_scans == 0 &&
-       vacrel->rel_pages <= FAILSAFE_MIN_PAGES)
+       vacrel->rel_pages <= FAILSAFE_EVERY_PAGES)
        return false;

Since there is the condition "vacrel->num_index_scans == 0" we could
enter the failsafe mode even if the table is less than 4GB, if we
enter lazy_check_wraparound_failsafe() after executing more than one
index scan. Whereas a vacuum on the table that is less than 4GB and
has no index never enters the failsafe mode. I think we can remove
this condition since I don't see the reason why we don't allow to
enter the failsafe mode only when the first-time index scan in the
case of such tables. What do you think?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

25 May 2021, 00:14:36

On Tue, May 18, 2021 at 12:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Since there is the condition "vacrel->num_index_scans == 0" we could
> enter the failsafe mode even if the table is less than 4GB, if we
> enter lazy_check_wraparound_failsafe() after executing more than one
> index scan. Whereas a vacuum on the table that is less than 4GB and
> has no index never enters the failsafe mode. I think we can remove
> this condition since I don't see the reason why we don't allow to
> enter the failsafe mode only when the first-time index scan in the
> case of such tables. What do you think?

I'm convinced -- this does seem like premature optimization now.

I pushed a version of the patch that removes that code just now.

Thanks
-- 
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Anastasia Lubennikova

Date:

10 June 2021, 13:42:01

On Thu, Jun 10, 2021 at 10:52 AM Andres Freund <andres@anarazel.de> wrote:

I started to write a test for $Subject, which I think we sorely need.

Currently my approach is to:
- start a cluster, create a few tables with test data
- acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent
autovacuum from doing anything
- cause dead tuples to exist
- restart
- run pg_resetwal -x 2000027648
- do things like acquiring pins on pages that block vacuum from progressing
- commit prepared transaction
- wait for template0, template1 datfrozenxid to increase
- wait for relfrozenxid for most relations in postgres to increase
- release buffer pin
- wait for postgres datfrozenxid to increase

Cool. Thank you for working on that!

Could you please share a WIP patch for the $subj? I'd be happy to help with it.

So far so good. But I've encountered a few things that stand in the way of
enabling such a test by default:

1) During startup StartupSUBTRANS() zeroes out all pages between
oldestActiveXID and nextXid. That takes 8s on my workstation, but only
because I have plenty memory - pg_subtrans ends up 14GB as I currently do
the test. Clearly not something we could do on the BF.
....

3) pg_resetwal -x requires to carefully choose an xid: It needs to be the
first xid on a clog page. It's not hard to determine which xids are but it
depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a
value appropriate for 8KB, but ...

Maybe we can add new pg_resetwal option? Something like pg_resetwal --xid-near-wraparound, which will ask pg_resetwal to calculate exact xid value using values from pg_control and clog macros?

I think it might come in handy for manual testing too.

I have 2 1/2 ideas about addressing 1);

- We could exposing functionality to do advance nextXid to a future value at
runtime, without filling in clog/subtrans pages. Would probably have to live
in varsup.c and be exposed via regress.so or such?

This option looks scary to me. Several functions rely on the fact that StartupSUBTRANS() have zeroed pages.

And if we will do it conditional just for tests, it means that we won't test the real code path.

- The only reason StartupSUBTRANS() does that work is because of the prepared
transaction holding back oldestActiveXID. That transaction in turn exists to
prevent autovacuum from doing anything before we do test setup
steps.

Perhaps it'd be sufficient to set autovacuum_naptime really high initially,
perform the test setup, set naptime to something lower, reload config. But
I'm worried that might not be reliable: If something ends up allocating an
xid we'd potentially reach the path in GetNewTransaction() that wakes up the
launcher? But probably there wouldn't be anything doing so?

Another aspect that might not make this a good choice is that it actually
seems relevant to be able to test cases where there are very old still
running transactions...

Maybe this exact scenario can be covered with a separate long-running test, not included in buildfarm test suite?

Best regards,

Lubennikova Anastasia

Re: Testing autovacuum wraparound (including failsafe)

From

Andres Freund

Date:

11 June 2021, 01:18:50

Hi,

On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote:
> Cool. Thank you for working on that!
> Could you please share a WIP patch for the $subj? I'd be happy to help with
> it.

I've attached the current WIP state, which hasn't evolved much since
this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl
but I'm not sure that's the best place. But I didn't think
src/test/recovery is great either.

Regards,

Andres

Attachment

001_emergency_vacuum.pl

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

01 February 2022, 02:58:55

On Fri, Jun 11, 2021 at 10:19 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote:
> > Cool. Thank you for working on that!
> > Could you please share a WIP patch for the $subj? I'd be happy to help with
> > it.
>
> I've attached the current WIP state, which hasn't evolved much since
> this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl
> but I'm not sure that's the best place. But I didn't think
> src/test/recovery is great either.
>

Thank you for sharing the WIP patch.

Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time
for zeroing out all pages), how about using single-user mode instead
of preparing the transaction? That is, after pg_resetwal we check the
ages of datfrozenxid by executing a query in single-user mode. That
way, we don’t need to worry about autovacuum concurrently running
while checking the ages of frozenxids. I’ve attached a PoC patch that
does the scenario like:

1. start cluster with autovacuum=off and create tables with a few data
and make garbage on them
2. stop cluster and do pg_resetwal
3. start cluster in single-user mode
4. check age(datfrozenxid)
5. stop cluster
6. start cluster and wait for autovacuums to increase template0,
template1, and postgres datfrozenxids

I put new tests in src/test/module/heap since we already have tests
for brin in src/test/module/brin.

I think that tap test facility to run queries in single-user mode will
also be helpful for testing a new vacuum option/command that is
intended to use in emergency cases and proposed here[1].

Regards,

[1]  https://www.postgresql.org/message-id/flat/20220128012842.GZ23027%40telsasoft.com#b76c13554f90d1c8bb5532d6f3e5cbf8

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

tap_tests_for_wraparound_emregency.patch

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

30 June 2022, 01:40:12

Hi,

On Tue, Feb 1, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jun 11, 2021 at 10:19 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote:
> > > Cool. Thank you for working on that!
> > > Could you please share a WIP patch for the $subj? I'd be happy to help with
> > > it.
> >
> > I've attached the current WIP state, which hasn't evolved much since
> > this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl
> > but I'm not sure that's the best place. But I didn't think
> > src/test/recovery is great either.
> >
>
> Thank you for sharing the WIP patch.
>
> Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time
> for zeroing out all pages), how about using single-user mode instead
> of preparing the transaction? That is, after pg_resetwal we check the
> ages of datfrozenxid by executing a query in single-user mode. That
> way, we don’t need to worry about autovacuum concurrently running
> while checking the ages of frozenxids. I’ve attached a PoC patch that
> does the scenario like:
>
> 1. start cluster with autovacuum=off and create tables with a few data
> and make garbage on them
> 2. stop cluster and do pg_resetwal
> 3. start cluster in single-user mode
> 4. check age(datfrozenxid)
> 5. stop cluster
> 6. start cluster and wait for autovacuums to increase template0,
> template1, and postgres datfrozenxids

The above steps are wrong.

I think we can expose a function in an extension used only by this
test in order to set nextXid to a future value with zeroing out
clog/subtrans pages. We don't need to fill all clog/subtrans pages
between oldestActiveXID and nextXid. I've attached a PoC patch for
adding this regression test and am going to register it to the next
CF.

BTW, while testing the emergency situation, I found there is a race
condition where anti-wraparound vacuum isn't invoked with the settings
autovacuum = off, autovacuum_max_workers = 1. AN autovacuum worker
sends a signal to the postmaster after advancing datfrozenxid in
SetTransactionIdLimit(). But with the settings, if the autovacuum
launcher attempts to launch a worker before the autovacuum worker who
has signaled to the postmaster finishes, the launcher exits without
launching a worker due to no free workers. The new launcher won’t be
launched until new XID is generated (and only when new XID % 65536 ==
0). Although autovacuum_max_workers = 1 is not mandatory for this
test, it's easier to verify the order of operations.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

v1-0001-Add-regression-tests-for-emergency-vacuums.patch

Re: Testing autovacuum wraparound (including failsafe)

From

Ian Lawrence Barwick

Date:

16 November 2022, 04:38:10

2022年6月30日(木) 10:40 Masahiko Sawada <sawada.mshk@gmail.com>:
>
> Hi,
>
> On Tue, Feb 1, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jun 11, 2021 at 10:19 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote:
> > > > Cool. Thank you for working on that!
> > > > Could you please share a WIP patch for the $subj? I'd be happy to help with
> > > > it.
> > >
> > > I've attached the current WIP state, which hasn't evolved much since
> > > this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl
> > > but I'm not sure that's the best place. But I didn't think
> > > src/test/recovery is great either.
> > >
> >
> > Thank you for sharing the WIP patch.
> >
> > Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time
> > for zeroing out all pages), how about using single-user mode instead
> > of preparing the transaction? That is, after pg_resetwal we check the
> > ages of datfrozenxid by executing a query in single-user mode. That
> > way, we don’t need to worry about autovacuum concurrently running
> > while checking the ages of frozenxids. I’ve attached a PoC patch that
> > does the scenario like:
> >
> > 1. start cluster with autovacuum=off and create tables with a few data
> > and make garbage on them
> > 2. stop cluster and do pg_resetwal
> > 3. start cluster in single-user mode
> > 4. check age(datfrozenxid)
> > 5. stop cluster
> > 6. start cluster and wait for autovacuums to increase template0,
> > template1, and postgres datfrozenxids
>
> The above steps are wrong.
>
> I think we can expose a function in an extension used only by this
> test in order to set nextXid to a future value with zeroing out
> clog/subtrans pages. We don't need to fill all clog/subtrans pages
> between oldestActiveXID and nextXid. I've attached a PoC patch for
> adding this regression test and am going to register it to the next
> CF.
>
> BTW, while testing the emergency situation, I found there is a race
> condition where anti-wraparound vacuum isn't invoked with the settings
> autovacuum = off, autovacuum_max_workers = 1. AN autovacuum worker
> sends a signal to the postmaster after advancing datfrozenxid in
> SetTransactionIdLimit(). But with the settings, if the autovacuum
> launcher attempts to launch a worker before the autovacuum worker who
> has signaled to the postmaster finishes, the launcher exits without
> launching a worker due to no free workers. The new launcher won’t be
> launched until new XID is generated (and only when new XID % 65536 ==
> 0). Although autovacuum_max_workers = 1 is not mandatory for this
> test, it's easier to verify the order of operations.

Hi

Thanks for the patch. While reviewing the patch backlog, we have determined that
the latest version of this patch was submitted before meson support was
implemented, so it should have a "meson.build" file added for consideration for
inclusion in PostgreSQL 16.

Regards

Ian Barwick

Re: Testing autovacuum wraparound (including failsafe)

From

Heikki Linnakangas

Date:

03 March 2023, 11:34:50

On 16/11/2022 06:38, Ian Lawrence Barwick wrote:
> Thanks for the patch. While reviewing the patch backlog, we have determined that
> the latest version of this patch was submitted before meson support was
> implemented, so it should have a "meson.build" file added for consideration for
> inclusion in PostgreSQL 16.

I wanted to do some XID wraparound testing again, to test the 64-bit 
SLRUs patches [1], and revived this.

I took a different approach to consuming the XIDs. Instead of setting 
nextXID directly, bypassing GetNewTransactionId(), this patch introduces 
a helper function to call GetNewTransactionId() repeatedly. But because 
that's slow, it does include a shortcut to skip over "uninteresting" 
XIDs. Whenever nextXid is close to an SLRU page boundary or XID 
wraparound, it calls GetNewTransactionId(), and otherwise it bumps up 
nextXid close to the next "interesting" value. That's still a lot slower 
than just setting nextXid, but exercises the code more realistically.

I've written some variant of this helper function many times over the 
years, for ad hoc testing. I'd love to have it permanently in the git tree.

In addition to Masahiko's test for emergency vacuum, this includes two 
other tests. 002_limits.pl tests the "warn limit" and "stop limit" in 
GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion 
transactions in total, exercising XID wraparound in general. 
Unfortunately these tests are pretty slow; the tests run for about 4 
minutes on my laptop in total, and use about 20 GB of disk space. So 
perhaps these need to be put in a special test suite that's not run as 
part of "check-world". Or perhaps leave out the 003_wraparounds.pl test, 
that's the slowest of the tests. But I'd love to have these in the git 
tree in some form.

[1] 
https://www.postgresql.org/message-id/CAJ7c6TPKf0W3MfpP2vr=kq7-NM5G12vTBhi7miu_5m8AG3Cw-w@mail.gmail.com)

- Heikki

Re: Testing autovacuum wraparound (including failsafe)

From

Heikki Linnakangas

Date:

03 March 2023, 13:41:55

On 03/03/2023 13:34, Heikki Linnakangas wrote:
> On 16/11/2022 06:38, Ian Lawrence Barwick wrote:
>> Thanks for the patch. While reviewing the patch backlog, we have determined that
>> the latest version of this patch was submitted before meson support was
>> implemented, so it should have a "meson.build" file added for consideration for
>> inclusion in PostgreSQL 16.
> 
> I wanted to do some XID wraparound testing again, to test the 64-bit
> SLRUs patches [1], and revived this.

Forgot attachment.

- Heikki

Attachment

0001-Add-tests-for-XID-wraparound.patch

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

08 March 2023, 04:52:31

On Fri, Mar 3, 2023 at 8:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 16/11/2022 06:38, Ian Lawrence Barwick wrote:
> > Thanks for the patch. While reviewing the patch backlog, we have determined that
> > the latest version of this patch was submitted before meson support was
> > implemented, so it should have a "meson.build" file added for consideration for
> > inclusion in PostgreSQL 16.
>
> I wanted to do some XID wraparound testing again, to test the 64-bit
> SLRUs patches [1], and revived this.

Thank you for reviving this thread!

>
> I took a different approach to consuming the XIDs. Instead of setting
> nextXID directly, bypassing GetNewTransactionId(), this patch introduces
> a helper function to call GetNewTransactionId() repeatedly. But because
> that's slow, it does include a shortcut to skip over "uninteresting"
> XIDs. Whenever nextXid is close to an SLRU page boundary or XID
> wraparound, it calls GetNewTransactionId(), and otherwise it bumps up
> nextXid close to the next "interesting" value. That's still a lot slower
> than just setting nextXid, but exercises the code more realistically.
>
> I've written some variant of this helper function many times over the
> years, for ad hoc testing. I'd love to have it permanently in the git tree.

These functions seem to be better than mine.

> In addition to Masahiko's test for emergency vacuum, this includes two
> other tests. 002_limits.pl tests the "warn limit" and "stop limit" in
> GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion
> transactions in total, exercising XID wraparound in general.
> Unfortunately these tests are pretty slow; the tests run for about 4
> minutes on my laptop in total, and use about 20 GB of disk space. So
> perhaps these need to be put in a special test suite that's not run as
> part of "check-world". Or perhaps leave out the 003_wraparounds.pl test,
> that's the slowest of the tests. But I'd love to have these in the git
> tree in some form.

cbfot reports some failures. The main reason seems that meson.build in
xid_wraparound directory adds the regression tests but the .sql and
.out files are missing in the patch. Perhaps the patch wants to add
only tap tests as Makefile doesn't define REGRESS?

Even after fixing this issue, CI tests (Cirrus CI) are not happy and
report failures due to a disk full. The size of xid_wraparound test
directory is 105MB out of 262MB:

% du -sh testrun
262M    testrun
% du -sh testrun/xid_wraparound/
105M    testrun/xid_wraparound/
% du -sh testrun/xid_wraparound/*
460K    testrun/xid_wraparound/001_emergency_vacuum
93M     testrun/xid_wraparound/002_limits
12M     testrun/xid_wraparound/003_wraparounds
% ls -lh testrun/xid_wraparound/002_limits/log*
total 93M
-rw-------. 1 masahiko masahiko 93M Mar  7 17:34 002_limits_wraparound.log
-rw-rw-r--. 1 masahiko masahiko 20K Mar  7 17:34 regress_log_002_limits

The biggest file is the server logs since an autovacuum worker writes
autovacuum logs for every table for every second (autovacuum_naptime
is 1s). Maybe we can set log_autovacuum_min_duration reloption for the
test tables instead of globally enabling it

The 001 test uses the 2PC transaction that holds locks on tables but
since we can consume xids while the server running, we don't need
that. Instead I think we can keep a transaction open in the background
like 002 test does.

I'll try these ideas.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

08 March 2023, 05:21:00

On Fri, Mar 3, 2023 at 3:34 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I took a different approach to consuming the XIDs. Instead of setting
> nextXID directly, bypassing GetNewTransactionId(), this patch introduces
> a helper function to call GetNewTransactionId() repeatedly. But because
> that's slow, it does include a shortcut to skip over "uninteresting"
> XIDs. Whenever nextXid is close to an SLRU page boundary or XID
> wraparound, it calls GetNewTransactionId(), and otherwise it bumps up
> nextXid close to the next "interesting" value. That's still a lot slower
> than just setting nextXid, but exercises the code more realistically.

Surely your tap test should be using single user mode?  Perhaps you
missed the obnoxious HINT, that's part of the WARNING that the test
parses?  ;-)

This is a very useful patch. I certainly don't want to make life
harder by (say) connecting it to the single user mode problem.
But...the single user mode thing really needs to go away. It's just
terrible advice, and actively harms users.

--
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Michael Paquier

Date:

09 March 2023, 06:46:53

On Tue, Mar 07, 2023 at 09:21:00PM -0800, Peter Geoghegan wrote:
> Surely your tap test should be using single user mode?  Perhaps you
> missed the obnoxious HINT, that's part of the WARNING that the test
> parses?  ;-)

I may be missing something, but you cannot use directly a "postgres"
command in a TAP test, can you?  See 1a9d802, that has fixed a problem
when TAP tests run with a privileged account on Windows.
--
Michael

Attachment

signature.asc

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

12 March 2023, 04:46:24

On Wed, Mar 8, 2023 at 10:47 PM Michael Paquier <michael@paquier.xyz> wrote:
> I may be missing something, but you cannot use directly a "postgres"
> command in a TAP test, can you?  See 1a9d802, that has fixed a problem
> when TAP tests run with a privileged account on Windows.

I was joking. But I did have a real point: once we have tests for the
xidStopLimit mechanism, why not take the opportunity to correct the
long standing issue with the documentation advising the use of single
user mode?

--
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Jacob Champion

Date:

13 March 2023, 22:25:46

On Sat, Mar 11, 2023 at 8:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I was joking. But I did have a real point: once we have tests for the
> xidStopLimit mechanism, why not take the opportunity to correct the
> long standing issue with the documentation advising the use of single
> user mode?

Does https://commitfest.postgresql.org/42/4128/ address that
independently enough?

--Jacob

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Geoghegan

Date:

13 March 2023, 23:24:48

On Mon, Mar 13, 2023 at 3:25 PM Jacob Champion <jchampion@timescale.com> wrote:
> Does https://commitfest.postgresql.org/42/4128/ address that
> independently enough?

I wasn't aware of that patch. It looks like it does exactly what I was
arguing in favor of. So yes.

--
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

14 March 2023, 06:01:30

On Wed, Mar 8, 2023 at 1:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 3, 2023 at 8:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 16/11/2022 06:38, Ian Lawrence Barwick wrote:
> > > Thanks for the patch. While reviewing the patch backlog, we have determined that
> > > the latest version of this patch was submitted before meson support was
> > > implemented, so it should have a "meson.build" file added for consideration for
> > > inclusion in PostgreSQL 16.
> >
> > I wanted to do some XID wraparound testing again, to test the 64-bit
> > SLRUs patches [1], and revived this.
>
> Thank you for reviving this thread!
>
> >
> > I took a different approach to consuming the XIDs. Instead of setting
> > nextXID directly, bypassing GetNewTransactionId(), this patch introduces
> > a helper function to call GetNewTransactionId() repeatedly. But because
> > that's slow, it does include a shortcut to skip over "uninteresting"
> > XIDs. Whenever nextXid is close to an SLRU page boundary or XID
> > wraparound, it calls GetNewTransactionId(), and otherwise it bumps up
> > nextXid close to the next "interesting" value. That's still a lot slower
> > than just setting nextXid, but exercises the code more realistically.
> >
> > I've written some variant of this helper function many times over the
> > years, for ad hoc testing. I'd love to have it permanently in the git tree.
>
> These functions seem to be better than mine.
>
> > In addition to Masahiko's test for emergency vacuum, this includes two
> > other tests. 002_limits.pl tests the "warn limit" and "stop limit" in
> > GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion
> > transactions in total, exercising XID wraparound in general.
> > Unfortunately these tests are pretty slow; the tests run for about 4
> > minutes on my laptop in total, and use about 20 GB of disk space. So
> > perhaps these need to be put in a special test suite that's not run as
> > part of "check-world". Or perhaps leave out the 003_wraparounds.pl test,
> > that's the slowest of the tests. But I'd love to have these in the git
> > tree in some form.
>
> cbfot reports some failures. The main reason seems that meson.build in
> xid_wraparound directory adds the regression tests but the .sql and
> .out files are missing in the patch. Perhaps the patch wants to add
> only tap tests as Makefile doesn't define REGRESS?
>
> Even after fixing this issue, CI tests (Cirrus CI) are not happy and
> report failures due to a disk full. The size of xid_wraparound test
> directory is 105MB out of 262MB:
>
> % du -sh testrun
> 262M    testrun
> % du -sh testrun/xid_wraparound/
> 105M    testrun/xid_wraparound/
> % du -sh testrun/xid_wraparound/*
> 460K    testrun/xid_wraparound/001_emergency_vacuum
> 93M     testrun/xid_wraparound/002_limits
> 12M     testrun/xid_wraparound/003_wraparounds
> % ls -lh testrun/xid_wraparound/002_limits/log*
> total 93M
> -rw-------. 1 masahiko masahiko 93M Mar  7 17:34 002_limits_wraparound.log
> -rw-rw-r--. 1 masahiko masahiko 20K Mar  7 17:34 regress_log_002_limits
>
> The biggest file is the server logs since an autovacuum worker writes
> autovacuum logs for every table for every second (autovacuum_naptime
> is 1s). Maybe we can set log_autovacuum_min_duration reloption for the
> test tables instead of globally enabling it

I think it could be acceptable since 002 and 003 tests are executed
only when required. And 001 test seems to be able to pass on cfbot but
it takes more than 30 sec. In the attached patch, I made these tests
optional and these are enabled if envar ENABLE_XID_WRAPAROUND_TESTS is
defined (supporting only autoconf).

>
> The 001 test uses the 2PC transaction that holds locks on tables but
> since we can consume xids while the server running, we don't need
> that. Instead I think we can keep a transaction open in the background
> like 002 test does.

Updated in the new patch. Also, I added a check if the failsafe mode
is triggered.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

v2-0001-Add-tests-for-XID-wraparound.patch

Re: Testing autovacuum wraparound (including failsafe)

From

John Naylor

Date:

21 April 2023, 03:02:31

I agree having the new functions in the tree is useful. I also tried running the TAP tests in v2, but 001 and 002 fail to run:

Odd number of elements in hash assignment at [...]/Cluster.pm line 2010.
Can't locate object method "pump_nb" via package "PostgreSQL::Test::BackgroundPsql" at [...]

It seems to be complaining about

+my $in = '';
+my $out = '';
+my $timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my $background_psql = $node->background_psql('postgres', \$in, \$out, $timeout);

...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

27 April 2023, 14:06:40

On Fri, Apr 21, 2023 at 12:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I agree having the new functions in the tree is useful. I also tried running the TAP tests in v2, but 001 and 002
failto run: 
>
> Odd number of elements in hash assignment at [...]/Cluster.pm line 2010.
> Can't locate object method "pump_nb" via package "PostgreSQL::Test::BackgroundPsql" at [...]
>
> It seems to be complaining about
>
> +my $in  = '';
> +my $out = '';
> +my $timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
> +my $background_psql = $node->background_psql('postgres', \$in, \$out, $timeout);
>
> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm
missing?

Thanks for reporting. I think that the patch needs to be updated since
commit 664d757531e1 changed background psql TAP functions. I've
attached the updated patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

v3-0001-Add-tests-for-XID-wraparound.patch

Re: Testing autovacuum wraparound (including failsafe)

From

Daniel Gustafsson

Date:

27 April 2023, 14:12:15

> On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Apr 21, 2023 at 12:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

>> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm
missing?
>
> Thanks for reporting. I think that the patch needs to be updated since
> commit 664d757531e1 changed background psql TAP functions. I've
> attached the updated patch.

Is there a risk that the background psql will time out on slow systems during
the consumption of 2B xid's?  Since you mainly want to hold it open for the
duration of testing you might want to bump it to avoid false negatives on slow
test systems.

--
Daniel Gustafsson

Re: Testing autovacuum wraparound (including failsafe)

From

John Naylor

Date:

28 April 2023, 03:21:24

On Thu, Apr 27, 2023 at 9:12 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Fri, Apr 21, 2023 at 12:02 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> >> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?
> >
> > Thanks for reporting. I think that the patch needs to be updated since
> > commit 664d757531e1 changed background psql TAP functions. I've
> > attached the updated patch.

Thanks, it passes for me now.

> Is there a risk that the background psql will time out on slow systems during
> the consumption of 2B xid's? Since you mainly want to hold it open for the
> duration of testing you might want to bump it to avoid false negatives on slow
> test systems.

If they're that slow, I'd worry more about generating 20GB of xact status data. That's why the tests are disabled by default.

--
John Naylor
EDB: http://www.enterprisedb.com

On Thu, Apr 27, 2023 at 9:12 PM Daniel Gustafsson <daniel@yesql.se> wrote:

> On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Apr 21, 2023 at 12:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

>> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?
>
> Thanks for reporting. I think that the patch needs to be updated since
> commit 664d757531e1 changed background psql TAP functions. I've
> attached the updated patch.

Is there a risk that the background psql will time out on slow systems during
the consumption of 2B xid's? Since you mainly want to hold it open for the
duration of testing you might want to bump it to avoid false negatives on slow
test systems.

--
Daniel Gustafsson

John Naylor

EDB: http://www.enterprisedb.com

Re: Testing autovacuum wraparound (including failsafe)

From

Tom Lane

Date:

28 April 2023, 04:42:23

John Naylor <john.naylor@enterprisedb.com> writes:
> On Thu, Apr 27, 2023 at 9:12 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>> Is there a risk that the background psql will time out on slow systems during
>> the consumption of 2B xid's?  Since you mainly want to hold it open for the
>> duration of testing you might want to bump it to avoid false negatives on
>> slow test systems.

> If they're that slow, I'd worry more about generating 20GB of xact status
> data. That's why the tests are disabled by default.

There is exactly zero chance that anyone will accept the introduction
of such an expensive test into either check-world or the buildfarm
sequence.

            regards, tom lane

Re: Testing autovacuum wraparound (including failsafe)

From

Daniel Gustafsson

Date:

28 April 2023, 08:49:45

> On 28 Apr 2023, at 06:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> John Naylor <john.naylor@enterprisedb.com> writes:

>> If they're that slow, I'd worry more about generating 20GB of xact status
>> data. That's why the tests are disabled by default.
>
> There is exactly zero chance that anyone will accept the introduction
> of such an expensive test into either check-world or the buildfarm
> sequence.

Even though the entire suite is disabled by default, shouldn't it also require
PG_TEST_EXTRA to be consistent with other off-by-default suites like for example
src/test/kerberos?

--
Daniel Gustafsson

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

12 July 2023, 07:52:23

On Thu, Apr 27, 2023 at 11:12 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Fri, Apr 21, 2023 at 12:02 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> >> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm
missing?
> >
> > Thanks for reporting. I think that the patch needs to be updated since
> > commit 664d757531e1 changed background psql TAP functions. I've
> > attached the updated patch.
>
> Is there a risk that the background psql will time out on slow systems during
> the consumption of 2B xid's?  Since you mainly want to hold it open for the
> duration of testing you might want to bump it to avoid false negatives on slow
> test systems.

Agreed. The timeout can be set by manually setting
PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
now require setting PG_TET_EXTRA to run it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

v4-0001-Add-tests-for-XID-wraparound.patch

Re: Testing autovacuum wraparound (including failsafe)

From

Daniel Gustafsson

Date:

12 July 2023, 11:47:51

> On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Agreed. The timeout can be set by manually setting
> PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
> now require setting PG_TET_EXTRA to run it.

+# bump the query timeout to avoid false negatives on slow test syetems.
typo: s/syetems/systems/


+# bump the query timeout to avoid false negatives on slow test syetems.
+$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
Does this actually work?  Utils.pm read the environment variable at compile
time in the BEGIN block so this setting won't be seen?  A quick testprogram
seems to confirm this but I might be missing something.

--
Daniel Gustafsson

Re: Testing autovacuum wraparound (including failsafe)

From

Michael Paquier

Date:

22 August 2023, 05:49:32

On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
> +# bump the query timeout to avoid false negatives on slow test syetems.
> +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
> Does this actually work?  Utils.pm read the environment variable at compile
> time in the BEGIN block so this setting won't be seen?  A quick testprogram
> seems to confirm this but I might be missing something.

I wish that this test were cheaper, without a need to depend on
PG_TEST_EXTRA..  Actually, note that you are forgetting to update the
documentation of PG_TEST_EXTRA with this new value of xid_wraparound.
--
Michael

Attachment

signature.asc

Re: Testing autovacuum wraparound (including failsafe)

From

Daniel Gustafsson

Date:

25 August 2023, 09:26:26

> On 22 Aug 2023, at 07:49, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
>> +# bump the query timeout to avoid false negatives on slow test syetems.
>> +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
>> Does this actually work?  Utils.pm read the environment variable at compile
>> time in the BEGIN block so this setting won't be seen?  A quick testprogram
>> seems to confirm this but I might be missing something.
>
> I wish that this test were cheaper, without a need to depend on
> PG_TEST_EXTRA..  Actually, note that you are forgetting to update the
> documentation of PG_TEST_EXTRA with this new value of xid_wraparound.

Agreed, it would be nice, but I don't see any way to achieve that.  I still
think the test is worthwhile to add, once the upthread mentioned issues are
resolved.

--
Daniel Gustafsson

Re: Testing autovacuum wraparound (including failsafe)

From

Noah Misch

Date:

03 September 2023, 05:48:01

On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
> > On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Agreed. The timeout can be set by manually setting
> > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
> > now require setting PG_TET_EXTRA to run it.
> 
> +# bump the query timeout to avoid false negatives on slow test syetems.
> typo: s/syetems/systems/
> 
> 
> +# bump the query timeout to avoid false negatives on slow test syetems.
> +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
> Does this actually work?  Utils.pm read the environment variable at compile
> time in the BEGIN block so this setting won't be seen?  A quick testprogram
> seems to confirm this but I might be missing something.

The correct way to get a longer timeout is "IPC::Run::timer(4 *
$PostgreSQL::Test::Utils::timeout_default);".  Even if changing env worked,
that would be removing the ability for even-slower systems to set timeouts
greater than 10min.

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

27 September 2023, 12:39:45

Sorry for the late reply.

On Sun, Sep 3, 2023 at 2:48 PM Noah Misch <noah@leadboat.com> wrote:
>
> On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
> > > On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Agreed. The timeout can be set by manually setting
> > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
> > > now require setting PG_TET_EXTRA to run it.
> >
> > +# bump the query timeout to avoid false negatives on slow test syetems.
> > typo: s/syetems/systems/
> >
> >
> > +# bump the query timeout to avoid false negatives on slow test syetems.
> > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
> > Does this actually work?  Utils.pm read the environment variable at compile
> > time in the BEGIN block so this setting won't be seen?  A quick testprogram
> > seems to confirm this but I might be missing something.
>
> The correct way to get a longer timeout is "IPC::Run::timer(4 *
> $PostgreSQL::Test::Utils::timeout_default);".  Even if changing env worked,
> that would be removing the ability for even-slower systems to set timeouts
> greater than 10min.

Agreed.

I've attached new version patches. 0001 patch adds an option to
background_psql to specify the timeout seconds, and 0002 patch is the
main regression test patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Testing autovacuum wraparound (including failsafe)

From

Daniel Gustafsson

Date:

29 September 2023, 10:17:04

> On 27 Sep 2023, at 14:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've attached new version patches. 0001 patch adds an option to
> background_psql to specify the timeout seconds, and 0002 patch is the
> main regression test patch.

-=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params)
+=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params, timeout)

Looking at this I notice that I made a typo in 664d757531e, the =item line
should have "@psql_params" and not "@params".  Perhaps you can fix that minor
thing while in there?

+    $timeout = $params{timeout} if defined $params{timeout};

I think this should be documented in the background_psql POD docs.

+       Not enabled by default it is resource intensive.

This sentence is missing a "because", should read: "..by default *because* it
is.."

+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;

Should we bump the timeout like this for all systems?  I interpreted Noah's
comment such that it should be possible for slower systems to override, not
that it should be extended everywhere, but I might have missed something.

--
Daniel Gustafsson

Re: Testing autovacuum wraparound (including failsafe)

From

vignesh C

Date:

29 September 2023, 10:19:54

On Thu, 28 Sept 2023 at 03:55, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Sorry for the late reply.
>
> On Sun, Sep 3, 2023 at 2:48 PM Noah Misch <noah@leadboat.com> wrote:
> >
> > On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
> > > > On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > Agreed. The timeout can be set by manually setting
> > > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
> > > > now require setting PG_TET_EXTRA to run it.
> > >
> > > +# bump the query timeout to avoid false negatives on slow test syetems.
> > > typo: s/syetems/systems/
> > >
> > >
> > > +# bump the query timeout to avoid false negatives on slow test syetems.
> > > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
> > > Does this actually work?  Utils.pm read the environment variable at compile
> > > time in the BEGIN block so this setting won't be seen?  A quick testprogram
> > > seems to confirm this but I might be missing something.
> >
> > The correct way to get a longer timeout is "IPC::Run::timer(4 *
> > $PostgreSQL::Test::Utils::timeout_default);".  Even if changing env worked,
> > that would be removing the ability for even-slower systems to set timeouts
> > greater than 10min.
>
> Agreed.
>
> I've attached new version patches. 0001 patch adds an option to
> background_psql to specify the timeout seconds, and 0002 patch is the
> main regression test patch.

Few comments:
1) Should we have some validation for the inputs given:
+PG_FUNCTION_INFO_V1(consume_xids_until);
+Datum
+consume_xids_until(PG_FUNCTION_ARGS)
+{
+       FullTransactionId targetxid =
FullTransactionIdFromU64((uint64) PG_GETARG_INT64(0));
+       FullTransactionId lastxid;
+
+       if (!FullTransactionIdIsNormal(targetxid))
+               elog(ERROR, "targetxid %llu is not normal", (unsigned
long long) U64FromFullTransactionId(targetxid));

If not it will take inputs like -1 and 999999999999999.
Also the notice messages might confuse for the above values, as it
will show a different untilxid value like the below:
postgres=# SELECT consume_xids_until(999999999999999);
NOTICE:  consumed up to 0:10000809 / 232830:2764472319

2) Should this be added after worker_spi as we generally add it in the
alphabetical order:
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..4054bde84c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -10,6 +10,7 @@ subdir('libpq_pipeline')
 subdir('plsample')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
+subdir('xid_wraparound')
 subdir('test_bloomfilter')

3) Similarly here too:
index e81873cb5a..a4c845ab4a 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
                  libpq_pipeline \
                  plsample \
                  spgist_name_ops \
+                 xid_wraparound \
                  test_bloomfilter \

4) The following includes are not required transam.h, fmgr.h, lwlock.h
+ *             src/test/modules/xid_wraparound/xid_wraparound.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/xact.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"

Regards,
Vignesh

Re: Testing autovacuum wraparound (including failsafe)

From

Noah Misch

Date:

29 September 2023, 13:57:21

On Fri, Sep 29, 2023 at 12:17:04PM +0200, Daniel Gustafsson wrote:
> +# Bump the query timeout to avoid false negatives on slow test systems.
> +my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
> 
> Should we bump the timeout like this for all systems?  I interpreted Noah's
> comment such that it should be possible for slower systems to override, not
> that it should be extended everywhere, but I might have missed something.

This is the conventional way to do it.  For an operation far slower than a
typical timeout_default situation, the patch can and should dilate the default
timeout like this.  The patch version as of my last comment was extending the
timeout but also blocking users from extending it further via
PG_TEST_TIMEOUT_DEFAULT.  The latest version restores PG_TEST_TIMEOUT_DEFAULT
reactivity, resolving my comment.

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

27 November 2023, 13:06:15

On Fri, Sep 29, 2023 at 7:17 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 27 Sep 2023, at 14:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I've attached new version patches. 0001 patch adds an option to
> > background_psql to specify the timeout seconds, and 0002 patch is the
> > main regression test patch.
>
> -=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params)
> +=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params, timeout)
>
> Looking at this I notice that I made a typo in 664d757531e, the =item line
> should have "@psql_params" and not "@params".  Perhaps you can fix that minor
> thing while in there?
>
>
> +       $timeout = $params{timeout} if defined $params{timeout};
>
> I think this should be documented in the background_psql POD docs.

While updating the documentation, I found the following description:

=item $node->background_psql($dbname, %params) =>
PostgreSQL::Test::BackgroundPsql inst$
Invoke B<psql> on B<$dbname> and return a BackgroundPsql object.

A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
which can be modified later.

Is it true that we can modify the timeout after creating
BackgroundPsql object? If so, it seems we don't need to introduce the
new timeout argument. But how?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Testing autovacuum wraparound (including failsafe)

From

Daniel Gustafsson

Date:

27 November 2023, 13:40:09

> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Is it true that we can modify the timeout after creating
> BackgroundPsql object? If so, it seems we don't need to introduce the
> new timeout argument. But how?

I can't remember if that's leftovers that incorrectly remains from an earlier
version of the BackgroundPsql work, or if it's a very bad explanation of
->set_query_timer_restart().  The timeout will use the timeout_default value
and that cannot be overridden, it can only be reset per query.

With your patch the timeout still cannot be changed, but at least set during
start which seems good enough until there are tests warranting more complexity.
The docs should be corrected to reflect this in your patch.

--
Daniel Gustafsson

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

28 November 2023, 02:00:53

On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Is it true that we can modify the timeout after creating
> > BackgroundPsql object? If so, it seems we don't need to introduce the
> > new timeout argument. But how?
>
> I can't remember if that's leftovers that incorrectly remains from an earlier
> version of the BackgroundPsql work, or if it's a very bad explanation of
> ->set_query_timer_restart().  The timeout will use the timeout_default value
> and that cannot be overridden, it can only be reset per query.

Thank you for confirming this. I see there is the same problem also in
interactive_psql(). So I've attached the 0001 patch to fix these
documentation issues. Which could be backpatched.

> With your patch the timeout still cannot be changed, but at least set during
> start which seems good enough until there are tests warranting more complexity.
> The docs should be corrected to reflect this in your patch.

I've incorporated the comments except for the following one and
attached updated version of the rest patches:

On Fri, Sep 29, 2023 at 7:20 PM vignesh C <vignesh21@gmail.com> wrote:
> Few comments:
> 1) Should we have some validation for the inputs given:
> +PG_FUNCTION_INFO_V1(consume_xids_until);
> +Datum
> +consume_xids_until(PG_FUNCTION_ARGS)
> +{
> +       FullTransactionId targetxid =
> FullTransactionIdFromU64((uint64) PG_GETARG_INT64(0));
> +       FullTransactionId lastxid;
> +
> +       if (!FullTransactionIdIsNormal(targetxid))
> +               elog(ERROR, "targetxid %llu is not normal", (unsigned
> long long) U64FromFullTransactionId(targetxid));
>
> If not it will take inputs like -1 and 999999999999999.
> Also the notice messages might confuse for the above values, as it
> will show a different untilxid value like the below:
> postgres=# SELECT consume_xids_until(999999999999999);
> NOTICE:  consumed up to 0:10000809 / 232830:2764472319

The full transaction ids shown in the notice messages are separated
into epoch and xid so it's not a different value. This epoch-and-xid
style is used also in pg_controldata output and makes sense to me to
show the progress of xid consumption.

Once the new test gets committed, I'll prepare a new buildfarm animal
for that if possible.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Testing autovacuum wraparound (including failsafe)

From

Daniel Gustafsson

Date:

28 November 2023, 10:16:25

> On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>>
>>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>>> Is it true that we can modify the timeout after creating
>>> BackgroundPsql object? If so, it seems we don't need to introduce the
>>> new timeout argument. But how?
>>
>> I can't remember if that's leftovers that incorrectly remains from an earlier
>> version of the BackgroundPsql work, or if it's a very bad explanation of
>> ->set_query_timer_restart().  The timeout will use the timeout_default value
>> and that cannot be overridden, it can only be reset per query.
>
> Thank you for confirming this. I see there is the same problem also in
> interactive_psql(). So I've attached the 0001 patch to fix these
> documentation issues.

-A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
-which can be modified later.
+A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.

Since it cannot be modified, I think we should just say "A timeout of .." and
call it a default timeout.  This obviously only matters for the backpatch since
the sentence is removed in 0002.

> Which could be backpatched.

+1

>> With your patch the timeout still cannot be changed, but at least set during
>> start which seems good enough until there are tests warranting more complexity.
>> The docs should be corrected to reflect this in your patch.
>
> I've incorporated the comments except for the following one and
> attached updated version of the rest patches:

LGTM.

--
Daniel Gustafsson

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

28 November 2023, 20:27:39

On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> >>
> >>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>
> >>> Is it true that we can modify the timeout after creating
> >>> BackgroundPsql object? If so, it seems we don't need to introduce the
> >>> new timeout argument. But how?
> >>
> >> I can't remember if that's leftovers that incorrectly remains from an earlier
> >> version of the BackgroundPsql work, or if it's a very bad explanation of
> >> ->set_query_timer_restart().  The timeout will use the timeout_default value
> >> and that cannot be overridden, it can only be reset per query.
> >
> > Thank you for confirming this. I see there is the same problem also in
> > interactive_psql(). So I've attached the 0001 patch to fix these
> > documentation issues.
>
> -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
> -which can be modified later.
> +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.
>
> Since it cannot be modified, I think we should just say "A timeout of .." and
> call it a default timeout.  This obviously only matters for the backpatch since
> the sentence is removed in 0002.

Agreed.

I've attached new version patches (0002 and 0003 are unchanged except
for the commit message). I'll push them, barring any objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

30 November 2023, 07:35:26

On Wed, Nov 29, 2023 at 5:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> >
> > > On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > >>
> > >>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >>
> > >>> Is it true that we can modify the timeout after creating
> > >>> BackgroundPsql object? If so, it seems we don't need to introduce the
> > >>> new timeout argument. But how?
> > >>
> > >> I can't remember if that's leftovers that incorrectly remains from an earlier
> > >> version of the BackgroundPsql work, or if it's a very bad explanation of
> > >> ->set_query_timer_restart().  The timeout will use the timeout_default value
> > >> and that cannot be overridden, it can only be reset per query.
> > >
> > > Thank you for confirming this. I see there is the same problem also in
> > > interactive_psql(). So I've attached the 0001 patch to fix these
> > > documentation issues.
> >
> > -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
> > -which can be modified later.
> > +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.
> >
> > Since it cannot be modified, I think we should just say "A timeout of .." and
> > call it a default timeout.  This obviously only matters for the backpatch since
> > the sentence is removed in 0002.
>
> Agreed.
>
> I've attached new version patches (0002 and 0003 are unchanged except
> for the commit message). I'll push them, barring any objections.
>

Pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

01 December 2023, 02:14:20

On Thu, Nov 30, 2023 at 4:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 5:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > >
> > > > On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > > >>
> > > >>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >>
> > > >>> Is it true that we can modify the timeout after creating
> > > >>> BackgroundPsql object? If so, it seems we don't need to introduce the
> > > >>> new timeout argument. But how?
> > > >>
> > > >> I can't remember if that's leftovers that incorrectly remains from an earlier
> > > >> version of the BackgroundPsql work, or if it's a very bad explanation of
> > > >> ->set_query_timer_restart().  The timeout will use the timeout_default value
> > > >> and that cannot be overridden, it can only be reset per query.
> > > >
> > > > Thank you for confirming this. I see there is the same problem also in
> > > > interactive_psql(). So I've attached the 0001 patch to fix these
> > > > documentation issues.
> > >
> > > -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
> > > -which can be modified later.
> > > +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.
> > >
> > > Since it cannot be modified, I think we should just say "A timeout of .." and
> > > call it a default timeout.  This obviously only matters for the backpatch since
> > > the sentence is removed in 0002.
> >
> > Agreed.
> >
> > I've attached new version patches (0002 and 0003 are unchanged except
> > for the commit message). I'll push them, barring any objections.
> >
>
> Pushed.

FYI I've configured the buildfarm animal perentie to run regression
tests including xid_wraparound:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=perentie&br=HEAD

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Eisentraut

Date:

07 February 2024, 18:11:41

The way src/test/modules/xid_wraparound/meson.build is written, it 
installs the xid_wraparound.so module into production installations. 
For test modules, a different installation code needs to be used.  See 
neighboring test modules such as 
src/test/modules/test_rbtree/meson.build for examples.

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

08 February 2024, 04:05:49

On Thu, Feb 8, 2024 at 3:11 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> The way src/test/modules/xid_wraparound/meson.build is written, it
> installs the xid_wraparound.so module into production installations.
> For test modules, a different installation code needs to be used.  See
> neighboring test modules such as
> src/test/modules/test_rbtree/meson.build for examples.
>

Good catch, thanks.

I've attached the patch to fix it. Does it make sense?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

0001-Prevent-installation-of-xid_wraparound-test-during-m.patch

Re: Testing autovacuum wraparound (including failsafe)

From

Peter Eisentraut

Date:

08 February 2024, 07:06:28

On 08.02.24 05:05, Masahiko Sawada wrote:
> On Thu, Feb 8, 2024 at 3:11 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>>
>> The way src/test/modules/xid_wraparound/meson.build is written, it
>> installs the xid_wraparound.so module into production installations.
>> For test modules, a different installation code needs to be used.  See
>> neighboring test modules such as
>> src/test/modules/test_rbtree/meson.build for examples.
>>
> 
> Good catch, thanks.
> 
> I've attached the patch to fix it. Does it make sense?

Yes, that looks correct to me and produces the expected behavior.

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

08 February 2024, 08:06:39

On Thu, Feb 8, 2024 at 4:06 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 08.02.24 05:05, Masahiko Sawada wrote:
> > On Thu, Feb 8, 2024 at 3:11 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> >>
> >> The way src/test/modules/xid_wraparound/meson.build is written, it
> >> installs the xid_wraparound.so module into production installations.
> >> For test modules, a different installation code needs to be used.  See
> >> neighboring test modules such as
> >> src/test/modules/test_rbtree/meson.build for examples.
> >>
> >
> > Good catch, thanks.
> >
> > I've attached the patch to fix it. Does it make sense?
>
> Yes, that looks correct to me and produces the expected behavior.
>

Thank you for the check. Pushed at 1aa67a5ea687.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Testing autovacuum wraparound (including failsafe)

From

Alexander Lakhin

Date:

22 May 2024, 12:00:00

Hello,

30.11.2023 10:35, Masahiko Sawada wrote:
>
>> I've attached new version patches (0002 and 0003 are unchanged except
>> for the commit message). I'll push them, barring any objections.
>>
> Pushed.

I've discovered that the test 001_emergency_vacuum.pl can fail due to a
race condition. I can't see the server log at [1], but I reproduced the
failure locally and with additional logging and log_min_messages = DEBUG3,
the log shows:
...
2024-05-22 11:46:28.125 UTC [21256:2853] DEBUG:  SlruScanDirectory invoking callback on pg_xact/0690
2024-05-22 11:46:28.125 UTC [21256:2854] DEBUG:  transaction ID wrap limit is 2147484396, limited by database with OID
5
2024-05-22 11:46:28.126 UTC [21256:2855] LOG: !!!SendPostmasterSignal| PMSIGNAL_START_AUTOVAC_LAUNCHER
2024-05-22 11:46:28.135 UTC [14871:20077] DEBUG:  postmaster received pmsignal signal
2024-05-22 11:46:28.137 UTC [21257:1] DEBUG:  autovacuum launcher started
2024-05-22 11:46:28.137 UTC [21257:2] DEBUG:  InitPostgres
2024-05-22 11:46:28.138 UTC [21257:3] LOG:  !!!AutoVacLauncherMain| !AutoVacuumingActive() && !ShutdownRequestPending;

before do_start_worker()
2024-05-22 11:46:28.138 UTC [21257:4] LOG:  !!!do_start_worker| return quickly when there are no free workers
2024-05-22 11:46:28.138 UTC [21257:5] DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to make
2024-05-22 11:46:28.138 UTC [21257:6] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
2024-05-22 11:46:28.138 UTC [21257:7] DEBUG:  proc_exit(0): 1 callbacks to make
2024-05-22 11:46:28.138 UTC [21257:8] DEBUG:  exit(0)
2024-05-22 11:46:28.138 UTC [21257:9] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-05-22 11:46:28.138 UTC [21257:10] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-05-22 11:46:28.138 UTC [21257:11] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-05-22 11:46:28.146 UTC [21256:2856] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 5
2024-05-22 11:46:28.146 UTC [21256:2857] DEBUG:  MultiXact member stop limit is now 4294914944 based on MultiXact 1
2024-05-22 11:46:28.146 UTC [21256:2858] DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to make
2024-05-22 11:46:28.147 UTC [21256:2859] DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
2024-05-22 11:46:28.147 UTC [21256:2860] DEBUG:  proc_exit(0): 1 callbacks to make
2024-05-22 11:46:28.147 UTC [21256:2861] DEBUG:  exit(0)
2024-05-22 11:46:28.147 UTC [21256:2862] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-05-22 11:46:28.147 UTC [21256:2863] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-05-22 11:46:28.147 UTC [21256:2864] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-05-22 11:46:28.151 UTC [14871:20078] DEBUG:  forked new backend, pid=21258 socket=8
2024-05-22 11:46:28.171 UTC [14871:20079] DEBUG:  server process (PID 21256) exited with exit code 0
2024-05-22 11:46:28.152 UTC [21258:1] [unknown] LOG:  connection received: host=[local]
2024-05-22 11:46:28.171 UTC [21258:2] [unknown] DEBUG:  InitPostgres
2024-05-22 11:46:28.172 UTC [21258:3] [unknown] LOG:  connection authenticated: user="vagrant" method=trust 

(/pgtest/postgresql.git/src/test/modules/xid_wraparound/tmp_check/t_001_emergency_vacuum_main_data/pgdata/pg_hba.conf:117)
2024-05-22 11:46:28.172 UTC [21258:4] [unknown] LOG:  connection authorized: user=vagrant database=postgres 
application_name=001_emergency_vacuum.pl
2024-05-22 11:46:28.175 UTC [21258:5] 001_emergency_vacuum.pl LOG: statement: INSERT INTO small(data) SELECT 1

That is, autovacuum worker (21256) sent PMSIGNAL_START_AUTOVAC_LAUNCHER,
postmaster started autovacuum launcher, which could not start new
autovacuum worker due to the process 21256 not exited yet.

The failure can be reproduced easily with the sleep added inside
SetTransactionIdLimit():
         if (TransactionIdFollowsOrEquals(curXid, xidVacLimit) &&
                 IsUnderPostmaster && !InRecovery)
SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
+pg_usleep(10000L);

By the way I also discovered that rather resource-intensive xid_wraparound
tests executed twice during the buildfarm "make" run (on dodo and perentie
(see [2]) animals), at stage module-xid_wraparound-check and then at stage
testmodules-install-check-C.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-05-19%2006%3A33%3A34
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=perentie&dt=2024-05-22%2000%3A02%3A19

Best regards,
Alexander

Re: Testing autovacuum wraparound (including failsafe)

From

Masahiko Sawada

Date:

10 October 2024, 09:25:24

Hi,

On Tue, Oct 8, 2024 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Masahiko-san,
>
> 01.12.2023 05:14, Masahiko Sawada wrote:
> > FYI I've configured the buildfarm animal perentie to run regression
> > tests including xid_wraparound:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=perentie&br=HEAD
>
> Please look at a failure produced by perentie recently: [1].
>
> I've analyzed all the available detailed perentie logs (starting from
> 2024-04-04) and got the following durations of the pg_ctl stop operation
> at the end of the 001_emergency_vacuum.pl (from the
> module-xid_wraparound-check stage): see perentie-timings.txt and
> perentie-timings.png attached. So it looks like perentie needs larger
> PGCTLTIMEOUT for the test (maybe 180 seconds would work?).
>
> Though it's not clear to me, why this test takes so long on that animal,
> even when it succeeds. For example, [2] shows:
> [09:28:23] t/001_emergency_vacuum.pl .. ok   225894 ms ( 0.00 usr 0.00 sys +  0.31 cusr  0.43 csys =  0.74 CPU)
> [09:30:53] t/002_limits.pl ............ ok   150291 ms ( 0.00 usr 0.00 sys +  1.85 cusr  1.50 csys =  3.35 CPU)
> [09:49:33] t/003_wraparounds.pl ....... ok  1119766 ms ( 0.00 usr 0.00 sys +  1.68 cusr  2.39 csys =  4.07 CPU)
>
> While what I'm seeing locally on my Fedora 40 VM is:
> PG_TEST_EXTRA="xid_wraparound" make -s check -C src/test/modules/xid_wraparound/ PROVE_FLAGS="--timer"
> # +++ tap check in src/test/modules/xid_wraparound +++
> [04:41:56] t/001_emergency_vacuum.pl .. ok    18852 ms ( 0.01 usr 0.00 sys +  0.14 cusr  0.28 csys =  0.43 CPU)
> [04:42:15] t/002_limits.pl ............ ok    18539 ms ( 0.00 usr 0.00 sys +  0.72 cusr  0.88 csys =  1.60 CPU)
> [04:42:34] t/003_wraparounds.pl ....... ok    74368 ms ( 0.00 usr 0.00 sys +  0.82 cusr  2.57 csys =  3.39 CPU)

Thank you for reporting it.

I've investigated the logs you shared and figured out that some other
(personal) tests were overlapped at that time (perentie is my
machine). So the failure should be ignored, sorry for making noise.
I'll make sure other tests won't be overlapped again when perentie is
executing the regression tests.

> Also, maybe it would make sense to run this test for REL_17_STABLE too, as
> dodo is not with us since 2024-09-04, and I don't know if there are other
> animals running these tests (having xid_wraparound in PG_TEST_EXTRA).

Good idea. I've configured perentie to do that.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com