Thread: Testing autovacuum wraparound (including failsafe)
Hi, I started to write a test for $Subject, which I think we sorely need. Currently my approach is to: - start a cluster, create a few tables with test data - acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent autovacuum from doing anything - cause dead tuples to exist - restart - run pg_resetwal -x 2000027648 - do things like acquiring pins on pages that block vacuum from progressing - commit prepared transaction - wait for template0, template1 datfrozenxid to increase - wait for relfrozenxid for most relations in postgres to increase - release buffer pin - wait for postgres datfrozenxid to increase So far so good. But I've encountered a few things that stand in the way of enabling such a test by default: 1) During startup StartupSUBTRANS() zeroes out all pages between oldestActiveXID and nextXid. That takes 8s on my workstation, but only because I have plenty memory - pg_subtrans ends up 14GB as I currently do the test. Clearly not something we could do on the BF. 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the failsafe mode, we can't really create 4GB relations on the BF. While writing the tests I've lowered this to 4MB... 3) pg_resetwal -x requires to carefully choose an xid: It needs to be the first xid on a clog page. It's not hard to determine which xids are but it depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a value appropriate for 8KB, but ... I have 2 1/2 ideas about addressing 1); - We could exposing functionality to do advance nextXid to a future value at runtime, without filling in clog/subtrans pages. Would probably have to live in varsup.c and be exposed via regress.so or such? - The only reason StartupSUBTRANS() does that work is because of the prepared transaction holding back oldestActiveXID. That transaction in turn exists to prevent autovacuum from doing anything before we do test setup steps. Perhaps it'd be sufficient to set autovacuum_naptime really high initially, perform the test setup, set naptime to something lower, reload config. But I'm worried that might not be reliable: If something ends up allocating an xid we'd potentially reach the path in GetNewTransaction() that wakes up the launcher? But probably there wouldn't be anything doing so? Another aspect that might not make this a good choice is that it actually seems relevant to be able to test cases where there are very old still running transactions... - As a variant of the previous idea: If that turns out to be unreliable, we could instead set nextxid, start in single user mode, create a blocking 2PC transaction, start normally. Because there's no old active xid we'd not run into the StartupSUBTRANS problem. For 2), I don't really have a better idea than making that configurable somehow? 3) is probably tolerable for now, we could skip the test if BLCKSZ isn't 8KB, or we could hardcode the calculation for different block sizes. I noticed one minor bug that's likely new: 2021-04-23 13:32:30.899 PDT [2027738] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.public.small_trunc":index scans: 1 pages: 400 removed, 28 remain, 0 skipped due to pins, 0 skipped frozen tuples: 14000 removed, 1000 remain, 0 are dead but not yet removable, oldest xmin: 2000027651 buffer usage: 735 hits, 1262 misses, 874 dirtied index scan needed: 401 pages from table (1432.14% of total) had 14000 dead item identifiers removed index "small_trunc_pkey": pages: 43 in total, 37 newly deleted, 37 currently deleted, 0 reusable avg read rate: 559.048 MB/s, avg write rate: 387.170 MB/s system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s WAL usage: 1809 records, 474 full page images, 3977538 bytes '1432.14% of total' - looks like removed pages need to be added before the percentage calculation? Greetings, Andres Freund
On Fri, Apr 23, 2021 at 01:43:06PM -0700, Andres Freund wrote: > 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the > failsafe mode, we can't really create 4GB relations on the BF. While > writing the tests I've lowered this to 4MB... > For 2), I don't really have a better idea than making that configurable > somehow? Does it work to shut down the cluster and create the .0,.1,.2,.3 segments of a new, empty relation with zero blocks using something like truncate -s 1G ? -- Justin
On Fri, Apr 23, 2021 at 1:43 PM Andres Freund <andres@anarazel.de> wrote: > I started to write a test for $Subject, which I think we sorely need. +1 > Currently my approach is to: > - start a cluster, create a few tables with test data > - acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent > autovacuum from doing anything > - cause dead tuples to exist > - restart > - run pg_resetwal -x 2000027648 > - do things like acquiring pins on pages that block vacuum from progressing > - commit prepared transaction > - wait for template0, template1 datfrozenxid to increase > - wait for relfrozenxid for most relations in postgres to increase > - release buffer pin > - wait for postgres datfrozenxid to increase Just having a standard-ish way to do stress testing like this would add something. > 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the > failsafe mode, we can't really create 4GB relations on the BF. While > writing the tests I've lowered this to 4MB... The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits how often we'll consider the failsafe in the single-pass/no-indexes case. I see no reason why it cannot be changed now. VACUUM_FSM_EVERY_PAGES also frustrates FSM testing in the single-pass case in about the same way, so maybe that should be considered as well? Note that the FSM handling for the single pass case is actually a bit different to the two pass/has-indexes case, since the single pass case calls lazy_vacuum_heap_page() directly in its first and only pass over the heap (that's the whole point of having it of course). > 3) pg_resetwal -x requires to carefully choose an xid: It needs to be the > first xid on a clog page. It's not hard to determine which xids are but it > depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a > value appropriate for 8KB, but ... Ugh. > For 2), I don't really have a better idea than making that configurable > somehow? That could make sense as a developer/testing option, I suppose. I just doubt that it makes sense as anything else. > 2021-04-23 13:32:30.899 PDT [2027738] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.public.small_trunc":index scans: 1 > pages: 400 removed, 28 remain, 0 skipped due to pins, 0 skipped frozen > tuples: 14000 removed, 1000 remain, 0 are dead but not yet removable, oldest xmin: 2000027651 > buffer usage: 735 hits, 1262 misses, 874 dirtied > index scan needed: 401 pages from table (1432.14% of total) had 14000 dead item identifiers removed > index "small_trunc_pkey": pages: 43 in total, 37 newly deleted, 37 currently deleted, 0 reusable > avg read rate: 559.048 MB/s, avg write rate: 387.170 MB/s > system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s > WAL usage: 1809 records, 474 full page images, 3977538 bytes > > '1432.14% of total' - looks like removed pages need to be added before the > percentage calculation? Clearly this needs to account for removed heap pages in order to consistently express the percentage of pages with LP_DEAD items in terms of a percentage of the original table size. I can fix this shortly. -- Peter Geoghegan
Hi, On 2021-04-23 18:08:12 -0500, Justin Pryzby wrote: > On Fri, Apr 23, 2021 at 01:43:06PM -0700, Andres Freund wrote: > > 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the > > failsafe mode, we can't really create 4GB relations on the BF. While > > writing the tests I've lowered this to 4MB... > > > For 2), I don't really have a better idea than making that configurable > > somehow? > > Does it work to shut down the cluster and create the .0,.1,.2,.3 segments of a > new, empty relation with zero blocks using something like truncate -s 1G ? I'd like this to be portable to at least windows - I don't know how well that deals with sparse files. But the bigger issue is that that IIRC will trigger vacuum to try to initialize all those pages, which will then force all that space to be allocated anyway... Greetings, Andres Freund
Hi, On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote: > The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the > related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits > how often we'll consider the failsafe in the single-pass/no-indexes > case. I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me? > I see no reason why it cannot be changed now. VACUUM_FSM_EVERY_PAGES > also frustrates FSM testing in the single-pass case in about the same > way, so maybe that should be considered as well? Note that the FSM > handling for the single pass case is actually a bit different to the > two pass/has-indexes case, since the single pass case calls > lazy_vacuum_heap_page() directly in its first and only pass over the > heap (that's the whole point of having it of course). I'm not opposed to lowering VACUUM_FSM_EVERY_PAGES (the costs don't seem all that high compared to vacuuming?), but I don't think there's as clear a need for testing around that as there is around wraparound. The failsafe mode affects the table scan itself by disabling cost limiting. As far as I can see the ways it triggers for the table scan (vs truncation or index processing) are: 1) Before vacuuming starts, for heap phases and indexes, if already necessary at that point 2) For a table with indexes, before/after each index vacuum, if now necessary 3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary Why would we want to trigger the failsafe mode during a scan of a table with dead tuples and no indexes, but not on a table without dead tuples or with indexes but fewer than m_w_m dead tuples? That makes little sense to me. It seems that for the no-index case the warning message is quite off? ereport(WARNING, (errmsg("abandoned index vacuuming of table \"%s.%s.%s\" as a failsafe after %d index scans", Doesn't exactly make one understand that vacuum cost limiting now is disabled? And is confusing because there would never be index vacuuming? And even in the cases indexes exist, it's odd to talk about abandoning index vacuuming that hasn't even started yet? > > For 2), I don't really have a better idea than making that configurable > > somehow? > > That could make sense as a developer/testing option, I suppose. I just > doubt that it makes sense as anything else. Yea, I only was thinking of making it configurable to be able to test it. If we change the limit to something considerably lower I wouldn't see a need for that anymore. Greetings, Andres Freund
On Fri, Apr 23, 2021 at 5:29 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote: > > The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the > > related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits > > how often we'll consider the failsafe in the single-pass/no-indexes > > case. > > I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES > and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me? VACUUM_FSM_EVERY_PAGES controls how often VACUUM does work that usually takes place right after the two pass case finishes a round of index and heap vacuuming. This is work that we certainly don't want to do every time we process a single heap page in the one-pass/no-indexes case. Initially this just meant FSM vacuuming, but it now includes a failsafe check. Of course all of the precise details here are fairly arbitrary (including VACUUM_FSM_EVERY_PAGES, which has been around for a couple of releases now). The overall goal that I had in mind was to make the one-pass case's use of the failsafe have analogous behavior to the two-pass/has-indexes case -- a goal which was itself somewhat arbitrary. > The failsafe mode affects the table scan itself by disabling cost > limiting. As far as I can see the ways it triggers for the table scan (vs > truncation or index processing) are: > > 1) Before vacuuming starts, for heap phases and indexes, if already > necessary at that point > 2) For a table with indexes, before/after each index vacuum, if now > necessary > 3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary > > Why would we want to trigger the failsafe mode during a scan of a table > with dead tuples and no indexes, but not on a table without dead tuples > or with indexes but fewer than m_w_m dead tuples? That makes little > sense to me. What alternative does make sense to you? It seemed important to put the failsafe check at points where we do other analogous work in all cases. We made a pragmatic trade-off. In theory almost any scheme might not check often enough, and/or might check too frequently. > It seems that for the no-index case the warning message is quite off? I'll fix that up some point soon. FWIW this happened because the support for one-pass VACUUM was added quite late, at Robert's request. Another issue with the failsafe commit is that we haven't considered the autovacuum_multixact_freeze_max_age table reloption -- we only check the GUC. That might have accidentally been the right thing to do, though, since the reloption is interpreted as lower than the GUC in all cases anyway -- arguably the autovacuum_multixact_freeze_max_age GUC should be all we care about anyway. I will need to think about this question some more, though. > > > For 2), I don't really have a better idea than making that configurable > > > somehow? > > > > That could make sense as a developer/testing option, I suppose. I just > > doubt that it makes sense as anything else. > > Yea, I only was thinking of making it configurable to be able to test > it. If we change the limit to something considerably lower I wouldn't > see a need for that anymore. It would probably be okay to just lower it significantly. Not sure if that's the best approach, though. Will pick it up next week. -- Peter Geoghegan
Hi, On 2021-04-23 19:15:43 -0700, Peter Geoghegan wrote: > > The failsafe mode affects the table scan itself by disabling cost > > limiting. As far as I can see the ways it triggers for the table scan (vs > > truncation or index processing) are: > > > > 1) Before vacuuming starts, for heap phases and indexes, if already > > necessary at that point > > 2) For a table with indexes, before/after each index vacuum, if now > > necessary > > 3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary > > > > Why would we want to trigger the failsafe mode during a scan of a table > > with dead tuples and no indexes, but not on a table without dead tuples > > or with indexes but fewer than m_w_m dead tuples? That makes little > > sense to me. > > What alternative does make sense to you? Check it every so often, independent of whether there are indexes or dead tuples? Or just check it at the boundaries. I'd make it dependent on the number of pages scanned, rather than the block distance to the last check - otherwise we might end up doing it way too often when there's only a few individual pages not in the freeze map. Greetings, Andres Freund
On Fri, Apr 23, 2021 at 7:33 PM Andres Freund <andres@anarazel.de> wrote: > Check it every so often, independent of whether there are indexes or > dead tuples? Or just check it at the boundaries. I think that the former suggestion might be better -- I actually thought about doing it that way myself. The latter suggestion sounds like you're suggesting that we just check it at the beginning and the end in all cases (we do the beginning in all cases already, but now we'd also do the end outside of the loop in all cases). Is that right? If that is what you meant, then you should note that there'd hardly be any check in the one-pass case with that scheme (apart from the initial check that we do already). The only work we'd be skipping at the end (in the event of that check triggering the failsafe) would be heap truncation, which (as you've pointed out yourself) doesn't seem particularly likely to matter. -- Peter Geoghegan
Hi, On 2021-04-23 19:42:30 -0700, Peter Geoghegan wrote: > On Fri, Apr 23, 2021 at 7:33 PM Andres Freund <andres@anarazel.de> wrote: > > Check it every so often, independent of whether there are indexes or > > dead tuples? Or just check it at the boundaries. > > I think that the former suggestion might be better -- I actually > thought about doing it that way myself. Cool. > The latter suggestion sounds like you're suggesting that we just check > it at the beginning and the end in all cases (we do the beginning in > all cases already, but now we'd also do the end outside of the loop in > all cases). Is that right? Yes. > If that is what you meant, then you should note that there'd hardly be > any check in the one-pass case with that scheme (apart from the > initial check that we do already). The only work we'd be skipping at > the end (in the event of that check triggering the failsafe) would be > heap truncation, which (as you've pointed out yourself) doesn't seem > particularly likely to matter. I mainly suggested it because to me the current seems hard to understand. I do think it'd be better to check more often. But checking depending on the amount of dead tuples at the right time doesn't strike me as a good idea - a lot of anti-wraparound vacuums will mainly be freezing tuples, rather than removing a lot of dead rows. Which makes it hard to understand when the failsafe kicks in. Greetings, Andres Freund
On Fri, Apr 23, 2021 at 7:53 PM Andres Freund <andres@anarazel.de> wrote: > I mainly suggested it because to me the current seems hard to > understand. I do think it'd be better to check more often. But checking > depending on the amount of dead tuples at the right time doesn't strike > me as a good idea - a lot of anti-wraparound vacuums will mainly be > freezing tuples, rather than removing a lot of dead rows. Which makes it > hard to understand when the failsafe kicks in. I'm convinced -- decoupling the logic from the one-pass-not-two pass case seems likely to be simpler and more useful. For both the one pass and two pass/has indexes case. -- Peter Geoghegan
On Fri, Apr 23, 2021 at 7:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm convinced -- decoupling the logic from the one-pass-not-two pass > case seems likely to be simpler and more useful. For both the one pass > and two pass/has indexes case. Attached draft patch does it that way. -- Peter Geoghegan
Attachment
On Sat, Apr 24, 2021 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Apr 23, 2021 at 5:29 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote: > > > The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the > > > related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits > > > how often we'll consider the failsafe in the single-pass/no-indexes > > > case. > > > > I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES > > and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me? > > VACUUM_FSM_EVERY_PAGES controls how often VACUUM does work that > usually takes place right after the two pass case finishes a round of > index and heap vacuuming. This is work that we certainly don't want to > do every time we process a single heap page in the one-pass/no-indexes > case. Initially this just meant FSM vacuuming, but it now includes a > failsafe check. > > Of course all of the precise details here are fairly arbitrary > (including VACUUM_FSM_EVERY_PAGES, which has been around for a couple > of releases now). The overall goal that I had in mind was to make the > one-pass case's use of the failsafe have analogous behavior to the > two-pass/has-indexes case -- a goal which was itself somewhat > arbitrary. > > > The failsafe mode affects the table scan itself by disabling cost > > limiting. As far as I can see the ways it triggers for the table scan (vs > > truncation or index processing) are: > > > > 1) Before vacuuming starts, for heap phases and indexes, if already > > necessary at that point > > 2) For a table with indexes, before/after each index vacuum, if now > > necessary > > 3) On a table without indexes, every 8GB, iff there are dead tuples, if now necessary > > > > Why would we want to trigger the failsafe mode during a scan of a table > > with dead tuples and no indexes, but not on a table without dead tuples > > or with indexes but fewer than m_w_m dead tuples? That makes little > > sense to me. > > What alternative does make sense to you? > > It seemed important to put the failsafe check at points where we do > other analogous work in all cases. We made a pragmatic trade-off. In > theory almost any scheme might not check often enough, and/or might > check too frequently. > > > It seems that for the no-index case the warning message is quite off? > > I'll fix that up some point soon. FWIW this happened because the > support for one-pass VACUUM was added quite late, at Robert's request. +1 to fix this. Are you already working on fixing this? If not, I'll post a patch. > > Another issue with the failsafe commit is that we haven't considered > the autovacuum_multixact_freeze_max_age table reloption -- we only > check the GUC. That might have accidentally been the right thing to > do, though, since the reloption is interpreted as lower than the GUC > in all cases anyway -- arguably the > autovacuum_multixact_freeze_max_age GUC should be all we care about > anyway. I will need to think about this question some more, though. FWIW, I intentionally ignored the reloption there since they're interpreted as lower than the GUC as you mentioned and the situation where we need to enter the failsafe mode is not the table-specific problem but a system-wide problem. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > +1 to fix this. Are you already working on fixing this? If not, I'll > post a patch. I posted a patch recently (last Thursday my time). Perhaps you can review it? -- Peter Geoghegan
On Tue, May 18, 2021 at 2:42 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > +1 to fix this. Are you already working on fixing this? If not, I'll > > post a patch. > > I posted a patch recently (last Thursday my time). Perhaps you can review it? Oh, I missed that the patch includes that fix. I'll review the patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, May 18, 2021 at 2:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, May 18, 2021 at 2:42 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > +1 to fix this. Are you already working on fixing this? If not, I'll > > > post a patch. > > > > I posted a patch recently (last Thursday my time). Perhaps you can review it? > > Oh, I missed that the patch includes that fix. I'll review the patch. > I've reviewed the patch. Here is one comment: if (vacrel->num_index_scans == 0 && - vacrel->rel_pages <= FAILSAFE_MIN_PAGES) + vacrel->rel_pages <= FAILSAFE_EVERY_PAGES) return false; Since there is the condition "vacrel->num_index_scans == 0" we could enter the failsafe mode even if the table is less than 4GB, if we enter lazy_check_wraparound_failsafe() after executing more than one index scan. Whereas a vacuum on the table that is less than 4GB and has no index never enters the failsafe mode. I think we can remove this condition since I don't see the reason why we don't allow to enter the failsafe mode only when the first-time index scan in the case of such tables. What do you think? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, May 18, 2021 at 12:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Since there is the condition "vacrel->num_index_scans == 0" we could > enter the failsafe mode even if the table is less than 4GB, if we > enter lazy_check_wraparound_failsafe() after executing more than one > index scan. Whereas a vacuum on the table that is less than 4GB and > has no index never enters the failsafe mode. I think we can remove > this condition since I don't see the reason why we don't allow to > enter the failsafe mode only when the first-time index scan in the > case of such tables. What do you think? I'm convinced -- this does seem like premature optimization now. I pushed a version of the patch that removes that code just now. Thanks -- Peter Geoghegan
On Thu, Jun 10, 2021 at 10:52 AM Andres Freund <andres@anarazel.de> wrote:
I started to write a test for $Subject, which I think we sorely need.
Currently my approach is to:
- start a cluster, create a few tables with test data
- acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent
autovacuum from doing anything
- cause dead tuples to exist
- restart
- run pg_resetwal -x 2000027648
- do things like acquiring pins on pages that block vacuum from progressing
- commit prepared transaction
- wait for template0, template1 datfrozenxid to increase
- wait for relfrozenxid for most relations in postgres to increase
- release buffer pin
- wait for postgres datfrozenxid to increase
Cool. Thank you for working on that!
Could you please share a WIP patch for the $subj? I'd be happy to help with it.
So far so good. But I've encountered a few things that stand in the way of
enabling such a test by default:
1) During startup StartupSUBTRANS() zeroes out all pages between
oldestActiveXID and nextXid. That takes 8s on my workstation, but only
because I have plenty memory - pg_subtrans ends up 14GB as I currently do
the test. Clearly not something we could do on the BF.
....
3) pg_resetwal -x requires to carefully choose an xid: It needs to be the
first xid on a clog page. It's not hard to determine which xids are but it
depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a
value appropriate for 8KB, but ...
Maybe we can add new pg_resetwal option? Something like pg_resetwal --xid-near-wraparound, which will ask pg_resetwal to calculate exact xid value using values from pg_control and clog macros?
I think it might come in handy for manual testing too.
I have 2 1/2 ideas about addressing 1);
- We could exposing functionality to do advance nextXid to a future value at
runtime, without filling in clog/subtrans pages. Would probably have to live
in varsup.c and be exposed via regress.so or such?
This option looks scary to me. Several functions rely on the fact that StartupSUBTRANS() have zeroed pages.
And if we will do it conditional just for tests, it means that we won't test the real code path.
- The only reason StartupSUBTRANS() does that work is because of the prepared
transaction holding back oldestActiveXID. That transaction in turn exists to
prevent autovacuum from doing anything before we do test setup
steps.
Perhaps it'd be sufficient to set autovacuum_naptime really high initially,
perform the test setup, set naptime to something lower, reload config. But
I'm worried that might not be reliable: If something ends up allocating an
xid we'd potentially reach the path in GetNewTransaction() that wakes up the
launcher? But probably there wouldn't be anything doing so?
Another aspect that might not make this a good choice is that it actually
seems relevant to be able to test cases where there are very old still
running transactions...
Maybe this exact scenario can be covered with a separate long-running test, not included in buildfarm test suite?
Best regards,
Lubennikova Anastasia
Hi, On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote: > Cool. Thank you for working on that! > Could you please share a WIP patch for the $subj? I'd be happy to help with > it. I've attached the current WIP state, which hasn't evolved much since this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl but I'm not sure that's the best place. But I didn't think src/test/recovery is great either. Regards, Andres
Attachment
On Fri, Jun 11, 2021 at 10:19 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote: > > Cool. Thank you for working on that! > > Could you please share a WIP patch for the $subj? I'd be happy to help with > > it. > > I've attached the current WIP state, which hasn't evolved much since > this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl > but I'm not sure that's the best place. But I didn't think > src/test/recovery is great either. > Thank you for sharing the WIP patch. Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time for zeroing out all pages), how about using single-user mode instead of preparing the transaction? That is, after pg_resetwal we check the ages of datfrozenxid by executing a query in single-user mode. That way, we don’t need to worry about autovacuum concurrently running while checking the ages of frozenxids. I’ve attached a PoC patch that does the scenario like: 1. start cluster with autovacuum=off and create tables with a few data and make garbage on them 2. stop cluster and do pg_resetwal 3. start cluster in single-user mode 4. check age(datfrozenxid) 5. stop cluster 6. start cluster and wait for autovacuums to increase template0, template1, and postgres datfrozenxids I put new tests in src/test/module/heap since we already have tests for brin in src/test/module/brin. I think that tap test facility to run queries in single-user mode will also be helpful for testing a new vacuum option/command that is intended to use in emergency cases and proposed here[1]. Regards, [1] https://www.postgresql.org/message-id/flat/20220128012842.GZ23027%40telsasoft.com#b76c13554f90d1c8bb5532d6f3e5cbf8 -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
Hi, On Tue, Feb 1, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Jun 11, 2021 at 10:19 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote: > > > Cool. Thank you for working on that! > > > Could you please share a WIP patch for the $subj? I'd be happy to help with > > > it. > > > > I've attached the current WIP state, which hasn't evolved much since > > this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl > > but I'm not sure that's the best place. But I didn't think > > src/test/recovery is great either. > > > > Thank you for sharing the WIP patch. > > Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time > for zeroing out all pages), how about using single-user mode instead > of preparing the transaction? That is, after pg_resetwal we check the > ages of datfrozenxid by executing a query in single-user mode. That > way, we don’t need to worry about autovacuum concurrently running > while checking the ages of frozenxids. I’ve attached a PoC patch that > does the scenario like: > > 1. start cluster with autovacuum=off and create tables with a few data > and make garbage on them > 2. stop cluster and do pg_resetwal > 3. start cluster in single-user mode > 4. check age(datfrozenxid) > 5. stop cluster > 6. start cluster and wait for autovacuums to increase template0, > template1, and postgres datfrozenxids The above steps are wrong. I think we can expose a function in an extension used only by this test in order to set nextXid to a future value with zeroing out clog/subtrans pages. We don't need to fill all clog/subtrans pages between oldestActiveXID and nextXid. I've attached a PoC patch for adding this regression test and am going to register it to the next CF. BTW, while testing the emergency situation, I found there is a race condition where anti-wraparound vacuum isn't invoked with the settings autovacuum = off, autovacuum_max_workers = 1. AN autovacuum worker sends a signal to the postmaster after advancing datfrozenxid in SetTransactionIdLimit(). But with the settings, if the autovacuum launcher attempts to launch a worker before the autovacuum worker who has signaled to the postmaster finishes, the launcher exits without launching a worker due to no free workers. The new launcher won’t be launched until new XID is generated (and only when new XID % 65536 == 0). Although autovacuum_max_workers = 1 is not mandatory for this test, it's easier to verify the order of operations. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
2022年6月30日(木) 10:40 Masahiko Sawada <sawada.mshk@gmail.com>: > > Hi, > > On Tue, Feb 1, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Jun 11, 2021 at 10:19 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote: > > > > Cool. Thank you for working on that! > > > > Could you please share a WIP patch for the $subj? I'd be happy to help with > > > > it. > > > > > > I've attached the current WIP state, which hasn't evolved much since > > > this message... I put the test in src/backend/access/heap/t/001_emergency_vacuum.pl > > > but I'm not sure that's the best place. But I didn't think > > > src/test/recovery is great either. > > > > > > > Thank you for sharing the WIP patch. > > > > Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time > > for zeroing out all pages), how about using single-user mode instead > > of preparing the transaction? That is, after pg_resetwal we check the > > ages of datfrozenxid by executing a query in single-user mode. That > > way, we don’t need to worry about autovacuum concurrently running > > while checking the ages of frozenxids. I’ve attached a PoC patch that > > does the scenario like: > > > > 1. start cluster with autovacuum=off and create tables with a few data > > and make garbage on them > > 2. stop cluster and do pg_resetwal > > 3. start cluster in single-user mode > > 4. check age(datfrozenxid) > > 5. stop cluster > > 6. start cluster and wait for autovacuums to increase template0, > > template1, and postgres datfrozenxids > > The above steps are wrong. > > I think we can expose a function in an extension used only by this > test in order to set nextXid to a future value with zeroing out > clog/subtrans pages. We don't need to fill all clog/subtrans pages > between oldestActiveXID and nextXid. I've attached a PoC patch for > adding this regression test and am going to register it to the next > CF. > > BTW, while testing the emergency situation, I found there is a race > condition where anti-wraparound vacuum isn't invoked with the settings > autovacuum = off, autovacuum_max_workers = 1. AN autovacuum worker > sends a signal to the postmaster after advancing datfrozenxid in > SetTransactionIdLimit(). But with the settings, if the autovacuum > launcher attempts to launch a worker before the autovacuum worker who > has signaled to the postmaster finishes, the launcher exits without > launching a worker due to no free workers. The new launcher won’t be > launched until new XID is generated (and only when new XID % 65536 == > 0). Although autovacuum_max_workers = 1 is not mandatory for this > test, it's easier to verify the order of operations. Hi Thanks for the patch. While reviewing the patch backlog, we have determined that the latest version of this patch was submitted before meson support was implemented, so it should have a "meson.build" file added for consideration for inclusion in PostgreSQL 16. Regards Ian Barwick
On 16/11/2022 06:38, Ian Lawrence Barwick wrote: > Thanks for the patch. While reviewing the patch backlog, we have determined that > the latest version of this patch was submitted before meson support was > implemented, so it should have a "meson.build" file added for consideration for > inclusion in PostgreSQL 16. I wanted to do some XID wraparound testing again, to test the 64-bit SLRUs patches [1], and revived this. I took a different approach to consuming the XIDs. Instead of setting nextXID directly, bypassing GetNewTransactionId(), this patch introduces a helper function to call GetNewTransactionId() repeatedly. But because that's slow, it does include a shortcut to skip over "uninteresting" XIDs. Whenever nextXid is close to an SLRU page boundary or XID wraparound, it calls GetNewTransactionId(), and otherwise it bumps up nextXid close to the next "interesting" value. That's still a lot slower than just setting nextXid, but exercises the code more realistically. I've written some variant of this helper function many times over the years, for ad hoc testing. I'd love to have it permanently in the git tree. In addition to Masahiko's test for emergency vacuum, this includes two other tests. 002_limits.pl tests the "warn limit" and "stop limit" in GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion transactions in total, exercising XID wraparound in general. Unfortunately these tests are pretty slow; the tests run for about 4 minutes on my laptop in total, and use about 20 GB of disk space. So perhaps these need to be put in a special test suite that's not run as part of "check-world". Or perhaps leave out the 003_wraparounds.pl test, that's the slowest of the tests. But I'd love to have these in the git tree in some form. [1] https://www.postgresql.org/message-id/CAJ7c6TPKf0W3MfpP2vr=kq7-NM5G12vTBhi7miu_5m8AG3Cw-w@mail.gmail.com) - Heikki
On 03/03/2023 13:34, Heikki Linnakangas wrote: > On 16/11/2022 06:38, Ian Lawrence Barwick wrote: >> Thanks for the patch. While reviewing the patch backlog, we have determined that >> the latest version of this patch was submitted before meson support was >> implemented, so it should have a "meson.build" file added for consideration for >> inclusion in PostgreSQL 16. > > I wanted to do some XID wraparound testing again, to test the 64-bit > SLRUs patches [1], and revived this. Forgot attachment. - Heikki
Attachment
On Fri, Mar 3, 2023 at 8:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 16/11/2022 06:38, Ian Lawrence Barwick wrote: > > Thanks for the patch. While reviewing the patch backlog, we have determined that > > the latest version of this patch was submitted before meson support was > > implemented, so it should have a "meson.build" file added for consideration for > > inclusion in PostgreSQL 16. > > I wanted to do some XID wraparound testing again, to test the 64-bit > SLRUs patches [1], and revived this. Thank you for reviving this thread! > > I took a different approach to consuming the XIDs. Instead of setting > nextXID directly, bypassing GetNewTransactionId(), this patch introduces > a helper function to call GetNewTransactionId() repeatedly. But because > that's slow, it does include a shortcut to skip over "uninteresting" > XIDs. Whenever nextXid is close to an SLRU page boundary or XID > wraparound, it calls GetNewTransactionId(), and otherwise it bumps up > nextXid close to the next "interesting" value. That's still a lot slower > than just setting nextXid, but exercises the code more realistically. > > I've written some variant of this helper function many times over the > years, for ad hoc testing. I'd love to have it permanently in the git tree. These functions seem to be better than mine. > In addition to Masahiko's test for emergency vacuum, this includes two > other tests. 002_limits.pl tests the "warn limit" and "stop limit" in > GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion > transactions in total, exercising XID wraparound in general. > Unfortunately these tests are pretty slow; the tests run for about 4 > minutes on my laptop in total, and use about 20 GB of disk space. So > perhaps these need to be put in a special test suite that's not run as > part of "check-world". Or perhaps leave out the 003_wraparounds.pl test, > that's the slowest of the tests. But I'd love to have these in the git > tree in some form. cbfot reports some failures. The main reason seems that meson.build in xid_wraparound directory adds the regression tests but the .sql and .out files are missing in the patch. Perhaps the patch wants to add only tap tests as Makefile doesn't define REGRESS? Even after fixing this issue, CI tests (Cirrus CI) are not happy and report failures due to a disk full. The size of xid_wraparound test directory is 105MB out of 262MB: % du -sh testrun 262M testrun % du -sh testrun/xid_wraparound/ 105M testrun/xid_wraparound/ % du -sh testrun/xid_wraparound/* 460K testrun/xid_wraparound/001_emergency_vacuum 93M testrun/xid_wraparound/002_limits 12M testrun/xid_wraparound/003_wraparounds % ls -lh testrun/xid_wraparound/002_limits/log* total 93M -rw-------. 1 masahiko masahiko 93M Mar 7 17:34 002_limits_wraparound.log -rw-rw-r--. 1 masahiko masahiko 20K Mar 7 17:34 regress_log_002_limits The biggest file is the server logs since an autovacuum worker writes autovacuum logs for every table for every second (autovacuum_naptime is 1s). Maybe we can set log_autovacuum_min_duration reloption for the test tables instead of globally enabling it The 001 test uses the 2PC transaction that holds locks on tables but since we can consume xids while the server running, we don't need that. Instead I think we can keep a transaction open in the background like 002 test does. I'll try these ideas. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 3, 2023 at 3:34 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I took a different approach to consuming the XIDs. Instead of setting > nextXID directly, bypassing GetNewTransactionId(), this patch introduces > a helper function to call GetNewTransactionId() repeatedly. But because > that's slow, it does include a shortcut to skip over "uninteresting" > XIDs. Whenever nextXid is close to an SLRU page boundary or XID > wraparound, it calls GetNewTransactionId(), and otherwise it bumps up > nextXid close to the next "interesting" value. That's still a lot slower > than just setting nextXid, but exercises the code more realistically. Surely your tap test should be using single user mode? Perhaps you missed the obnoxious HINT, that's part of the WARNING that the test parses? ;-) This is a very useful patch. I certainly don't want to make life harder by (say) connecting it to the single user mode problem. But...the single user mode thing really needs to go away. It's just terrible advice, and actively harms users. -- Peter Geoghegan
On Tue, Mar 07, 2023 at 09:21:00PM -0800, Peter Geoghegan wrote: > Surely your tap test should be using single user mode? Perhaps you > missed the obnoxious HINT, that's part of the WARNING that the test > parses? ;-) I may be missing something, but you cannot use directly a "postgres" command in a TAP test, can you? See 1a9d802, that has fixed a problem when TAP tests run with a privileged account on Windows. -- Michael
Attachment
On Wed, Mar 8, 2023 at 10:47 PM Michael Paquier <michael@paquier.xyz> wrote: > I may be missing something, but you cannot use directly a "postgres" > command in a TAP test, can you? See 1a9d802, that has fixed a problem > when TAP tests run with a privileged account on Windows. I was joking. But I did have a real point: once we have tests for the xidStopLimit mechanism, why not take the opportunity to correct the long standing issue with the documentation advising the use of single user mode? -- Peter Geoghegan
On Sat, Mar 11, 2023 at 8:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > I was joking. But I did have a real point: once we have tests for the > xidStopLimit mechanism, why not take the opportunity to correct the > long standing issue with the documentation advising the use of single > user mode? Does https://commitfest.postgresql.org/42/4128/ address that independently enough? --Jacob
On Mon, Mar 13, 2023 at 3:25 PM Jacob Champion <jchampion@timescale.com> wrote: > Does https://commitfest.postgresql.org/42/4128/ address that > independently enough? I wasn't aware of that patch. It looks like it does exactly what I was arguing in favor of. So yes. -- Peter Geoghegan
On Wed, Mar 8, 2023 at 1:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 3, 2023 at 8:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 16/11/2022 06:38, Ian Lawrence Barwick wrote: > > > Thanks for the patch. While reviewing the patch backlog, we have determined that > > > the latest version of this patch was submitted before meson support was > > > implemented, so it should have a "meson.build" file added for consideration for > > > inclusion in PostgreSQL 16. > > > > I wanted to do some XID wraparound testing again, to test the 64-bit > > SLRUs patches [1], and revived this. > > Thank you for reviving this thread! > > > > > I took a different approach to consuming the XIDs. Instead of setting > > nextXID directly, bypassing GetNewTransactionId(), this patch introduces > > a helper function to call GetNewTransactionId() repeatedly. But because > > that's slow, it does include a shortcut to skip over "uninteresting" > > XIDs. Whenever nextXid is close to an SLRU page boundary or XID > > wraparound, it calls GetNewTransactionId(), and otherwise it bumps up > > nextXid close to the next "interesting" value. That's still a lot slower > > than just setting nextXid, but exercises the code more realistically. > > > > I've written some variant of this helper function many times over the > > years, for ad hoc testing. I'd love to have it permanently in the git tree. > > These functions seem to be better than mine. > > > In addition to Masahiko's test for emergency vacuum, this includes two > > other tests. 002_limits.pl tests the "warn limit" and "stop limit" in > > GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion > > transactions in total, exercising XID wraparound in general. > > Unfortunately these tests are pretty slow; the tests run for about 4 > > minutes on my laptop in total, and use about 20 GB of disk space. So > > perhaps these need to be put in a special test suite that's not run as > > part of "check-world". Or perhaps leave out the 003_wraparounds.pl test, > > that's the slowest of the tests. But I'd love to have these in the git > > tree in some form. > > cbfot reports some failures. The main reason seems that meson.build in > xid_wraparound directory adds the regression tests but the .sql and > .out files are missing in the patch. Perhaps the patch wants to add > only tap tests as Makefile doesn't define REGRESS? > > Even after fixing this issue, CI tests (Cirrus CI) are not happy and > report failures due to a disk full. The size of xid_wraparound test > directory is 105MB out of 262MB: > > % du -sh testrun > 262M testrun > % du -sh testrun/xid_wraparound/ > 105M testrun/xid_wraparound/ > % du -sh testrun/xid_wraparound/* > 460K testrun/xid_wraparound/001_emergency_vacuum > 93M testrun/xid_wraparound/002_limits > 12M testrun/xid_wraparound/003_wraparounds > % ls -lh testrun/xid_wraparound/002_limits/log* > total 93M > -rw-------. 1 masahiko masahiko 93M Mar 7 17:34 002_limits_wraparound.log > -rw-rw-r--. 1 masahiko masahiko 20K Mar 7 17:34 regress_log_002_limits > > The biggest file is the server logs since an autovacuum worker writes > autovacuum logs for every table for every second (autovacuum_naptime > is 1s). Maybe we can set log_autovacuum_min_duration reloption for the > test tables instead of globally enabling it I think it could be acceptable since 002 and 003 tests are executed only when required. And 001 test seems to be able to pass on cfbot but it takes more than 30 sec. In the attached patch, I made these tests optional and these are enabled if envar ENABLE_XID_WRAPAROUND_TESTS is defined (supporting only autoconf). > > The 001 test uses the 2PC transaction that holds locks on tables but > since we can consume xids while the server running, we don't need > that. Instead I think we can keep a transaction open in the background > like 002 test does. Updated in the new patch. Also, I added a check if the failsafe mode is triggered. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
I agree having the new functions in the tree is useful. I also tried running the TAP tests in v2, but 001 and 002 fail to run:
Odd number of elements in hash assignment at [...]/Cluster.pm line 2010.
Can't locate object method "pump_nb" via package "PostgreSQL::Test::BackgroundPsql" at [...]
It seems to be complaining about
Can't locate object method "pump_nb" via package "PostgreSQL::Test::BackgroundPsql" at [...]
It seems to be complaining about
+my $in = '';
+my $out = '';
+my $timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my $background_psql = $node->background_psql('postgres', \$in, \$out, $timeout);
...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?
--
John Naylor
EDB: http://www.enterprisedb.com
+my $out = '';
+my $timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my $background_psql = $node->background_psql('postgres', \$in, \$out, $timeout);
...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Apr 21, 2023 at 12:02 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > I agree having the new functions in the tree is useful. I also tried running the TAP tests in v2, but 001 and 002 failto run: > > Odd number of elements in hash assignment at [...]/Cluster.pm line 2010. > Can't locate object method "pump_nb" via package "PostgreSQL::Test::BackgroundPsql" at [...] > > It seems to be complaining about > > +my $in = ''; > +my $out = ''; > +my $timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default); > +my $background_psql = $node->background_psql('postgres', \$in, \$out, $timeout); > > ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing? Thanks for reporting. I think that the patch needs to be updated since commit 664d757531e1 changed background psql TAP functions. I've attached the updated patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
> On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Apr 21, 2023 at 12:02 PM John Naylor > <john.naylor@enterprisedb.com> wrote: >> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing? > > Thanks for reporting. I think that the patch needs to be updated since > commit 664d757531e1 changed background psql TAP functions. I've > attached the updated patch. Is there a risk that the background psql will time out on slow systems during the consumption of 2B xid's? Since you mainly want to hold it open for the duration of testing you might want to bump it to avoid false negatives on slow test systems. -- Daniel Gustafsson
On Thu, Apr 27, 2023 at 9:12 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Fri, Apr 21, 2023 at 12:02 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> >> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?
> >
> > Thanks for reporting. I think that the patch needs to be updated since
> > commit 664d757531e1 changed background psql TAP functions. I've
> > attached the updated patch.
Thanks, it passes for me now.
> Is there a risk that the background psql will time out on slow systems during
> the consumption of 2B xid's? Since you mainly want to hold it open for the
> duration of testing you might want to bump it to avoid false negatives on slow
> test systems.
If they're that slow, I'd worry more about generating 20GB of xact status data. That's why the tests are disabled by default.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> > On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Fri, Apr 21, 2023 at 12:02 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> >> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?
> >
> > Thanks for reporting. I think that the patch needs to be updated since
> > commit 664d757531e1 changed background psql TAP functions. I've
> > attached the updated patch.
Thanks, it passes for me now.
> Is there a risk that the background psql will time out on slow systems during
> the consumption of 2B xid's? Since you mainly want to hold it open for the
> duration of testing you might want to bump it to avoid false negatives on slow
> test systems.
If they're that slow, I'd worry more about generating 20GB of xact status data. That's why the tests are disabled by default.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Apr 27, 2023 at 9:12 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Apr 21, 2023 at 12:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
>> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing?
>
> Thanks for reporting. I think that the patch needs to be updated since
> commit 664d757531e1 changed background psql TAP functions. I've
> attached the updated patch.
Is there a risk that the background psql will time out on slow systems during
the consumption of 2B xid's? Since you mainly want to hold it open for the
duration of testing you might want to bump it to avoid false negatives on slow
test systems.
--
Daniel Gustafsson
John Naylor <john.naylor@enterprisedb.com> writes: > On Thu, Apr 27, 2023 at 9:12 PM Daniel Gustafsson <daniel@yesql.se> wrote: >> Is there a risk that the background psql will time out on slow systems during >> the consumption of 2B xid's? Since you mainly want to hold it open for the >> duration of testing you might want to bump it to avoid false negatives on >> slow test systems. > If they're that slow, I'd worry more about generating 20GB of xact status > data. That's why the tests are disabled by default. There is exactly zero chance that anyone will accept the introduction of such an expensive test into either check-world or the buildfarm sequence. regards, tom lane
> On 28 Apr 2023, at 06:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > John Naylor <john.naylor@enterprisedb.com> writes: >> If they're that slow, I'd worry more about generating 20GB of xact status >> data. That's why the tests are disabled by default. > > There is exactly zero chance that anyone will accept the introduction > of such an expensive test into either check-world or the buildfarm > sequence. Even though the entire suite is disabled by default, shouldn't it also require PG_TEST_EXTRA to be consistent with other off-by-default suites like for example src/test/kerberos? -- Daniel Gustafsson
On Thu, Apr 27, 2023 at 11:12 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 27 Apr 2023, at 16:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Apr 21, 2023 at 12:02 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > >> ...that call to background_psql doesn't look like other ones that have "key => value". Is there something I'm missing? > > > > Thanks for reporting. I think that the patch needs to be updated since > > commit 664d757531e1 changed background psql TAP functions. I've > > attached the updated patch. > > Is there a risk that the background psql will time out on slow systems during > the consumption of 2B xid's? Since you mainly want to hold it open for the > duration of testing you might want to bump it to avoid false negatives on slow > test systems. Agreed. The timeout can be set by manually setting PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it now require setting PG_TET_EXTRA to run it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
> On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Agreed. The timeout can be set by manually setting > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it > now require setting PG_TET_EXTRA to run it. +# bump the query timeout to avoid false negatives on slow test syetems. typo: s/syetems/systems/ +# bump the query timeout to avoid false negatives on slow test syetems. +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600; Does this actually work? Utils.pm read the environment variable at compile time in the BEGIN block so this setting won't be seen? A quick testprogram seems to confirm this but I might be missing something. -- Daniel Gustafsson
On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote: > +# bump the query timeout to avoid false negatives on slow test syetems. > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600; > Does this actually work? Utils.pm read the environment variable at compile > time in the BEGIN block so this setting won't be seen? A quick testprogram > seems to confirm this but I might be missing something. I wish that this test were cheaper, without a need to depend on PG_TEST_EXTRA.. Actually, note that you are forgetting to update the documentation of PG_TEST_EXTRA with this new value of xid_wraparound. -- Michael
Attachment
> On 22 Aug 2023, at 07:49, Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote: >> +# bump the query timeout to avoid false negatives on slow test syetems. >> +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600; >> Does this actually work? Utils.pm read the environment variable at compile >> time in the BEGIN block so this setting won't be seen? A quick testprogram >> seems to confirm this but I might be missing something. > > I wish that this test were cheaper, without a need to depend on > PG_TEST_EXTRA.. Actually, note that you are forgetting to update the > documentation of PG_TEST_EXTRA with this new value of xid_wraparound. Agreed, it would be nice, but I don't see any way to achieve that. I still think the test is worthwhile to add, once the upthread mentioned issues are resolved. -- Daniel Gustafsson
On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote: > > On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Agreed. The timeout can be set by manually setting > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it > > now require setting PG_TET_EXTRA to run it. > > +# bump the query timeout to avoid false negatives on slow test syetems. > typo: s/syetems/systems/ > > > +# bump the query timeout to avoid false negatives on slow test syetems. > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600; > Does this actually work? Utils.pm read the environment variable at compile > time in the BEGIN block so this setting won't be seen? A quick testprogram > seems to confirm this but I might be missing something. The correct way to get a longer timeout is "IPC::Run::timer(4 * $PostgreSQL::Test::Utils::timeout_default);". Even if changing env worked, that would be removing the ability for even-slower systems to set timeouts greater than 10min.
Sorry for the late reply. On Sun, Sep 3, 2023 at 2:48 PM Noah Misch <noah@leadboat.com> wrote: > > On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote: > > > On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Agreed. The timeout can be set by manually setting > > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it > > > now require setting PG_TET_EXTRA to run it. > > > > +# bump the query timeout to avoid false negatives on slow test syetems. > > typo: s/syetems/systems/ > > > > > > +# bump the query timeout to avoid false negatives on slow test syetems. > > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600; > > Does this actually work? Utils.pm read the environment variable at compile > > time in the BEGIN block so this setting won't be seen? A quick testprogram > > seems to confirm this but I might be missing something. > > The correct way to get a longer timeout is "IPC::Run::timer(4 * > $PostgreSQL::Test::Utils::timeout_default);". Even if changing env worked, > that would be removing the ability for even-slower systems to set timeouts > greater than 10min. Agreed. I've attached new version patches. 0001 patch adds an option to background_psql to specify the timeout seconds, and 0002 patch is the main regression test patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
> On 27 Sep 2023, at 14:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've attached new version patches. 0001 patch adds an option to > background_psql to specify the timeout seconds, and 0002 patch is the > main regression test patch. -=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params) +=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params, timeout) Looking at this I notice that I made a typo in 664d757531e, the =item line should have "@psql_params" and not "@params". Perhaps you can fix that minor thing while in there? + $timeout = $params{timeout} if defined $params{timeout}; I think this should be documented in the background_psql POD docs. + Not enabled by default it is resource intensive. This sentence is missing a "because", should read: "..by default *because* it is.." +# Bump the query timeout to avoid false negatives on slow test systems. +my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default; Should we bump the timeout like this for all systems? I interpreted Noah's comment such that it should be possible for slower systems to override, not that it should be extended everywhere, but I might have missed something. -- Daniel Gustafsson
On Thu, 28 Sept 2023 at 03:55, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Sorry for the late reply. > > On Sun, Sep 3, 2023 at 2:48 PM Noah Misch <noah@leadboat.com> wrote: > > > > On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote: > > > > On 12 Jul 2023, at 09:52, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Agreed. The timeout can be set by manually setting > > > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it > > > > now require setting PG_TET_EXTRA to run it. > > > > > > +# bump the query timeout to avoid false negatives on slow test syetems. > > > typo: s/syetems/systems/ > > > > > > > > > +# bump the query timeout to avoid false negatives on slow test syetems. > > > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600; > > > Does this actually work? Utils.pm read the environment variable at compile > > > time in the BEGIN block so this setting won't be seen? A quick testprogram > > > seems to confirm this but I might be missing something. > > > > The correct way to get a longer timeout is "IPC::Run::timer(4 * > > $PostgreSQL::Test::Utils::timeout_default);". Even if changing env worked, > > that would be removing the ability for even-slower systems to set timeouts > > greater than 10min. > > Agreed. > > I've attached new version patches. 0001 patch adds an option to > background_psql to specify the timeout seconds, and 0002 patch is the > main regression test patch. Few comments: 1) Should we have some validation for the inputs given: +PG_FUNCTION_INFO_V1(consume_xids_until); +Datum +consume_xids_until(PG_FUNCTION_ARGS) +{ + FullTransactionId targetxid = FullTransactionIdFromU64((uint64) PG_GETARG_INT64(0)); + FullTransactionId lastxid; + + if (!FullTransactionIdIsNormal(targetxid)) + elog(ERROR, "targetxid %llu is not normal", (unsigned long long) U64FromFullTransactionId(targetxid)); If not it will take inputs like -1 and 999999999999999. Also the notice messages might confuse for the above values, as it will show a different untilxid value like the below: postgres=# SELECT consume_xids_until(999999999999999); NOTICE: consumed up to 0:10000809 / 232830:2764472319 2) Should this be added after worker_spi as we generally add it in the alphabetical order: diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build index fcd643f6f1..4054bde84c 100644 --- a/src/test/modules/meson.build +++ b/src/test/modules/meson.build @@ -10,6 +10,7 @@ subdir('libpq_pipeline') subdir('plsample') subdir('spgist_name_ops') subdir('ssl_passphrase_callback') +subdir('xid_wraparound') subdir('test_bloomfilter') 3) Similarly here too: index e81873cb5a..a4c845ab4a 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -13,6 +13,7 @@ SUBDIRS = \ libpq_pipeline \ plsample \ spgist_name_ops \ + xid_wraparound \ test_bloomfilter \ 4) The following includes are not required transam.h, fmgr.h, lwlock.h + * src/test/modules/xid_wraparound/xid_wraparound.c + * + * ------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/transam.h" +#include "access/xact.h" +#include "fmgr.h" +#include "miscadmin.h" +#include "storage/lwlock.h" +#include "storage/proc.h" Regards, Vignesh
On Fri, Sep 29, 2023 at 12:17:04PM +0200, Daniel Gustafsson wrote: > +# Bump the query timeout to avoid false negatives on slow test systems. > +my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default; > > Should we bump the timeout like this for all systems? I interpreted Noah's > comment such that it should be possible for slower systems to override, not > that it should be extended everywhere, but I might have missed something. This is the conventional way to do it. For an operation far slower than a typical timeout_default situation, the patch can and should dilate the default timeout like this. The patch version as of my last comment was extending the timeout but also blocking users from extending it further via PG_TEST_TIMEOUT_DEFAULT. The latest version restores PG_TEST_TIMEOUT_DEFAULT reactivity, resolving my comment.
On Fri, Sep 29, 2023 at 7:17 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 27 Sep 2023, at 14:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached new version patches. 0001 patch adds an option to > > background_psql to specify the timeout seconds, and 0002 patch is the > > main regression test patch. > > -=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params) > +=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params, timeout) > > Looking at this I notice that I made a typo in 664d757531e, the =item line > should have "@psql_params" and not "@params". Perhaps you can fix that minor > thing while in there? > > > + $timeout = $params{timeout} if defined $params{timeout}; > > I think this should be documented in the background_psql POD docs. While updating the documentation, I found the following description: =item $node->background_psql($dbname, %params) => PostgreSQL::Test::BackgroundPsql inst$ Invoke B<psql> on B<$dbname> and return a BackgroundPsql object. A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up, which can be modified later. Is it true that we can modify the timeout after creating BackgroundPsql object? If so, it seems we don't need to introduce the new timeout argument. But how? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Is it true that we can modify the timeout after creating > BackgroundPsql object? If so, it seems we don't need to introduce the > new timeout argument. But how? I can't remember if that's leftovers that incorrectly remains from an earlier version of the BackgroundPsql work, or if it's a very bad explanation of ->set_query_timer_restart(). The timeout will use the timeout_default value and that cannot be overridden, it can only be reset per query. With your patch the timeout still cannot be changed, but at least set during start which seems good enough until there are tests warranting more complexity. The docs should be corrected to reflect this in your patch. -- Daniel Gustafsson
On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Is it true that we can modify the timeout after creating > > BackgroundPsql object? If so, it seems we don't need to introduce the > > new timeout argument. But how? > > I can't remember if that's leftovers that incorrectly remains from an earlier > version of the BackgroundPsql work, or if it's a very bad explanation of > ->set_query_timer_restart(). The timeout will use the timeout_default value > and that cannot be overridden, it can only be reset per query. Thank you for confirming this. I see there is the same problem also in interactive_psql(). So I've attached the 0001 patch to fix these documentation issues. Which could be backpatched. > With your patch the timeout still cannot be changed, but at least set during > start which seems good enough until there are tests warranting more complexity. > The docs should be corrected to reflect this in your patch. I've incorporated the comments except for the following one and attached updated version of the rest patches: On Fri, Sep 29, 2023 at 7:20 PM vignesh C <vignesh21@gmail.com> wrote: > Few comments: > 1) Should we have some validation for the inputs given: > +PG_FUNCTION_INFO_V1(consume_xids_until); > +Datum > +consume_xids_until(PG_FUNCTION_ARGS) > +{ > + FullTransactionId targetxid = > FullTransactionIdFromU64((uint64) PG_GETARG_INT64(0)); > + FullTransactionId lastxid; > + > + if (!FullTransactionIdIsNormal(targetxid)) > + elog(ERROR, "targetxid %llu is not normal", (unsigned > long long) U64FromFullTransactionId(targetxid)); > > If not it will take inputs like -1 and 999999999999999. > Also the notice messages might confuse for the above values, as it > will show a different untilxid value like the below: > postgres=# SELECT consume_xids_until(999999999999999); > NOTICE: consumed up to 0:10000809 / 232830:2764472319 The full transaction ids shown in the notice messages are separated into epoch and xid so it's not a different value. This epoch-and-xid style is used also in pg_controldata output and makes sense to me to show the progress of xid consumption. Once the new test gets committed, I'll prepare a new buildfarm animal for that if possible. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
> On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote: >> >>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >>> Is it true that we can modify the timeout after creating >>> BackgroundPsql object? If so, it seems we don't need to introduce the >>> new timeout argument. But how? >> >> I can't remember if that's leftovers that incorrectly remains from an earlier >> version of the BackgroundPsql work, or if it's a very bad explanation of >> ->set_query_timer_restart(). The timeout will use the timeout_default value >> and that cannot be overridden, it can only be reset per query. > > Thank you for confirming this. I see there is the same problem also in > interactive_psql(). So I've attached the 0001 patch to fix these > documentation issues. -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up, -which can be modified later. +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up. Since it cannot be modified, I think we should just say "A timeout of .." and call it a default timeout. This obviously only matters for the backpatch since the sentence is removed in 0002. > Which could be backpatched. +1 >> With your patch the timeout still cannot be changed, but at least set during >> start which seems good enough until there are tests warranting more complexity. >> The docs should be corrected to reflect this in your patch. > > I've incorporated the comments except for the following one and > attached updated version of the rest patches: LGTM. -- Daniel Gustafsson
On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote: > >> > >>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> > >>> Is it true that we can modify the timeout after creating > >>> BackgroundPsql object? If so, it seems we don't need to introduce the > >>> new timeout argument. But how? > >> > >> I can't remember if that's leftovers that incorrectly remains from an earlier > >> version of the BackgroundPsql work, or if it's a very bad explanation of > >> ->set_query_timer_restart(). The timeout will use the timeout_default value > >> and that cannot be overridden, it can only be reset per query. > > > > Thank you for confirming this. I see there is the same problem also in > > interactive_psql(). So I've attached the 0001 patch to fix these > > documentation issues. > > -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up, > -which can be modified later. > +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up. > > Since it cannot be modified, I think we should just say "A timeout of .." and > call it a default timeout. This obviously only matters for the backpatch since > the sentence is removed in 0002. Agreed. I've attached new version patches (0002 and 0003 are unchanged except for the commit message). I'll push them, barring any objections. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Nov 29, 2023 at 5:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > > > On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > >> > > >>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > >> > > >>> Is it true that we can modify the timeout after creating > > >>> BackgroundPsql object? If so, it seems we don't need to introduce the > > >>> new timeout argument. But how? > > >> > > >> I can't remember if that's leftovers that incorrectly remains from an earlier > > >> version of the BackgroundPsql work, or if it's a very bad explanation of > > >> ->set_query_timer_restart(). The timeout will use the timeout_default value > > >> and that cannot be overridden, it can only be reset per query. > > > > > > Thank you for confirming this. I see there is the same problem also in > > > interactive_psql(). So I've attached the 0001 patch to fix these > > > documentation issues. > > > > -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up, > > -which can be modified later. > > +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up. > > > > Since it cannot be modified, I think we should just say "A timeout of .." and > > call it a default timeout. This obviously only matters for the backpatch since > > the sentence is removed in 0002. > > Agreed. > > I've attached new version patches (0002 and 0003 are unchanged except > for the commit message). I'll push them, barring any objections. > Pushed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Nov 30, 2023 at 4:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Nov 29, 2023 at 5:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > > > > > On 28 Nov 2023, at 03:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > >> > > > >>> On 27 Nov 2023, at 14:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > >> > > > >>> Is it true that we can modify the timeout after creating > > > >>> BackgroundPsql object? If so, it seems we don't need to introduce the > > > >>> new timeout argument. But how? > > > >> > > > >> I can't remember if that's leftovers that incorrectly remains from an earlier > > > >> version of the BackgroundPsql work, or if it's a very bad explanation of > > > >> ->set_query_timer_restart(). The timeout will use the timeout_default value > > > >> and that cannot be overridden, it can only be reset per query. > > > > > > > > Thank you for confirming this. I see there is the same problem also in > > > > interactive_psql(). So I've attached the 0001 patch to fix these > > > > documentation issues. > > > > > > -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up, > > > -which can be modified later. > > > +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up. > > > > > > Since it cannot be modified, I think we should just say "A timeout of .." and > > > call it a default timeout. This obviously only matters for the backpatch since > > > the sentence is removed in 0002. > > > > Agreed. > > > > I've attached new version patches (0002 and 0003 are unchanged except > > for the commit message). I'll push them, barring any objections. > > > > Pushed. FYI I've configured the buildfarm animal perentie to run regression tests including xid_wraparound: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=perentie&br=HEAD Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
The way src/test/modules/xid_wraparound/meson.build is written, it installs the xid_wraparound.so module into production installations. For test modules, a different installation code needs to be used. See neighboring test modules such as src/test/modules/test_rbtree/meson.build for examples.
On Thu, Feb 8, 2024 at 3:11 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > The way src/test/modules/xid_wraparound/meson.build is written, it > installs the xid_wraparound.so module into production installations. > For test modules, a different installation code needs to be used. See > neighboring test modules such as > src/test/modules/test_rbtree/meson.build for examples. > Good catch, thanks. I've attached the patch to fix it. Does it make sense? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On 08.02.24 05:05, Masahiko Sawada wrote: > On Thu, Feb 8, 2024 at 3:11 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> >> The way src/test/modules/xid_wraparound/meson.build is written, it >> installs the xid_wraparound.so module into production installations. >> For test modules, a different installation code needs to be used. See >> neighboring test modules such as >> src/test/modules/test_rbtree/meson.build for examples. >> > > Good catch, thanks. > > I've attached the patch to fix it. Does it make sense? Yes, that looks correct to me and produces the expected behavior.
On Thu, Feb 8, 2024 at 4:06 PM Peter Eisentraut <peter@eisentraut.org> wrote: > > On 08.02.24 05:05, Masahiko Sawada wrote: > > On Thu, Feb 8, 2024 at 3:11 AM Peter Eisentraut <peter@eisentraut.org> wrote: > >> > >> The way src/test/modules/xid_wraparound/meson.build is written, it > >> installs the xid_wraparound.so module into production installations. > >> For test modules, a different installation code needs to be used. See > >> neighboring test modules such as > >> src/test/modules/test_rbtree/meson.build for examples. > >> > > > > Good catch, thanks. > > > > I've attached the patch to fix it. Does it make sense? > > Yes, that looks correct to me and produces the expected behavior. > Thank you for the check. Pushed at 1aa67a5ea687. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hello, 30.11.2023 10:35, Masahiko Sawada wrote: > >> I've attached new version patches (0002 and 0003 are unchanged except >> for the commit message). I'll push them, barring any objections. >> > Pushed. I've discovered that the test 001_emergency_vacuum.pl can fail due to a race condition. I can't see the server log at [1], but I reproduced the failure locally and with additional logging and log_min_messages = DEBUG3, the log shows: ... 2024-05-22 11:46:28.125 UTC [21256:2853] DEBUG: SlruScanDirectory invoking callback on pg_xact/0690 2024-05-22 11:46:28.125 UTC [21256:2854] DEBUG: transaction ID wrap limit is 2147484396, limited by database with OID 5 2024-05-22 11:46:28.126 UTC [21256:2855] LOG: !!!SendPostmasterSignal| PMSIGNAL_START_AUTOVAC_LAUNCHER 2024-05-22 11:46:28.135 UTC [14871:20077] DEBUG: postmaster received pmsignal signal 2024-05-22 11:46:28.137 UTC [21257:1] DEBUG: autovacuum launcher started 2024-05-22 11:46:28.137 UTC [21257:2] DEBUG: InitPostgres 2024-05-22 11:46:28.138 UTC [21257:3] LOG: !!!AutoVacLauncherMain| !AutoVacuumingActive() && !ShutdownRequestPending; before do_start_worker() 2024-05-22 11:46:28.138 UTC [21257:4] LOG: !!!do_start_worker| return quickly when there are no free workers 2024-05-22 11:46:28.138 UTC [21257:5] DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make 2024-05-22 11:46:28.138 UTC [21257:6] DEBUG: shmem_exit(0): 6 on_shmem_exit callbacks to make 2024-05-22 11:46:28.138 UTC [21257:7] DEBUG: proc_exit(0): 1 callbacks to make 2024-05-22 11:46:28.138 UTC [21257:8] DEBUG: exit(0) 2024-05-22 11:46:28.138 UTC [21257:9] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2024-05-22 11:46:28.138 UTC [21257:10] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2024-05-22 11:46:28.138 UTC [21257:11] DEBUG: proc_exit(-1): 0 callbacks to make 2024-05-22 11:46:28.146 UTC [21256:2856] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 5 2024-05-22 11:46:28.146 UTC [21256:2857] DEBUG: MultiXact member stop limit is now 4294914944 based on MultiXact 1 2024-05-22 11:46:28.146 UTC [21256:2858] DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make 2024-05-22 11:46:28.147 UTC [21256:2859] DEBUG: shmem_exit(0): 7 on_shmem_exit callbacks to make 2024-05-22 11:46:28.147 UTC [21256:2860] DEBUG: proc_exit(0): 1 callbacks to make 2024-05-22 11:46:28.147 UTC [21256:2861] DEBUG: exit(0) 2024-05-22 11:46:28.147 UTC [21256:2862] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2024-05-22 11:46:28.147 UTC [21256:2863] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2024-05-22 11:46:28.147 UTC [21256:2864] DEBUG: proc_exit(-1): 0 callbacks to make 2024-05-22 11:46:28.151 UTC [14871:20078] DEBUG: forked new backend, pid=21258 socket=8 2024-05-22 11:46:28.171 UTC [14871:20079] DEBUG: server process (PID 21256) exited with exit code 0 2024-05-22 11:46:28.152 UTC [21258:1] [unknown] LOG: connection received: host=[local] 2024-05-22 11:46:28.171 UTC [21258:2] [unknown] DEBUG: InitPostgres 2024-05-22 11:46:28.172 UTC [21258:3] [unknown] LOG: connection authenticated: user="vagrant" method=trust (/pgtest/postgresql.git/src/test/modules/xid_wraparound/tmp_check/t_001_emergency_vacuum_main_data/pgdata/pg_hba.conf:117) 2024-05-22 11:46:28.172 UTC [21258:4] [unknown] LOG: connection authorized: user=vagrant database=postgres application_name=001_emergency_vacuum.pl 2024-05-22 11:46:28.175 UTC [21258:5] 001_emergency_vacuum.pl LOG: statement: INSERT INTO small(data) SELECT 1 That is, autovacuum worker (21256) sent PMSIGNAL_START_AUTOVAC_LAUNCHER, postmaster started autovacuum launcher, which could not start new autovacuum worker due to the process 21256 not exited yet. The failure can be reproduced easily with the sleep added inside SetTransactionIdLimit(): if (TransactionIdFollowsOrEquals(curXid, xidVacLimit) && IsUnderPostmaster && !InRecovery) SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER); +pg_usleep(10000L); By the way I also discovered that rather resource-intensive xid_wraparound tests executed twice during the buildfarm "make" run (on dodo and perentie (see [2]) animals), at stage module-xid_wraparound-check and then at stage testmodules-install-check-C. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-05-19%2006%3A33%3A34 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=perentie&dt=2024-05-22%2000%3A02%3A19 Best regards, Alexander
Hi, On Tue, Oct 8, 2024 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Masahiko-san, > > 01.12.2023 05:14, Masahiko Sawada wrote: > > FYI I've configured the buildfarm animal perentie to run regression > > tests including xid_wraparound: > > > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=perentie&br=HEAD > > Please look at a failure produced by perentie recently: [1]. > > I've analyzed all the available detailed perentie logs (starting from > 2024-04-04) and got the following durations of the pg_ctl stop operation > at the end of the 001_emergency_vacuum.pl (from the > module-xid_wraparound-check stage): see perentie-timings.txt and > perentie-timings.png attached. So it looks like perentie needs larger > PGCTLTIMEOUT for the test (maybe 180 seconds would work?). > > Though it's not clear to me, why this test takes so long on that animal, > even when it succeeds. For example, [2] shows: > [09:28:23] t/001_emergency_vacuum.pl .. ok 225894 ms ( 0.00 usr 0.00 sys + 0.31 cusr 0.43 csys = 0.74 CPU) > [09:30:53] t/002_limits.pl ............ ok 150291 ms ( 0.00 usr 0.00 sys + 1.85 cusr 1.50 csys = 3.35 CPU) > [09:49:33] t/003_wraparounds.pl ....... ok 1119766 ms ( 0.00 usr 0.00 sys + 1.68 cusr 2.39 csys = 4.07 CPU) > > While what I'm seeing locally on my Fedora 40 VM is: > PG_TEST_EXTRA="xid_wraparound" make -s check -C src/test/modules/xid_wraparound/ PROVE_FLAGS="--timer" > # +++ tap check in src/test/modules/xid_wraparound +++ > [04:41:56] t/001_emergency_vacuum.pl .. ok 18852 ms ( 0.01 usr 0.00 sys + 0.14 cusr 0.28 csys = 0.43 CPU) > [04:42:15] t/002_limits.pl ............ ok 18539 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.88 csys = 1.60 CPU) > [04:42:34] t/003_wraparounds.pl ....... ok 74368 ms ( 0.00 usr 0.00 sys + 0.82 cusr 2.57 csys = 3.39 CPU) Thank you for reporting it. I've investigated the logs you shared and figured out that some other (personal) tests were overlapped at that time (perentie is my machine). So the failure should be ignored, sorry for making noise. I'll make sure other tests won't be overlapped again when perentie is executing the regression tests. > Also, maybe it would make sense to run this test for REL_17_STABLE too, as > dodo is not with us since 2024-09-04, and I don't know if there are other > animals running these tests (having xid_wraparound in PG_TEST_EXTRA). Good idea. I've configured perentie to do that. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com