Re: pg_upgrade and wraparound - Mailing list pgsql-general
From | Adrian Klaver |
---|---|
Subject | Re: pg_upgrade and wraparound |
Date | |
Msg-id | 651b245a-8874-0cd4-827c-a3691dbda6eb@aklaver.com Whole thread Raw |
In response to | Re: pg_upgrade and wraparound (Alexander Shutyaev <shutyaev@gmail.com>) |
Responses |
Re: pg_upgrade and wraparound
|
List | pgsql-general |
On 06/11/2018 11:32 AM, Alexander Shutyaev wrote: > I'm back with more details. > > First, I've deleted the smaller sslentry database, since I don't need > it, just so that it doesn't somehow spoil the picture. Now there is only > 1 user database - bof (OID=16400). After that I've ran the pg_upgrade on > a clean 10.4 cluster and it failed in the same way. > > Now, the answers to your queries. > > >> 2) The upgrade stops because of transaction ID wraparound, which is > strange as that is not showing up in the 9.6 cluster I presume. You > might want the queries found below on the 9.6 and 10 clusters to help > figure this out: > > >> SELECT c.oid::regclass as table_name, > >> greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age > >> FROM pg_class c > >> LEFT JOIN pg_class t ON c.reltoastrelid = t.oid > >> WHERE c.relkind IN ('r', 'm'); > > 9.6 result - see attached 96-query1.txt > 10.4 result - see attached 104-query1.txt I should have been clearer in my previous post, the above query is per database. From the query below I am going to say the above query was done on the bof database. Is that correct? Given the below from 96-query2.txt: template0 | 110588398 Can you run the table_name query in template0 in the 9.6 cluster? > > >> SELECT datname, age(datfrozenxid) FROM pg_database; > > 9.6 result - see attached 96-query2.txt > 10.4 result - see attached 104-query2.txt > > >> It might be useful > >> to see pg_controldata output for the old cluster, as well as > >> "select datname, datfrozenxid from pg_database" output from the > >> old cluster. > > for the query - see above, for pg_controldata: > > 9.6 - see attached 96-pg_controldata.txt > 10.4 - see attached 104-pg_controldata.txt > > >> Alexander, could you hack things up so autovacuum logging is enabled > >> (log_autovacuum_min_duration=0), and see whether it's triggered? > > I'll be happy to, but that will require to run pg_upgrade once more and > that takes more that half a day and during this time clusters are not > available to me. Given the data I'm attaching it may happen that the > colleagues will want to see something else from my clusters or maybe > change some settings before running the pg_upgrade again. Therefore, > I'll wait 12 hours after this message in case there will be any more > requests and the I'll run the pg_upgrade again. > > Thank you all for trying to solve this matter, this is much appreciated! :) > > 2018-06-11 20:29 GMT+03:00 Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>>: > > On 2018-06-11 13:14:12 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> writes: > > > I suspect the issue is that pg_resetwal does: > > > if (set_xid != 0) > > > { > > > ControlFile.checkPointCopy.nextXid = set_xid; > > > > > /* > > > * For the moment, just set oldestXid to a value that will force > > > * immediate autovacuum-for-wraparound. It's not clear whether adding > > > * user control of this is useful, so let's just do something that's > > > * reasonably safe. The magic constant here corresponds to the > > > * maximum allowed value of autovacuum_freeze_max_age. > > > */ > > > ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000; > > > if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId) > > > ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId; > > > ControlFile.checkPointCopy.oldestXidDB = InvalidOid; > > > } > > > > > but we have codepath that doesn't check for oldestXidDB being > > > InvalidOid. Not great. > > > > Hm, I think I'd define the problem as "pg_resetwal is violating the > > expectation that oldestXidDB be valid". > > Well, what could it do otherwise? ForceTransactionIdLimitUpdate() > currently does a syscache check for database existence. That'll just > return a lookup failure for InvalidOid, so we're reasonably good on that > front. > > Using a hardcoded 2000000000 seems worse, will have funny results if > running with a smaller autovacuum_freeze_max_age... > > > > However, this just explains the basically-cosmetic issue that the > > complaint message mentions OID 0. It doesn't really get us to the > > answer to why Alexander is seeing a failure. It might be useful > > to see pg_controldata output for the old cluster, as well as > > "select datname, datfrozenxid from pg_database" output from the > > old cluster. > > pg_upgrade starts the server with autovacuum disabled, I suspect > restoring all the large objects ends up using a lot of transaction > ids. GetNewTransactionId() should start autovacuum, but I'd guess that's > where things are going wrong for some reason. > > Alexander, could you hack things up so autovacuum logging is enabled > (log_autovacuum_min_duration=0), and see whether it's triggered? > > I'm not entirely clear why pg_restore appears to use a separate > transaction for each large object, surely exascerbating the problem. > > Greetings, > > Andres Freund > > -- Adrian Klaver adrian.klaver@aklaver.com
pgsql-general by date: