Thread: BUG #15840: Vacuum does not work after database stopped for wraparound protection. Database seems unrepearable.
BUG #15840: Vacuum does not work after database stopped for wraparound protection. Database seems unrepearable.
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15840 Logged by: Thierry Husson Email address: thusson@informiciel.com PostgreSQL version: 12beta1 Operating system: Ubuntu 18.04.2 LTS Description: I was doing tables COPY between my old server with PG10.8 and the new one with 12Beta1. After each table is done, I make a vacuum of it. However PG12 has stopped working for wraparound protection. I was doing it on around 10 cpu, 1 table by cpu. The is the end of the log of the copy program in Python3 with Psycopg2: ... 2019-06-06 23:15:26 prog_sync Vacuum usr_ops.prg_hrdps_n1500n_voisin_ade_metar... 4s. 2019-06-06 23:15:30 prog_sync Vacuum usr_ops.flt_hrdps_n1500n_voisin_ade_metar... 0s. 2019-06-06 23:15:30 prog_sync CPU 0 - Sync done 8857sec. 2019-06-06 23:15:30 prog_sync Tables Skipped:0, Already sync:0, Copied from pravda:1. Copied from zhen:0. 2019-06-06 23:15:30 prog_sync Sync done for 1 tables of 106451311 records in 8858s. (12018 rec./sec.) Traceback (most recent call last): File "/home/semt700/emet/script/prog_sync.py", line 316, in syncTable ioResult = e.flushCopyBuffer(ioResult, curPG[slave][procId], progTable[slave], columns) File "/fs/home/fs1/eccc/cmd/cmdn/semt700/emet/script/emetlib.py", line 607, in flushCopyBuffer cursorObj.copy_from(ioBuffer, tableName, sep='\t', columns=columnName, null='NULL') psycopg2.OperationalError: database is not accepting commands to avoid wraparound data loss in database "emet_zhen" HINT: Stop the postmaster and vacuum that database in single-user mode. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. CONTEXT: SQL statement "INSERT INTO usr_ops.prg_gdps_g1610n_voisin_ade_synop_swob_metar_201903 SELECT $1.*" PL/pgSQL function prog_insert() line 17 at EXECUTE COPY prg_gdps_g1610n_voisin_ade_synop_swob_metar, line 132822: "284532738 2019-03-20 00:00:00 2019-03-29 12:00:00 11011 37980000 -101750000 75597472 NULL -5.4617 1 ..." I did a DB shutdown and started a vacuum with: postgres --single emet_zhen VACUUM FREEZE VERBOSE; It worked a few hours and when I was thinking it was done as nothing was loggin anymore, I made a ctrl-\ and restarted the DB. I was still getting wraparound protection messages so I shutdown the DB again & redo the vacuum command but it didn't work anymore: zhen:semt700 $ postgres --single emet_zhen 2019-06-07 17:23:36 UTC 7251 WARNING: database with OID 16394 must be vacuumed within 999995 transactions 2019-06-07 17:23:36 UTC 7251 HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. PostgreSQL stand-alone backend 12beta1 backend> VACUUM VERBOSE; 2019-06-07 17:23:59 UTC 7251 WARNING: database "emet_zhen" must be vacuumed within 999995 transactions 2019-06-07 17:23:59 UTC 7251 HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. 2019-06-07 17:23:59 UTC 7251 LOG: duration: 2417.639 ms statement: VACUUM VERBOSE; I tried with various options but none worked. It also tried to restard the DB and use vacuumdb --all -v , or various options, but always get the same message for each table: INFO: aggressively vacuuming "pg_catalog.pg_publication" INFO: index "pg_publication_oid_index" now contains 0 row versions in 1 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "pg_publication_pubname_index" now contains 0 row versions in 1 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "pg_publication": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 2146520116 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. WARNING: database "emet_zhen" must be vacuumed within 999995 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. I out of clues of what to try next. I already got this situation with PG 9.x & PG10.x but system wide in exclusive mode usualy worked. Seems like a PG12 bug that will certainly prevent us from upgrading even if the new fonctionnalities look really great. Thanks a lot! Thierry
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Andres Freund
Date:
Hi, On 2019-06-07 18:22:20 +0000, PG Bug reporting form wrote: > I was doing tables COPY between my old server with PG10.8 and the new one > with 12Beta1. After each table is done, I make a vacuum of it. > However PG12 has stopped working for wraparound protection. I was doing it > on around 10 cpu, 1 table by cpu. That was a new postgres 12 cluster, not a pg_upgraded one? And you just did a bunch of COPYs? How many? I'm not clear as to how the cluster got to wraparound if that's the scenario. We use one xid per transaction, and copy doesn't use multiple transactions internally. Any chance you have triggers on these tables that use savepoints internally? > postgres --single emet_zhen > VACUUM FREEZE VERBOSE; Don't FREEZE in wraparound cases, that just makes it take longer. > It worked a few hours and when I was thinking it was done as nothing was > loggin anymore, I made a ctrl-\ and restarted the DB. > I was still getting wraparound protection messages so I shutdown the DB > again & redo the vacuum command but it didn't work anymore: > zhen:semt700 $ postgres --single emet_zhen > 2019-06-07 17:23:36 UTC 7251 WARNING: database with OID 16394 must be > vacuumed within 999995 transactions > 2019-06-07 17:23:36 UTC 7251 HINT: To avoid a database shutdown, execute a > database-wide VACUUM in that database. > You might also need to commit or roll back old prepared > transactions, or drop stale replication slots. > PostgreSQL stand-alone backend 12beta1 > backend> VACUUM VERBOSE; > 2019-06-07 17:23:59 UTC 7251 WARNING: database "emet_zhen" must be > vacuumed within 999995 transactions > 2019-06-07 17:23:59 UTC 7251 HINT: To avoid a database shutdown, execute a > database-wide VACUUM in that database. > You might also need to commit or roll back old prepared > transactions, or drop stale replication slots. > 2019-06-07 17:23:59 UTC 7251 LOG: duration: 2417.639 ms statement: VACUUM > VERBOSE; What do you mean by "didn't work anymore"? As far as I can tell the VACUUM here succeeded? > HINT: To avoid a database shutdown, execute a database-wide VACUUM in that > database. > You might also need to commit or roll back old prepared transactions, or > drop stale replication slots. Did you check whether any of these are the case? SELECT * FROM pg_replication_slots; SELECT * FROM pg_prepared_xacts; Could you also show SELECT oid, datname, datfrozenxid, age(datfrozenxid), datminmxid, mxid_age(datminmxid) FROM pg_database ORDER BY age(datfrozenxid)DESC; SELECT * FROM pg_control_checkpoint(); Greetings, Andres Freund
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Thierry Husson
Date:
Hi Andres, Thank you for your anwser. Precisions bellow: Andres Freund <andres@anarazel.de> a écrit : > Hi, > > On 2019-06-07 18:22:20 +0000, PG Bug reporting form wrote: >> I was doing tables COPY between my old server with PG10.8 and the new one >> with 12Beta1. After each table is done, I make a vacuum of it. >> However PG12 has stopped working for wraparound protection. I was doing it >> on around 10 cpu, 1 table by cpu. > > That was a new postgres 12 cluster, not a pg_upgraded one? And you just > did a bunch of COPYs? How many? > > I'm not clear as to how the cluster got to wraparound if that's the > scenario. We use one xid per transaction, and copy doesn't use multiple > transactions internally. Any chance you have triggers on these tables > that use savepoints internally? Yes it was a new cluster. Around 30 copy were done. Yes there is a trigger to manage partitions. Around 1200 tables were created. 10 billions records transfered, I need to tranfert 180BR over 1700 tables. I just realize I made vacuum on partitions for the first 8BR rows and forgot for the last 2BR That would explain the wraparound protection. > > >> postgres --single emet_zhen >> VACUUM FREEZE VERBOSE; > > Don't FREEZE in wraparound cases, that just makes it take longer. > > >> It worked a few hours and when I was thinking it was done as nothing was >> loggin anymore, I made a ctrl-\ and restarted the DB. >> I was still getting wraparound protection messages so I shutdown the DB >> again & redo the vacuum command but it didn't work anymore: > >> zhen:semt700 $ postgres --single emet_zhen >> 2019-06-07 17:23:36 UTC 7251 WARNING: database with OID 16394 must be >> vacuumed within 999995 transactions >> 2019-06-07 17:23:36 UTC 7251 HINT: To avoid a database shutdown, execute a >> database-wide VACUUM in that database. >> You might also need to commit or roll back old prepared >> transactions, or drop stale replication slots. >> PostgreSQL stand-alone backend 12beta1 >> backend> VACUUM VERBOSE; >> 2019-06-07 17:23:59 UTC 7251 WARNING: database "emet_zhen" must be >> vacuumed within 999995 transactions >> 2019-06-07 17:23:59 UTC 7251 HINT: To avoid a database shutdown, execute a >> database-wide VACUUM in that database. >> You might also need to commit or roll back old prepared >> transactions, or drop stale replication slots. >> 2019-06-07 17:23:59 UTC 7251 LOG: duration: 2417.639 ms statement: VACUUM >> VERBOSE; > > What do you mean by "didn't work anymore"? As far as I can tell the > VACUUM here succeeded? > > >> HINT: To avoid a database shutdown, execute a database-wide VACUUM in that >> database. >> You might also need to commit or roll back old prepared transactions, or >> drop stale replication slots. > > Did you check whether any of these are the case? > > SELECT * FROM pg_replication_slots; > SELECT * FROM pg_prepared_xacts; These are empty. emet_zhen=# SELECT max(age(pg_database.datfrozenxid)) / 2147483648.0 * 100.0 AS "Percentage of transaction ID's used" FROM pg_database; Percentage of transaction ID's used ------------------------------------- 99.953434057533740997000 > > Could you also show > > SELECT oid, datname, datfrozenxid, age(datfrozenxid), datminmxid, > mxid_age(datminmxid) FROM pg_database ORDER BY age(datfrozenxid) DESC; oid | datname | datfrozenxid | age | datminmxid | mxid_age -------+-----------+--------------+------------+------------+---------- 16394 | emet_zhen | 36464 | 2146483652 | 1 | 0 12672 | template0 | 504982897 | 1641537219 | 1 | 0 12673 | postgres | 2096520116 | 50000000 | 1 | 0 1 | template1 | 2096520116 | 50000000 | 1 | 0 > SELECT * FROM pg_control_checkpoint(); checkpoint_lsn | redo_lsn | redo_wal_file | timeline_id | prev_timeline_id | full_page_writes | next_xid | next_oid | next_multixact_id | next_multi_offset | oldest_xid | oldest_xid_dbid | oldest_active_xid | oldest_multi_xid | oldest_multi_dbid | oldest_commit_ts_xid | newest_commit_ts_xid | checkpoint_time 32D/54074EC0 | 32D/54074E88 | 000000010000032D00000054 | 1 | 1 | t | 0:2146520116 | 475782 | 1 | 0 | 36464 | 16394 | 2146520116 | 1 | 16394 | 0 | 0 | 2019-06-07 18:11:39+00 (1 row) Could it be that PG12 considers "vacuum" as a transaction and trigger wraparound protection against it? > > Greetings, > > Andres Freund
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Andres Freund
Date:
Hi, On 2019-06-07 14:59:11 -0500, Thierry Husson wrote: > Thank you for your anwser. Precisions bellow: > Andres Freund <andres@anarazel.de> a écrit : > > On 2019-06-07 18:22:20 +0000, PG Bug reporting form wrote: > > > I was doing tables COPY between my old server with PG10.8 and the new one > > > with 12Beta1. After each table is done, I make a vacuum of it. > > > However PG12 has stopped working for wraparound protection. I was doing it > > > on around 10 cpu, 1 table by cpu. > > > > That was a new postgres 12 cluster, not a pg_upgraded one? And you just > > did a bunch of COPYs? How many? > > > > I'm not clear as to how the cluster got to wraparound if that's the > > scenario. We use one xid per transaction, and copy doesn't use multiple > > transactions internally. Any chance you have triggers on these tables > > that use savepoints internally? > > Yes it was a new cluster. Around 30 copy were done. > Yes there is a trigger to manage partitions. Around 1200 tables were > created. 10 billions records transfered, I need to tranfert 180BR over 1700 > tables. > I just realize I made vacuum on partitions for the first 8BR rows and forgot > for the last 2BR That would explain the wraparound protection. Do those triggers use savepoints / EXCEPTION handling? Might be worthwhile to check - independent of this issue - if you still need the partition handling via trigger, now that pg's builtin partitioning can handle COPY (and likely *much* faster). > > Could you also show > > > > SELECT oid, datname, datfrozenxid, age(datfrozenxid), datminmxid, > > mxid_age(datminmxid) FROM pg_database ORDER BY age(datfrozenxid) DESC; > oid | datname | datfrozenxid | age | datminmxid | mxid_age > -------+-----------+--------------+------------+------------+---------- > 16394 | emet_zhen | 36464 | 2146483652 | 1 | 0 Ok, so it's xids, and clearly not multixids. Could you connect to emet_zhen and show the output of: SELECT oid, oid::regclass, relkind, relfrozenxid, age(relfrozenxid) FROM pg_class WHERE relfrozenxid <> 0 AND age(relfrozenxid)> 1800000000 ORDER BY age(relfrozenxid) DESC; that will tell us which relations need to be vacuumed, and then we can see why that doesn't work. > Could it be that PG12 considers "vacuum" as a transaction and trigger > wraparound protection against it? I'm still somewhat confused - the output you showed didn't include vacuum failing, as far as I can tell? - Andres
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Thierry Husson
Date:
Thanks again Andres, Andres Freund <andres@anarazel.de> a écrit : > Hi, > > On 2019-06-07 14:59:11 -0500, Thierry Husson wrote: >> Thank you for your anwser. Precisions bellow: >> Andres Freund <andres@anarazel.de> a écrit : >> > On 2019-06-07 18:22:20 +0000, PG Bug reporting form wrote: >> > > I was doing tables COPY between my old server with PG10.8 and >> the new one >> > > with 12Beta1. After each table is done, I make a vacuum of it. >> > > However PG12 has stopped working for wraparound protection. I >> was doing it >> > > on around 10 cpu, 1 table by cpu. >> > >> > That was a new postgres 12 cluster, not a pg_upgraded one? And you just >> > did a bunch of COPYs? How many? >> > >> > I'm not clear as to how the cluster got to wraparound if that's the >> > scenario. We use one xid per transaction, and copy doesn't use multiple >> > transactions internally. Any chance you have triggers on these tables >> > that use savepoints internally? >> >> Yes it was a new cluster. Around 30 copy were done. >> Yes there is a trigger to manage partitions. Around 1200 tables were >> created. 10 billions records transfered, I need to tranfert 180BR over 1700 >> tables. >> I just realize I made vacuum on partitions for the first 8BR rows and forgot >> for the last 2BR That would explain the wraparound protection. > > Do those triggers use savepoints / EXCEPTION handling? > > Might be worthwhile to check - independent of this issue - if you still > need the partition handling via trigger, now that pg's builtin > partitioning can handle COPY (and likely *much* faster). Yes, those triggers use exception handling (if partition doesn't exist, create it) but no savepoint. Thanks for the suggestion, I take that in note! >> > Could you also show >> > >> > SELECT oid, datname, datfrozenxid, age(datfrozenxid), datminmxid, >> > mxid_age(datminmxid) FROM pg_database ORDER BY age(datfrozenxid) DESC; >> oid | datname | datfrozenxid | age | datminmxid | mxid_age >> -------+-----------+--------------+------------+------------+---------- >> 16394 | emet_zhen | 36464 | 2146483652 | 1 | 0 > > Ok, so it's xids, and clearly not multixids. Could you connect to > emet_zhen and show the output of: > > SELECT oid, oid::regclass, relkind, relfrozenxid, age(relfrozenxid) > FROM pg_class WHERE relfrozenxid <> 0 AND age(relfrozenxid) > > 1800000000 ORDER BY age(relfrozenxid) DESC; > that will tell us which relations need to be vacuumed, and then we can > see why that doesn't work. >> Could it be that PG12 considers "vacuum" as a transaction and trigger >> wraparound protection against it? > > I'm still somewhat confused - the output you showed didn't include > vacuum failing, as far as I can tell? > > - Andres oid | oid | relkind | relfrozenxid | age --------+--------------------------------------+---------+--------------+------------ 460564 | pg_temp_3.cur_semt700_progsync_4996 | r | 36464 | 2146483652 460764 | pg_temp_8.cur_semt700_progsync_5568 | r | 19836544 | 2126683572 460718 | pg_temp_4.cur_semt700_progsync_5564 | r | 19836544 | 2126683572 460721 | pg_temp_5.cur_semt700_progsync_5565 | r | 19836544 | 2126683572 461068 | pg_temp_22.cur_semt700_progsync_5581 | r | 19836544 | 2126683572 These are temporary tables to manage concurrency & server load. It seems the sudden disconnection due to wraparound protection didn't get them removed. I removed them manually under single mode and there is no more warning now, vacuum command included. Your command is very interesting to know. It annoying PG create a xId for empty temporary tables. You can't clear it with a vacuum as there is no record. I have to terminate connexions of my deamon processes daily to avoid wraparound protection. Is there a way to tell PG to forget these tables on its age estimation? Thank you so much Andres! You saved me! Thierry
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Andres Freund
Date:
Hi, On 2019-06-07 16:40:27 -0500, Thierry Husson wrote: > oid | oid | relkind | relfrozenxid | > age > --------+--------------------------------------+---------+--------------+------------ > 460564 | pg_temp_3.cur_semt700_progsync_4996 | r | 36464 | > 2146483652 > 460764 | pg_temp_8.cur_semt700_progsync_5568 | r | 19836544 | > 2126683572 > 460718 | pg_temp_4.cur_semt700_progsync_5564 | r | 19836544 | > 2126683572 > 460721 | pg_temp_5.cur_semt700_progsync_5565 | r | 19836544 | > 2126683572 > 461068 | pg_temp_22.cur_semt700_progsync_5581 | r | 19836544 | > 2126683572 > > These are temporary tables to manage concurrency & server load. It seems the > sudden disconnection due to wraparound protection didn't get them removed. I > removed them manually under single mode and there is no more warning now, > vacuum command included. Your command is very interesting to know. Hm. But you do have autovacuum enabled, is that right? If enabled, have you tuned it at all? It seems quite possible that given your load (10 parallel loads), the default settings werent aggressive enough. > It annoying PG create a xId for empty temporary tables. You can't clear it > with a vacuum as there is no record. I have to terminate connexions of my > deamon processes daily to avoid wraparound protection. Is there a way to > tell PG to forget these tables on its age estimation? Normally postgres would drop such "orphaned" temp tables on its own, in autovacuum (triggering it when close to a wraparound, even if disabled). But if it can't keep up for some reason, then that's not necessarily good enough with very rapid xid usage as you seem to have. I'll start a thread about this subtopic on -hackers. Greetings, Andres Freund
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Thierry Husson
Date:
Hi Andres, Andres Freund <andres@anarazel.de> a écrit : > Hi, > > On 2019-06-07 16:40:27 -0500, Thierry Husson wrote: >> oid | oid | relkind | relfrozenxid | >> age >> --------+--------------------------------------+---------+--------------+------------ >> 460564 | pg_temp_3.cur_semt700_progsync_4996 | r | 36464 | >> 2146483652 >> 460764 | pg_temp_8.cur_semt700_progsync_5568 | r | 19836544 | >> 2126683572 >> 460718 | pg_temp_4.cur_semt700_progsync_5564 | r | 19836544 | >> 2126683572 >> 460721 | pg_temp_5.cur_semt700_progsync_5565 | r | 19836544 | >> 2126683572 >> 461068 | pg_temp_22.cur_semt700_progsync_5581 | r | 19836544 | >> 2126683572 >> >> These are temporary tables to manage concurrency & server load. It seems the >> sudden disconnection due to wraparound protection didn't get them removed. I >> removed them manually under single mode and there is no more warning now, >> vacuum command included. Your command is very interesting to know. > > Hm. But you do have autovacuum enabled, is that right? If enabled, have > you tuned it at all? It seems quite possible that given your load (10 > parallel loads), the default settings werent aggressive enough. Yes autovacuum is enabled. Aggressiveness was effectively a recent problem I had and putting its max_worker to 8 wasn't a solution, there were all busy 24/7 and I had to do a daily script to help it. The solution was to push vacuum_cost_limit to 2000, since then it works like a charm. Another issue was autovaccuums were taking the lock over my running vacuums, making them waiting for 5 days instead of taking around 1 hour. I could do another post on that but it's not PG12 specific, I have it with 10.x >> It annoying PG create a xId for empty temporary tables. You can't clear it >> with a vacuum as there is no record. I have to terminate connexions of my >> deamon processes daily to avoid wraparound protection. Is there a way to >> tell PG to forget these tables on its age estimation? > > Normally postgres would drop such "orphaned" temp tables on its own, in > autovacuum (triggering it when close to a wraparound, even if > disabled). But if it can't keep up for some reason, then that's not > necessarily good enough with very rapid xid usage as you seem to have. > > I'll start a thread about this subtopic on -hackers. > Greetings, > > Andres Freund What is the link to this forum? I'm very very interested to follow that subtopic & I could make some tests if necessary. Have a great weekend & thanks for your time :) Thierry
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Andres Freund
Date:
Hi, On 2019-06-07 17:49:52 -0500, Thierry Husson wrote: > Andres Freund <andres@anarazel.de> a écrit : > > I'll start a thread about this subtopic on -hackers. > What is the link to this forum? I'm very very interested to follow that > subtopic & I could make some tests if necessary. It's now (was interrupted by something else) at: https://postgr.es/m/20190607225843.z73jqqyy6hhc6qnp%40alap3.anarazel.de and you're CCed in the discussion. Have a nice weekend as well! Andres
Re: BUG #15840: Vacuum does not work after database stopped forwraparound protection. Database seems unrepearable.
From
Alvaro Herrera
Date:
On 2019-Jun-07, Thierry Husson wrote: > Yes autovacuum is enabled. Aggressiveness was effectively a recent problem I > had and putting its max_worker to 8 wasn't a solution, there were all busy > 24/7 and I had to do a daily script to help it. The solution was to push > vacuum_cost_limit to 2000, since then it works like a charm. Note the I/O cost balancing thing, which seems to bite many people: if you raise max_workers without changing cost_delay or cost_limit, it doesn't have much of an effect, because each worker goes slower to accomodate. Raising the cost limit (or lowering the cost delay) does have a useful impact. In pg12 we changed the default cost_delay to 2ms (from 20ms). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services