Thread: Vacuum full crash
Hello list
an interrupted vacuum full has just caused a PG instance to restart and recover. Background:
select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 8.3.1 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
(1 row)
I have a largish ( >1TB ) database which is kind of a data warehouse. Recently I had to do some major operations to some of the tables (update all rows in a table etc) which caused major bloat. To remove the bloat, I run VACUUM FULL VERBOSE on the bloated tables. Before the vacuum got finished, I had to abort it due to problems with my own laptop. When I hit ctrl+c to the vacuum, the PG instance went suddenly to recover mode. The logs showed this:
2008-03-29 22:25:15 EET [26841]: [1-1] ERROR: canceling statement due to user request
2008-03-29 22:25:15 EET [26841]: [2-1] STATEMENT: vacuum full verbose xyz ;
2008-03-29 22:25:15 EET [26841]: [1-1] ERROR: canceling statement due to user request
2008-03-29 22:25:15 EET [26841]: [2-1] STATEMENT: vacuum full verbose xyz ;
2008-03-29 22:25:15 EET [26841]: [3-1] PANIC: cannot abort transaction 3778747509, it was already committed
2008-03-29 22:25:15 EET [6476]: [4-1] LOG: server process (PID 26841) was terminated by signal 6: Aborted
2008-03-29 22:25:15 EET [6476]: [5-1] LOG: terminating any other active server processes
2008-03-29 22:25:15 EET [24814]: [48-1] WARNING: terminating connection because of crash of another server process
2008-03-29 22:25:15 EET [24814]: [49-1] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-03-29 22:25:15 EET [24814]: [50-1] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2008-03-29 22:25:15 EET [26841]: [3-1] PANIC: cannot abort transaction 3778747509, it was already committed
2008-03-29 22:25:15 EET [6476]: [4-1] LOG: server process (PID 26841) was terminated by signal 6: Aborted
2008-03-29 22:25:15 EET [6476]: [5-1] LOG: terminating any other active server processes
2008-03-29 22:25:15 EET [24814]: [48-1] WARNING: terminating connection because of crash of another server process
2008-03-29 22:25:15 EET [24814]: [49-1] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-03-29 22:25:15 EET [24814]: [50-1] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2008-03-29 22:25:15 EET [6476]: [6-1] LOG: archiver process (PID 6489) exited with exit code 1
2008-03-29 22:25:15 EET [18228]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:16 EET [6476]: [7-1] LOG: all server processes terminated; reinitializing
2008-03-29 22:25:16 EET [18229]: [1-1] LOG: database system was interrupted; last known up at 2008-03-29 22:20:24 EET
2008-03-29 22:25:16 EET [18229]: [2-1] LOG: database system was not properly shut down; automatic recovery in progress
2008-03-29 22:25:16 EET [18229]: [3-1] LOG: redo starts at CB5/16399698
2008-03-29 22:25:22 EET [18229]: [4-1] LOG: unexpected pageaddr CB4/76FF6000 in log file 3253, segment 35, offset 16736256
2008-03-29 22:25:22 EET [18229]: [5-1] LOG: redo done at CB5/23FF4C80
2008-03-29 22:25:22 EET [18229]: [6-1] LOG: last completed transaction was at log time 2008-03-29 22:22:47.931231+02
2008-03-29 22:25:23 EET [18336]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:27 EET [18337]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:30 EET [18346]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:41 EET [18424]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:43 EET [18427]: [1-1] LOG: autovacuum launcher started
2008-03-29 22:25:43 EET [6476]: [8-1] LOG: database system is ready to accept connections
Seems quite serious to me ("cannot abort a transaction that has already committed"), what can cause such behaviour?
Regards
Mikko
an interrupted vacuum full has just caused a PG instance to restart and recover. Background:
select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 8.3.1 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
(1 row)
I have a largish ( >1TB ) database which is kind of a data warehouse. Recently I had to do some major operations to some of the tables (update all rows in a table etc) which caused major bloat. To remove the bloat, I run VACUUM FULL VERBOSE on the bloated tables. Before the vacuum got finished, I had to abort it due to problems with my own laptop. When I hit ctrl+c to the vacuum, the PG instance went suddenly to recover mode. The logs showed this:
2008-03-29 22:25:15 EET [26841]: [1-1] ERROR: canceling statement due to user request
2008-03-29 22:25:15 EET [26841]: [2-1] STATEMENT: vacuum full verbose xyz ;
2008-03-29 22:25:15 EET [26841]: [1-1] ERROR: canceling statement due to user request
2008-03-29 22:25:15 EET [26841]: [2-1] STATEMENT: vacuum full verbose xyz ;
2008-03-29 22:25:15 EET [26841]: [3-1] PANIC: cannot abort transaction 3778747509, it was already committed
2008-03-29 22:25:15 EET [6476]: [4-1] LOG: server process (PID 26841) was terminated by signal 6: Aborted
2008-03-29 22:25:15 EET [6476]: [5-1] LOG: terminating any other active server processes
2008-03-29 22:25:15 EET [24814]: [48-1] WARNING: terminating connection because of crash of another server process
2008-03-29 22:25:15 EET [24814]: [49-1] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-03-29 22:25:15 EET [24814]: [50-1] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2008-03-29 22:25:15 EET [26841]: [3-1] PANIC: cannot abort transaction 3778747509, it was already committed
2008-03-29 22:25:15 EET [6476]: [4-1] LOG: server process (PID 26841) was terminated by signal 6: Aborted
2008-03-29 22:25:15 EET [6476]: [5-1] LOG: terminating any other active server processes
2008-03-29 22:25:15 EET [24814]: [48-1] WARNING: terminating connection because of crash of another server process
2008-03-29 22:25:15 EET [24814]: [49-1] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-03-29 22:25:15 EET [24814]: [50-1] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2008-03-29 22:25:15 EET [6476]: [6-1] LOG: archiver process (PID 6489) exited with exit code 1
2008-03-29 22:25:15 EET [18228]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:16 EET [6476]: [7-1] LOG: all server processes terminated; reinitializing
2008-03-29 22:25:16 EET [18229]: [1-1] LOG: database system was interrupted; last known up at 2008-03-29 22:20:24 EET
2008-03-29 22:25:16 EET [18229]: [2-1] LOG: database system was not properly shut down; automatic recovery in progress
2008-03-29 22:25:16 EET [18229]: [3-1] LOG: redo starts at CB5/16399698
2008-03-29 22:25:22 EET [18229]: [4-1] LOG: unexpected pageaddr CB4/76FF6000 in log file 3253, segment 35, offset 16736256
2008-03-29 22:25:22 EET [18229]: [5-1] LOG: redo done at CB5/23FF4C80
2008-03-29 22:25:22 EET [18229]: [6-1] LOG: last completed transaction was at log time 2008-03-29 22:22:47.931231+02
2008-03-29 22:25:23 EET [18336]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:27 EET [18337]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:30 EET [18346]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:41 EET [18424]: [1-1] FATAL: the database system is in recovery mode
2008-03-29 22:25:43 EET [18427]: [1-1] LOG: autovacuum launcher started
2008-03-29 22:25:43 EET [6476]: [8-1] LOG: database system is ready to accept connections
Seems quite serious to me ("cannot abort a transaction that has already committed"), what can cause such behaviour?
Regards
Mikko
"Mikko Partio" <mpartio@gmail.com> writes: > 2008-03-29 22:25:15 EET [26841]: [3-1] PANIC: cannot abort transaction > 3778747509, it was already committed Yeah, this is a well-known problem with VACUUM FULL ... regards, tom lane
On Sun, Mar 30, 2008 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Mikko Partio" <mpartio@gmail.com> writes:Yeah, this is a well-known problem with VACUUM FULL ...
> 2008-03-29 22:25:15 EET [26841]: [3-1] PANIC: cannot abort transaction
> 3778747509, it was already committed
Ok, that's good to know. Is there a bugfix on the way or is it more like a 'feature'?
Regards
Mikko
"Mikko Partio" <mpartio@gmail.com> writes: > On Sun, Mar 30, 2008 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Mikko Partio" <mpartio@gmail.com> writes: >>> 2008-03-29 22:25:15 EET [26841]: [3-1] PANIC: cannot abort transaction >>> 3778747509, it was already committed >> >> Yeah, this is a well-known problem with VACUUM FULL ... > Ok, that's good to know. Is there a bugfix on the way or is it more like a > 'feature'? It would be good to fix it I suppose, but what with VACUUM FULL being on the edge of deprecation anyway, it's hard to muster enthusiasm for doing a (probably) large additional amount of work on it, I think most hackers would prefer to spend time on making it unnecessary than making it more bulletproof. Note that there is (AFAIK) no risk of loss of committed data from this bug. The fact that the PANIC kills all your other sessions is annoying, but in the scenarios where this is likely to happen you're trying to stop them all anyway ... regards, tom lane
On Sun, Mar 30, 2008 at 6:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Interesting. What is planned substitute for VACUUM FULL? CLUSTER?
Regards
Mikko
It would be good to fix it I suppose, but what with VACUUM FULL being on
the edge of deprecation anyway, it's hard to muster enthusiasm for doing
a (probably) large additional amount of work on it, I think most
hackers would prefer to spend time on making it unnecessary than making
it more bulletproof.
Interesting. What is planned substitute for VACUUM FULL? CLUSTER?
Regards
Mikko
"Mikko Partio" <mpartio@gmail.com> writes: > Interesting. What is planned substitute for VACUUM FULL? CLUSTER? What's been discussed is something that works roughly like CLUSTER but doesn't bother to sort the data any particular way (ie, just do a seqscan not an indexscan in the data-copying phase). This would certainly dominate VACUUM FULL in contexts where you've removed a large fraction of the rows. In cases where you haven't, the large transient space requirement (2x the table size) would be objectionable, but it's not real clear why you need a VACUUM FULL rather than plain VACUUM in such cases anyway. A lot of the context for this arises from recent and planned improvements, notably (1) CLUSTER is MVCC-safe now, removing one of the core arguments for not using it for routine cleanup. (2) Planned fixes like the dead space map will substantially reduce the cost of running frequent plain vacuums. That together with autovacuum (which we could make more aggressive in consequence) should help to avoid getting into situations where VACUUM FULL is needed in the first place. We certainly couldn't have considered eliminating VACUUM FULL before this, but it's starting to seem a viable option to me. regards, tom lane