Thread: Unable to startup postgres: Could not read from file "pg_clog/00EC"
Hello, Yesterday, we experienced some issues with our Postgres installation (v9.6 running on macOS 10.12). It seems that the machine was automatically rebooted for a yet unknown reason, and afterwards we were unable to start the Postgres service. The postgres log shows the following: 2020-02-04 15:20:41 CET LOG: database system was interrupted; last known up at 2020-02-04 15:18:34 CET 2020-02-04 15:20:43 CET LOG: database system was not properly shut down; automatic recovery in progress 2020-02-04 15:20:44 CET LOG: invalid record length at 14A/9E426DF8: wanted 24, got 0 2020-02-04 15:20:44 CET LOG: redo is not required 2020-02-04 15:20:44 CET FATAL: could not access status of transaction 247890764 2020-02-04 15:20:44 CET DETAIL: Could not read from file "pg_clog/00EC" at offset 106496: Undefined error: 0. 2020-02-04 15:20:44 CET LOG: startup process (PID 403) exited with exit code 1 2020-02-04 15:20:44 CET LOG: aborting startup due to startup process failure 2020-02-04 15:20:44 CET LOG: database system is shut down After some searching, I found someone who had had a similar issue and was able to resolve it by overwriting the file in pg_clog. So I tried the following command: dd if=/dev/zero of=[dbpath]/pg_clog/00EC bs=256k count=1 and now the service is running again. But I am worried that there might still be some issues that we haven't noticed yet. I also have no idea what caused this error in the first place. It might have been the reboot, but maybe the reboot was a result of a Postgres issue. Is there anything specific I should check in our postgres installation / database to make sure it is running ok now? Anyway to see what the consequences were of purging that one pg_clog file? Best regards, Nick Renders
On 2020-Feb-05, Nick Renders wrote: > Is there anything specific I should check in our postgres installation / > database to make sure it is running ok now? Anyway to see what the > consequences were of purging that one pg_clog file? Losing pg_clog files is pretty bad, and should not happen; then again, this might have been something else (ie. the file was maybe not lost). That said, wrongly overwriting files is even worse. By zeroing an existing pg_clog file, you marked a bunch of transactions as aborted. Your data is now probably inconsistent, if not downright corrupt. I would be looking for my most recent backup ... If you're very lucky, your database might be pg_dumpable. I would try that, followed by restoring it in a separate clean instance and seeing what happens. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you for the feedback, Alvaro. Unfortunately, the database is no longer "dumpable". We were able to do a pg_dump yesterday morning (12 hours after the crash + purging the pg_clog) but if we try one now, we get the following error: unexpected chunk number 1 (expected 0) for toast value 8282331 in pg_toast_38651 Looking at our data, there seem to be 6 tables that have corrupt records. Doing a SELECT * for one of those records, will return a similar error: missing chunk number 0 for toast value 8288522 in pg_toast_5572299 What is the best way to go from here? Is tracking down these corrupt records and deleting them the best / only solution? Is there a way to determine of there are issues with new data (after the crash)? Any help and advice is very much appreciated. Thanks, Nick Renders On 5 Feb 2020, at 12:51, Alvaro Herrera wrote: > On 2020-Feb-05, Nick Renders wrote: > >> Is there anything specific I should check in our postgres >> installation / >> database to make sure it is running ok now? Anyway to see what the >> consequences were of purging that one pg_clog file? > > Losing pg_clog files is pretty bad, and should not happen; then again, > this might have been something else (ie. the file was maybe not lost). > That said, wrongly overwriting files is even worse. > > By zeroing an existing pg_clog file, you marked a bunch of > transactions > as aborted. Your data is now probably inconsistent, if not downright > corrupt. I would be looking for my most recent backup ... > > If you're very lucky, your database might be pg_dumpable. I would try > that, followed by restoring it in a separate clean instance and seeing > what happens. > > -- > Álvaro Herrera https://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Feb 7, 2020 at 1:47 AM Nick Renders <postgres@arcict.com> wrote: > Thank you for the feedback, Alvaro. > > Unfortunately, the database is no longer "dumpable". We were able to do > a pg_dump yesterday morning (12 hours after the crash + purging the > pg_clog) but if we try one now, we get the following error: > > unexpected chunk number 1 (expected 0) for toast value 8282331 in > pg_toast_38651 > > Looking at our data, there seem to be 6 tables that have corrupt > records. Doing a SELECT * for one of those records, will return a > similar error: > > missing chunk number 0 for toast value 8288522 in pg_toast_5572299 > > > What is the best way to go from here? Is tracking down these corrupt > records and deleting them the best / only solution? > Is there a way to determine of there are issues with new data (after the > crash)? > > Any help and advice is very much appreciated. This error indicates that the file did exist already, it was just shorter than we expected: 2020-02-04 15:20:44 CET DETAIL: Could not read from file "pg_clog/00EC" at offset 106496: Undefined error: 0. What was the length of the file before you overwrote it? Are there 00EB and 00ED files, and if so what size? When your server rebooted, did crash recovery run or had it shut down cleanly? Do you know if the machine lost power, or the kernel crashed, or if it was a normal reboot? What are your settings for "fsync" and "wal_sync_method"? What is the output of pg_controldata -D pgdata? I wonder if that part of the clog file was supposed to be created before the checkpoint (ie the checkpoint is somehow borked), or was supposed to be created during recovery after that checkpoint (something else is borked, but I don't know what), or if the xid is somehow corrupted. Here's a dirty trick that might help rescue some data. Assuming you have a copy of the original file before you zeroed it, you could write a 256kb file full of 0x55 (that's 01010101 and represents 4 commits, so if you fill the file up with that it means 'all transactions in this range committed', which is probably closer to the truth than all zeroes), and then copy the original shorter file over the top of it, so that at least the range of transactions represented by the earlier part of the file that did make it to disk are preserved, and we have just have bogus force-everything-to-look-committed data after that. But as Alvaro said, this is a pretty bad situation, this is key meta data used to interpret all other data files, so all bets are off here, this is restore-from-backups territory.
Alvaro, Thomas,
Putting a 256kb file full of 0x55 that's 01010101 and represents 4 commits
It did the job in being able to restart the server.
According to our data a “better” way, with less garbage.
The “Toast” issues how ever are still present.
To spend our weekend well we setup a new server with version 12.1 but had to fallback on 11.6 ( see other post )
We kept our “old” server active to see if we can learn some more from this hard-times.
Thanks for the help
Marc
On 5 Feb 2020, at 12:14, Nick Renders wrote:
Hello,
Yesterday, we experienced some issues with our Postgres installation (v9.6 running on macOS 10.12).
It seems that the machine was automatically rebooted for a yet unknown reason, and afterwards we were unable to start the Postgres service.
The postgres log shows the following:
2020-02-04 15:20:41 CET LOG: database system was interrupted; last known up at 2020-02-04 15:18:34 CET
2020-02-04 15:20:43 CET LOG: database system was not properly shut down; automatic recovery in progress
2020-02-04 15:20:44 CET LOG: invalid record length at 14A/9E426DF8: wanted 24, got 0
2020-02-04 15:20:44 CET LOG: redo is not required
2020-02-04 15:20:44 CET FATAL: could not access status of transaction 247890764
2020-02-04 15:20:44 CET DETAIL: Could not read from file "pg_clog/00EC" at offset 106496: Undefined error: 0.
2020-02-04 15:20:44 CET LOG: startup process (PID 403) exited with exit code 1
2020-02-04 15:20:44 CET LOG: aborting startup due to startup process failure
2020-02-04 15:20:44 CET LOG: database system is shut down
After some searching, I found someone who had had a similar issue and was able to resolve it by overwriting the file in pg_clog.
So I tried the following command:
dd if=/dev/zero of=[dbpath]/pg_clog/00EC bs=256k count=1
and now the service is running again.
But I am worried that there might still be some issues that we haven't noticed yet. I also have no idea what caused this error in the first place. It might have been the reboot, but maybe the reboot was a result of a Postgres issue.
Is there anything specific I should check in our postgres installation / database to make sure it is running ok now? Anyway to see what the consequences were of purging that one pg_clog file?
Best regards,
Nick Renders
ARC - your Apple Authorised Service partner | H.D. Saviolaan 8 |
B-1700 Dilbeek | |
Belgium | |
info@arcict.com | www.arcict.com |
tel. : +32 (0)2 466 50 00 | fax. : +32 (0)2 466 88 33 |