Re: [GENERAL] Recovery Assistance - Mailing list pgsql-general
From | Adrian Klaver |
---|---|
Subject | Re: [GENERAL] Recovery Assistance |
Date | |
Msg-id | d1ad4bc1-4497-a4ba-0616-3b571bfa8b1b@aklaver.com Whole thread Raw |
In response to | Re: [GENERAL] Recovery Assistance (Brian Mills <brian@trybooking.com>) |
Responses |
Re: [GENERAL] Recovery Assistance
|
List | pgsql-general |
On 01/27/2017 05:40 PM, Brian Mills wrote: > First of all, Thank you for your time to assist me learning. I really > appreciate it. > > root# ps ax | grep post > 1364 ? Ss 0:00 /usr/lib/postfix/master > 5198 pts/3 S 0:00 su postgres > 5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D > /etc/postgresql/9.3/main > 5222 ? Ss 0:10 postgres: startup process recovering > 0000000100000005000000A3 > 7930 pts/4 S+ 0:00 grep --color=auto post So if you check back does the recovering XXXXXXXX part change? If so Postgres is walking through the WAL files as it should. > > Its a single machine postgres database server, so I'm assuming there is > no cluster log. If there is one, where would I look for it? > The only log in /var/log/postgres is postgresql-9.3-main.log That would be it. A single Postgres instance has multiple databases in it, by default it starts with template0, template1 and postgres databases. Then add whatever databases you created and you have a cluster of databases. > which shows (tail): > > 2017-01-27 20:27:01 AEDT LOG: database system was shut down at > 2017-01-27 20:26:29 AEDT > 2017-01-27 20:27:01 AEDT LOG: MultiXact member wraparound protections > are now enabled > 2017-01-27 20:27:01 AEDT LOG: autovacuum launcher started > 2017-01-27 20:27:01 AEDT LOG: database system is ready to accept > connections > 2017-01-27 20:27:02 AEDT LOG: incomplete startup packet > 2017-01-27 20:28:54 AEDT ERROR: unexpected chunk size 104 (expected > 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028 > 2017-01-27 20:28:54 AEDT STATEMENT: COPY public.bodycontent > (bodycontentid, body, contentid, bodytypeid) TO stdout; > 2017-01-27 20:30:13 AEDT LOG: received fast shutdown request > 2017-01-27 20:30:13 AEDT LOG: aborting any active transactions > 2017-01-27 20:30:13 AEDT LOG: autovacuum launcher shutting down > 2017-01-27 20:30:13 AEDT LOG: shutting down > 2017-01-27 20:30:13 AEDT LOG: database system is shut down That would be your Attempt 1 log. > > I ran the screen utility so I could leave the db started using > the pg_ctl command. The later logs in that session have not progressed, > its last entry is still > 2017-01-27 23:00:01 AEDT FATAL: the database system is starting up > Which is still later datetime than the /var/log/postgres... log. So it is just logging to stdout and not to the log file. > > Connection attempt shows the same log > postgres@atlassian:/home/myuser$ psql > psql: FATAL: the database system is starting up > > Nothing in the syslog that seems connected. > > *Brian Mills* > CTO > > > *Mob: *0410660003 <tel:0410660003> > *Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327 > <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600 > <tel:02%208064%203600> *|* *Brisbane* 07 3173 1570 <tel:07%203173%201570> > Level 1 *|* 600 Chapel Street *|* South > Yarra*|* VIC *|* 3141 *|* Australia > > <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking> <https://www.linkedin.com/company/trybooking-com> > > On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> wrote: > > On 01/27/2017 01:31 PM, Brian Mills wrote: > > Hi, > > I have a Atlassian Confluence Wiki that depends on postgres, but I > haven't much experience with postgres other than for this purpose. > > A few days ago, the hard disk filled, so all services stopped > working. > When the admin realised this he increased the disk size (its in > a cloud, > so that was easy to do) however I think the way it shutdown left > Postgres in an inconsistent state for some reason. > Now when I start it up I get this message in the log over and > over again: > "FATAL: the database system is starting up" > > I have a backup, which I have successfully recovered, but it is > 24 hours > old, the next backup was the cause of the disk filling. So I'm using > this as exercise in learning a bit more about postgres. > > I did some research and found a number of options. I took a file > level > backup with the service not running then tried 2 things. I > haven't found > much else on what to do though. > > *Attempt 1 - Reset Log * > > It sounded like this shouldn't be my first option (it wasn't) > but it did > sound like what I needed to do. > I ran this command > ./pg_resetxlog /var/lib/postgresql/9.3/main -f > It worked a treat, the database did startup ok. > However when I tried to dump the DB: > root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb > > $now-confluencedb.sql > pg_dump: Dumping the contents of table "bodycontent" failed: > PQgetResult() failed. > pg_dump: Error message from server: ERROR: unexpected chunk > size 104 > (expected 1996) in chunk 3 of 22 for toast value 48862 in > pg_toast_20028 > pg_dump: The command was: COPY public.bodycontent > (bodycontentid, body, > contentid, bodytypeid) TO stdout; > The dump failed, so I assume this did leave my database in an > inconsistent state. > > > *Attempt 2 - startup manually and let it try recovery* > > I restored my file level backup and started again. > This time I tried to startup manually on the command line to see the > output (I'd done it as a service startup a number of times to > nearly the > same effect too) > > postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D > /etc/postgresql/9.3/main start > server starting > postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 > 20:36:08 AEDT > LOG: database system was interrupted while in recovery at > 2017-01-27 > 20:13:26 AEDT > 2017-01-27 20:36:08 AEDT HINT: This probably means that some > data is > corrupted and you will have to use the last backup for recovery. > 2017-01-27 20:36:08 AEDT LOG: database system was not properly shut > down; automatic recovery in progress > 2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558 > 2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8 > 2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log > time 2017-01-24 02:08:00.023064+11 > 2017-01-27 23:00:01 AEDT FATAL: the database system is starting up > 2017-01-27 23:00:01 AEDT FATAL: the database system is starting up > > > What does ps ax | grep post show? > > Is the cluster set up to log to a file, if so what does it show? > > Does the system log show anything relevant? > > > I've left it that way for 12 hours, and its still not allowing > connections. > I assume its doing some kind of consistency check? > > > What does it say when you attempt a connection? > > > > Does anyone have any suggestions on what I should be doing to > try and > restore this database? > > - The amount of change is minimal in the DB after 6pm it should be > basically no change overnight. > - The log above seems to suggest it has completed a redo ok, is > that right? > - The last completed transaction time 2017-01-24 > 02:08:00.023064+11 the > log mentions would be fine to use, so loosing even a few hours > before > that would be more than adequate. > > I'm just not clear what the database is doing now, or how I > should be > trying to recover it. > > Any help anyone can suggest would be great! I've given myself this > weekend to attempt to recover more than the last backup, after > that I > need to restore the service for my team to use and suck up the > lost last > day of updates. > > Thanks, > Brian > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> > > -- Adrian Klaver adrian.klaver@aklaver.com
pgsql-general by date: