bug, bad memory, or bad disk? - Mailing list pgsql-general
From | Ben Chobot |
---|---|
Subject | bug, bad memory, or bad disk? |
Date | |
Msg-id | 77BBEB20-89D9-47C5-8F36-41DF5E05355A@silentmedia.com Whole thread Raw |
Responses |
Re: bug, bad memory, or bad disk?
|
List | pgsql-general |
We have a Postgres server (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, = compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit) which does = streaming replication to some slaves, and has another set of slaves = reading the wal archive for wal-based replication. We had a bit of fun = yesterday where, suddenly, the master started spewing errors like: 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1] = ERROR: invalid memory alloc request size 1968078400 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1] = ERROR: invalid page header in block 2948 of relation = pg_tblspc/16435/PG_9.1_201105231/188417/56951641 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1] = ERROR: could not open file = "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block = 3936767042): No such file or directory 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1] = ERROR: could not open file = "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block = 3936767042): No such file or directory 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1] = ERROR: could not open file = "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block = 3936767042): No such file or directory 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1] = ERROR: invalid memory alloc request size 1968078400 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1] = ERROR: could not open file = "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block = 3936767042): No such file or directory 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1] = ERROR: invalid page header in block 38887 of relation = pg_tblspc/16435/PG_9.1_201105231/188417/58206627 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1] = ERROR: invalid page header in block 2368 of relation = pg_tblspc/16435/PG_9.1_201105231/188417/60418945 There didn't seem to be much correlation to which files were affected, = and this was a critical server, so once we realized a simple reindex = wasn't going to solve things, we shut it down and brought up a slave as = the new master db. While that seemed to fix these issues, we soon noticed problems with = missing clog files. The missing clogs were outside the range of the = existing clogs, so we tried using dummy clog files. It didn't help, and = running pg_check we found that one block of one table was definitely = corrupt. Worse, that corruption had spread to all our replicas. I know this is a little sparse on details, but my questions are: 1. What kind of fault should I be looking to fix? Because it spread to = all the replicas, both those that stream and those that replicate by = replaying wals in the wal archive, I assume it's not a storage issue. = (My understanding is that streaming replicas stream their changes from = memory, not from wals.) So that leaves bad memory on the master, or a = bug in postgres. Or a flawed assumption... :) 2. Is it possible that the corruption that was on the master got = replicated to the slaves when I tried to cleanly shut down the master = before bringing up a new slave as the new master and switching the other = slaves over to replicating from that?
pgsql-general by date: