Thread: database corruption

database corruption

From

"Oliver Stöneberg"

Date:

12 February 2016, 09:56:03

We are running a 64-bit PostgreSQL 9.4.5 server on Windows Server
2012. The system is a virtual machine on a VMware ESX 6.0 server and
has 24 GB of memory. The database server is only accessed locally by
two services and there is only a single database in the server. The
disk is located on a storage that is shared with lots of other
servers. The database server has fsync enabled.

A few weeks ago we already had a data corruption when the disk was
full. There are other services running on the same machine that could
cause the disk to fill up (e.g. local chaching when the network is
acting up). It happened a few times so far but the database was never
compromised. In that case thought it was but fortunately we only lost
a huge table/toast (300+ GB) that has very verbose data stored which
is not essential. That happened with an earlier 9.4 version.

Today we encountered another data corruption after the disk was full.
It's much worse this time around since data that is essential for the
applications using it to run. After truncating that 300+ GB table
already mentioned above all the services were restarted and one of
the applications failed to start with the following database error:

Caused by: org.postgresql.util.PSQLException: ERROR: invalid memory alloc request size 18446744073709551613
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
at joulex.cloud.data.tables.JxObjectStoreJsonTable.getAllJsonsAsHashMap(JxObjectStoreJsonTable.java:215)
... 16 more

Checking the database logs the problems seem to have started two days
ago:
2016-02-10 16:00:34 GMTERROR: invalid page in block 1255 of relation base/16387/3634911224
2016-02-10 16:00:34 GMTCONTEXT: automatic vacuum of table "cloud.t_63d1c1eb806f4e25904b659b75176093.y2016_utilization"

For some reason it doesn't seem to be caused by the disk being full
since the database server was still able to write another 8 GB of log
files for the next two days and the system was still working
yesterday afternoon.
It also doesn't appear to be a hardware problem since all the other
systems sharing the virtual hostsystem and the storage show no issues
at all.

Unfortunately we don't have a recent backup of the database (a tool
to back up all the relevant data was just finished recently and was
not set up for this system yet).

Something else worth noting is that we disabled the autovacuum on the
toast table of the 300+ GB table since we perform INSERT INTO on that
tbale and the vacuum on the table was causing a performance hit. The
autovacuum for it is still being performed to prevent wraparound from
time to and that autovacuum was still running after the machine run
out of disk space and the services was restarted.

Any help in recovering the data is appreciated and if there is more
information necessary on this I will try to provide it. Thanks in
advance.

Re: database corruption

From

Bill Moran

Date:

12 February 2016, 12:46:35

On Fri, 12 Feb 2016 10:56:04 +0100
"Oliver Stöneberg" <oliverst@online.de> wrote:

> We are running a 64-bit PostgreSQL 9.4.5 server on Windows Server
> 2012. The system is a virtual machine on a VMware ESX 6.0 server and
> has 24 GB of memory. The database server is only accessed locally by
> two services and there is only a single database in the server. The
> disk is located on a storage that is shared with lots of other
> servers. The database server has fsync enabled.
>
> A few weeks ago we already had a data corruption when the disk was
> full. There are other services running on the same machine that could
> cause the disk to fill up (e.g. local chaching when the network is
> acting up). It happened a few times so far but the database was never
> compromised. In that case thought it was but fortunately we only lost
> a huge table/toast (300+ GB) that has very verbose data stored which
> is not essential. That happened with an earlier 9.4 version.
>
> Today we encountered another data corruption after the disk was full.
> It's much worse this time around since data that is essential for the
> applications using it to run. After truncating that 300+ GB table
> already mentioned above all the services were restarted and one of
> the applications failed to start with the following database error:
>
> Caused by: org.postgresql.util.PSQLException: ERROR: invalid memory alloc request size 18446744073709551613
>     at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
>     at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
>     at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
>     at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
>     at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
>     at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
>     at joulex.cloud.data.tables.JxObjectStoreJsonTable.getAllJsonsAsHashMap(JxObjectStoreJsonTable.java:215)
>     ... 16 more
>
> Checking the database logs the problems seem to have started two days
> ago:
> 2016-02-10 16:00:34 GMTERROR:  invalid page in block 1255 of relation base/16387/3634911224
> 2016-02-10 16:00:34 GMTCONTEXT:  automatic vacuum of table
"cloud.t_63d1c1eb806f4e25904b659b75176093.y2016_utilization"
>
> For some reason it doesn't seem to be caused by the disk being full
> since the database server was still able to write another 8 GB of log
> files for the next two days and the system was still working
> yesterday afternoon.
> It also doesn't appear to be a hardware problem since all the other
> systems sharing the virtual hostsystem and the storage show no issues
> at all.
>
> Unfortunately we don't have a recent backup of the database (a tool
> to back up all the relevant data was just finished recently and was
> not set up for this system yet).
>
> Something else worth noting is that we disabled the autovacuum on the
> toast table of the 300+ GB table since we perform INSERT INTO on that
> tbale and the vacuum on the table was causing a performance hit. The
> autovacuum for it is still being performed to prevent wraparound from
> time to and that autovacuum was still running after the machine run
> out of disk space and the services was restarted.
>
> Any help in recovering the data is appreciated and if there is more
> information necessary on this I will try to provide it. Thanks in
> advance.

You most likely have byte-level corruption. If you need that data back,
your best bet is to hire a company with PostgreSQL experts who know
the structure of how the data is stored on disk and can manipulate
the files directly to recover whatever hasn't been destroyed. If you
want to do it yourself, it will require you to understand the actual
byte sequences as they are stored on disk, as well as the system
PostgreSQL uses to identify database pages within the file system.
First you will have to indentify the file that contains the corrupt
page, then you will have to modify the bytes in the page to make the
page non-corrupt. (do this with Postgres shut down) In any event,
that sort of thing is touchy work, even if you do understand it well,
so make sure you have a full copy of all database files so you can
roll back if you make things worse.

Long term, you need to fix your hardware. Postgres doesn't corrupt
itself just because the disks fill up, so your hardware must be lying
about what writes completed successfully, otherwise, Postgres would
be able to recover after a restart.

Beyond that, running Postgres on a filesystem that frequently fills up
is going to be problematic all around anyway. If you don't improve the
hardware situation, you're going to continue to have problems like this.

--
Bill Moran

Re: database corruption

From

Andrew Sullivan

Date:

12 February 2016, 14:29:00

On Fri, Feb 12, 2016 at 07:46:25AM -0500, Bill Moran wrote:
> Long term, you need to fix your hardware. Postgres doesn't corrupt
> itself just because the disks fill up, so your hardware must be lying
> about what writes completed successfully, otherwise, Postgres would
> be able to recover after a restart.

It may not be the hardware.  Depending on how vmware is configured, it
could just be a setting.  Also, something in the OP's message made me
think that this was _actually_ a network-attached disk, which can also
have such problems.  (But in general, I agree.)

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: database corruption

From

Craig Ringer

Date:

17 February 2016, 07:58:06

On 12 February 2016 at 17:56, Oliver Stöneberg <oliverst@online.de> wrote:

A few weeks ago we already had a data corruption when the disk was
full. There are other services running on the same machine that could
cause the disk to fill up (e.g. local chaching when the network is
acting up). It happened a few times so far but the database was never
compromised. In that case thought it was but fortunately we only lost
a huge table/toast (300+ GB) that has very verbose data stored which
is not essential. That happened with an earlier 9.4 version.

What remedial action was taken to restore the database to normal functionality at this time?

Is the current database a direct descendant of the one that was corrupted here? i.e. has it had a complete dump, initdb and restore since then, or not?

Unfortunately we don't have a recent backup of the database (a tool
to back up all the relevant data was just finished recently and was
not set up for this system yet).

Read and act on https://wiki.postgresql.org/wiki/Corruption immediately.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services