Thread: Question on corruption (PostgreSQL 9.6.1)

Question on corruption (PostgreSQL 9.6.1)

From

Andy Halsall

Date:

13 March 2018, 11:48:31

Following a number of kernel crashes due to memory corruption, we experienced "ERROR: invalid memory alloc request size 8589934587" from pg_dump. We assume because of corruption to a PostgreSQL page due to the same memory problems causing the crashes. Fortunately this was low impact - we could remove the single problematic row through the application. pg_dump now ok.

But recently we hit a different problem when restoring a pg_dump, seemingly caused by empty data in a not-null column. From pg_dump:

...

CREATE TABLE xxxxx (
    col_a bigint NOT NULL,
    col_b bytea,
    col_c integer NOT NULL,
    col_d bytea,
    col_e integer NOT NULL,
    last_modified timestamp without time zone NOT NULL
);

COPY xxxxx (col_a, col_b, col_c, col_d, col_e, last_modified) FROM stdin;
....
4675632 \\x     0       \\x     0       2017-09-27 10:34:38.109677
4675633 \\x     0       \\x     0       2017-09-27 10:34:38.113812
4675634 \\x     0       \\x     0       2017-09-27 10:34:38.118072
\N      \N      \N      \N      \N      \N
4675636 \\x     0       \\x     0       2017-09-27 10:34:38.128796
4675637 \\x     0       \\x     0       2017-09-27 10:34:38.132003
4675638 \\x     0       \\x     0       2017-09-27 10:34:38.134197
....

And then investigating rows 4675634 and 4675635:

db=# select * from xxxxx where col_a = 4675634;
col_a | col_b | col_c | col_d | col_e | last_modified
---------+-------------+-----------------+-----------+---------------+----------------------------
4675634 | \x | 0 | \x | 0 | 2017-09-27 10:34:38.118072
(1 row)

Row 4675635 is very odd - NULL columns and at the same time retrievable by a value in col_a.

Can this be explained in any other way than by corruption or could it be some other behaviour that we should worry about?

(We've never seen anything like it in test, just on the system that's had the problems. Previous problem row was col_a=4675656 - same table and perhaps page and so perhaps same corruption).

Fortunately again low impact - we were able to delete the row through the application.

Thanks.

Re: Question on corruption (PostgreSQL 9.6.1)

From

Tom Lane

Date:

13 March 2018, 16:58:31

Andy Halsall <andyh@live.co.uk> writes:
>  db=# select * from xxxxx where col_a = 4675635;
>  col_a   | col_b       | col_c           | col_d     | col_e         |  last_modified
> --------+-------------+-----------------+-----------+---------------+---------------
>         |             |                 |           |               |
> (1 row)

> Row 4675635 is very odd - NULL columns and at the same time retrievable by a value in col_a.

Doesn't seem particularly surprising if col_a is indexed.  That query
would choose an indexscan plan, which would normally not bother to
re-verify the index condition against heap tuples found via the index.

If you're continuing to use this damaged database, it might be a good
idea to try to REINDEX all your indexes.  That'd be particularly
useful for primary/unique indexes, since if corruption has led to
any apparent duplicate rows, the reindex would fail and complain.
But in any case it'd clean up heap-vs-index inconsistencies like the
above, as well as repairing any cases where the corruption was in an
index rather than heap.

Another test I'd strongly recommend is to see if you can pg_dumpall
and reload into a spare server.  That might catch forms of data
corruption that reindexing would not, such as violated CHECK constraints.

            regards, tom lane

Re: Question on corruption (PostgreSQL 9.6.1)

From

Andy Halsall

Date:

15 March 2018, 18:16:54

Thanks for the advice. I re-indexed and reloaded a pg_dumpall into a spare server - no errors. Will run pg_catcheck
asap.

regards,
Andy

Andy Halsall <andyh@live.co.uk> writes:
>  db=# select * from xxxxx where col_a = 4675635;
>  col_a   | col_b       | col_c           | col_d     | col_e         |  last_modified
> --------+-------------+-----------------+-----------+---------------+---------------
>         |             |                 |           |               |
> (1 row)

> Row 4675635 is very odd - NULL columns and at the same time retrievable by a value in col_a.

Doesn't seem particularly surprising if col_a is indexed.  That query
would choose an indexscan plan, which would normally not bother to
re-verify the index condition against heap tuples found via the index.

If you're continuing to use this damaged database, it might be a good
idea to try to REINDEX all your indexes.  That'd be particularly
useful for primary/unique indexes, since if corruption has led to
any apparent duplicate rows, the reindex would fail and complain.
But in any case it'd clean up heap-vs-index inconsistencies like the
above, as well as repairing any cases where the corruption was in an
index rather than heap.

Another test I'd strongly recommend is to see if you can pg_dumpall
and reload into a spare server.  That might catch forms of data
corruption that reindexing would not, such as violated CHECK constraints.

                        regards, tom lane

Re: Question on corruption (PostgreSQL 9.6.1)

From

Peter Geoghegan

Date:

16 March 2018, 21:24:23

On Thu, Mar 15, 2018 at 8:16 AM, Andy Halsall <andyh@live.co.uk> wrote:
> Thanks for the advice. I re-indexed and reloaded a pg_dumpall into a spare server - no errors. Will run pg_catcheck
asap.

You can also run amcheck. Get the version targeting earlier Postgres
releases off Github (there are packages for most Linux systems). This
can verify that the heap is consistent with indexes.

-- 
Peter Geoghegan