Thread: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database

BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database

From

PG Bug reporting form

Date:

01 June 2021, 20:45:06

The following bug has been logged on the website:

Bug reference:      17042
Logged by:          Kim-Alexander Brodowski
Email address:      kim.brodowski@iserv.eu
PostgreSQL version: 13.3
Operating system:   Debian Bullseye/Buster
Description:

Dear Readers,

We, IServ GmbH, are a German company providing software primarily for the
education sector. We deploy PostgreSQL as our primary database backend on
almost 5,000 machines. Due to an unfortunate bug in our software and under
rare circumstances, a database schema migration script for DAViCal
(https://www.davical.org/) was executed twice in parallel.

Under normal circumstances this shouldn't be a problem, as databases like
PostgreSQL are designed for concurrent reading and writing. However, we have
noticed corruption on some of our PostgreSQL instances. Particularly our
backups using pg_dumpall would fail:

  LC_ALL=C pg_dumpall -l davical -U postgres
  [...]
  pg_dump: error: query failed: ERROR: cache lookup failed for type 243143
  pg_dump: error: query was: SELECT proretset, prosrc, probin,
pg_catalog.pg_get_function_arguments(oid) AS funcargs,
pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs,
pg_catalog.pg_get_function_result(oid) AS funcresult,
array_to_string(protrftypes, ' ') AS protrftypes, prokind, provolatile,
proisstrict, prosecdef, proleakproof, proconfig, procost, prorows,
prosupport, proparallel, (SELECT lanname FROM pg_catalog.pg_language WHERE
oid = prolang) AS lanname FROM pg_catalog.pg_proc WHERE oid =
'243147'::pg_catalog.oid
  pg_dumpall: error: pg_dump failed on database "iserv", exiting

Interestingly, dumping only the actual table contents works fine. The
corruption occurred in the database schema.

Digging through pg_proc looking for the culprit revealed that for varying
user defined functions, the OID referenced in prorettype could not be found
in pg_type. Cross-checking with the database schema showed us that the data
type was indeed still defined in pg_type, but the OID appears to have
changed.

Reproducing the bug:

My co-worker Martin von Wittich came up with a set of minimal SQL statements
to trigger the issue:

bug.sql:

  DROP TYPE foo CASCADE;
  CREATE TYPE foo AS (foo INTEGER);
  
  CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$
  BEGIN
    RETURN ROW(1)::foo;
  END
  $$
  LANGUAGE plpgsql;
  
  SELECT foobar();

In order to trigger the issue, we need to run our statements in parallel:

  psql -f test.sql & psql -f test.sql

You might need to run this more than once to trigger the issue. Occasionally
errors will show up.

To check whether we succeeded, the easiest way is to dump the schema of the
database in use:

  pg_dumpall -s

We were able to reproduce this issue on PostgreSQL 13.3, 13.2 and 11.12.

There are a couple of accounts of similar issues coming up in the past, such
as in
https://www.postgresql-archive.org/problem-with-create-function-and-drop-type-td6162498.html.
Therefore, we believe we are not the only ones affected by this issue nor is
this issue particularly new. Nobody appears to have made a connection with
parallel execution of statements though until now.

Workaround:
The database schema migration script unconditionally recreates the functions
in use (CREATE OR REPLACE ...). Unfortunately, this doesn't appear to fix
the issue. Instead, you have to manually DROP and CREATE all affected
functions.

Cause:
We didn't dive deep into PostgreSQL's codebase, but it seems likely, that a
lack of proper locking might be to blame here. While parallel execution of
queries like that is certainly uncommon, we believe the PostgreSQL cluster
should never end up in an inconsistent state.

On behalf of my employer, I'd like to thank the PostgreSQL developers and
community for their work on this project. We heavily rely on PostgreSQL.

Kind regards,
Kim-Alexander Brodowski
IServ GmbH

Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database

From

Kim-Alexander Brodowski

Date:

01 June 2021, 21:06:38

Of course, you only realize something went wrong once you hit submit:

Please ignore the inconsistencies between file and database names, as 
the logs have been put together from multiple terminal sessions. Context 
should make clear what was meant to be there.

On 01.06.21 22:45, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17042
> Logged by:          Kim-Alexander Brodowski
> Email address:      kim.brodowski@iserv.eu
> PostgreSQL version: 13.3
> Operating system:   Debian Bullseye/Buster
> Description:
>
> Dear Readers,
>
> We, IServ GmbH, are a German company providing software primarily for the
> education sector. We deploy PostgreSQL as our primary database backend on
> almost 5,000 machines. Due to an unfortunate bug in our software and under
> rare circumstances, a database schema migration script for DAViCal
> (https://www.davical.org/) was executed twice in parallel.
>
> Under normal circumstances this shouldn't be a problem, as databases like
> PostgreSQL are designed for concurrent reading and writing. However, we have
> noticed corruption on some of our PostgreSQL instances. Particularly our
> backups using pg_dumpall would fail:
>
>    LC_ALL=C pg_dumpall -l davical -U postgres
>    [...]
>    pg_dump: error: query failed: ERROR: cache lookup failed for type 243143
>    pg_dump: error: query was: SELECT proretset, prosrc, probin,
> pg_catalog.pg_get_function_arguments(oid) AS funcargs,
> pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs,
> pg_catalog.pg_get_function_result(oid) AS funcresult,
> array_to_string(protrftypes, ' ') AS protrftypes, prokind, provolatile,
> proisstrict, prosecdef, proleakproof, proconfig, procost, prorows,
> prosupport, proparallel, (SELECT lanname FROM pg_catalog.pg_language WHERE
> oid = prolang) AS lanname FROM pg_catalog.pg_proc WHERE oid =
> '243147'::pg_catalog.oid
>    pg_dumpall: error: pg_dump failed on database "iserv", exiting
>
> Interestingly, dumping only the actual table contents works fine. The
> corruption occurred in the database schema.
>
> Digging through pg_proc looking for the culprit revealed that for varying
> user defined functions, the OID referenced in prorettype could not be found
> in pg_type. Cross-checking with the database schema showed us that the data
> type was indeed still defined in pg_type, but the OID appears to have
> changed.
>
> Reproducing the bug:
>
> My co-worker Martin von Wittich came up with a set of minimal SQL statements
> to trigger the issue:
>
> bug.sql:
>
>    DROP TYPE foo CASCADE;
>    CREATE TYPE foo AS (foo INTEGER);
>    
>    CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$
>    BEGIN
>      RETURN ROW(1)::foo;
>    END
>    $$
>    LANGUAGE plpgsql;
>    
>    SELECT foobar();
>
> In order to trigger the issue, we need to run our statements in parallel:
>
>    psql -f test.sql & psql -f test.sql
>
> You might need to run this more than once to trigger the issue. Occasionally
> errors will show up.
>
> To check whether we succeeded, the easiest way is to dump the schema of the
> database in use:
>
>    pg_dumpall -s
>
> We were able to reproduce this issue on PostgreSQL 13.3, 13.2 and 11.12.
>
> There are a couple of accounts of similar issues coming up in the past, such
> as in
> https://www.postgresql-archive.org/problem-with-create-function-and-drop-type-td6162498.html.
> Therefore, we believe we are not the only ones affected by this issue nor is
> this issue particularly new. Nobody appears to have made a connection with
> parallel execution of statements though until now.
>
> Workaround:
> The database schema migration script unconditionally recreates the functions
> in use (CREATE OR REPLACE ...). Unfortunately, this doesn't appear to fix
> the issue. Instead, you have to manually DROP and CREATE all affected
> functions.
>
> Cause:
> We didn't dive deep into PostgreSQL's codebase, but it seems likely, that a
> lack of proper locking might be to blame here. While parallel execution of
> queries like that is certainly uncommon, we believe the PostgreSQL cluster
> should never end up in an inconsistent state.
>
> On behalf of my employer, I'd like to thank the PostgreSQL developers and
> community for their work on this project. We heavily rely on PostgreSQL.
>
> Kind regards,
> Kim-Alexander Brodowski
> IServ GmbH
>

Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database

From

Tom Lane

Date:

01 June 2021, 21:41:40

PG Bug reporting form <noreply@postgresql.org> writes:
> My co-worker Martin von Wittich came up with a set of minimal SQL statements
> to trigger the issue:

> bug.sql:

>   DROP TYPE foo CASCADE;
>   CREATE TYPE foo AS (foo INTEGER);

>   CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$
>   BEGIN
>     RETURN ROW(1)::foo;
>   END
>   $$
>   LANGUAGE plpgsql;

>   SELECT foobar();

> In order to trigger the issue, we need to run our statements in parallel:
>   psql -f test.sql & psql -f test.sql

This doesn't seem terribly surprising.  Occasionally one session's
foobar() will run at an instant where type foo doesn't exist, thanks
to the other session's script having dropped the type and not yet
recreated it.  The "cache lookup failed" messages are a bit scary,
but they arise in situations where the type did exist a moment ago
but now it's gone.

There's been occasional proposals to prevent this sort of thing by
obtaining a lock on every type name mentioned in a function and
holding it till (probably) end of transaction.  The cost of that,
compared to the value, has dissuaded us from doing it.

            regards, tom lane

Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database

From

Kim-Alexander Brodowski

Date:

01 June 2021, 22:06:59

On 01.06.21 23:41, Tom Lane wrote:

> This doesn't seem terribly surprising. Occasionally one session's
> foobar() will run at an instant where type foo doesn't exist, thanks
> to the other session's script having dropped the type and not yet
> recreated it.  The "cache lookup failed" messages are a bit scary,
> but they arise in situations where the type did exist a moment ago
> but now it's gone.

I'm not too concerned with what happens when our test statements are 
executed. I don't care about them at all. From the moment the issue 
occurs the database is corrupted though. No backups including the schema 
will succeed and statements involving the function will not operate 
correctly. The schema is permanently corrupted. Novice users will likely 
be unable to recover from that state without recreating the entire 
database.