Thread: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17042 Logged by: Kim-Alexander Brodowski Email address: kim.brodowski@iserv.eu PostgreSQL version: 13.3 Operating system: Debian Bullseye/Buster Description: Dear Readers, We, IServ GmbH, are a German company providing software primarily for the education sector. We deploy PostgreSQL as our primary database backend on almost 5,000 machines. Due to an unfortunate bug in our software and under rare circumstances, a database schema migration script for DAViCal (https://www.davical.org/) was executed twice in parallel. Under normal circumstances this shouldn't be a problem, as databases like PostgreSQL are designed for concurrent reading and writing. However, we have noticed corruption on some of our PostgreSQL instances. Particularly our backups using pg_dumpall would fail: LC_ALL=C pg_dumpall -l davical -U postgres [...] pg_dump: error: query failed: ERROR: cache lookup failed for type 243143 pg_dump: error: query was: SELECT proretset, prosrc, probin, pg_catalog.pg_get_function_arguments(oid) AS funcargs, pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs, pg_catalog.pg_get_function_result(oid) AS funcresult, array_to_string(protrftypes, ' ') AS protrftypes, prokind, provolatile, proisstrict, prosecdef, proleakproof, proconfig, procost, prorows, prosupport, proparallel, (SELECT lanname FROM pg_catalog.pg_language WHERE oid = prolang) AS lanname FROM pg_catalog.pg_proc WHERE oid = '243147'::pg_catalog.oid pg_dumpall: error: pg_dump failed on database "iserv", exiting Interestingly, dumping only the actual table contents works fine. The corruption occurred in the database schema. Digging through pg_proc looking for the culprit revealed that for varying user defined functions, the OID referenced in prorettype could not be found in pg_type. Cross-checking with the database schema showed us that the data type was indeed still defined in pg_type, but the OID appears to have changed. Reproducing the bug: My co-worker Martin von Wittich came up with a set of minimal SQL statements to trigger the issue: bug.sql: DROP TYPE foo CASCADE; CREATE TYPE foo AS (foo INTEGER); CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$ BEGIN RETURN ROW(1)::foo; END $$ LANGUAGE plpgsql; SELECT foobar(); In order to trigger the issue, we need to run our statements in parallel: psql -f test.sql & psql -f test.sql You might need to run this more than once to trigger the issue. Occasionally errors will show up. To check whether we succeeded, the easiest way is to dump the schema of the database in use: pg_dumpall -s We were able to reproduce this issue on PostgreSQL 13.3, 13.2 and 11.12. There are a couple of accounts of similar issues coming up in the past, such as in https://www.postgresql-archive.org/problem-with-create-function-and-drop-type-td6162498.html. Therefore, we believe we are not the only ones affected by this issue nor is this issue particularly new. Nobody appears to have made a connection with parallel execution of statements though until now. Workaround: The database schema migration script unconditionally recreates the functions in use (CREATE OR REPLACE ...). Unfortunately, this doesn't appear to fix the issue. Instead, you have to manually DROP and CREATE all affected functions. Cause: We didn't dive deep into PostgreSQL's codebase, but it seems likely, that a lack of proper locking might be to blame here. While parallel execution of queries like that is certainly uncommon, we believe the PostgreSQL cluster should never end up in an inconsistent state. On behalf of my employer, I'd like to thank the PostgreSQL developers and community for their work on this project. We heavily rely on PostgreSQL. Kind regards, Kim-Alexander Brodowski IServ GmbH
Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
From
Kim-Alexander Brodowski
Date:
Of course, you only realize something went wrong once you hit submit: Please ignore the inconsistencies between file and database names, as the logs have been put together from multiple terminal sessions. Context should make clear what was meant to be there. On 01.06.21 22:45, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 17042 > Logged by: Kim-Alexander Brodowski > Email address: kim.brodowski@iserv.eu > PostgreSQL version: 13.3 > Operating system: Debian Bullseye/Buster > Description: > > Dear Readers, > > We, IServ GmbH, are a German company providing software primarily for the > education sector. We deploy PostgreSQL as our primary database backend on > almost 5,000 machines. Due to an unfortunate bug in our software and under > rare circumstances, a database schema migration script for DAViCal > (https://www.davical.org/) was executed twice in parallel. > > Under normal circumstances this shouldn't be a problem, as databases like > PostgreSQL are designed for concurrent reading and writing. However, we have > noticed corruption on some of our PostgreSQL instances. Particularly our > backups using pg_dumpall would fail: > > LC_ALL=C pg_dumpall -l davical -U postgres > [...] > pg_dump: error: query failed: ERROR: cache lookup failed for type 243143 > pg_dump: error: query was: SELECT proretset, prosrc, probin, > pg_catalog.pg_get_function_arguments(oid) AS funcargs, > pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs, > pg_catalog.pg_get_function_result(oid) AS funcresult, > array_to_string(protrftypes, ' ') AS protrftypes, prokind, provolatile, > proisstrict, prosecdef, proleakproof, proconfig, procost, prorows, > prosupport, proparallel, (SELECT lanname FROM pg_catalog.pg_language WHERE > oid = prolang) AS lanname FROM pg_catalog.pg_proc WHERE oid = > '243147'::pg_catalog.oid > pg_dumpall: error: pg_dump failed on database "iserv", exiting > > Interestingly, dumping only the actual table contents works fine. The > corruption occurred in the database schema. > > Digging through pg_proc looking for the culprit revealed that for varying > user defined functions, the OID referenced in prorettype could not be found > in pg_type. Cross-checking with the database schema showed us that the data > type was indeed still defined in pg_type, but the OID appears to have > changed. > > Reproducing the bug: > > My co-worker Martin von Wittich came up with a set of minimal SQL statements > to trigger the issue: > > bug.sql: > > DROP TYPE foo CASCADE; > CREATE TYPE foo AS (foo INTEGER); > > CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$ > BEGIN > RETURN ROW(1)::foo; > END > $$ > LANGUAGE plpgsql; > > SELECT foobar(); > > In order to trigger the issue, we need to run our statements in parallel: > > psql -f test.sql & psql -f test.sql > > You might need to run this more than once to trigger the issue. Occasionally > errors will show up. > > To check whether we succeeded, the easiest way is to dump the schema of the > database in use: > > pg_dumpall -s > > We were able to reproduce this issue on PostgreSQL 13.3, 13.2 and 11.12. > > There are a couple of accounts of similar issues coming up in the past, such > as in > https://www.postgresql-archive.org/problem-with-create-function-and-drop-type-td6162498.html. > Therefore, we believe we are not the only ones affected by this issue nor is > this issue particularly new. Nobody appears to have made a connection with > parallel execution of statements though until now. > > Workaround: > The database schema migration script unconditionally recreates the functions > in use (CREATE OR REPLACE ...). Unfortunately, this doesn't appear to fix > the issue. Instead, you have to manually DROP and CREATE all affected > functions. > > Cause: > We didn't dive deep into PostgreSQL's codebase, but it seems likely, that a > lack of proper locking might be to blame here. While parallel execution of > queries like that is certainly uncommon, we believe the PostgreSQL cluster > should never end up in an inconsistent state. > > On behalf of my employer, I'd like to thank the PostgreSQL developers and > community for their work on this project. We heavily rely on PostgreSQL. > > Kind regards, > Kim-Alexander Brodowski > IServ GmbH >
Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > My co-worker Martin von Wittich came up with a set of minimal SQL statements > to trigger the issue: > bug.sql: > DROP TYPE foo CASCADE; > CREATE TYPE foo AS (foo INTEGER); > CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$ > BEGIN > RETURN ROW(1)::foo; > END > $$ > LANGUAGE plpgsql; > SELECT foobar(); > In order to trigger the issue, we need to run our statements in parallel: > psql -f test.sql & psql -f test.sql This doesn't seem terribly surprising. Occasionally one session's foobar() will run at an instant where type foo doesn't exist, thanks to the other session's script having dropped the type and not yet recreated it. The "cache lookup failed" messages are a bit scary, but they arise in situations where the type did exist a moment ago but now it's gone. There's been occasional proposals to prevent this sort of thing by obtaining a lock on every type name mentioned in a function and holding it till (probably) end of transaction. The cost of that, compared to the value, has dissuaded us from doing it. regards, tom lane
Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
From
Kim-Alexander Brodowski
Date:
On 01.06.21 23:41, Tom Lane wrote: > This doesn't seem terribly surprising. Occasionally one session's > foobar() will run at an instant where type foo doesn't exist, thanks > to the other session's script having dropped the type and not yet > recreated it. The "cache lookup failed" messages are a bit scary, > but they arise in situations where the type did exist a moment ago > but now it's gone. I'm not too concerned with what happens when our test statements are executed. I don't care about them at all. From the moment the issue occurs the database is corrupted though. No backups including the schema will succeed and statements involving the function will not operate correctly. The schema is permanently corrupted. Novice users will likely be unable to recover from that state without recreating the entire database.