Thread: BUG #17989: ERROR: could not open file "pg_logical/snapshots/6-1E182808.snap": Permission denied
BUG #17989: ERROR: could not open file "pg_logical/snapshots/6-1E182808.snap": Permission denied
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17989 Logged by: Dmitry Ivanov Email address: dmivanov@ptsecurity.com PostgreSQL version: 12.1 Operating system: Windows Server 2019 Standard Description: Hi, I have problem with logical replication on Postgres12 (later versions too). I set up logical replication for 40 databases (40 subscriptions / publications) + (a large number of replication slots). After starting replication at random times, I get the following errors on the master: ERROR: could not open file "pg_logical/snapshots/6-1E182808.snap": Permission denied This leads to PG restart and downtime of databases consumer services. There are no obvious problems associated with permissions. For the some databases replication works well. The error appears at a random times. I look at file/handle activity using Sysinternas FileMon/ProcessMonitor. No third party software (antivirus, etc) touches files in pg_logical/snapshots It looks like two backend workers are trying to open the same snapshot at the same time: Postres log with DEBUG1 level: 2023-06-22 12:39:54 MSK db1 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db2 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db3 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db4 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db5 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db6 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db7 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db8 DEBUG: updated xmin: 0 restart: 1 2023-06-22 12:39:54 MSK db9 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db10 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db13 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db8 DEBUG: got new restart lsn 6/1ED17A98 at 6/1ED17A98 2023-06-22 12:39:54 MSK db8 DEBUG: serializing snapshot to pg_logical/snapshots/6-1ED17AD0.snap 2023-06-22 12:39:54 MSK db11 DEBUG: updated xmin: 0 restart: 1 2023-06-22 12:39:54 MSK db10 DEBUG: received replication command: CREATE_REPLICATION_SLOT "subs_db10_190385_sync_16547" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2023-06-22 12:39:54 MSK db2 DEBUG: failed to increase restart lsn: proposed 6/1ED17AD0, after 6/1ED17AD0, current candidate 6/1ED17AD0, current after 6/1ED17AD0, flushed up to 6/1ED17AD0 2023-06-22 12:39:54 MSK db7 DEBUG: searching for logical decoding starting point, starting at 6/1ED17AD0 2023-06-22 12:39:54 MSK db7 LOG: logical decoding found consistent point at 6/1ED17AD0 2023-06-22 12:39:54 MSK db7 DETAIL: There are no running transactions. 2023-06-22 12:39:54 MSK db7 STATEMENT: CREATE_REPLICATION_SLOT "subs_db7_190470_sync_17114" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2023-06-22 12:39:54 MSK db11 DEBUG: failed to increase restart lsn: proposed 6/1ED17A28, after 6/1ED17A28, current candidate 6/1ED17A28, current after 6/1ED17A28, flushed up to 6/1ED17A28 2023-06-22 12:39:54 MSK db12 DEBUG: updated xmin: 0 restart: 1 2023-06-22 12:39:54 MSK db2 DEBUG: updated xmin: 0 restart: 1 2023-06-22 12:39:54 MSK db13 DEBUG: failed to increase restart lsn: proposed 6/1ED17AD0, after 6/1ED17AD0, current candidate 6/1ED17AD0, current after 6/1ED17AD0, flushed up to 6/1ED17AD0 2023-06-22 12:39:54 MSK db12 DEBUG: got new restart lsn 6/1ED17A98 at 6/1ED17A98 2023-06-22 12:39:54 MSK db14 DEBUG: failed to increase restart lsn: proposed 6/1ED17A28, after 6/1ED17A28, current candidate 6/1ED17A28, current after 6/1ED17A28, flushed up to 6/1ED17A28 2023-06-22 12:39:54 MSK db10 DEBUG: failed to increase restart lsn: proposed 6/1ED17AD0, after 6/1ED17AD0, current candidate 6/1ED179F0, current after 6/1ED179F0, flushed up to 6/1ED179B8 2023-06-22 12:39:54 MSK db11 ERROR: could not open file "pg_logical/snapshots/6-1ED17AD0.snap": Permission denied I looked at the source code and see a comment about the race-condition and sync via atomic rename https://github.com/postgres/postgres/blob/596114164699bbe184969df374fd6a1d8a93a57c/src/backend/replication/logical/snapbuild.c#L1670 The PANIC occurs when calling fsync (opening a file with permissions) on the target file after renaming .tmp https://github.com/postgres/postgres/blob/596114164699bbe184969df374fd6a1d8a93a57c/src/backend/replication/logical/snapbuild.c#L1815 https://github.com/postgres/postgres/blob/596114164699bbe184969df374fd6a1d8a93a57c/src/backend/storage/file/fd.c#LL3775C6-L3775C6 On POSIX systems rename is an atomic operation, but AFAIK in Windows MoveFileEx is not atomic. At fact, this is Copy-Delete. Is it reasonable to rely here on the atomicity of rename for synchronization? Are there any ways to fix this? Unfortunately, now I don’t see other ways, except how to turn off panic on fsync failures at all for the duration of replication (data_sync_retry = on). It looks nothing is damaged in this case, but in general this solution is VERY fragile and dangerous. BTW: A similar issue was already mentioned here (different bug, same place), but instead of a fix, they just upgraded FreeBSD on test machines. https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us