Re: Synchronous commit behavior during network outage - Mailing list pgsql-hackers
From | Ondřej Žižka |
---|---|
Subject | Re: Synchronous commit behavior during network outage |
Date | |
Msg-id | 39d141cf-89c0-6aaa-c8e9-500c05d0d744@stratox.cz Whole thread Raw |
In response to | Re: Synchronous commit behavior during network outage (Ondřej Žižka <ondrej.zizka@stratox.cz>) |
Responses |
Re: Synchronous commit behavior during network outage
|
List | pgsql-hackers |
I am sorry, I forgot mentioned, that in the second situation I added a primary key to the table. Ondrej On 20/04/2021 18:49, Ondřej Žižka wrote: > Hello Aleksander, > > Thank you for the reaction. This was tested on version 13.2. > > There are also other possible situations with the same setup and > similar issue: > > ----------------- > When the background process on server fails.... > > On postgresql1: > tecmint=# select * from a; --> LAN on sync replica is OK > id > ---- > 1 > (1 row) > > tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN > and insert is waiting. During this time kill the background process on > the PostgreSQL server for this session > WARNING: canceling the wait for synchronous replication and > terminating connection due to administrator command > DETAIL: The transaction has already committed locally, but might not > have been replicated to the standby. > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Succeeded. > tecmint=# select * from a; > id > ---- > 1 > 2 > (2 rows) > > tecmint=# ---> LAN on sync replica is still DOWN > > The potgres session will restore after the background process failed. > When you run select on master, it still looks OK. But data is still > not replicated on the sync replica. If we lost the master now, we > would lost this data as well. > > ************** > Another case > ************** > > Kill the client process. > > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > (3 rows) > tecmint=# --> Disconnect the sync replica now. LAN on > replica is DOWN > tecmint=# insert into a values (4); --> Kill the client process > Terminated > xzizka@service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d tecmint > Password for user postgres: > psql (13.2 (Debian 13.2-1.pgdg100+1)) > Type "help" for help. > > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > (3 rows) > > tecmint=# --> Number 4 is not there. Now switch the LAN on sync > replica ON. > > ---------- > > Result from sync replica after the LAN is again UP: > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > 4 > (4 rows) > > > In this situation, try to insert the number 4 again to the table. > > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > (3 rows) > > tecmint=# insert into a values (4); > ERROR: duplicate key value violates unique constraint "a_pkey" > DETAIL: Key (id)=(4) already exists. > tecmint=# > > This is really strange... Application can be confused, It is not > possible to insert record, which is not there, but some systems which > use the sync node as a read replica maybe already read that record > from the sync replica database and done some steps which can cause > issues and can be hard to track. > > If I say, that it would be hard to send the CTRL+C to the database > from the client, I need to say, that the 2 situations I described here > can happen in real. > > What do you think? > > Thank you and regards > Ondrej > > On 20/04/2021 17:23, Aleksander Alekseev wrote: >> Hi Ondřej, >> >> Thanks for the report. It seems to be a clear violation of what is >> promised in the docs. Although it's unlikely that someone implemented >> an application which deals with important data and "pressed Ctr+C" as >> it's done in psql. So this might be not such a critical issue after >> all. BTW what version of PostgreSQL are you using? >> >> >> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka >> <ondrej.zizka@stratox.cz> wrote: >>> Hello all, >>> I would like to know your opinion on the following behaviour I see >>> for PostgreSQL setup with synchronous replication. >>> >>> This behaviour happens in a special use case. In this use case, >>> there are 2 synchronous replicas with the following config (truncated): >>> >>> - 2 nodes >>> - synchronous_standby_names='*' >>> - synchronous_commit=remote_apply >>> >>> >>> With this setup run the following steps (LAN down - LAN between >>> master and replica): >>> ----------------- >>> postgres=# truncate table a; >>> TRUNCATE TABLE >>> postgres=# insert into a values (1); -- LAN up, insert has been >>> applied to replica. >>> INSERT 0 1 >>> Vypnu LAN na serveru se standby: >>> postgres=# insert into a values (2); --LAN down, waiting for a >>> confirmation from sync replica. In this situation cancel it (press >>> CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> INSERT 0 1 >>> There will be warning that commit was performed only locally: >>> 2021-04-12 19:55:53.063 CEST [26104] WARNING: canceling wait for >>> synchronous replication due to user request >>> 2021-04-12 19:55:53.063 CEST [26104] DETAIL: The transaction has >>> already committed locally, but might not have been replicated to the >>> standby. >>> >>> postgres=# insert into a values (2); --LAN down, waiting for a >>> confirmation from sync replica. In this situation cancel it (press >>> CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> INSERT 0 1 >>> postgres=# insert into a values (2); --LAN down, waiting for sync >>> replica, second attempt, cancel it as well (CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> INSERT 0 1 >>> postgres=# update a set n=3 where n=2; --LAN down, waiting for sync >>> replica, cancel it (CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> UPDATE 2 >>> postgres=# update a set n=3 where n=2; -- run the same >>> update,because data from the previous attempt was commited on >>> master, it is sucessfull, but no changes >>> UPDATE 0 >>> postgres=# select * from a; >>> n >>> --- >>> 1 >>> 3 >>> 3 >>> (3 rows) >>> postgres=# >>> ------------------------ >>> >>> Now, there is only value 1 in the sync replica table (no other >>> values), data is not in sync. This is expected, after the LAN >>> restore, data will come sync again, but if the main/primary node >>> will fail and we failover to replica before the LAN is back up or >>> the storage for this node would be destroyed and data would not sync >>> to replica before it, we will lose data even if the client received >>> successful commit (with a warning). >>> From the synchronous_commit=remote_write level and "higher", I >>> would expect, that when the remote application (doesn't matter if >>> flush, write or apply) would not be applied I would not receive a >>> confirmation about the commit (even with a warning). Something like, >>> if there is no commit from sync replica, there is no commit on >>> primary and if someone performs the steps above, the whole >>> transaction will not send a confirmation. >>> >>> This can cause issues if the application receives a confirmation >>> about the success and performs some follow-up steps e.g. create a >>> user account and sends a request to the mail system to create an >>> account or create a VPN account. If the scenario above happens, >>> there can exist a VPN account that does not have any presence in the >>> central database and can be a security issue. >>> >>> I hope I explained it sufficiently. :-) >>> >>> Do you think, that would be possible to implement a process that >>> would solve this use case? >>> >>> Thank you >>> Ondrej >> >>
pgsql-hackers by date: