Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries - Mailing list pgsql-bugs
From | Kyotaro Horiguchi |
---|---|
Subject | Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries |
Date | |
Msg-id | 20200903.122225.861514703052371341.horikyota.ntt@gmail.com Whole thread Raw |
In response to | BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries
|
List | pgsql-bugs |
At Wed, 02 Sep 2020 10:58:49 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > The following bug has been logged on the website: > > Bug reference: 16605 > Logged by: Tushar Takate > Email address: tushar11.takate@gmail.com > PostgreSQL version: 10.5 > Operating system: OEL 6.9 > Description: > > Hi Team , > > Problem/bug/issues statement : - PostgreSQL recovery startup process is > blocking the application queries . > > Env Details - > > PostgtreSQL community version - 10.5 > OS - OEL 6.9 > Replication type - Streaming - async > > DB parameters at replica side - > > max_standby_streaming_delay = -1 > max_standby_archive_delay = -1 > hot_standby_feedback = off > > Startup process by ps -ef > > postgres 24489 24487 0 2019 ? 1-20:23:35 postgres: > startup process recovering 00000006000021D4000000ED waiting This is the cause of all succeeding blocking chain. "recovering xxx waiting" means the process is waiting for a recovery-conflict to be resolved. In other words, the startup process is being blocked by some backend. In your case the startup process should have taken AccessExclusiveLock on the relation 17280 before being blocked. With the setting max_standby_streaming_delay -1, the tartup process waits forever. The most common cause of recovery-conflict when "hot_standby_feedback = off" is snapshot conflict. In other words, vacuum on the primary side cannot be replayed on the standby since some transaction on the standby side may refer to-be-vacuumed table rows. I'm not sure about easier way but the cause can be examined by the following steps. =# select sent_lsn, replay_lsn, file, upper(to_hex(off)) from pg_stat_replication, lateral pg_walfile_name_offset(replay_lsn)as o(file, off); sent_lsn | replay_lsn | file | to_hex -----------+------------+--------------------------+-------- 0/5882B30 | 0/5874878 | 000000010000000000000005 | 874878 You will see that the replay_lsn is behind sent_lsn. and the last two columns show the location of the blocked record. $ pg_waldump .../000000010000000000000005 | grep 874878 rmgr: Heap2 len (rec/tot): 506/ 506, tx: 0, lsn: 0/05874878, prev 0/05874850, desc: CLEAN remxid 600,blkref #0: rel 1663/12348/16385 blk 0 The Heap2-CLEAN is emitted by vacuum. You can avoid this kind of conflict by turning hot_standby_feedback on and/or preferably setting max_standby_streaming_delay to an effective value breaks any kind of conflicts by terminating conflicting backends. As a workaround after snapshot-conflict happens, manually terminate backends with backend_xmin <= 600, then startup will continue recovery. Another common cause is another access exclusive lock that is blocked by standby transaction. In this case you will see Standby-LOCK as the problem WAL record but pg_locks also shows such lock conflicts. I think other kinds of recovery-conflicts rarely happens. > All are the pid which are in wait queue are executing the select queries on > same table that is (mytable) . > > Can you please help us to understand the issues and possible > solution/workaround/fix for it . > > For temp fix we have restarted the DB service ( Before that we have tried to > kill the process by -15 flag which not worked ) -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-bugs by date: