Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time |
Date | |
Msg-id | CAD21AoA7u6jwqXzE_caWMznzLfecNuE0nOuh=7EQONkU4zvNcQ@mail.gmail.com Whole thread Raw |
In response to | BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time
|
List | pgsql-bugs |
On Wed, Aug 10, 2022 at 5:50 PM PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 17580 > Logged by: whale song > Email address: 13207961@qq.com > PostgreSQL version: 10.5 > Operating system: centos > Description: > > 1) Create a table for test > create table test666(id int primary key, number int); > > 2) Use pg_recvlogical to subscribe > 2.1) Create a replicate slot > pg_recvlogical -h 9.223.28.176 -p 11002 -d test -U tbase --slot > wal2json_pg_test02 --create-slot -P wal2json > 2.2) Start to subscribe, and use filter-tables to filter test table > test666 > pg_recvlogical -h 9.223.28.176 -p 11002 -d test -U tbase --slot > wal2json_pg_test02 --start -o pretty-print=1 -o format-version=2 -o > filter-tables="public.test666" -f - > > 3) Insert a lot of data to the table in one transaction > insert into test666 values (generate_series(1000000001,2000000000), > generate_series(1000000001,2000000000)); > It may cause a long time, maybe 50 minutes. > > 4) Use pg_replication_slots to get wal sender process pid(active_pid) > select *,pg_current_wal_lsn() from pg_replication_slots ; > > 5) When step 3 finish, use pstack to watch wal sender process stack, when it > is loop in ReorderBufferCommit, kill pg_recvlogical which start in step > 2.2 > pstack 80743 > #0 0x0000000000c400d9 in ResourceArrayAdd (resarr=0x1576970, > value=139941599240896) at resowner.c:359 > #1 0x0000000000c3fef4 in ResourceArrayEnlarge (resarr=0x1576970) at > resowner.c:252 > #2 0x0000000000c411f4 in ResourceOwnerEnlargeRelationRefs (owner=0x15768f0) > at resowner.c:1129 > #3 0x0000000000be0899 in RelationIncrementReferenceCount > (rel=0x7f46b14fe2c0) at relcache.c:2589 > #4 0x0000000000be05da in RelationIdGetRelation (relationId=1177785) at > relcache.c:2451 > #5 0x00007f4e0d649b9e in pg_decode_write_tuple (ctx=0x13fc5a8, > relation=0x7f46b14fcf38, tuple=0x315b910, kind=PGOUTPUTJSON_IDENTITY) at > wal2json.c:1503 > #6 0x00007f4e0d64a514 in pg_decode_write_change (ctx=0x13fc5a8, > txn=0x15957a8, relation=0x7f46b14fcf38, change=0x33bc9f8) at > wal2json.c:1747 > #7 0x00007f4e0d64a9db in pg_decode_change_v2 (ctx=0x13fc5a8, txn=0x15957a8, > relation=0x7f46b14fcf38, change=0x33bc9f8) at wal2json.c:1865 > #8 0x00007f4e0d648b2b in pg_decode_change (ctx=0x13fc5a8, txn=0x15957a8, > relation=0x7f46b14fcf38, change=0x33bc9f8) at wal2json.c:1098 > #9 0x00000000009cf3d1 in change_cb_wrapper (cache=0x1410fb8, txn=0x15957a8, > relation=0x7f46b14fcf38, change=0x33bc9f8) at logical.c:730 > #10 0x00000000009d75df in ReorderBufferCommit (rb=0x1410fb8, xid=174028801, > commit_lsn=44579135148872, end_lsn=44579135148936, > commit_time=712222669188947, origin_id=0, origin_lsn=0) at > reorderbuffer.c:1445 > #11 0x00000000009cb15b in DecodeCommit (ctx=0x13fc5a8, buf=0x7fff2574d600, > parsed=0x7fff2574d530, xid=174028801) at decode.c:713 > #12 0x00000000009ca76b in DecodeXactOp (ctx=0x13fc5a8, buf=0x7fff2574d600) > at decode.c:335 > #13 0x00000000009ca401 in LogicalDecodingProcessRecord (ctx=0x13fc5a8, > record=0x13fc880) at decode.c:175 > #14 0x00000000009ead45 in XLogSendLogical () at walsender.c:3152 > #15 0x00000000009ea09f in WalSndLoop (send_data=0x9eaca9 <XLogSendLogical>) > at walsender.c:2520 > #16 0x00000000009e8aed in StartLogicalReplication (cmd=0x151bb48) at > walsender.c:1445 > #17 0x00000000009e944e in exec_replication_command (cmd_string=0x1520408 > "START_REPLICATION SLOT subs_f524p217b7_dn0001 LOGICAL 285C/F8FFEA50 ( > \"format-version\" '2' )") at walsender.c:1887 > #18 0x0000000000a71ed8 in PostgresMain (argc=1, argv=0x1419f58, > dbname=0x1419f28 "prd_jxcard_lingwang", username=0x1419f08 "mingchen_read") > at postgres.c:6118 > #19 0x00000000009b0faa in BackendRun (port=0x14135b0) at postmaster.c:5071 > #20 0x00000000009b0753 in BackendStartup (port=0x14135b0) at > postmaster.c:4743 > #21 0x00000000009ac3ae in ServerLoop () at postmaster.c:1995 > #22 0x00000000009aba35 in PostmasterMain (argc=5, argv=0x13dc770) at > postmaster.c:1603 > #23 0x0000000000878c44 in main (argc=5, argv=0x13dc770) at main.c:233 > > 6) Use pg_replication_slots to query the replication slots infomation, the > active_pid is still active. Use pg_terminate_backend to terminate > active_pid, it returns true, but the active_pid is still active. > > 7) The active_pid wal sender process exit after a long time waiting? > Hmm, I could not reproduce this behavior in my environment. As we have CHECK_FOR_INTERRUPTS in WalSndWriteData() and the walsender exits if it fails to flush, I think the walsender process should exit if it either receives SIGTERM or failed to send the data. Did you see the log "LOG: could not send data to client: Broken pipe" after terminating pg_recvlogical? If it's reproducible in your environment, could you get the stack trace of the walsender process that keeps running? and please try with the latest version, 10.22. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-bugs by date: