BUG #16125: Crash of PostgreSQL's wal sender during logical replication - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #16125: Crash of PostgreSQL's wal sender during logical replication |
Date | |
Msg-id | 16125-a78dafa032d53f7e@postgresql.org Whole thread Raw |
Responses |
Re: BUG #16125: Crash of PostgreSQL's wal sender during logicalreplication
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16125 Logged by: Andrey Salnikov Email address: andrey.salnikov@dataegret.com PostgreSQL version: 10.11 Operating system: Ubuntu 18.04.2 LTS Description: Hi, PostgreSQL's wal sender begins continuously crash during logical replication. Each start after crash sender tries to decode the same part of wal and crash again and again. Log contains next lines: 2019-11-16 03:01:39.512 UTC 22235 @ from [vxid: txid:0] [] LOG: server process (PID 26972) was terminated by signal 11: Segmentation fault -- 2019-11-16 03:03:16.240 UTC 8286 @ from [vxid: txid:0] [] LOG: checkpoint complete: wrote 675790 buffers (32.2%); 0 WAL file(s) added, 198 removed, 0 recycled; write=68.768 s, sync=0.011 s, total=69.214 s; sync files=361, longest=0.002 s, average=0.000 s; distance=3164435 kB, estimate=3164435 kB 2019-11-16 03:03:16.251 UTC 22235 @ from [vxid: txid:0] [] LOG: database system is ready to accept connections 2019-11-16 03:03:16.533 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0] [idle] LOG: received replication command: IDENTIFY_SYSTEM 2019-11-16 03:03:16.533 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0] [idle] LOG: received replication command: START_REPLICATION SLOT "rep_slot" LOGICAL 25EE/D6DD72D0 (proto_version '1', publication_names '"rep_pub"') 2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0] [idle] LOG: starting logical decoding for slot "rep_slot" 2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0] [idle] DETAIL: streaming transactions committing after 25EE/D614FC78, reading WAL from 25EE/D614B580 2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0] [idle] LOG: logical decoding found consistent point at 25EE/D614B580 2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0] [idle] DETAIL: There are no running transactions. 2019-11-16 03:03:16.732 UTC 22235 @ from [vxid: txid:0] [] LOG: server process (PID 10070) was terminated by signal 11: Segmentation fault -- 2019-11-16 03:03:18.410 UTC 10072 @ from [vxid: txid:0] [] LOG: database system was not properly shut down; automatic recovery in progress 2019-11-16 03:03:18.413 UTC 10072 @ from [vxid: txid:0] [] LOG: invalid record length at 25EE/D6E155C8: wanted 24, got 0 2019-11-16 03:03:18.413 UTC 10072 @ from [vxid: txid:0] [] LOG: redo is not required 2019-11-16 03:03:18.419 UTC 10072 @ from [vxid: txid:0] [] LOG: checkpoint starting: end-of-recovery immediate 2019-11-16 03:03:18.467 UTC 10072 @ from [vxid: txid:0] [] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.028 s, sync=0.000 s, total=0.050 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB 2019-11-16 03:03:18.476 UTC 22235 @ from [vxid: txid:0] [] LOG: database system is ready to accept connections 2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0] [idle] LOG: received replication command: IDENTIFY_SYSTEM 2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0] [idle] LOG: received replication command: START_REPLICATION SLOT "rep_slot" LOGICAL 25EE/D6DD72D0 (proto_version '1', publication_names '"rep_pub"') 2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0] [idle] LOG: starting logical decoding for slot "rep_slot" 2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0] [idle] DETAIL: streaming transactions committing after 25EE/D614FC78, reading WAL from 25EE/D614B580 2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0] [idle] LOG: logical decoding found consistent point at 25EE/D614B580 2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0] [idle] DETAIL: There are no running transactions. 2019-11-16 03:03:18.949 UTC 22235 @ from [vxid: txid:0] [] LOG: server process (PID 10112) was terminated by signal 11: Segmentation fault Core dump contains: Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/92/b3b790a5135245f2cbb2a0a2f48fc7ff9f689a.debug...done. done. [New LWP 13678] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: 10/main: wal sender process replication 1.1.1.1(43224) idle '. Program terminated with signal SIGSEGV, Segmentation fault. #0 heap_deform_tuple (tuple=tuple@entry=0x8, tupleDesc=tupleDesc@entry=0x555713b2e638, values=values@entry=0x555713b7ca48, isnull=isnull@entry=0x555713b7cf98 "") at ./build/../src/backend/access/common/heaptuple.c:936 936 ./build/../src/backend/access/common/heaptuple.c: No such file or directory. (gdb) bt #0 heap_deform_tuple (tuple=tuple@entry=0x8, tupleDesc=tupleDesc@entry=0x555713b2e638, values=values@entry=0x555713b7ca48, isnull=isnull@entry=0x555713b7cf98 "") at ./build/../src/backend/access/common/heaptuple.c:936 #1 0x00005557127c918b in ReorderBufferToastReplace (rb=0x555713b7b788, txn=0x555713b982e8, relation=0x7f4b3fafa248, relation=0x7f4b3fafa248, change=0x555713b9a108) at ./build/../src/backend/replication/logical/reorderbuffer.c:2985 #2 ReorderBufferCommit (rb=0x555713b7b788, xid=xid@entry=1667601527, commit_lsn=commit_lsn@entry=41707737345720, end_lsn=end_lsn@entry=41707737345792, commit_time=commit_time@entry=627188499326201, origin_id=origin_id@entry=0, origin_lsn=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1585 #3 0x00005557127bef32 in DecodeCommit (xid=1667601527, parsed=0x7ffd288fe4a0, buf=<synthetic pointer>, ctx=0x555713a9ca08) at ./build/../src/backend/replication/logical/decode.c:614 #4 DecodeXactOp (buf=<synthetic pointer>, ctx=0x555713a9ca08) at ./build/../src/backend/replication/logical/decode.c:244 #5 LogicalDecodingProcessRecord (ctx=0x555713a9ca08, record=<optimized out>) at ./build/../src/backend/replication/logical/decode.c:113 #6 0x00005557127d0b86 in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2797 #7 0x00005557127d232f in WalSndLoop (send_data=send_data@entry=0x5557127d0b10 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2157 #8 0x00005557127d30aa in StartLogicalReplication (cmd=0x555713b499d8) at ./build/../src/backend/replication/walsender.c:1105 #9 exec_replication_command (cmd_string=cmd_string@entry=0x555713b4da38 "START_REPLICATION SLOT \"pgprod_office_basic\" LOGICAL 25EE/D6DD72D0 (proto_version '1', publication_names '\"pgprod_master_basic\"')") at ./build/../src/backend/replication/walsender.c:1536 #10 0x000055571281de3a in PostgresMain (argc=<optimized out>, argv=argv@entry=0x555713add8d8, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4113 #11 0x000055571255c72c in BackendRun (port=0x555713ad56d0) at ./build/../src/backend/postmaster/postmaster.c:4408 #12 BackendStartup (port=0x555713ad56d0) at ./build/../src/backend/postmaster/postmaster.c:4080 #13 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1756 #14 0x00005557127ada21 in PostmasterMain (argc=9, argv=0x555713a7d190) at ./build/../src/backend/postmaster/postmaster.c:1364 #15 0x000055571255e422 in main (argc=9, argv=0x555713a7d190) at ./build/../src/backend/main/main.c:228 (gdb) quit This behaviour does not depends on defined data in tables, because we see it in different database with different sets of tables in publications. Looks like a real issue in logical replication. I will happy to provide an additional information about that issue, but i should know what else to need to collect for helping to solve this problem. Regards, Andrey.
pgsql-bugs by date: