Re: [HACKERS] Logical replication existing data copy - Mailing list pgsql-hackers
From | Erik Rijkers |
---|---|
Subject | Re: [HACKERS] Logical replication existing data copy |
Date | |
Msg-id | 9f9b45f3de725a1a2f3e19090f1c3e1a@xs4all.nl Whole thread Raw |
In response to | Re: [HACKERS] Logical replication existing data copy (Erik Rijkers <er@xs4all.nl>) |
Responses |
Re: [HACKERS] Logical replication existing data copy
|
List | pgsql-hackers |
On 2017-02-28 07:38, Erik Rijkers wrote: > On 2017-02-27 15:08, Petr Jelinek wrote: > 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch > + > 0002-Fix-after-trigger-execution-in-logical-replication.patch > + > 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch > + > snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch > + > snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ > snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch > + > snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch > + > snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch > + > 0001-Logical-replication-support-for-initial-data-copy-v6.patch This is the most frequent error that happens while doing pgbench-runs over logical replication: I run it continuously all day, and every few hours an error occurs of the kind seen below: a table (pgbench_history, mostly) ends up 1 row short (673466 instead of 673467). I have the script wait a long time before calling it an error (because in theory it could still 'finish', and end successfully (although that has not happened yet, once the system got into this state). -- pgbench -c 16 -j 8 -T 120 -P 24 -n -M simple -- scale 25 [...] 6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 20 s) (unchanged 1) -- getting md5 (cb) 6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 25 s) (unchanged 2) -- getting md5 (cb) 6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 30 s) (unchanged 3) -- getting md5 (cb) 6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 35 s) (unchanged 4) I gathered some info in this (proabably deadlocked) state in the hope there is something suspicious in there: UID PID PPID C STIME TTY STAT TIME CMD rijkers 71203 1 0 20:06 pts/57 S 0:00 postgres -D /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/data -p 6973 -c wal_level=replica [...] rijkers 71214 71203 0 20:06 ? Ss 0:00 \_ postgres: logger process rijkers 71216 71203 0 20:06 ? Ss 0:00 \_ postgres: checkpointer process rijkers 71217 71203 0 20:06 ? Ss 0:00 \_ postgres: writer process rijkers 71218 71203 0 20:06 ? Ss 0:00 \_ postgres: wal writer process rijkers 71219 71203 0 20:06 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 71220 71203 0 20:06 ? Ss 0:00 \_ postgres: stats collector process rijkers 71221 71203 0 20:06 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 71222 71203 0 20:06 ? Ss 0:00 \_ postgres: bgworker: logical replication worker 30042 rijkers 71201 1 0 20:06 pts/57 S 0:00 postgres -D /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/data -p 6972 -c wal_level=logical [...] rijkers 71206 71201 0 20:06 ? Ss 0:00 \_ postgres: logger process rijkers 71208 71201 0 20:06 ? Ss 0:00 \_ postgres: checkpointer process rijkers 71209 71201 0 20:06 ? Ss 0:00 \_ postgres: writer process rijkers 71210 71201 0 20:06 ? Ss 0:00 \_ postgres: wal writer process rijkers 71211 71201 0 20:06 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 71212 71201 0 20:06 ? Ss 0:00 \_ postgres: stats collector process rijkers 71213 71201 0 20:06 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 71223 71201 0 20:06 ? Ss 0:00 \_ postgres: wal sender process rijkers [local] idle -- replica: port | shared_buffers | work_mem | m_w_m | e_c_s ------+----------------+----------+-------+------- 6973 | 100MB | 50MB | 2GB | 64GB (1 row) select current_setting('port') as port , datname as db , to_char(pg_database_size(datname), '9G999G999G999G999') || ' (' || pg_size_pretty(pg_database_size(datname)) || ')'as dbsize , pid , application_name as app , xact_start , query_start , regexp_replace( cast(now() - query_start as text), E'\.[[:digit:]]+\$', '') as duration , state , wait_event_type as wt_evt_type , wait_event as wt_evt from pg_stat_activity where query !~ 'pg_stat_activity' -- and query != '<IDLE>' port | db | dbsize | pid | app | xact_start | query_start | duration | state | wt_evt_type | wt_evt ------+--------+-----------------------------+-------+----------------------------------+------------+-------------+----------+-------+-------------+------------------ 6973| testdb | 817,243,955 (779 MB) | 71222 | logical replication worker 30042 | | | | idle | Activity | LogicalApplyMain (1 row) select * from pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+-----------------+---------+---------- relation | 16384 | 11547 | | | | | | | | 4/568 | 77713 | AccessShareLock | t | t virtualxid | | | | | 4/568 | | | | | 4/568 | 77713 | ExclusiveLock | t | t (2 rows) select * from pg_stat_subscription; subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time -------+---------+-------+-------+--------------+-------------------------------+-------------------------------+----------------+------------------------------- 30042| sub1 | 71222 | | 29/6C91C980 | 2017-02-28 20:11:50.818774+01 | 2017-02-28 20:11:50.818845+01 | 29/6C91C980 | 2017-02-28 20:11:50.818774+01 (1 row) select * from pg_subscription; subdbid | subname | subowner | subenabled | subconninfo | subslotname | subpublications ---------+---------+----------+------------+-------------+-------------+----------------- 16384 | sub1 | 10 |t | port=6972 | sub1 | {pub1} (1 row) select * from pg_subscription_rel; srsubid | srrelid | srsubstate | srsublsn ---------+---------+------------+---------- 30042 | 30022 | r | 30042 | 30019 | r | 30042 | 30030 | r | 30042 | 30025 | r | (4 rows) -- master \c - - - 6972 select pid , usesysid /* , usename */ , application_name /* , client_addr */ /* , client_hostname */ /* , client_port */ , backend_start , backend_xmin , state , sent_location , write_location , flush_location , replay_location , sync_priority , sync_state from pg_stat_replication; pid | usesysid | application_name | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state -------+----------+------------------+-----------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------ 71223| 10 | sub1 | 2017-02-28 20:06:50.3357+01 | | streaming | 29/88BC9BB0 | 29/88BC9BB0 | 29/88BC9BB0 | 29/88BC9BB0 | 0 | async (1 row) Erik Rijkers
pgsql-hackers by date: