Re: [HACKERS] snapbuild woes - Mailing list pgsql-hackers
From | Erik Rijkers |
---|---|
Subject | Re: [HACKERS] snapbuild woes |
Date | |
Msg-id | 076a3e4c3d4fef091522d3cd7e50c7eb@xs4all.nl Whole thread Raw |
In response to | Re: [HACKERS] snapbuild woes (Andres Freund <andres@anarazel.de>) |
Responses |
Re: [HACKERS] snapbuild woes
|
List | pgsql-hackers |
On 2017-05-05 02:00, Andres Freund wrote: > > Could you have a look? Running tests with these three patches: > 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ > 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + > fix-statistics-reporting-in-logical-replication-work.patch (on top of 44c528810) I test by 15-minute pgbench runs while there is a logical replication connection. Primary and replica are on the same machine. I have seen errors on 3 different machines (where error means: at least 1 of the 4 pgbench tables is not md5-equal). It seems better, faster machines yield less errors. Normally I see in pg_stat_replication (on master) one process in state 'streaming'. pid | wal | replay_loc | diff | state | app | sync_state 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | async Often there are another two processes in pg_stat_replication that remain in state 'startup'. In the failing sessions the 'streaming'-state process is missing; in failing sessions there are only the two processes that are and remain in 'startup'. FWIW, below the output of a succesful and a failed run: ------------------ successful run: creating tables... 1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s) 2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s) vacuum... set primary keys... done. create publication pub1 for all tables; create subscription sub1 connection 'port=6972 application_name=derail2' publication pub1 with (disabled); alter subscription sub1 enable; -- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25 progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785 progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440 progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130 progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841 progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695 transaction type: <builtin: TPC-B (sort of)> scaling factor: 25 query mode: simple number of clients: 90 number of threads: 8 duration: 900 s number of transactions actually processed: 38175 latency average = 2128.606 ms latency stddev = 3948.634 ms tps = 42.151205 (including connections establishing) tps = 42.151589 (excluding connections establishing) -- waiting 0s... (always) port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 | 0 | t | streaming | derail2 | async 2017.05.08 23:19:22 -- getting md5 (cb) 6972 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837 d1afac950 d4abcc72e master 6973 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837 d1afac950 d4abcc72e replica ok bee2312c7 2017.05.08 23:20:48 port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972| pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 | 95357856 | f | streaming | derail2 | async ------------------------------------ ------------------ failure: creating tables... 1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s) 2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s) vacuum... set primary keys... done. create publication pub1 for all tables; create subscription sub1 connection 'port=6972 application_name=derail2' publication pub1 with (disabled); alter subscription sub1 enable; port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 11945 | 18/5E2913D0 | | | | catchup | derail2 | async -- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25 progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815 progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231 progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400 progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066 progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800 transaction type: <builtin: TPC-B (sort of)> scaling factor: 25 query mode: simple number of clients: 90 number of threads: 8 duration: 900 s number of transactions actually processed: 58846 latency average = 1378.259 ms latency stddev = 2304.159 ms tps = 65.224168 (including connections establishing) tps = 65.224788 (excluding connections establishing) -- waiting 0s... (always) port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 11948 | 18/7469A038 | | | | startup | derail2 | async 6972 | pg_stat_replication | 12372 | 18/7469A038 | | | | startup |derail2 | async ------------------------------------ During my tests, I keep an eye on pg_stat_replication (refreshing every 2s), and when I see those two 'startup'-lines in pg_stat_replication without any 'streaming'-line I know the test is going to fail. I believe this has been true for all failure cases that I've seen (except the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another thread). Sorry, I have not been able to get any thing more clear or definitive... thanks, Erik Rijkers
pgsql-hackers by date: