Re: [HACKERS] snapbuild woes - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Re: [HACKERS] snapbuild woes |
Date | |
Msg-id | 7d3b9bb1-41a9-1f9a-43ca-c4bb271333a1@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] snapbuild woes (Erik Rijkers <er@xs4all.nl>) |
Responses |
Re: [HACKERS] snapbuild woes
|
List | pgsql-hackers |
On 09/05/17 19:54, Erik Rijkers wrote: > On 2017-05-09 11:50, Petr Jelinek wrote: > >> I rebased the above mentioned patch to apply to the patches Andres sent, >> if you could try to add it on top of what you have and check if it still >> fails, that would be helpful. > > It still fails. > > With these patches > > - 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ > - 00002-WIP-Possibly-more-robust-snapbuild-approach.patch + > - fix-statistics-reporting-in-logical-replication-work.patch + > - Skip-unnecessary-snapshot-builds.patch > > built again on top of 44c528810a1 ( so I had to add the > 'fix-statistics-rep*' patch because without it I immediately got that > Assertion failure again ). > > As always most runs succeed (especially on this large 192GB 16-core > server). > > But attached is an output file of a number of runs of my > pgbench_derail2.sh test. > > Overal result: > > -- out_20170509_1635.txt > 3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25 > 2 -- All is well. > 1 -- Not good, but breaking out of wait (21 times no change) > > I broke it off after iteration 4, so 5 never ran, and > iteration 1 failed due to a mistake in the harness (somethind stupid I > did) - not interesting. > > iteration 2 succeeds. (eventually has 'replica ok') > > iteration 3 succeeds. (eventually has 'replica ok') > > iteration 4 fails. > Just after 'alter subscription sub1 enable' I caught (as is usual) > pg_stat_replication.state as 'catchup'. So far so good. > After the 15-minute pgbench run pg_stat_replication has only 2 > 'startup' lines (and none 'catchup' or 'streaming'): > > port | pg_stat_replication | pid | wal | replay_loc | diff | > ?column? | state | app | sync_state > 6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | | > | startup | derail2 | async > 6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | | > | startup | derail2 | async > > (that's from: > select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid > , pg_current_wal_location() wal, replay_location replay_loc, > pg_current_wal_location() - replay_location as diff > , pg_current_wal_location() <= replay_location > , state, application_name as app, sync_state > from pg_stat_replication > ) > > This remains in this state for as long as my test-programs lets it > (i.e., 20 x 30s, or something like that, and then the loop is exited); > in the ouput file it says: 'Not good, but breaking out of wait' > > Below is the accompanying ps (with the 2 'deranged senders' as Jeff > Janes would surely call them): > > > UID PID PPID C STIME TTY STAT TIME CMD > rijkers 107147 1 0 17:11 pts/35 S+ 0:00 > /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres > -D /var/data1/pg_stuff/pg_installations > rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres: > logger process > rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres: > checkpointer process > rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres: > writer process > rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal > writer process > rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres: > autovacuum launcher process > rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats > collector process > rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres: > bgworker: logical replication launcher > rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres: > bgworker: logical replication worker for subscription 70310 sync 70293 > rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres: > bgworker: logical replication worker for subscription 70310 sync 70298 > rijkers 107145 1 0 17:11 pts/35 S+ 0:02 > /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres > -D /var/data1/pg_stuff/pg_installations > rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres: > logger process > rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres: > checkpointer process > rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres: > writer process > rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal > writer process > rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres: > autovacuum launcher process > rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats > collector process > rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres: > bgworker: logical replication launcher > rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal > sender process rijkers [local] idle > rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal > sender process rijkers [local] idle > > I have had no time to add (or view) any CPUinfo. > > Ah okay, so this is same issue that's reported by both Masahiko Sawada [1] and Jeff Janes [2]. [1] https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: