Home > mailing lists

Re: [HACKERS] snapbuild woes - Mailing list pgsql-hackers

From	Petr Jelinek
Subject	Re: [HACKERS] snapbuild woes
Date	May 9, 2017 22:00:05
Msg-id	7d3b9bb1-41a9-1f9a-43ca-c4bb271333a1@2ndquadrant.com Whole thread Raw
In response to	Re: [HACKERS] snapbuild woes (Erik Rijkers <er@xs4all.nl>)
Responses	Re: [HACKERS] snapbuild woes
List	pgsql-hackers

Tree view

On 09/05/17 19:54, Erik Rijkers wrote:
> On 2017-05-09 11:50, Petr Jelinek wrote:
> 
>> I rebased the above mentioned patch to apply to the patches Andres sent,
>> if you could try to add it on top of what you have and check if it still
>> fails, that would be helpful.
> 
> It still fails.
> 
> With these patches
> 
> - 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> - 00002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
> - fix-statistics-reporting-in-logical-replication-work.patch +
> - Skip-unnecessary-snapshot-builds.patch
> 
> built again on top of 44c528810a1 ( so I had to add the
> 'fix-statistics-rep*' patch because without it I immediately got that
> Assertion failure again ).
> 
> As always most runs succeed (especially on this large 192GB 16-core
> server).
> 
> But attached is an output file of a number of runs of my
> pgbench_derail2.sh test.
> 
> Overal result:
> 
> -- out_20170509_1635.txt
>       3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n   --  scale 25
>       2 -- All is well.
>       1 -- Not good, but breaking out of wait (21 times no change)
> 
> I broke it off after iteration 4, so 5 never ran, and
> iteration 1 failed due to a mistake in the harness (somethind stupid I
> did) - not interesting.
> 
> iteration 2 succeeds. (eventually has 'replica ok')
> 
> iteration 3 succeeds. (eventually has 'replica ok')
> 
> iteration 4 fails.
>   Just after 'alter subscription sub1 enable' I caught (as is usual)
> pg_stat_replication.state as 'catchup'. So far so good.
>   After the 15-minute pgbench run pg_stat_replication has only 2
> 'startup' lines (and none 'catchup' or 'streaming'):
> 
>  port | pg_stat_replication |  pid   |     wal     | replay_loc | diff |
> ?column? |  state  |   app   | sync_state
>  6972 | pg_stat_replication | 108349 | 19/8FBCC248 |            |      |
>          | startup | derail2 | async
>  6972 | pg_stat_replication | 108351 | 19/8FBCC248 |            |      |
>          | startup | derail2 | async
> 
> (that's from:
>    select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid
>      , pg_current_wal_location() wal, replay_location replay_loc,
> pg_current_wal_location() - replay_location as diff
>      , pg_current_wal_location() <= replay_location
>      , state, application_name as app, sync_state
>    from pg_stat_replication
> )
> 
> This remains in this state for as long as my test-programs lets it
> (i.e., 20 x 30s, or something like that, and then the loop is exited);
> in the ouput file it says: 'Not good, but breaking out of wait'
> 
> Below is the accompanying ps (with the 2 'deranged senders' as Jeff
> Janes would surely call them):
> 
> 
> UID         PID   PPID  C STIME TTY      STAT   TIME CMD
> rijkers  107147      1  0 17:11 pts/35   S+     0:00
> /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres
> -D /var/data1/pg_stuff/pg_installations
> rijkers  107149 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> logger process
> rijkers  107299 107147  0 17:11 ?        Ss     0:01  \_ postgres:
> checkpointer process
> rijkers  107300 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> writer process
> rijkers  107301 107147  0 17:11 ?        Ss     0:00  \_ postgres: wal
> writer process
> rijkers  107302 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> autovacuum launcher process
> rijkers  107303 107147  0 17:11 ?        Ss     0:00  \_ postgres: stats
> collector process
> rijkers  107304 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> bgworker: logical replication launcher
> rijkers  108348 107147  0 17:12 ?        Ss     0:01  \_ postgres:
> bgworker: logical replication worker for subscription 70310 sync 70293
> rijkers  108350 107147  0 17:12 ?        Ss     0:00  \_ postgres:
> bgworker: logical replication worker for subscription 70310 sync 70298
> rijkers  107145      1  0 17:11 pts/35   S+     0:02
> /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres
> -D /var/data1/pg_stuff/pg_installations
> rijkers  107151 107145  0 17:11 ?        Ss     0:00  \_ postgres:
> logger process
> rijkers  107160 107145  0 17:11 ?        Ss     0:08  \_ postgres:
> checkpointer process
> rijkers  107161 107145  0 17:11 ?        Ss     0:07  \_ postgres:
> writer process
> rijkers  107162 107145  0 17:11 ?        Ss     0:02  \_ postgres: wal
> writer process
> rijkers  107163 107145  0 17:11 ?        Ss     0:00  \_ postgres:
> autovacuum launcher process
> rijkers  107164 107145  0 17:11 ?        Ss     0:02  \_ postgres: stats
> collector process
> rijkers  107165 107145  0 17:11 ?        Ss     0:00  \_ postgres:
> bgworker: logical replication launcher
> rijkers  108349 107145  0 17:12 ?        Ss     0:27  \_ postgres: wal
> sender process rijkers [local] idle
> rijkers  108351 107145  0 17:12 ?        Ss     0:26  \_ postgres: wal
> sender process rijkers [local] idle
> 
> I have had no time to add (or view) any CPUinfo.
> 
> 

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

pgsql-hackers by date:

From: Peter Eisentraut
Date: 09 May 2017, 21:53:05
Subject: Re: [HACKERS] Logical replication ApplyContext bloat

From: Pavel Stehule
Date: 09 May 2017, 22:10:03
Subject: Re: [HACKERS] proposal psql \gdesc

Re: [HACKERS] snapbuild woes - Mailing list pgsql-hackers

Previous

Next