Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers
From | Mark Dilger |
---|---|
Subject | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Date | |
Msg-id | 3177FAD5-3600-4C8B-B338-FB108F60455F@gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
|
List | pgsql-hackers |
> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: >> Whoa. This just turned into a much larger can of worms than I expected. >> How can it be that processes are getting assertion crashes and yet the >> test framework reports success anyway? That's impossibly >> broken/unacceptable. > > I poked into this on my laptop, where I'm able to reproduce both assertion > failures. The reason the one in subtrans.c is not being detected by the > test framework is that it happens late in the run and the test doesn't > do anything more with that server than shut it down. "pg_ctl stop" > actually notices there's a problem; for instance, the log on tern for > this step shows > > ### Stopping node "master" using mode immediate > # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m immediatestop > pg_ctl: PID file "/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"does not exist > Is server running? > # No postmaster PID > > However, PostgresNode::stop blithely ignores the failure exit from > pg_ctl. I think it should use system_or_bail() not just system_log(), > so that we'll notice if the server is not running when we expect it > to be. It's conceivable that there should also be a "stop_if_running" > method that allows the case where the server is not running, but so > far as I can find, no existing TAP test needs such a behavior --- and > it certainly shouldn't be the default. > > The walsender.c crash is harder to detect because the postmaster very > nicely recovers and restarts its children. That's great for robustness, > but it sucks for testing. I think that we ought to disable that in > TAP tests, which we can do by having PostgresNode::init include > "restart_after_crash = off" in the postgresql.conf modifications it > applies. Again, it's conceivable that some tests would not want that, > but there is no existing test that seems to need crash recovery on, > and I don't see a good argument for it being default for test purposes. > > As best I've been able to tell, the reason why the walsender.c crash > is detected when it's detected is that the TAP script chances to try > to connect while the recovery is happening: > > connection error: 'psql: FATAL: the database system is in recovery mode' > while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1173. > > The window for that is not terribly wide, explaining why the detection > is unreliable even when the crash does occur. > > In short then, I propose the attached patch to make these cases fail > more reliably. We might extend this later to allow the old behaviors > to be explicitly opted-into, but we don't seem to need that today. > > regards, tom lane I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45, and the tests consistently fail (5 out of 5 attempts) for me on my laptop with: /Applications/Xcode.app/Contents/Developer/usr/bin/make -C recovery check for extra in contrib/test_decoding; do /Applications/Xcode.app/Contents/Developer/usr/bin/make -C '../../..'/$extra DESTDIR='/Users/mark/vanilla/bitoctetlength'/tmp_installinstall >>'/Users/mark/vanilla/bitoctetlength'/tmp_install/log/install.log|| exit; done rm -rf /Users/mark/vanilla/bitoctetlength/src/test/recovery/tmp_check/log cd . && TESTDIR='/Users/mark/vanilla/bitoctetlength/src/test/recovery' PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/bin:$PATH" DYLD_LIBRARY_PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/lib"PGPORT='65432' PG_REGRESS='/Users/mark/vanilla/bitoctetlength/src/test/recovery/../../../src/test/regress/pg_regress'prove -I ../../../src/test/perl/-I . t/*.pl t/001_stream_rep.pl .................. ok t/002_archiving.pl ................... ok t/003_recovery_targets.pl ............ ok t/004_timeline_switch.pl ............. Bailout called. Further testing stopped: system pg_ctl failed FAILED--Further testing stopped: system pg_ctl failed make[2]: *** [check] Error 255 make[1]: *** [check-recovery-recurse] Error 2 make: *** [check-world-src/test-recurse] Error 2 The first time, I had some local changes, but I've stashed them and am still getting the error each of these next four times. I'm compiling as follows: make distclean; make clean; ./configure --enable-cassert --enable-tap-tests --enable-depend && make -j4 && make check-world Are the errors expected now? mark
pgsql-hackers by date: