Further stabilization of isolationtester's timeouts test - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Further stabilization of isolationtester's timeouts test |
Date | |
Msg-id | 8693.1464314819@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Further stabilization of isolationtester's timeouts
test
|
List | pgsql-hackers |
I've been noticing recently that certain buildfarm members sometimes fail the "timeouts" isolation test with this symptom: *** /home/pgbf/buildroot/HEAD/pgsql.build/src/test/isolation/expected/timeouts.out Mon May 16 23:45:12 2016 --- /home/pgbf/buildroot/HEAD/pgsql.build/src/test/isolation/results/timeouts.out Mon May 16 23:53:08 2016 *************** *** 70,73 **** step slto: SET lock_timeout = 6000; SET statement_timeout = 5000; step update: DELETE FROM accounts WHERE accountid = 'checking'; <waiting ...> step update: <... completed> ! ERROR: canceling statement due to statement timeout --- 70,73 ---- step slto: SET lock_timeout = 6000; SET statement_timeout = 5000; step update: DELETE FROM accounts WHERE accountid = 'checking'; <waiting ...> step update: <... completed> ! ERROR: canceling statement due to lock timeout as for example here and here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2016-05-16%2021%3A45%3A06 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2016-05-26%2016%3A45%3A03 A look at the postmaster log is informative: 2016-05-16 23:52:12.612 CEST [573a40e9.24a8:26] LOG: statement: BEGIN ISOLATION LEVEL READ COMMITTED; 2016-05-16 23:52:12.612 CEST [573a40e9.10ef:50] LOG: statement: BEGIN ISOLATION LEVEL READ COMMITTED; 2016-05-16 23:52:12.613 CEST [573a40e9.24a8:27] LOG: statement: UPDATE accounts SET balance = balance + 100; 2016-05-16 23:52:12.613 CEST [573a40e9.10ef:51] LOG: statement: SET lock_timeout = 6000; SET statement_timeout = 5000; 2016-05-16 23:52:12.614 CEST [573a40e9.10ef:52] LOG: statement: DELETE FROM accounts WHERE accountid = 'checking'; 2016-05-16 23:52:12.631 CEST [573a40e9.5afe:34] LOG: execute isolationtester_waiting: SELECT pg_catalog.pg_blocking_pids($1)&& '{9384,4335}'::integer[] 2016-05-16 23:52:12.631 CEST [573a40e9.5afe:35] DETAIL: parameters: $1 = '4335' 2016-05-16 23:53:08.658 CEST [573a40e9.10ef:53] ERROR: canceling statement due to lock timeout 2016-05-16 23:53:08.658 CEST [573a40e9.10ef:54] CONTEXT: while deleting tuple (0,1) in relation "accounts" 2016-05-16 23:53:08.658 CEST [573a40e9.10ef:55] STATEMENT: DELETE FROM accounts WHERE accountid = 'checking'; 2016-05-16 23:53:08.659 CEST [573a40e9.24a8:28] LOG: statement: ABORT; 2016-05-16 23:53:08.659 CEST [573a40e9.10ef:56] LOG: statement: ABORT; In this case the process seems to have gone to sleep for just short of a minute rather than the expected 5 seconds. Presumably that just reflects overload on the buildfarm member rather than anything really exciting, but it explains the failure nicely: by the time we got to postgres.c's ProcessInterrupts(), both the lock and statement timeouts had expired, and the code there preferentially reports "lock timeout" in that case. What I propose we do about this is make ProcessInterrupts() compare the scheduled timeout times if both timeouts have triggered, and report whichever event was intended to happen first. We should still apply the "lock error first" tiebreaking rule if there's an exact tie, but that's likely to be rare. The attached patch does that, and also reduces the longer of the two timeouts in these test cases to 5010 ms. That should give us good coverage in the buildfarm of both the case where both timeouts have fired before reaching ProcessInterrupts(), and the case where they have not. We should get bulletproof reproducibility of which timeout is reported, as long as the system clock doesn't go backwards during the test. Unfortunately this doesn't do anything to let us reduce the base 5000-ms timeout :-(. That's driven by the need to be sure that isolationtester has detected that the processes are waiting, which is independent of this problem. I'd like to put this into all branches back to 9.3, since we've seen this type of failure in all of them. Any objections? regards, tom lane diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 68811f1..b185c1b 100644 *** a/src/backend/tcop/postgres.c --- b/src/backend/tcop/postgres.c *************** ProcessInterrupts(void) *** 2909,2914 **** --- 2909,2917 ---- if (QueryCancelPending) { + bool lock_timeout_occurred; + bool stmt_timeout_occurred; + /* * Don't allow query cancel interrupts while reading input from the * client, because we might lose sync in the FE/BE protocol. (Die *************** ProcessInterrupts(void) *** 2929,2945 **** /* * If LOCK_TIMEOUT and STATEMENT_TIMEOUT indicators are both set, we ! * prefer to report the former; but be sure to clear both. */ ! if (get_timeout_indicator(LOCK_TIMEOUT, true)) { - (void) get_timeout_indicator(STATEMENT_TIMEOUT, true); LockErrorCleanup(); ereport(ERROR, (errcode(ERRCODE_LOCK_NOT_AVAILABLE), errmsg("canceling statement due to lock timeout"))); } ! if (get_timeout_indicator(STATEMENT_TIMEOUT, true)) { LockErrorCleanup(); ereport(ERROR, --- 2932,2960 ---- /* * If LOCK_TIMEOUT and STATEMENT_TIMEOUT indicators are both set, we ! * need to clear both, so always fetch both. */ ! lock_timeout_occurred = get_timeout_indicator(LOCK_TIMEOUT, true); ! stmt_timeout_occurred = get_timeout_indicator(STATEMENT_TIMEOUT, true); ! ! /* ! * If both were set, we want to report whichever timeout completed ! * earlier; this ensures consistent behavior if the machine is slow ! * enough that the second timeout triggers before we get here. A tie ! * is arbitrarily broken in favor of reporting a lock timeout. ! */ ! if (lock_timeout_occurred && stmt_timeout_occurred && ! get_timeout_finish_time(STATEMENT_TIMEOUT) < get_timeout_finish_time(LOCK_TIMEOUT)) ! lock_timeout_occurred = false; /* report stmt timeout */ ! ! if (lock_timeout_occurred) { LockErrorCleanup(); ereport(ERROR, (errcode(ERRCODE_LOCK_NOT_AVAILABLE), errmsg("canceling statement due to lock timeout"))); } ! if (stmt_timeout_occurred) { LockErrorCleanup(); ereport(ERROR, diff --git a/src/backend/utils/misc/timeout.c b/src/backend/utils/misc/timeout.c index 3b3f220..7171a7c 100644 *** a/src/backend/utils/misc/timeout.c --- b/src/backend/utils/misc/timeout.c *************** typedef struct timeout_params *** 34,40 **** timeout_handler_proc timeout_handler; TimestampTz start_time; /* time that timeout was last activated */ ! TimestampTz fin_time; /* if active, time it is due to fire */ } timeout_params; /* --- 34,40 ---- timeout_handler_proc timeout_handler; TimestampTz start_time; /* time that timeout was last activated */ ! TimestampTz fin_time; /* time it is, or was last, due to fire */ } timeout_params; /* *************** get_timeout_start_time(TimeoutId id) *** 654,656 **** --- 654,670 ---- { return all_timeouts[id].start_time; } + + /* + * Return the time when the timeout is, or most recently was, due to fire + * + * Note: will return 0 if timeout has never been activated in this process. + * However, we do *not* reset the fin_time when a timeout occurs, so as + * not to create a race condition if SIGALRM fires just as some code is + * about to fetch the value. + */ + TimestampTz + get_timeout_finish_time(TimeoutId id) + { + return all_timeouts[id].fin_time; + } diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index f64921e..260c013 100644 *** a/src/include/utils/timeout.h --- b/src/include/utils/timeout.h *************** extern void disable_all_timeouts(bool ke *** 82,86 **** --- 82,87 ---- /* accessors */ extern bool get_timeout_indicator(TimeoutId id, bool reset_indicator); extern TimestampTz get_timeout_start_time(TimeoutId id); + extern TimestampTz get_timeout_finish_time(TimeoutId id); #endif /* TIMEOUT_H */ diff --git a/src/test/isolation/expected/timeouts.out b/src/test/isolation/expected/timeouts.out index ff64627..dfba324 100644 *** a/src/test/isolation/expected/timeouts.out --- b/src/test/isolation/expected/timeouts.out *************** accountid balance *** 28,34 **** checking 600 savings 600 ! step lsto: SET lock_timeout = 5000; SET statement_timeout = 6000; step locktbl: LOCK TABLE accounts; <waiting ...> step locktbl: <... completed> ERROR: canceling statement due to lock timeout --- 28,34 ---- checking 600 savings 600 ! step lsto: SET lock_timeout = 5000; SET statement_timeout = 5010; step locktbl: LOCK TABLE accounts; <waiting ...> step locktbl: <... completed> ERROR: canceling statement due to lock timeout *************** accountid balance *** 39,45 **** checking 600 savings 600 ! step slto: SET lock_timeout = 6000; SET statement_timeout = 5000; step locktbl: LOCK TABLE accounts; <waiting ...> step locktbl: <... completed> ERROR: canceling statement due to statement timeout --- 39,45 ---- checking 600 savings 600 ! step slto: SET statement_timeout = 5000; SET lock_timeout = 5010; step locktbl: LOCK TABLE accounts; <waiting ...> step locktbl: <... completed> ERROR: canceling statement due to statement timeout *************** ERROR: canceling statement due to lock *** 60,73 **** starting permutation: wrtbl lsto update step wrtbl: UPDATE accounts SET balance = balance + 100; ! step lsto: SET lock_timeout = 5000; SET statement_timeout = 6000; step update: DELETE FROM accounts WHERE accountid = 'checking'; <waiting ...> step update: <... completed> ERROR: canceling statement due to lock timeout starting permutation: wrtbl slto update step wrtbl: UPDATE accounts SET balance = balance + 100; ! step slto: SET lock_timeout = 6000; SET statement_timeout = 5000; step update: DELETE FROM accounts WHERE accountid = 'checking'; <waiting ...> step update: <... completed> ERROR: canceling statement due to statement timeout --- 60,73 ---- starting permutation: wrtbl lsto update step wrtbl: UPDATE accounts SET balance = balance + 100; ! step lsto: SET lock_timeout = 5000; SET statement_timeout = 5010; step update: DELETE FROM accounts WHERE accountid = 'checking'; <waiting ...> step update: <... completed> ERROR: canceling statement due to lock timeout starting permutation: wrtbl slto update step wrtbl: UPDATE accounts SET balance = balance + 100; ! step slto: SET statement_timeout = 5000; SET lock_timeout = 5010; step update: DELETE FROM accounts WHERE accountid = 'checking'; <waiting ...> step update: <... completed> ERROR: canceling statement due to statement timeout diff --git a/src/test/isolation/specs/timeouts.spec b/src/test/isolation/specs/timeouts.spec index 7f821d4..7d157da 100644 *** a/src/test/isolation/specs/timeouts.spec --- b/src/test/isolation/specs/timeouts.spec *************** session "s2" *** 21,28 **** setup { BEGIN ISOLATION LEVEL READ COMMITTED; } step "sto" { SET statement_timeout = 5000; } step "lto" { SET lock_timeout = 5000; } ! step "lsto" { SET lock_timeout = 5000; SET statement_timeout = 6000; } ! step "slto" { SET lock_timeout = 6000; SET statement_timeout = 5000; } step "locktbl" { LOCK TABLE accounts; } step "update" { DELETE FROM accounts WHERE accountid = 'checking'; } teardown { ABORT; } --- 21,28 ---- setup { BEGIN ISOLATION LEVEL READ COMMITTED; } step "sto" { SET statement_timeout = 5000; } step "lto" { SET lock_timeout = 5000; } ! step "lsto" { SET lock_timeout = 5000; SET statement_timeout = 5010; } ! step "slto" { SET statement_timeout = 5000; SET lock_timeout = 5010; } step "locktbl" { LOCK TABLE accounts; } step "update" { DELETE FROM accounts WHERE accountid = 'checking'; } teardown { ABORT; }
pgsql-hackers by date: