Re: [HACKERS] pgbench regression test failure - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] pgbench regression test failure |
Date | |
Msg-id | 28977.1511125910@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [HACKERS] pgbench regression test failure (Fabien COELHO <coelho@cri.ensmp.fr>) |
Responses |
Re: [HACKERS] pgbench regression test failure
|
List | pgsql-hackers |
Fabien COELHO <coelho@cri.ensmp.fr> writes: > Here is the hopefully right version, which passes tests here. I took a look at this. What I get for the "sleep-100" test case posted in https://postgr.es/m/alpine.DEB.2.20.1709240820160.4999@lancre with current HEAD is progress: 1.0 s, 16.0 tps, lat 37.746 ms stddev 48.730, lag 0.039 ms, 10 skipped progress: 2.0 s, 26.0 tps, lat 27.115 ms stddev 44.672, lag 0.068 ms, 19 skipped progress: 3.0 s, 25.0 tps, lat 32.184 ms stddev 46.915, lag 0.034 ms, 17 skipped progress: 4.0 s, 22.0 tps, lat 22.868 ms stddev 42.167, lag 0.033 ms, 17 skipped transaction type: sleep-100.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 number of transactions per client: 100 number of transactions actually processed: 29/100 number of transactions skipped: 71 (71.000 %) number of transactions above the 1.0 ms latency limit: 29 (29.000 %) latency average = 100.636 ms latency stddev = 0.185 ms rate limit schedule lag: avg 0.044 (max 0.956) ms tps = 22.467484 (including connections establishing) tps = 22.488151 (excluding connections establishing) script statistics: - number of transactions skipped: 0 (-nan%) The patch changes that to progress: 1.0 s, 6.0 tps, lat 100.658 ms stddev 0.193, lag 0.101 ms, 10 skipped progress: 2.0 s, 8.0 tps, lat 100.684 ms stddev 0.264, lag 0.208 ms, 14 skipped progress: 3.0 s, 8.0 tps, lat 100.576 ms stddev 0.026, lag 0.101 ms, 22 skipped progress: 4.0 s, 6.0 tps, lat 100.542 ms stddev 0.080, lag 0.102 ms, 10 skipped transaction type: sleep-100.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 number of transactions per client: 100 number of transactions actually processed: 33/100 number of transactions skipped: 67 (67.000 %) number of transactions above the 1.0 ms latency limit: 33 (33.000 %) latency average = 100.613 ms latency stddev = 0.169 ms rate limit schedule lag: avg 0.043 (max 0.921) ms tps = 7.138174 (including connections establishing) tps = 7.144332 (excluding connections establishing) script statistics: - number of transactions skipped: 67 (67.000%) So we fixed the reported TPS rate, which was nowhere near reality, and the per-script stats are sane now. Good so far, but this still has two problems IMO: 1. The per-script stats shouldn't be printed at all if there's only one script. They're redundant with the overall stats. 2. ISTM that we should report that 100% of the transactions were above the latency limit, not 33%; that is, the appropriate base for the "number of transactions above the latency limit" percentage is the number of actual transactions not the number of scheduled transactions. I also noticed that if I specify "-f sleep-100.sql" more than once, the per-script TPS reports are out of line. This is evidently because that calculation isn't excluding skipped xacts; but if we're going to define tps as excluding skipped xacts, surely we should do so there too. I'm also still exceedingly unhappy about the NaN business. You have this comment in printSimpleStats: /* print NaN if no transactions where executed */ but I find that unduly optimistic. It should really read more like "if no transactions were executed, at best we'll get some platform- dependent spelling of NaN. At worst we'll get a SIGFPE." I do not think that a pedantic argument about NaN being the "correct" answer justifies taking such portability risks, even if I bought that argument, which I don't particularly. It's also inconsistent with this basic decision in printResults: /* Remaining stats are nonsensical if we failed to execute any xacts */ if (total->cnt <= 0) return; If we thought NaN was the "correct" answer for no xacts then we'd just bull ahead and print NaNs all over the place. I think we would be best advised to deal with this by printing zeroes in the progress-report case, and by skipping the output altogether in printSimpleStats. (Or we could make printSimpleStats print zeroes; I'm indifferent to that choice.) Attached is an updated patch addressing these points. Comments? regards, tom lane diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index ceb2fcc..1d4814b 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -2612,7 +2612,7 @@ processXactStats(TState *thread, CState *st, instr_time *now, doLog(thread, st, agg, skipped, latency, lag); /* XXX could use a mutex here, but we choose not to */ - if (per_script_stats) + if (per_script_stats || latency_limit) accumStats(&sql_script[st->use_file].stats, skipped, latency, lag); } @@ -3664,12 +3664,14 @@ addScript(ParsedScript script) static void printSimpleStats(const char *prefix, SimpleStats *ss) { - /* print NaN if no transactions where executed */ - double latency = ss->sum / ss->count; - double stddev = sqrt(ss->sum2 / ss->count - latency * latency); + if (ss->count > 0) + { + double latency = ss->sum / ss->count; + double stddev = sqrt(ss->sum2 / ss->count - latency * latency); - printf("%s average = %.3f ms\n", prefix, 0.001 * latency); - printf("%s stddev = %.3f ms\n", prefix, 0.001 * stddev); + printf("%s average = %.3f ms\n", prefix, 0.001 * latency); + printf("%s stddev = %.3f ms\n", prefix, 0.001 * stddev); + } } /* print out results */ @@ -3680,11 +3682,14 @@ printResults(TState *threads, StatsData *total, instr_time total_time, double time_include, tps_include, tps_exclude; + int64 ntx = total->cnt - total->skipped; time_include = INSTR_TIME_GET_DOUBLE(total_time); - tps_include = total->cnt / time_include; - tps_exclude = total->cnt / (time_include - - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients)); + + /* tps is about actually executed transactions */ + tps_include = ntx / time_include; + tps_exclude = ntx / + (time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients)); /* Report test parameters. */ printf("transaction type: %s\n", @@ -3697,13 +3702,13 @@ printResults(TState *threads, StatsData *total, instr_time total_time, { printf("number of transactions per client: %d\n", nxacts); printf("number of transactions actually processed: " INT64_FORMAT "/%d\n", - total->cnt - total->skipped, nxacts * nclients); + ntx, nxacts * nclients); } else { printf("duration: %d s\n", duration); printf("number of transactions actually processed: " INT64_FORMAT "\n", - total->cnt); + ntx); } /* Remaining stats are nonsensical if we failed to execute any xacts */ @@ -3718,7 +3723,7 @@ printResults(TState *threads, StatsData *total, instr_time total_time, if (latency_limit) printf("number of transactions above the %.1f ms latency limit: %d (%.3f %%)\n", latency_limit / 1000.0, latency_late, - 100.0 * latency_late / total->cnt); + (ntx > 0) ? 100.0 * latency_late / ntx : 0.0); if (throttle_delay || progress || latency_limit) printSimpleStats("latency", &total->latency); @@ -3752,40 +3757,48 @@ printResults(TState *threads, StatsData *total, instr_time total_time, for (i = 0; i < num_scripts; i++) { if (num_scripts > 1) + { + StatsData *sstats = &sql_script[i].stats; + printf("SQL script %d: %s\n" " - weight: %d (targets %.1f%% of total)\n" " - " INT64_FORMAT " transactions (%.1f%% of total, tps = %f)\n", i + 1, sql_script[i].desc, sql_script[i].weight, 100.0 * sql_script[i].weight / total_weight, - sql_script[i].stats.cnt, - 100.0 * sql_script[i].stats.cnt / total->cnt, - sql_script[i].stats.cnt / time_include); - else - printf("script statistics:\n"); + sstats->cnt, + 100.0 * sstats->cnt / total->cnt, + (sstats->cnt - sstats->skipped) / time_include); - if (latency_limit) - printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n", - sql_script[i].stats.skipped, - 100.0 * sql_script[i].stats.skipped / sql_script[i].stats.cnt); + if (latency_limit && sstats->cnt > 0) + printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n", + sstats->skipped, + 100.0 * sstats->skipped / sstats->cnt); - if (num_scripts > 1) - printSimpleStats(" - latency", &sql_script[i].stats.latency); + printSimpleStats(" - latency", &sstats->latency); + } /* Report per-command latencies */ if (is_latencies) { Command **commands; - printf(" - statement latencies in milliseconds:\n"); + if (num_scripts > 1) + printf(" - statement latencies in milliseconds:\n"); + else + printf("statement latencies in milliseconds:\n"); for (commands = sql_script[i].commands; *commands != NULL; commands++) + { + SimpleStats *cstats = &(*commands)->stats; + printf(" %11.3f %s\n", - 1000.0 * (*commands)->stats.sum / - (*commands)->stats.count, + (cstats->count > 0) ? + 1000.0 * cstats->sum / cstats->count : 0.0, (*commands)->line); + } } } } @@ -4861,7 +4874,8 @@ threadRun(void *arg) { /* generate and show report */ StatsData cur; - int64 run = now - last_report; + int64 run = now - last_report, + ntx; double tps, total_run, latency, @@ -4876,7 +4890,7 @@ threadRun(void *arg) * XXX: No locking. There is no guarantee that we get an * atomic snapshot of the transaction count and latencies, so * these figures can well be off by a small amount. The - * progress is report's purpose is to give a quick overview of + * progress report's purpose is to give a quick overview of * how the test is going, so that shouldn't matter too much. * (If a read from a 64-bit integer is not atomic, you might * get a "torn" read and completely bogus latencies though!) @@ -4890,15 +4904,21 @@ threadRun(void *arg) cur.skipped += thread[i].stats.skipped; } + /* we count only actually executed transactions */ + ntx = (cur.cnt - cur.skipped) - (last.cnt - last.skipped); total_run = (now - thread_start) / 1000000.0; - tps = 1000000.0 * (cur.cnt - last.cnt) / run; - latency = 0.001 * (cur.latency.sum - last.latency.sum) / - (cur.cnt - last.cnt); - sqlat = 1.0 * (cur.latency.sum2 - last.latency.sum2) - / (cur.cnt - last.cnt); - stdev = 0.001 * sqrt(sqlat - 1000000.0 * latency * latency); - lag = 0.001 * (cur.lag.sum - last.lag.sum) / - (cur.cnt - last.cnt); + tps = 1000000.0 * ntx / run; + if (ntx > 0) + { + latency = 0.001 * (cur.latency.sum - last.latency.sum) / ntx; + sqlat = 1.0 * (cur.latency.sum2 - last.latency.sum2) / ntx; + stdev = 0.001 * sqrt(sqlat - 1000000.0 * latency * latency); + lag = 0.001 * (cur.lag.sum - last.lag.sum) / ntx; + } + else + { + latency = sqlat = stdev = lag = 0; + } if (progress_timestamp) { @@ -4915,7 +4935,10 @@ threadRun(void *arg) (long) tv.tv_sec, (long) (tv.tv_usec / 1000)); } else + { + /* round seconds are expected, but the thread may be late */ snprintf(tbuf, sizeof(tbuf), "%.1f s", total_run); + } fprintf(stderr, "progress: %s, %.1f tps, lat %.3f ms stddev %.3f",
pgsql-hackers by date: