Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors - Mailing list pgsql-hackers
From | Marina Polyakova |
---|---|
Subject | Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors |
Date | |
Msg-id | 976a25a46420c93866ada4051976b1be@postgrespro.ru Whole thread Raw |
In response to | Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors (Fabien COELHO <coelho@cri.ensmp.fr>) |
Responses |
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors |
List | pgsql-hackers |
On 11-07-2018 16:24, Fabien COELHO wrote: > Hello Marina, > >>> * -d/--debug: I'm not in favor in requiring a mandatory text argument >>> on this option. >> >> As you wrote in [1], adding an additional option is also a bad idea: > > Hey, I'm entitled to some internal contradictions:-) ... and discussions will be continue forever %-) >>> I'm sceptical of the "--debug-fails" options. ISTM that --debug is >>> already there and should just be reused. > > I was thinking that you could just use the existing --debug, not > change its syntax. My point was that --debug exists, and you could > just print > the messages when under --debug. Now I understand you better, thanks. I think it will be useful to receive only messages about failures, because they and progress reports can be lost in many other debug messages such as "client %d sending ..." / "client %d executing ..." / "client %d receiving". >> Maybe it's better to use an optional argument/arguments for >> compatibility (--debug[=fails] or --debug[=NUM])? But if we use the >> numbers, now I can see only 2 levels, and there's no guarantee that >> they will no change.. > > Optional arguments to options (!) are not really clean things, so I'd > like to avoid going onto this path, esp. as I cannot see any other > instance in pgbench or elsewhere in postgres, AFAICS they are used in pg_waldump (option --stats[=record]) and in psql (option --help[=topic]). > and I personnaly > consider these as a bad idea. > So if absolutely necessary, a new option is still better than changing > --debug syntax. If not necessary, then it is better:-) Ok! >>> * I'm reserved about the whole ereport thing, see comments in other >>> messages. >> >> Thank you, I'll try to implement the error reporting in the way you >> suggested. > > Dunno if it is a good idea either. The committer word is the good one > in the end:-à I agree with you that ereport has good reasons to be non-trivial in the backend and it does not have the same in pgbench.. >>> * doCustom changes. > >>> >>> On CSTATE_FAILURE, the next command is possibly started. Although >>> there is some consistency with the previous point, I think that it >>> totally breaks the state automaton where now a command can start >>> while the whole transaction is in failing state anyway. There was no >>> point in starting it in the first place. >>> >>> So, for me, the FAILURE state should record/count the failure, then >>> skip >>> to RETRY if a retry is decided, else proceed to ABORT. Nothing else. >>> This is much clearer that way. >>> >>> Then RETRY should reinstate the global state and proceed to start the >>> *first* command again. >>> <...> >>> >>> It is unclear to me why backslash command errors are turned to >>> FAILURE >>> instead of ABORTED: there is no way they are going to be retried, so >>> maybe they should/could skip directly to ABORTED? > >> So do you propose to execute the command "ROLLBACK" without >> calculating its latency etc. if we are in a failed transaction and >> clear the conditional stack after each failure? > >> Also just to be clear: do you want to have the state CSTATE_ABORTED >> for client abortion and another state for interrupting the current >> transaction? > > I do not understand what "interrupting the current transaction" means. > A transaction is either committed or rollbacked, I do not know about > "interrupted". I mean that IIUC the server usually only reports the error and you must manually send the command "END" or "ROLLBACK" to rollback a failed transaction. > When it is rollbacked, probably some stats will be > collected in passing, I'm fine with that. > > If there is an error in a pgbench script, the transaction is aborted, > which means for me that the script execution is stopped where it was, > and either it is restarted from the beginning (retry) or counted as > failure (not retry, just aborted, really). > > If by interrupted you mean that one script begins a transaction and > another ends it, as I said in the review I think that this strange > case should be forbidden, so that all the code and documentation > trying to > manage that can be removed. Ok! >>> The current RETRY state does memory allocations to generate a message >>> with buffer allocation and so on. This looks like a costly and >>> useless >>> operation. If the user required "retries", then this is normal >>> behavior, >>> the retries are counted and will be printed out in the final report, >>> and there is no point in printing out every single one of them. >>> Maybe you want that debugging, but then coslty operations should be >>> guarded. >> >> I think we need these debugging messages because, for example, > > Debugging message should cost only when under debug. When not under > debug, there should be no debugging message, and there should be no > cost for building and discarding such messages in the executed code > path beyond > testing whether program is under debug. > >> if you use the option --latency-limit, you we will never know in >> advance whether the serialization/deadlock failure will be retried or >> not. > > ISTM that it will be shown final report. If I want debug, I ask for > --debug, otherwise I think that the command should do what it was > asked for, i.e. run scripts, collect performance statistics and show > them at the end. > > In particular, when running with retries is enabled, the user is > expecting deadlock/serialization errors, so that they are not "errors" > as such for > them. > >> They also help to understand which limit of retries was violated or >> how close we were to these limits during the execution of a specific >> transaction. But I agree with you that they are costly and can be >> skipped if the failure type is never retried. Maybe it is better to >> split them into multiple error function calls?.. > > Debugging message costs should only be incurred when under --debug, > not otherwise. Ok! IIUC instead of this part of the code initPQExpBuffer(&errmsg_buf); printfPQExpBuffer(&errmsg_buf, "client %d repeats the failed transaction (try %d", st->id, st->retries + 1); if (max_tries) appendPQExpBuffer(&errmsg_buf, "/%d", max_tries); if (latency_limit) { appendPQExpBuffer(&errmsg_buf, ", %.3f%% of the maximum time of tries was used", getLatencyUsed(st, &now)); } appendPQExpBufferStr(&errmsg_buf, ")\n"); pgbench_error(DEBUG_FAIL, "%s", errmsg_buf.data); termPQExpBuffer(&errmsg_buf); can we try something like this? PGBENCH_ERROR_START(DEBUG_FAIL) { PGBENCH_ERROR("client %d repeats the failed transaction (try %d", st->id, st->retries + 1); if (max_tries) PGBENCH_ERROR("/%d", max_tries); if (latency_limit) { PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used", getLatencyUsed(st, &now)); } PGBENCH_ERROR(")\n"); } PGBENCH_ERROR_END(); >>> You have added 20-columns alignment prints. This looks like too much >>> and >>> generates much too large lines. Probably 10 (billion) would be >>> enough. >> >> I have already asked you about this in [2]: > > Probably:-) > >>> The variables for the numbers of failures and retries are of type >>> int64 >>> since the variable for the total number of transactions has the same >>> type. That's why such a large alignment (as I understand it now, >>> enough >>> 20 characters). Do you prefer floating alignemnts, depending on the >>> maximum number of failures/retries for any command in any script? > > An int64 counter is not likely to reach its limit anytime soon:-) If > the column display limit is ever reached, ISTM that then the text is > just misaligned, which is a minor and rare inconvenience. If very wide > columns are used, then it does not fit my terminal and the report text > will always be wrapped around, which makes it harder to read, every > time. Ok! >>> The latency limit to 900 ms try is a bad idea because it takes a lot >>> of time. I did such tests before and they were removed by Tom Lane >>> because of determinism and time issues. I would comment this test out >>> for now. >> >> Ok! If it doesn't bother you - can you tell more about the causes of >> these determinism issues?.. Tests for some other failures that cannot >> be retried are already added to 001_pgbench_with_server.pl. > > Some farm animals are very slow, so you cannot really assume much > about time one way or another. Thanks! >>> I do not understand why there is so much text about in failed sql >>> transaction stuff, while we are mainly interested in serialization & >>> deadlock errors, and this only falls in some "other" category. There >>> seems to be more details about other errors that about deadlocks & >>> serializable errors. >>> >>> The reporting should focus on what is of interest, either all errors, >>> or some detailed split of these errors. >>> >>> <...> >>> >>> * "errors_in_failed_tx" is some subcounter of "errors", for a special >>> case. Why it is there fails me [I finally understood, and I think it >>> should be removed, see end of review]. If we wanted to distinguish, >>> then we should distinguish homogeneously: maybe just count the >>> different error types, eg have things like "deadlock_errors", >>> "serializable_errors", "other_errors", "internal_pgbench_errors" >>> which >>> would be orthogonal one to the other, and "errors" could be >>> recomputed >>> from these. >> >> Thank you, I agree with you. Unfortunately each new error type adds a >> new 1 or 2 columns of maximum width 20 to the per-statement report > > The fact that some data are collected does not mean that they should > all be reported in detail. We can have detailed error count and report > the sum of this errors for instance, or have some more > verbose/detailed reports > as options (eg --latencies does just that). Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: