Re: Suggestion to add --continue-client-on-abort option to pgbench - Mailing list pgsql-hackers

From Yugo Nagata
Subject Re: Suggestion to add --continue-client-on-abort option to pgbench
Date
Msg-id 20250705000356.9637424fa2343ee6c2d08957@sraoss.co.jp
Whole thread Raw
List pgsql-hackers
Hi,

On Tue, 1 Jul 2025 17:43:18 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:

> I've updated the previous patch based on your feedback. Below is a summary of 
> the changes from v4 to v5:

Thank you for updating the patch.

> On 2025/06/14 0:24, Yugo Nagata wrote:
> >              case PGRES_NONFATAL_ERROR:
> >              case PGRES_FATAL_ERROR:
> >                  st->estatus = getSQLErrorStatus(PQresultErrorField(res,
> >                                                                     PG_DIAG_SQLSTATE));
> >                  if (canRetryError(st->estatus))
> >                  {
> >                      if (verbose_errors)
> >                          commandError(st, PQerrorMessage(st->con));
> >                      goto error;
> >                  }
> >                  /* fall through */
> > 
> >              default:
> >                  /* anything else is unexpected */
> >                  pg_log_error("client %d script %d aborted in command %d query %d: %s",
> >                               st->id, st->use_file, st->command, qrynum,
> >                               PQerrorMessage(st->con));
> >                  goto error;
> >          }
> > 
> > When an SQL error other than a serialization or deadlock error occurs, an error message is
> > output via pg_log_error in this code path. However, I think this should be reported only
> > when verbose_errors is set, similar to how serialization and deadlock errors are handled when
> > --continue-on-error is enabled
> 
> I think the error message logged via pg_log_error is useful when verbose_errors 
> is not specified, because it informs users that the client has exited. Without 
> it, users may not notice that something went wrong.

However, if a large number of errors occur, this could result in a significant increase
in stderr output during the benchmark.

Users can still notice that something went wrong by checking the “number of other failures”
reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
users aren’t particularly interested in seeing individual error messages as they happen.

It’s true that seeing error messages during the benchmark might be useful in some cases, but
the same could be said for serialization or deadlock errors, and that’s exactly what the
--verbose-errors option is for.


Here are some comments on the patch.

(1)

                }
-               else if (canRetryError(st->estatus))
+               else if (continue_on_error | canRetryError(st->estatus))
                    st->state = CSTATE_ERROR;
                else
                    st->state = CSTATE_ABORTED;

Due to this change, when --continue-on-error is enabled, st->state is set to
CSTATE_ERROR regardless of the type of error returned by readCommandResponse.
When the error is not ESTATUS_OTHER_SQL_ERROR, e.g. ESTATUS_META_COMMAND_ERROR
due to a failure of \gset with query returning more the one row.

Therefore, this should be like:

               else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR  && continue_on_error) ||
                         canRetryError(st->estatus))

(2)

+          "  --continue-on-error      continue processing transations after a trasaction fails\n"

"trasaction" is a typo and including "transaction" twice looks a bit redundant.
Instead using the word "transaction", how about:

 "--continue-on-error continue running after an SQL error" ?

This version is shorter, avoids repetition, and describes well the actual behavior when
SQL statements fail.



As for the comments:

(3)

-    * A failed transaction is defined as unsuccessfully retried transactions.
+    * A failed transaction is defined as unsuccessfully retried transactions
+    * unless continue-on-error option is specified.
     * It can be one of two types:
     *
     * failed (the number of failed transactions) =
@@ -411,6 +412,12 @@ typedef struct StatsData
     *   'deadlock_failures' (they got a deadlock error and were not
     *                        successfully retried).
     *
+    * When continue-on-error option is specified,
+    * failed (the number of failed transactions) =
+    *   'serialization_failures' + 'deadlock_failures' +
+    *   'other_sql_failures' (they got a error when continue-on-error option
+    *                         was specified).
+    *

To explain explicitly that there are two definitions of failed transactions
depending on the situation, how about:

"""
 A failed transaction is counted differently depending on whether
 the --continue-on-error option is specified.

 Without --continue-on-error:
 
 failed (the number of failed transactions) =
  'serialization_failures' (they got a serialization error and were not
                            successfully retried) +
  'deadlock_failures' (they got a deadlock error and were not
                       successfully retried).

 When --continue-on-error is specified:

 failed (number of failed transactions) =
   'serialization_failures' + 'deadlock_failures' +
   'other_sql_failures'  (they got some other SQL error; the transaction was
                          not retried and counted as failed due to
                          --continue-on-error).
"""

(4)
+   int64       other_sql_failures; /* number of failed transactions for
+                                    * reasons other than
+                                    * serialization/deadlock failure , which
+                                    * is enabled if --continue-on-error is
+                                    * used */

Is "counted" is more proper than "enabled" here?


Af for the documentations:
(5)
   The next line reports the number of failed transactions due to
-  serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
-  for more information).
+  serialization or deadlock errors by default (see
+  <xref linkend="failures-and-retries"/> for more information).

Would it be more readable to simply say:
"The next line reports the number of failed transactions (see ... for more information),
since definition of "failed transaction" has become a bit messy?

(6)
    connection with the database server was lost or the end of script was reached
    without completing the last transaction. In addition, if execution of an SQL
    or meta command fails for reasons other than serialization or deadlock errors,
-   the client is aborted. Otherwise, if an SQL command fails with serialization or
-   deadlock errors, the client is not aborted. In such cases, the current
-   transaction is rolled back, which also includes setting the client variables
-   as they were before the run of this transaction (it is assumed that one
-   transaction script contains only one transaction; see
-   <xref linkend="transactions-and-scripts"/> for more information).
+   the client is aborted by default. However, if the --continue-on-error option
+   is specified, the client does not abort and proceeds to the next transaction
+   regardless of the error. This case is reported as other failures in the output.
+   Otherwise, if an SQL command fails with serialization or deadlock errors, the
+   client is not aborted. In such cases, the current transaction is rolled back,
+   which also includes setting the client variables as they were before the run
+   of this transaction (it is assumed that one transaction script contains only 
+   one transaction; see <xref linkend="transactions-and-scripts"/> for more information).

To emphasize the default behavior, I wonder it would be better to move "by default"
to the beginning of the statements; like

 "By default, if execution of an SQL or meta command fails for reasons other than
 serialization or deadlock errors, the client is aborted."

How about quoting "other failures"? like:

 "These cases are reported as "other failures" in the output."

Also, I feel the meaning of "Otherwise" has becomes somewhat unclear since the
explanation of --continue-on-error was added between the sentences So, how about
clarifying that "the clients are not aborted due to serializable/deadlock even without
--continue-on-error".  For example;

 "On contrast, if an SQL command fails with serialization or deadlock errors, the
  client is not aborted even without  <option>--continue-on-error</option>. 
  Instead, the current transaction is rolled back, which also includes setting
  the client variables as they were before the run of this transaction
  (it is assumed that one transaction script contains only 
   one transaction; see <xref linkend="transactions-and-scripts"/> for more information)."

(7)
    The main report contains the number of failed transactions. If the
-   <option>--max-tries</option> option is not equal to 1, the main report also
+   <option>--max-tries</option> option is not equal to 1 and 
+   <option>--continue-on-error</option> is not specified, the main report also
    contains statistics related to retries: the total number of retried

Is that true?
The retreis statitics would be included even without --continue-on-error.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Assertion failure during initdb with transaction_timeout set
Next
From: Tom Lane
Date:
Subject: Re: cpluspluscheck vs ICU again