Thread: Re: Suggestion to add --continue-client-on-abort option to pgbench

Re: Suggestion to add --continue-client-on-abort option to pgbench

From
Stepan Neretin
Date:


On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:
Hi hackers,

I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.

For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.

The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.

```
% bin/pgbench -d postgres -f ../insert_to_unique_column.sql -T 10
--failures-detailed --continue-client-on-error
transaction type: ../custom_script_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 33552
number of failed transactions: 21901 (39.495%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 21901 (39.495%)
latency average = 0.180 ms (including failures)
initial connection time = 2.857 ms
tps = 3356.092385 (without initial connection time)
```

I have attached the patch. I would appreciate your feedback.

Best regards,

Rintaro Ikeda
NTT DATA Corporation Japan

Hi Rintaro,

Thanks for the patch and explanation. I understand your goal is to ensure that pgbench clients continue running even when some transactions fail due to application-level errors (e.g., constraint violations), especially when running custom scripts.

However, I wonder if the intended behavior can't already be achieved using standard SQL constructs — specifically ON CONFLICT or careful transaction structure. For example, your sample script:

CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));

can be rewritten as:

\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;

This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.

So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?

Best regards,

Stepan Neretin 

RE: Suggestion to add --continue-client-on-abort option to pgbench

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Ikeda-san,

Thanks for starting the new thread! I have never known the issue before I heard at
PGConf.dev.

Few comments:

1.
This parameter seems a type of benchmark option. So should we set
benchmarking_option_set as well?

2.
Not sure, but exit-on-abort seems a similar option. What if both are specified?
Is it allowed?

3.
Can you consider a test case for the new parameter?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Suggestion to add --continue-client-on-abort option to pgbench

From
Rintaro Ikeda
Date:
Dear Kuroda-san, hackers,


On 2025/06/04 21:57, Hayato Kuroda (Fujitsu) wrote:
> Dear Ikeda-san,
> 
> Thanks for starting the new thread! I have never known the issue before I heard at
> PGConf.dev.
> 
> Few comments:
> 
> 1.
> This parameter seems a type of benchmark option. So should we set
> benchmarking_option_set as well?
> 
> 2.
> Not sure, but exit-on-abort seems a similar option. What if both are specified?
> Is it allowed?
> 
> 3.
> Can you consider a test case for the new parameter?
> 
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
> 
> 

Thank you for your valuable comment!

1. I should've also set benchmarking_option_set. I've modified it accordingly.

2. The exit-on-abort option and continue-on-error option are mutually exclusive. 
Therefore, I've updated the patch to throw a FATAL error when two options are 
set simultaneously. Corresponding explanation was also added.
(I'm wondering the name of parameter should be continue-on-abort so that users 
understand the two option are mutually exclusive.)

3. I've added the test.

Additionally, I modified the patch so that st->state does not transition to 
CSTATE_RETRY when a transaction fails and continue-on-error option is enabled. 
In the previous patch, we retry the failed transaction up to max-try times, 
which is unnecessary for our purpose: clients does not exit when its 
transactions fail.

I've attached the updated patch. 
v3-0001-Add-continue-on-error-option-to-pgbench.patch is identical to 
v4-0001-Add-continue-on-error-option-to-pgbench.patch. The v4-0002 patch is the 
diff from the previous patch.

Best regards,
Rintaro Ikeda
Attachment

RE: Suggestion to add --continue-client-on-abort option to pgbench

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Ikeda-san,

Thanks for updating the patch!

> 1. I should've also set benchmarking_option_set. I've modified it accordingly.

Confirmed it has been fixed. Thanks.

> 2. The exit-on-abort option and continue-on-error option are mutually exclusive. 
> Therefore, I've updated the patch to throw a FATAL error when two options are 
> set simultaneously. Corresponding explanation was also added.
> (I'm wondering the name of parameter should be continue-on-abort so that users
> understand the two option are mutually exclusive.)

Make sense, +1.

Here are new comments.

01. build failure
According to the cfbot [1], the documentation cannot be built. IIUC </para> seemed
to be missed here:
```
+       <para>
+        Note that this option can not be used together with
+        <option>--exit-on-abort</option>.
+      </listitem>
+     </varlistentry>
```

02. patch separation
How about separating the patch series like:

0001 - contains option handling and retry part, and documentation
0002 - contains accumulation/reporting part
0003 - contains tests.

I hope above style is more helpful for reviewers.

03. documentation
```
+        Note that this option can not be used together with
+        <option>--exit-on-abort</option>.
```

I feel we should add the similar description in `exit-on-abort` part.

04. documentation
```
+        Client rolls back the failed transaction and starts a new one when its
+        transaction fails due to the reason other than the deadlock and
+        serialization failure. This allows all clients specified with -c option
+        to continuously apply load to the server, even if some transactions fail.
```

I feel the description contains bit redundant part and misses the default behavior.
How about:
```
       <para>
        Clients survive when their transactions are aborted, and they continue
        their run. Without the option, clients exit when transactions they run
        are aborted.
       </para>
       <para>
        Note that serialization failures or deadlock failures do not abort the
        client, so they are not affected by this option.
        See <xref linkend="failures-and-retries"/> for more information.
       </para>
```

05. StatsData
```
+        * When continue-on-error option is specified,
+        * failed (the number of failed transactions) =
+        *   'other_sql_failures' (they got a error when continue-on-error option
+        *                                                 was specified).
```

Let me confirm one point; can serialization_failures and deadlock_failures be
counted when continue-on-error is true? If so, the comment seems not correct for me.
The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures'
in the case.

06. StatsData
Another point; can other_sql_failures be counted when the continue-on-error is NOT
specified? I feel it should be...

06. usage()
Added line is too long. According to program_help_ok(), the output by help should
be less than 80.

07.
Please run pgindent/pgperltidy, I got some diffs.

[1]: https://cirrus-ci.com/task/5210061275922432

Best regards,
Hayato Kuroda
FUJITSU LIMITED


On Mon, 9 Jun 2025 09:34:03 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

> > 2. The exit-on-abort option and continue-on-error option are mutually exclusive. 
> > Therefore, I've updated the patch to throw a FATAL error when two options are 
> > set simultaneously. Corresponding explanation was also added.

I don't think that's right since "abort" and "error" are different concept in pgbench.
(Here, "abort" refers to the termination of a client, not a transaction abort.)

The --exit-on-abort option forces to exit pgbench immediately when any client is aborted
due to some error. When the --continue-on-error option is not set, SQL errors other than
deadlock or serialization error cause a client to be aborted. On the other hand, when the option
is set, clients are not aborted due to any SQL errors; instead they continue to run after them.
However, clients can still be aborted for other reasons, such as connection failures or
meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option remains
useful even when --continue-on-error is enabled. 

> > (I'm wondering the name of parameter should be continue-on-abort so that users
> > understand the two option are mutually exclusive.)

For the same reason as above, I believe --continue-on-error is a more accurate description
of the option's behavior.
 
> 02. patch separation
> How about separating the patch series like:
> 
> 0001 - contains option handling and retry part, and documentation
> 0002 - contains accumulation/reporting part
> 0003 - contains tests.
> 
> I hope above style is more helpful for reviewers.

I'm not sure whether it's necessary to split the patch, as the change doesn't seem very
complex. However, the current separation appears inconsistent. For example, patch 0001
modifies canRetryError(), but patch 0002 reverts that change, and so on.

> 
> 04. documentation
> ```
> +        Client rolls back the failed transaction and starts a new one when its
> +        transaction fails due to the reason other than the deadlock and
> +        serialization failure. This allows all clients specified with -c option
> +        to continuously apply load to the server, even if some transactions fail.
> ```
> 
> I feel the description contains bit redundant part and misses the default behavior.
> How about:
> ```
>        <para>
>         Clients survive when their transactions are aborted, and they continue
>         their run. Without the option, clients exit when transactions they run
>         are aborted.
>        </para>
>        <para>
>         Note that serialization failures or deadlock failures do not abort the
>         client, so they are not affected by this option.
>         See <xref linkend="failures-and-retries"/> for more information.
>        </para>
> ```

I think we can make it clearer as follows:

 Allows clients to continue their run even if an SQL statement fails due to errors other
 than serialization or deadlock. Without this option, the client is aborted after
 such errors.

 Note that serialization and deadlock failures never cause the client to be aborted,
 so they are not affected by this option. See <xref linkend="failures-and-retries"/>
 for more information.

That said, a review by a native English speaker would still be appreciated.

Also, we would need to update several parts of the documentation. For example, the 
"Failures and Serialization/Deadlock Retries" section should be revised to describe the
behavior change. In addition, we should update the explanations of output result examples
and logging, the description of the --failures-detailed option, and so on.

If transactions are not retried after SQL errors other than serialization or deadlock,
this should also be explicitly documented.


> 05. StatsData
> ```
> +        * When continue-on-error option is specified,
> +        * failed (the number of failed transactions) =
> +        *   'other_sql_failures' (they got a error when continue-on-error option
> +        *                                                 was specified).
> ```
> 
> Let me confirm one point; can serialization_failures and deadlock_failures be
> counted when continue-on-error is true? If so, the comment seems not correct for me.
> The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures'
> in the case.

+1

> 06. StatsData
> Another point; can other_sql_failures be counted when the continue-on-error is NOT
> specified? I feel it should be...

We could do that. However, if an SQL error other than a serialization or deadlock error
occurs when --continue-on-error is not set, pgbench will be aborted midway and the printed
results will be incomplete. Therefore, this might not make much sense.

> 06. usage()
> Added line is too long. According to program_help_ok(), the output by help should
> be less than 80.

+1


Here are additional comments from me.

@@ -4548,6 +4570,8 @@ getResultString(bool skipped, EStatus estatus)
                return "serialization";
            case ESTATUS_DEADLOCK_ERROR:
                return "deadlock";
+           case ESTATUS_OTHER_SQL_ERROR:
+               return "error (except serialization/deadlock)";

Strings returned by getResultString() are printed in the "time" field of the
log when both the -l and --failures-detailed options are set. Therefore, they
should be single words that do not contain any space characters. I wonder if
something like "other" or "other_sql_error" would be appropriate.


@@ -4099,6 +4119,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
                         * can retry the error.
                         */
                        st->state = timer_exceeded ? CSTATE_FINISHED :
+                           continue_on_error ? CSTATE_FAILURE :
                            doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;
                    }
                    else

This fix is not necessary because doRetry() (and canRetryError(), which is called
within it) will return false when continue_on_error is set (after applying patch 0002).


            case PGRES_NONFATAL_ERROR:
            case PGRES_FATAL_ERROR:
                st->estatus = getSQLErrorStatus(PQresultErrorField(res,
                                                                   PG_DIAG_SQLSTATE));
                if (canRetryError(st->estatus))
                {
                    if (verbose_errors)
                        commandError(st, PQerrorMessage(st->con));
                    goto error;
                }
                /* fall through */

            default:    
                /* anything else is unexpected */
                pg_log_error("client %d script %d aborted in command %d query %d: %s",
                             st->id, st->use_file, st->command, qrynum,
                             PQerrorMessage(st->con));
                goto error;
        }

When an SQL error other than a serialization or deadlock error occurs, an error message is
output via pg_log_error in this code path. However, I think this should be reported only
when verbose_errors is set, similar to how serialization and deadlock errors are handled when
--continue-on-error is enabled


Best regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



RE: Suggestion to add --continue-client-on-abort option to pgbench

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Nagata-san,

> > > 2. The exit-on-abort option and continue-on-error option are mutually
> exclusive.
> > > Therefore, I've updated the patch to throw a FATAL error when two options
> are
> > > set simultaneously. Corresponding explanation was also added.
>
> I don't think that's right since "abort" and "error" are different concept in pgbench.
> (Here, "abort" refers to the termination of a client, not a transaction abort.)
>
> The --exit-on-abort option forces to exit pgbench immediately when any client is
> aborted
> due to some error. When the --continue-on-error option is not set, SQL errors
> other than
> deadlock or serialization error cause a client to be aborted. On the other hand,
> when the option
> is set, clients are not aborted due to any SQL errors; instead they continue to run
> after them.
> However, clients can still be aborted for other reasons, such as connection
> failures or
> meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option
> remains
> useful even when --continue-on-error is enabled.

To clarify: another approach is that allow --continue-on-error option to continue
running even when clients meet such errors. Which one is better?

> > 02. patch separation
> > How about separating the patch series like:
> >
> > 0001 - contains option handling and retry part, and documentation
> > 0002 - contains accumulation/reporting part
> > 0003 - contains tests.
> >
> > I hope above style is more helpful for reviewers.
>
> I'm not sure whether it's necessary to split the patch, as the change doesn't seem
> very
> complex. However, the current separation appears inconsistent. For example,
> patch 0001
> modifies canRetryError(), but patch 0002 reverts that change, and so on.

Either way is fine for me if they are changed from the current method.

>
> >
> > 04. documentation
> > ```
> > +        Client rolls back the failed transaction and starts a new one when its
> > +        transaction fails due to the reason other than the deadlock and
> > +        serialization failure. This allows all clients specified with -c option
> > +        to continuously apply load to the server, even if some transactions
> fail.
> > ```
> >
> > I feel the description contains bit redundant part and misses the default
> behavior.
> > How about:
> > ```
> >        <para>
> >         Clients survive when their transactions are aborted, and they continue
> >         their run. Without the option, clients exit when transactions they run
> >         are aborted.
> >        </para>
> >        <para>
> >         Note that serialization failures or deadlock failures do not abort the
> >         client, so they are not affected by this option.
> >         See <xref linkend="failures-and-retries"/> for more information.
> >        </para>
> > ```
>
> I think we can make it clearer as follows:

I do not have confident for English, native speaker is needed....

> > 06. usage()
> > Added line is too long. According to program_help_ok(), the output by help
> should
> > be less than 80.
>
> +1

FYI - I posted a patch which adds the test. You can apply and confirm how the function says.

[1]:
https://www.postgresql.org/message-id/OSCPR01MB1496610451F5896375B2562E6F56BA%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: Suggestion to add --continue-client-on-abort option to pgbench

From
ikedarintarof
Date:
Hi,

Thank you very much for your valuable comments and kind advice. I'm 
currently working on revising the previous patch based on the feedback 
received. I would like to share my thoughts regarding the conditions 
under which the --continue-on-error option should initiate a new 
transaction or a new connection.

In my opinion, when the --continue-on-error option is enabled, pgbench 
clients does not need to start new transactions after network errors and 
other errors except for SQL-level errors.

Network errors are relatively rare, except in failover scenarios. 
Outside of failover, any network issues should be resolved rather than 
worked around. In the context of failover, the key metric is not TPS, 
but system downtime. While one might infer the timing of a failover by 
observing by using --progress option, you can easily determine the 
downtime by executing simple SQL query such as `psql -c 'SELECT 1` every 
second.


On 2025/06/26 18:47, Yugo Nagata wrote:
> On Thu, 26 Jun 2025 05:45:12 +0000
> "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
> 
>> Dear Nagata-san,
>> 
>>> As I understand it, the current patch aims to allow continuation only 
>>> after
>>> SQL-level
>>> errors, such as constraint violations. That seems reasonable, as it 
>>> can simulate
>>> the
>>> behavior of applications that ignore or retry such errors (even 
>>> though retries are
>>> not
>>> implemented in the current patch).
>> 
>> Yes, no one has objections to retry in this case. This is a main part 
>> of the proposal.,
> 
> As I understand it, the proposed --continue-on-error option does not 
> retry the transaction
> in any case; it simply gives up on the transaction. That is, when an 
> SQL-level error occurs,
> the transaction is reported as "failed" rather than "retried", and the 
> random state is discarded.

Retrying the failed transaction is not necessary when the transaction 
failed due to SQL-level errors. Unlike real-world applications, pgbench 
does not need to complete specific transaction successfully. In the case 
of unique constraint violations, retrying the same transaction will 
likely to result in the same error again.

I want to hear your thoughts on this.

Best regards,
Rintaro Ikeda



Re: Suggestion to add --continue-client-on-abort option to pgbench

From
Rintaro Ikeda
Date:
Hi,

Thank you for the kind comments.

I've updated the previous patch.

Below is a summary of the changes:
1. The code path and documentation have been corrected based on your feedback.
2. The following message is now suppressed by default. Instead, an error message
is added when a client aborts during SQL execution. (v6-0003-Suppress-xxx.patch)

```
                if (verbose_errors)
                    pg_log_error("client %d script %d aborted in command %d query %d: %s",
                                 st->id, st->use_file, st->command, qrynum,
                                 PQerrorMessage(st->con));
```


On 2025/07/04 22:01, Hayato Kuroda (Fujitsu) wrote:

>>> Could I confirm what you mean by "start new one"?
>>>
>>> In the current pgbench, when a query raises an error (a deadlock or
>>> serialization failure), it can be retried using the same random state.
>>> This typically means the query will be retried with the same parameter values.
>>>
>>> On the other hand, when the query ultimately fails (possibly after some retries),
>>> the transaction is marked as a "failure", and the next transaction starts with a
>>> new random state (i.e., with new parameter values).
>>>
>>> Therefore, if a query fails due to a unique constraint violation and is retried
>>> with the same parameters, it will keep failing on each retry.
>>
>> Thank you for your explanation. I understand it as you described. I've also
>> attached a schematic diagram of the state machine. I hope it will help clarify
>> the behavior of pgbench. Red arrows represent the transition of state when SQL
>> command fails and --continue-on-error option is specified.
>
> Thanks for the diagram, it's quite helpful. Let me share my understanding and opinion.
>
> The terminology "retry" is being used for the transition CSTATE_ERROR->CSTATE_RETRY,
> and here the random state would be restored to be the begining:
>
> ```
>                 /*
>                  * Reset the random state as they were at the beginning of the
>                  * transaction.
>                  */
>                 st->cs_func_rs = st->random_state;
> ```
>
> In --continue-on-error case, the transaction CSTATE_WAIT_RESULT->CSTATE_ERROR
> can happen even the reason of failure is not serialization and deadlock.
> Ultimately the pass will reach ...->CSTATE_END_TX->CSTATE_CHOOSE_SCRIPT, the
> beginning of the state machine. cs_func_rs is not overwritten in the route so
> that different random value would be generated, or even another script may be
> chosen. Is it correct?

Yes, I believe that’s correct.

>
> 01.
> ```
> $ git am ../patches/pgbench/v5-0001-Add-continue-on-error-opt
> ion-to-pgbench.patch
> Applying: When the option is set, client rolls back the failed transaction and...
> .git/rebase-apply/patch:65: trailing whitespace.
>    <literal>serialization</literal>, <literal>deadlock</literal>, or
> .git/rebase-apply/patch:139: trailing whitespace.
>    <option>--max-tries</option> option is not equal to 1 and
> warning: 2 lines add whitespace errors.
> ```
>
> I got warnings when I applied the patch. Please fix it.

It's been fixed.

>
> 02.
> ```
> +        *       'serialization_failures' + 'deadlock_failures' +
> +        *   'other_sql_failures' (they got a error when continue-on-error option
> ```
> The first line has the tab, but it should be normal blank.

I hadn't noticed it. It's fixed.


> 03.
> ```
> +                               else if (continue_on_error | canRetryError(st->estatus))
> ```
>
> I feel "|" should be "||".

Thank you for pointing out. Fixed it.

> 04.
> ```
>      <term><replaceable>retries</replaceable></term>
>      <listitem>
>       <para>
>        number of retries after serialization or deadlock errors
>        (zero unless <option>--max-tries</option> is not equal to one)
>       </para>
>      </listitem>
> ```
>
> To confirm; --continue-on-error won't be counted here because it is not "retry",
> in other words, it does not reach CSTATE_RETRY, right?

Yes. I agree with Nagata-san [1] — --continue-on-error is not considered a
"retry" because it doesn't reach CSTATE_RETRY.


On 2025/07/05 0:03, Yugo Nagata wrote:
>>>              case PGRES_NONFATAL_ERROR:
>>>              case PGRES_FATAL_ERROR:
>>>                  st->estatus = getSQLErrorStatus(PQresultErrorField(res,
>>>                                                                     PG_DIAG_SQLSTATE));
>>>                  if (canRetryError(st->estatus))
>>>                  {
>>>                      if (verbose_errors)
>>>                          commandError(st, PQerrorMessage(st->con));
>>>                      goto error;
>>>                  }
>>>                  /* fall through */
>>>
>>>              default:
>>>                  /* anything else is unexpected */
>>>                  pg_log_error("client %d script %d aborted in command %d query %d: %s",
>>>                               st->id, st->use_file, st->command, qrynum,
>>>                               PQerrorMessage(st->con));
>>>                  goto error;
>>>          }
>>>
>>> When an SQL error other than a serialization or deadlock error occurs, an error message is
>>> output via pg_log_error in this code path. However, I think this should be reported only
>>> when verbose_errors is set, similar to how serialization and deadlock errors are handled when
>>> --continue-on-error is enabled
>>
>> I think the error message logged via pg_log_error is useful when verbose_errors
>> is not specified, because it informs users that the client has exited. Without
>> it, users may not notice that something went wrong.
>
> However, if a large number of errors occur, this could result in a significant increase
> in stderr output during the benchmark.
>
> Users can still notice that something went wrong by checking the “number of other failures”
> reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
> users aren’t particularly interested in seeing individual error messages as they happen.
>
> It’s true that seeing error messages during the benchmark might be useful in some cases, but
> the same could be said for serialization or deadlock errors, and that’s exactly what the
> --verbose-errors option is for.


I understand your concern. The condition for calling pg_log_error() was modified
to reduce stderr output.
Additionally, an error message was added for cases where some clients aborted
while executing SQL commands, similar to other code paths that transition to
st->state = CSTATE_ABORTED, as shown in the example below:

```
                        pg_log_error("client %d aborted while establishing connection", st->id);
                        st->state = CSTATE_ABORTED;
```


> Here are some comments on the patch.
>
> (1)
>
>                 }
> -               else if (canRetryError(st->estatus))
> +               else if (continue_on_error | canRetryError(st->estatus))
>                     st->state = CSTATE_ERROR;
>                 else
>                     st->state = CSTATE_ABORTED;
>
> Due to this change, when --continue-on-error is enabled, st->state is set to
> CSTATE_ERROR regardless of the type of error returned by readCommandResponse.
> When the error is not ESTATUS_OTHER_SQL_ERROR, e.g. ESTATUS_META_COMMAND_ERROR
> due to a failure of \gset with query returning more the one row.
>
> Therefore, this should be like:
>
>                else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR  && continue_on_error) ||
>                          canRetryError(st->estatus))
>

Thanks for pointing that out — I’ve corrected it.


> (2)
>
> +          "  --continue-on-error      continue processing transations after a trasaction fails\n"
>
> "trasaction" is a typo and including "transaction" twice looks a bit redundant.
> Instead using the word "transaction", how about:
>
>  "--continue-on-error continue running after an SQL error" ?
>
> This version is shorter, avoids repetition, and describes well the actual behavior when
> SQL statements fail.

Fixed it.

> (3)
>
> -    * A failed transaction is defined as unsuccessfully retried transactions.
> +    * A failed transaction is defined as unsuccessfully retried transactions
> +    * unless continue-on-error option is specified.
>      * It can be one of two types:
>      *
>      * failed (the number of failed transactions) =
> @@ -411,6 +412,12 @@ typedef struct StatsData
>      *   'deadlock_failures' (they got a deadlock error and were not
>      *                        successfully retried).
>      *
> +    * When continue-on-error option is specified,
> +    * failed (the number of failed transactions) =
> +    *   'serialization_failures' + 'deadlock_failures' +
> +    *   'other_sql_failures' (they got a error when continue-on-error option
> +    *                         was specified).
> +    *
>
> To explain explicitly that there are two definitions of failed transactions
> depending on the situation, how about:
>
> """
>  A failed transaction is counted differently depending on whether
>  the --continue-on-error option is specified.
>
>  Without --continue-on-error:
>
>  failed (the number of failed transactions) =
>   'serialization_failures' (they got a serialization error and were not
>                             successfully retried) +
>   'deadlock_failures' (they got a deadlock error and were not
>                        successfully retried).
>
>  When --continue-on-error is specified:
>
>  failed (number of failed transactions) =
>    'serialization_failures' + 'deadlock_failures' +
>    'other_sql_failures'  (they got some other SQL error; the transaction was
>                           not retried and counted as failed due to
>                           --continue-on-error).
> """

Thank you for your suggestion. I modified it accordingly.


> (4)
> +   int64       other_sql_failures; /* number of failed transactions for
> +                                    * reasons other than
> +                                    * serialization/deadlock failure , which
> +                                    * is enabled if --continue-on-error is
> +                                    * used */
>
> Is "counted" is more proper than "enabled" here?

Fixed.


>
> Af for the documentations:
> (5)
>    The next line reports the number of failed transactions due to
> -  serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
> -  for more information).
> +  serialization or deadlock errors by default (see
> +  <xref linkend="failures-and-retries"/> for more information).
>
> Would it be more readable to simply say:
> "The next line reports the number of failed transactions (see ... for more information),
> since definition of "failed transaction" has become a bit messy?
>

I fixed it to the simple explanation.

> (6)
>     connection with the database server was lost or the end of script was reached
>     without completing the last transaction. In addition, if execution of an SQL
>     or meta command fails for reasons other than serialization or deadlock errors,
> -   the client is aborted. Otherwise, if an SQL command fails with serialization or
> -   deadlock errors, the client is not aborted. In such cases, the current
> -   transaction is rolled back, which also includes setting the client variables
> -   as they were before the run of this transaction (it is assumed that one
> -   transaction script contains only one transaction; see
> -   <xref linkend="transactions-and-scripts"/> for more information).
> +   the client is aborted by default. However, if the --continue-on-error option
> +   is specified, the client does not abort and proceeds to the next transaction
> +   regardless of the error. This case is reported as other failures in the output.
> +   Otherwise, if an SQL command fails with serialization or deadlock errors, the
> +   client is not aborted. In such cases, the current transaction is rolled back,
> +   which also includes setting the client variables as they were before the run
> +   of this transaction (it is assumed that one transaction script contains only
> +   one transaction; see <xref linkend="transactions-and-scripts"/> for more information).
>
> To emphasize the default behavior, I wonder it would be better to move "by default"
> to the beginning of the statements; like
>
>  "By default, if execution of an SQL or meta command fails for reasons other than
>  serialization or deadlock errors, the client is aborted."
>
> How about quoting "other failures"? like:
>
>  "These cases are reported as "other failures" in the output."
>
> Also, I feel the meaning of "Otherwise" has becomes somewhat unclear since the
> explanation of --continue-on-error was added between the sentences So, how about
> clarifying that "the clients are not aborted due to serializable/deadlock even without
> --continue-on-error".  For example;
>
>  "On contrast, if an SQL command fails with serialization or deadlock errors, the
>   client is not aborted even without  <option>--continue-on-error</option>.
>   Instead, the current transaction is rolled back, which also includes setting
>   the client variables as they were before the run of this transaction
>   (it is assumed that one transaction script contains only
>    one transaction; see <xref linkend="transactions-and-scripts"/> for more information)."
>

I've modified according to your suggestion.

> (7)
>     The main report contains the number of failed transactions. If the
> -   <option>--max-tries</option> option is not equal to 1, the main report also
> +   <option>--max-tries</option> option is not equal to 1 and
> +   <option>--continue-on-error</option> is not specified, the main report also
>     contains statistics related to retries: the total number of retried
>
> Is that true?
> The retreis statitics would be included even without --continue-on-error.

That was wrong. I corrected it.


[1]
https://www.postgresql.org/message-id/20250705002239.27e6e5a4ba22c047ac2fa16a%40sraoss.co.jp

Regards,
Rintaro Ikeda


Attachment
On Wed, 9 Jul 2025 23:58:32 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:

> Hi,
> 
> Thank you for the kind comments.
> 
> I've updated the previous patch.

Thank you for updating the patch!

> > However, if a large number of errors occur, this could result in a significant increase
> > in stderr output during the benchmark.
> > 
> > Users can still notice that something went wrong by checking the “number of other failures”
> > reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
> > users aren’t particularly interested in seeing individual error messages as they happen.
> > 
> > It’s true that seeing error messages during the benchmark might be useful in some cases, but
> > the same could be said for serialization or deadlock errors, and that’s exactly what the
> > --verbose-errors option is for.
> 
> 
> I understand your concern. The condition for calling pg_log_error() was modified
> to reduce stderr output.
> Additionally, an error message was added for cases where some clients aborted
> while executing SQL commands, similar to other code paths that transition to
> st->state = CSTATE_ABORTED, as shown in the example below:
> 
> ```
>                         pg_log_error("client %d aborted while establishing connection", st->id);
>                         st->state = CSTATE_ABORTED;
> ```

            default:
                /* anything else is unexpected */
-               pg_log_error("client %d script %d aborted in command %d query %d: %s",
-                            st->id, st->use_file, st->command, qrynum,
-                            PQerrorMessage(st->con));
+               if (verbose_errors)
+                   pg_log_error("client %d script %d aborted in command %d query %d: %s",
+                                st->id, st->use_file, st->command, qrynum,
+                                PQerrorMessage(st->con));
                goto error;
        }

Thanks to this fix, error messages caused by SQL errors are now output only when
--verbose-errors is enable. However, the comment describes the condition as "unexpected",
and the message states that the client was "aborted". This does not seems accurate, since
clients are not aborted due to SQL errors when --continue-on-errors is enabled. 

I think the error message should be emitted using commandError() when both
--coneinut-on-errors and --verbose-errors are specified, like this;

            case PGRES_NONFATAL_ERROR:
            case PGRES_FATAL_ERROR:
                st->estatus = getSQLErrorStatus(PQresultErrorField(res,
                                                                   PG_DIAG_SQLSTATE));
                if (continue_on_error || canRetryError(st->estatus))
                {
                    if (verbose_errors)
                        commandError(st, PQerrorMessage(st->con));
                    goto error;
                }
                /* fall through */

In addition, the error message in the "default" case should be shown regardless
of the --verbose-errors since it represents an unexpected situation and should
always reported.

Finally, I believe this fix should be included in patch 0001 rather than 0003,
as it would be a part of the implementation of --continiue-on-error.


As of 0003:

+               {
+                   pg_log_error("client %d aborted while executing SQL commands", st->id);
                    st->state = CSTATE_ABORTED;
+               }
                break;

I understand that the patch is not directly related to --continue-on-error, similar to 0002,
and that it aims to improve the error message to indicate that the client was aborted due to
some error during readCommandResponse().

However, this message doesn't seem entirely accurate, since the error is not always caused
by an SQL command failure itself. For example, it could also be due to a failure of the \gset
meta-command.

In addition, this fix causes error messages to be emitted twice. For example, if \gset fails,
the following similar messages are printed:

 pgbench: error: client 0 script 0 command 0 query 0: expected one row, got 0
 pgbench: error: client 0 aborted while executing SQL commands

Even worse, if an unexpected error occurs in readCommandResponse() (i.e., the default case),
the following messages are emitted, both indicating that the client was aborted;

 pgbench: error: client 0 script 0 aborted in command ... query ...
 pgbench: error: client 0 aborted while executing SQL commands

I feel this is a bit redundant.

Therefore, if we are to improve these messages to indicate explicitly that the client
was aborted, I would suggest modifying the error messages in readCommandResponse() rather
than adding a new one in advanceConnectionState().

I've attached patch 0003 incorporating my suggestion. What do you think?

Additionally, the patch 0001 includes the fix that was originally part of
your proposed 0003, as previously discussed.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Attachment