RE: ReplicationSlotRelease() crashes when the instance is in the single user mode - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: ReplicationSlotRelease() crashes when the instance is in the single user mode
Date
Msg-id OSCPR01MB14966B39B3ED21229EAEFA10FF531A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: ReplicationSlotRelease() crashes when the instance is in the single user mode  (Paul A Jungwirth <pj@illuminatedcomputing.com>)
Responses Re: ReplicationSlotRelease() crashes when the instance is in the single user mode
List pgsql-hackers
Dear Paul, Mutaamba,

Here are updated patches. Based on the Robert's suggestion, I separated into two parts.
0001 fixed the original issue and 0002 prohibited the slot manipulation in
single-user mode. I want to focus on 0001 first because on one would argue it.

All comments from you were included in 0002.

> The patch does not apply. The attached v5 fixes it with a small update
> to the context for the slot.h hunk.

Oh, I didn't realize because cfbot said OK. Anyway, new patch could be applied atop HEAD.

> The commit message needs some explanation. Why are we doing this? What
> does the patch do? What alternatives did we consider?

Updated. How do you feel?

> The patch has no docs. If we plan to forbid some functions in
> single-user mode, we should document which ones (e.g. in func.sgml).

A statement was added.

> There are no tests. Do we have any tests at all for single-user mode?
> The only one I see is in recovery/t/017_shm.pl. What if we had tests
> that ran the regress suite in single-user mode? Basically instead of
> psql we run postgres --single? Do we expect a lot of it would fail? Is
> it small enough the test could maintain a diff that expects those
> failures? I'm not saying we should do that as part of this patch, but
> is there any interest in that? Since single-user mode is most often
> used when things are broken, it's a harsh place to hit a bug.
> 
> The patch lacks source comments. Something on
> CheckSlotIsInSingleUserMode explaining why would be good.

Comments were added in all caller functions.

> In ReplicationSlotRelease, why only assert that `slot->active_pid !=
> 0`? Why not assert that it's MyProcPid (even outside single-user
> mode)? Can one backend really release a slot while another is using
> it? Can you drop it? Maybe you can copy it?

You meant we should assert `slot->active_pid == MyProcPid`, right?
Naively considered it can be fix, but it is out-of-scope of the thread. It is
existing code, should be discussed verified in another place.

> Are we calling CheckSlotIsInSingleUserMode from everywhere that needs
> it? We tried to check all the functions that call
> ReplicationSlotCreate, ReplicationSlotRelease, and
> update_synced_slots_inactive_since (since they all have these
> asserts). To call out a few:
> 
> The pg_replication_origin_* functions don't call the Assert-ing functions.

You asked that whether we should call CheckSlotIsInSingleUserMode in
pg_replication_origin_* APIs, right? It depends on the policy. If we want to
prohibit all replication-related command, it should be. It is still under
discussion, so I did not touch.

> binary_upgrade_logical_slot_has_caught_up - Not available from the command
> line.

I found that we can in both single-user and binary-upgrade mode. At that time
the function could be called:

```
$ postgres --single -D data/ postgres -b

PostgreSQL stand-alone backend 19devel
backend> SELECT binary_upgrade_logical_slot_has_caught_up('test');
         1: binary_upgrade_logical_slot_has_caught_up   (typeid = 16, len = 1, typmod = -1, byval = t)
        ----
LOG:  starting logical decoding for slot "test"
DETAIL:  Streaming transactions committing after 0/0182C640, reading WAL from 0/0182C608.
STATEMENT:  SELECT binary_upgrade_logical_slot_has_caught_up('test');

LOG:  logical decoding found consistent point at 0/0182C608
DETAIL:  There are no running transactions.
STATEMENT:  SELECT binary_upgrade_logical_slot_has_caught_up('test');
 
         1: binary_upgrade_logical_slot_has_caught_up = "f"     (typeid = 16, len = 1, typmod = -1, byval = t)
        ----
```

I don't think this is a realistic case, but the check was added.

> WalSndErrorCleanup - Probably not used in single-user mode?

IIUC we won't reach in the single-user mode. Hence it was not called intentionally.

> We also see that PostgresMain calls ReplicationSlotRelease from its
> error handler. Presumably single-user mode would be executing
> PostgresSingleUserMain instead,

ISTM, PostgresMain() would be called from the PostgresSingleUserMain().

> but still perhaps we should call
> CheckSlotIsInSingleUserMode here just as a sanity-check?

I feel this code is to release the acquired slot when ERROR was raised. Since we
have already covered the entrance, it is not needed.

[1]: https://www.postgresql.org/message-id/CA%2BTgmobspWhoMysNHL9b3C9xi%3DOpHdBYhtAgDH4N_A2foyjN-w%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: GB18030-2022 Support in PostgreSQL
Next
From: Chao Li
Date:
Subject: Re: GB18030-2022 Support in PostgreSQL