Thread: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From
"Vitaly Davydov"
Date:

Dear Hackers,

 

I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values.

 

If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good.

 

The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed.

 

I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem.

 

I reproduced the problem by the following way:

  • Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution.
  • Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection.
  • Hard restart the server right after checkpoint completion.
  • After restart slot's restart_lsn may point to removed WAL segment.

The proposed patch fixes it.

 

[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From
"Vitaly Davydov"
Date:

Dear Hackers,

 

To ping the topic, I'd like to clarify what may be wrong with the idea described here, because I do not see any interest from the community. The topic is related to physical replication. The primary idea is to define the horizon of WAL segments (files) removal based on saved on disk restart LSN values. Now, the WAL segment removal horizon is calculated based on the current restart LSN values of slots, that can not be saved on disk at the time of the horizon calculation. The case take place when a slot is advancing during checkpoint as described earlier in the topic.

 

Such behaviour is not a problem when slots are used only for physical replication in a conventional way. But it may be a problem when physical slot is used for some other goals. For example, I have an extension which keeps the WAL using physical replication slots. It creates a new physical slot and advances it as needed. After restart, it can use restart lsn of the slot to read WAL from this LSN. In this case, there is no guarantee that restart lsn will point to an existing WAL segment.

 

The advantage of the current behaviour is that it requires a little bit less WAL to keep. The disadvantage is that physical slots do not guarantee WAL keeping starting from its' restart lsns in general.

 

I would be happy to get some advice, whether I am on the right or wrong way.  Thank you in advance.

 

With best regards,

Vitaly


On Thursday, November 07, 2024 16:30 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:
 

Dear Hackers,

 

I'd like to introduce an improved version of my patch (see the attached file). My original idea was to take into account saved on disk restart_lsn (slot→restart_lsn_flushed) for persistent slots when removing WAL segment files. It helps tackle errors like: ERROR: requested WAL segment 000...0AA has already been removed.

 

Improvements:

  • flushed_restart_lsn is used only for RS_PERSISTENT slots.
  • Save physical slot on disk when advancing only once - if restart_lsn_flushed is invalid. It is needed because slots with invalid restart LSN are not used when calculating oldest LSN for WAL truncation. Once restart_lsn becomes valid, it should be saved to disk immediately to update restart_lsn_flushed.

Regression tests seems to be ok except:

  • recovery/t/001_stream_rep.pl (checkpoint is needed)
  • recovery/t/019_replslot_limit.pl (it seems, slot was invalidated, some adjustments are needed)
  • pg_basebackup/t/020_pg_receivewal.pl (not sure about it)

 

There are some problems:

  • More WAL segments may be kept. It may lead to invalidations of slots in some tests (recovery/t/019_replslot_limit.pl). A couple of tests should be adjusted.

 

With best regards,

Vitaly Davydov



On Thursday, October 31, 2024 13:32 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:

 

Sorry, attached the missed patch.

On Thursday, October 31, 2024 13:18 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:

Dear Hackers,

 

I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values.

 

If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good.

 

The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed.

 

I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem.

 

I reproduced the problem by the following way:

  • Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution.
  • Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection.
  • Hard restart the server right after checkpoint completion.
  • After restart slot's restart_lsn may point to removed WAL segment.

The proposed patch fixes it.

 

[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru




 

On 10/31/24 11:18, Vitaly Davydov wrote:
> Dear Hackers,
> 
>  
> 
> I'd like to discuss a problem with replication slots's restart LSN.
> Physical slots are saved to disk at the beginning of checkpoint. At the
> end of checkpoint, old WAL segments are recycled or removed from disk,
> if they are not kept by slot's restart_lsn values.
> 

I agree that if we can lose WAL still needed for a replication slot,
that is a bug. Retaining the WAL is the primary purpose of slots, and we
just fixed a similar issue for logical replication.

> 
> If an existing physical slot is advanced in the middle of checkpoint
> execution, WAL segments, which are related to saved on disk restart LSN
> may be removed. It is because the calculation of the replication slot
> miminal LSN is occured at the end of checkpoint, prior to old WAL
> segments removal. If to hard stop (pg_stl -m immediate) the postgres
> instance right after checkpoint and to restart it, the slot's
> restart_lsn may point to the removed WAL segment. I believe, such
> behaviour is not good.
> 

Not sure I 100% follow, but let me rephrase, just so that we're on the
same page. CreateCheckPoint() does this:

  ... something ...

  CheckPointGuts(checkPoint.redo, flags);

  ... something ...

  RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
                     checkPoint.ThisTimeLineID);

The slots get synced in CheckPointGuts(), so IIUC you're saying if the
slot gets advanced shortly after the sync, the RemoveOldXlogFiles() may
remove still-needed WAL, because we happen to consider a fresh
restart_lsn when calculating the logSegNo. Is that right?

> 
> The doc [0] describes that restart_lsn may be set to the some past value
> after reload. There is a discussion [1] on pghackers where such
> behaviour is discussed. The main reason of not flushing physical slots
> on advancing is a performance reason. I'm ok with such behaviour, except
> of that the corresponding WAL segments should not be removed.
> 

I don't know which part of [0] you refer to, but I guess you're
referring to this:

    The current position of each slot is persisted only at checkpoint,
    so in the case of a crash the slot may return to an earlier LSN,
    which will then cause recent changes to be sent again when the
    server restarts. Logical decoding clients are responsible for
    avoiding ill effects from handling the same message more than once.

Yes, it's fine if we discard the new in-memory restart_lsn value, and we
do this for performance reasons - flushing the slot on every advance
would be very expensive. I haven't read [1] as it's quite long, but I
guess that's what it says.

But we must not make any "permanent" actions based on the unflushed
value, I think. Like, we should not remove WAL segments, for example.

>  
> 
> I propose to keep WAL segments by saved on disk (flushed) restart_lsn of
> slots. Add a new field restart_lsn_flushed into ReplicationSlot
> structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It
> doesn't change the format of storing the slot contents on disk. I
> attached a patch. It is not yet complete, but demonstate a way to solve
> the problem.
> 

That seems like a possible fix this, yes. And maybe it's the right one.

> 
> I reproduced the problem by the following way:
> 
>   * Add some delay in CheckPointBuffers (pg_usleep) to emulate long
>     checkpoint execution.
>   * Execute checkpoint and pg_replication_slot_advance right after
>     starting of the checkpoint from another connection.
>   * Hard restart the server right after checkpoint completion.
>   * After restart slot's restart_lsn may point to removed WAL segment.
> 
> The proposed patch fixes it.
> 

I tried to reproduce the issue using a stress test (checkpoint+restart
in a loop), but so far without success :-(

Can you clarify where exactly you added the pg_usleep(), and how long
are the waits you added? I wonder if the sleep is really needed,
considering the checkpoints are spread anyway. Also, what you mean by
"hard reset"?

What confuses me a bit is that we update the restart_lsn (and call
ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
all the time. Walsender does that in PhysicalConfirmReceivedLocation for
example. So we actually see the required LSN to move during checkpoint
very often. So how come we don't see the issues much more often? Surely
I miss something important.

Another option might be that pg_replication_slot_advance() doesn't do
something it should be doing. For example, shouldn't be marking the slot
as dirty?


regards

-- 
Tomas Vondra





On 11/20/24 14:40, Vitaly Davydov wrote:
> Dear Hackers,
> 
>  
> 
> To ping the topic, I'd like to clarify what may be wrong with the idea
> described here, because I do not see any interest from the community.
> The topic is related to physical replication. The primary idea is to
> define the horizon of WAL segments (files) removal based on saved on
> disk restart LSN values. Now, the WAL segment removal horizon is
> calculated based on the current restart LSN values of slots, that can
> not be saved on disk at the time of the horizon calculation. The case
> take place when a slot is advancing during checkpoint as described
> earlier in the topic.
> 

Yeah, a simple way to fix this might be to make sure we don't use the
required LSN value set after CheckPointReplicationSlots() to remove WAL.
AFAICS the problem is KeepLogSeg() gets the new LSN value by:

   keep = XLogGetReplicationSlotMinimumLSN();

Let's say we get the LSN before calling CheckPointGuts(), and then pass
it to KeepLogSeg, so that it doesn't need to get the fresh value.

Wouldn't that fix the issue?

> 
> Such behaviour is not a problem when slots are used only for physical
> replication in a conventional way. But it may be a problem when physical
> slot is used for some other goals. For example, I have an extension
> which keeps the WAL using physical replication slots. It creates a new
> physical slot and advances it as needed. After restart, it can use
> restart lsn of the slot to read WAL from this LSN. In this case, there
> is no guarantee that restart lsn will point to an existing WAL segment.
> 

Yeah.

> 
> The advantage of the current behaviour is that it requires a little bit
> less WAL to keep. The disadvantage is that physical slots do not
> guarantee WAL keeping starting from its' restart lsns in general.
> 

If it's wrong, it doesn't really matter it has some advantages.


regards

-- 
Tomas Vondra




On 11/20/24 18:24, Tomas Vondra wrote:
>
> ...
>
> What confuses me a bit is that we update the restart_lsn (and call
> ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
> all the time. Walsender does that in PhysicalConfirmReceivedLocation for
> example. So we actually see the required LSN to move during checkpoint
> very often. So how come we don't see the issues much more often? Surely
> I miss something important.
> 

This question "How come we don't see this more often?" kept bugging me,
and the answer is actually pretty simple.

The restart_lsn can move backwards after a hard restart (for the reasons
explained), but physical replication does not actually rely on that. The
replica keeps track of the LSN it received (well, it uses the same LSN),
and on reconnect it sends the startpoint to the primary. And the primary
just proceeds use that instead of the (stale) restart LSN for the slot.
And the startpoint is guaranteed (I think) to be at least restart_lsn.

AFAICS this would work for pg_replication_slot_advance() too, that is if
you remember the last LSN the slot advanced to, it should be possible to
advance to it just fine. Of course, it requires a way to remember that
LSN, which for a replica is not an issue. But this just highlights we
can't rely on restart_lsn for this purpose.


(Apologies if this repeats something obvious, or something you already
said, Vitaly.)

regards

-- 
Tomas Vondra





On 11/20/24 23:19, Tomas Vondra wrote:
> On 11/20/24 18:24, Tomas Vondra wrote:
>>
>> ...
>>
>> What confuses me a bit is that we update the restart_lsn (and call
>> ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
>> all the time. Walsender does that in PhysicalConfirmReceivedLocation for
>> example. So we actually see the required LSN to move during checkpoint
>> very often. So how come we don't see the issues much more often? Surely
>> I miss something important.
>>
> 
> This question "How come we don't see this more often?" kept bugging me,
> and the answer is actually pretty simple.
> 
> The restart_lsn can move backwards after a hard restart (for the reasons
> explained), but physical replication does not actually rely on that. The
> replica keeps track of the LSN it received (well, it uses the same LSN),
> and on reconnect it sends the startpoint to the primary. And the primary
> just proceeds use that instead of the (stale) restart LSN for the slot.
> And the startpoint is guaranteed (I think) to be at least restart_lsn.
> 
> AFAICS this would work for pg_replication_slot_advance() too, that is if
> you remember the last LSN the slot advanced to, it should be possible to
> advance to it just fine. Of course, it requires a way to remember that
> LSN, which for a replica is not an issue. But this just highlights we
> can't rely on restart_lsn for this purpose.
> 

I kept thinking about this (sorry it's this incremental), particularly
if this applies to logical replication too. And AFAICS it does not, or
at least not to this extent.

For streaming, the subscriber sends the startpoint (just like physical
replication), so it should be protected too.

But then there's the SQL API - pg_logical_slot_get_changes(). And it
turns out it ends up syncing the slot to disk pretty often, because for
RUNNING_XACTS we call LogicalDecodingProcessRecord() + standby_decode(),
which ends up calling SaveSlotToDisk(). And at the end we call
LogicalConfirmReceivedLocation() for good measure, which saves the slot
too, just to be sure.

FWIW I suspect this still is not perfectly safe, because we may still
crash / restart before the updated data.restart_lsn makes it to disk,
but after it was already used to remove old WAL, although that's
probably harder to hit. With streaming the subscriber will still send us
the new startpoint, so that should not fail I think. But with the SQL
API we probably can get into the "segment already removed" issues.

I haven't tried reproducing this yet, I guess it should be possible
using the injection points. Not sure when I get to this, though.


In any case, doesn't this suggest SaveSlotToDisk() really is not that
expensive, if we do it pretty often for logical replication? Which was
presented as the main reason why pg_replication_slot_advance() doesn't
do that. Maybe it should?

If the advance is substantial I don't think it really matters, because
there simply can't be that many of large advances. It amortizes, in a
way. But even with smaller advances it should be fine, I think - if the
goal is to not remove WAL prematurely, it's enough to flush when we move
to the next segment.


regards

-- 
Tomas Vondra




Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From
"Vitaly Davydov"
Date:

Hi Tomas,

 

Thank you for the reply and your interest to the investigation.

On Wednesday, November 20, 2024 20:24 MSK, Tomas Vondra <tomas@vondra.me> wrote:
 

> If an existing physical slot is advanced in the middle of checkpoint
> execution, WAL segments, which are related to saved on disk restart LSN
> may be removed. It is because the calculation of the replication slot
> miminal LSN is occured at the end of checkpoint, prior to old WAL
> segments removal. If to hard stop (pg_stl -m immediate) the postgres
> instance right after checkpoint and to restart it, the slot's
> restart_lsn may point to the removed WAL segment. I believe, such
> behaviour is not good.


Not sure I 100% follow, but let me rephrase, just so that we're on the
same page. CreateCheckPoint() does this:

... something ...

CheckPointGuts(checkPoint.redo, flags);

... something ...

RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
checkPoint.ThisTimeLineID);

The slots get synced in CheckPointGuts(), so IIUC you're saying if the
slot gets advanced shortly after the sync, the RemoveOldXlogFiles() may
remove still-needed WAL, because we happen to consider a fresh
restart_lsn when calculating the logSegNo. Is that right?

They key action here is to restart the instance with -m immediate (or kill it and start it again) right after checkpoint. After restart, the slot's restart_lsn will be read from the disk and address to the removed WAL segment, if it's LSN was davanced enought to switch to a new WAL segment.

 


I tried to reproduce the issue using a stress test (checkpoint+restart
in a loop), but so far without success :-(

Can you clarify where exactly you added the pg_usleep(), and how long
are the waits you added? I wonder if the sleep is really needed,
considering the checkpoints are spread anyway. Also, what you mean by
"hard reset"?
 

I added pg_usleep as show below (in CheckPointBuffers function):

 

CheckPointBuffers(int flags)
{
    BufferSync(flags);
+    pg_usleep(10000000);
}
 

Below is the instruction how I run my test (pg_usleep should be added to the code):

 

CONSOLE> initdb -D pgdata
CONSOLE> pg_ctl -D pgdata -l logfile start
... open two psql terminals and connect to the database (lets call them as PSQL-1, PSQL-2)
PSQL-1>  select pg_create_physical_replication_slot('myslot', true, false);
CONSOLE> pgbench -i -s 10 postgres # create some WAL records
PSQL-1>  checkpoint; -- press ENTER key and go to PSQL-2 console and execute next line in 1 second
PSQL-2>  select pg_replication_slot_advance('myslot', pg_current_wal_lsn()); -- advance repslot during checkpoint
... wait for checkpoint to complete
CONSOLE> pg_ctl -D pgdata -m immediate stop 
CONSOLE> pg_ctl -D pgdata start
PSQL-1>  \c
PSQL-1>  create extension pg_walinspect;
PSQL-1>  select pg_get_wal_record_info(restart_lsn) from pg_replication_slots where slot_name = 'myslot';
ERROR:  requested WAL segment pg_wal/000000010000000000000001 has already been removed

 

I'm trying to create a perl test to reproduce it. Please, give me some time to create the test script.

 

I kept thinking about this (sorry it's this incremental), particularly
if this applies to logical replication too. And AFAICS it does not, or
at least not to this extent.

 

Yes, it is not applied to logical replication, because logical slot is synced when advancing.

 

eah, a simple way to fix this might be to make sure we don't use the
required LSN value set after CheckPointReplicationSlots() to remove WAL.
AFAICS the problem is KeepLogSeg() gets the new LSN value by:

keep = XLogGetReplicationSlotMinimumLSN();

Let's say we get the LSN before calling CheckPointGuts(), and then pass
it to KeepLogSeg, so that it doesn't need to get the fresh value.

 

Yes, it is another solution and it can fix the problem. The question - which solution to choose. Well, I prefer to add a new in-memory state variable in the slot structure. Such variable may be useful if we want to check whether the slot data is synced or not. The calculation of the keep value before CheckPointGuts(), IMHO, requires to change signatures of a number of functions. I may prepare a new patch where your solution is implemented.

 

I'm sorry, if I missed to answer to some other questions. I will answer later.

 

With best regards,

Vitaly



 

Dear Hackers,

Let me please introduce a new version of the patch.

Patch description:

The slot data is flushed to the disk at the beginning of checkpoint. If
an existing slot is advanced in the middle of checkpoint execution, its
advanced restart LSN is taken to calculate the oldest LSN for WAL
segments removal at the end of checkpoint. If the node is restarted just
after the checkpoint, the slots data will be read from the disk at
recovery with the oldest restart LSN which can refer to removed WAL
segments.

The patch introduces a new in-memory state for slots -
flushed_restart_lsn which is used to calculate the oldest LSN for WAL
segments removal. This state is updated every time with the current
restart_lsn at the moment, when the slot is saving to disk.

With best regards,
Vitaly

Attachment
Hi, Vitaly!

On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> The slot data is flushed to the disk at the beginning of checkpoint. If
> an existing slot is advanced in the middle of checkpoint execution, its
> advanced restart LSN is taken to calculate the oldest LSN for WAL
> segments removal at the end of checkpoint. If the node is restarted just
> after the checkpoint, the slots data will be read from the disk at
> recovery with the oldest restart LSN which can refer to removed WAL
> segments.
>
> The patch introduces a new in-memory state for slots -
> flushed_restart_lsn which is used to calculate the oldest LSN for WAL
> segments removal. This state is updated every time with the current
> restart_lsn at the moment, when the slot is saving to disk.

Thank you for your work on this subject.   I think generally your
approach is correct.  When we're truncating the WAL log, we need to
reply on the position that would be used in the case of server crush.
That is the position flushed to the disk.

While your patch is generality looks good, I'd like make following notes:

1) As ReplicationSlotsComputeRequiredLSN() is called each time we need
to advance the position of WAL needed by replication slots, the usage
pattern probably could be changed.  Thus, we probably need to call
ReplicationSlotsComputeRequiredLSN() somewhere after change of
restart_lsn_flushed while restart_lsn is not changed.  And probably
can skip ReplicationSlotsComputeRequiredLSN() in some cases when only
restart_lsn is changed.
2) I think it's essential to include into the patch test caches which
fail without patch.  You could start from integrating [1] test into
your patch, and then add more similar tests for different situations.

Links.
1.  https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me

------
Regards,
Alexander Korotkov
Supabase



Hi Alexander,

Thank you for the review. I apologize for a late reply. I missed your email.

> 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need
> to advance the position of WAL needed by replication slots, the usage
> pattern probably could be changed.  Thus, we probably need to call
> ReplicationSlotsComputeRequiredLSN() somewhere after change of
> restart_lsn_flushed while restart_lsn is not changed.  And probably
> can skip ReplicationSlotsComputeRequiredLSN() in some cases when only
> restart_lsn is changed.

Yes, it is a good idea for investigation, thank you! I guess, It may work for
persistent slots but I'm not sure about other types of slots (ephemeral and
temporary). I have no clear understanding of consequences at the moment. I
propose to postpone it for future, because the proposed changes will me more
invasive.

> 2) I think it's essential to include into the patch test caches which
> fail without patch.  You could start from integrating [1] test into
> your patch, and then add more similar tests for different situations.

The problem with TAP tests - it is hard to reproduce without injection points.
The Tomas Vondra's tests create two new injection points. I have to add more
injection points for new tests as well. Injection points help to test the code
but make the code unreadable. I'm not sure, what is the policy of creating
injection points? Personally, I would not like to add new injection points
only to check this particular rare case. I'm trying to create a test without
injection points that should fail occasionally, but I haven't succeeded at
the moment.

I have a question - is there any interest to backport the solution into
existing major releases? I can prepare a patch where restart_lsn_flushed stored
outside of ReplicationSlot structure and doesn't affect the existing API.

With best regards,
Vitaly

On Friday, April 04, 2025 06:22 MSK, Alexander Korotkov <aekorotkov@gmail.com> wrote:

> Hi, Vitaly!
>
> On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> > The slot data is flushed to the disk at the beginning of checkpoint. If
> > an existing slot is advanced in the middle of checkpoint execution, its
> > advanced restart LSN is taken to calculate the oldest LSN for WAL
> > segments removal at the end of checkpoint. If the node is restarted just
> > after the checkpoint, the slots data will be read from the disk at
> > recovery with the oldest restart LSN which can refer to removed WAL
> > segments.
> >
> > The patch introduces a new in-memory state for slots -
> > flushed_restart_lsn which is used to calculate the oldest LSN for WAL
> > segments removal. This state is updated every time with the current
> > restart_lsn at the moment, when the slot is saving to disk.
>
> Thank you for your work on this subject.   I think generally your
> approach is correct.  When we're truncating the WAL log, we need to
> reply on the position that would be used in the case of server crush.
> That is the position flushed to the disk.
>
> While your patch is generality looks good, I'd like make following notes:
>
> 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need
> to advance the position of WAL needed by replication slots, the usage
> pattern probably could be changed.  Thus, we probably need to call
> ReplicationSlotsComputeRequiredLSN() somewhere after change of
> restart_lsn_flushed while restart_lsn is not changed.  And probably
> can skip ReplicationSlotsComputeRequiredLSN() in some cases when only
> restart_lsn is changed.
> 2) I think it's essential to include into the patch test caches which
> fail without patch.  You could start from integrating [1] test into
> your patch, and then add more similar tests for different situations.
>
> Links.
> 1.  https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me
>
> ------
> Regards,
> Alexander Korotkov
> Supabase




On Thu, Apr 24, 2025 at 5:32 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> Thank you for the review. I apologize for a late reply. I missed your email.
>
> > 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need
> > to advance the position of WAL needed by replication slots, the usage
> > pattern probably could be changed.  Thus, we probably need to call
> > ReplicationSlotsComputeRequiredLSN() somewhere after change of
> > restart_lsn_flushed while restart_lsn is not changed.  And probably
> > can skip ReplicationSlotsComputeRequiredLSN() in some cases when only
> > restart_lsn is changed.
>
> Yes, it is a good idea for investigation, thank you! I guess, It may work for
> persistent slots but I'm not sure about other types of slots (ephemeral and
> temporary). I have no clear understanding of consequences at the moment. I
> propose to postpone it for future, because the proposed changes will me more
> invasive.

Yes, that's different for different types of slots.  So, removing
ReplicationSlotsComputeRequiredLSN() doesn't look safe.  But at least,
we need to analyze if we need to add extra calls.

> > 2) I think it's essential to include into the patch test caches which
> > fail without patch.  You could start from integrating [1] test into
> > your patch, and then add more similar tests for different situations.
>
> The problem with TAP tests - it is hard to reproduce without injection points.
> The Tomas Vondra's tests create two new injection points. I have to add more
> injection points for new tests as well. Injection points help to test the code
> but make the code unreadable. I'm not sure, what is the policy of creating
> injection points? Personally, I would not like to add new injection points
> only to check this particular rare case. I'm trying to create a test without
> injection points that should fail occasionally, but I haven't succeeded at
> the moment.

I don't know if there is an explicit policy.  I think we just add them
as needed to reproduce important situations in TAP tests.  So, feel
free to them as many as you want to reproduce all the problematic
situations.  During review we can find if they seem too many, but
don't bother about this at present stage.

> I have a question - is there any interest to backport the solution into
> existing major releases?

As long as this is the bug, it should be backpatched to all supported
affected releases.

> I can prepare a patch where restart_lsn_flushed stored
> outside of ReplicationSlot structure and doesn't affect the existing API.

Yes, please!

------
Regards,
Alexander Korotkov
Supabase



On Mon, Apr 28, 2025 at 8:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > I have a question - is there any interest to backport the solution into
> > existing major releases?
>
> As long as this is the bug, it should be backpatched to all supported
> affected releases.

Yes, but I think we cannot back-patch the proposed fix to back
branches as it changes the ReplicationSlot struct defined in slot.h,
which breaks ABI compatibility.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Tue, Apr 29, 2025 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Apr 28, 2025 at 8:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > > I have a question - is there any interest to backport the solution into
> > > existing major releases?
> >
> > As long as this is the bug, it should be backpatched to all supported
> > affected releases.
>
> Yes, but I think we cannot back-patch the proposed fix to back
> branches as it changes the ReplicationSlot struct defined in slot.h,
> which breaks ABI compatibility.

Yes, and I think Vitaly already proposed to address this issue.  This
aspect also needs to be carefully reviewed for sure.

On Thu, Apr 24, 2025 at 5:32 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> I can prepare a patch where restart_lsn_flushed stored
> outside of ReplicationSlot structure and doesn't affect the existing API.

------
Regards,
Alexander Korotkov
Supabase



On Mon, Apr 28, 2025 at 6:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Tue, Apr 29, 2025 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Apr 28, 2025 at 8:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >
> > > > I have a question - is there any interest to backport the solution into
> > > > existing major releases?
> > >
> > > As long as this is the bug, it should be backpatched to all supported
> > > affected releases.
> >
> > Yes, but I think we cannot back-patch the proposed fix to back
> > branches as it changes the ReplicationSlot struct defined in slot.h,
> > which breaks ABI compatibility.
>
> Yes, and I think Vitaly already proposed to address this issue.  This
> aspect also needs to be carefully reviewed for sure.
>
> On Thu, Apr 24, 2025 at 5:32 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> > I can prepare a patch where restart_lsn_flushed stored
> > outside of ReplicationSlot structure and doesn't affect the existing API.

Oh, I totally missed this part. Sorry for making noise. I'll review
the patch once submitted.

Regarding the proposed patch, I think we can somewhat follow
last_saved_confirmed_flush field of ReplicationSlot. For example, we
can set restart_lsn_flushed when restoring the slot from the disk,
etc.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Dear All,

Thank you for the attention to the patch. I updated a patch with a better
solution for the master branch which can be easily backported to the other
branches as we agree on the final solution.

Two tests are introduced which are based on Tomas Vondra's test for logical slots
with injection points from the discussion (patches: [1], [2], [3]). Tests
are implemented as module tests in src/test/modules/test_replslot_required_lsn
directory. I slightly modified the original test for logical slots and created a
new simpler test for physical slots without any additional injection points.

I prepared a new solution (patch [4]) which is also based on Tomas Vondra's
proposal. With a fresh eye, I realized that it can fix the issue as well. It is
easy and less invasive to implement. The new solution differs from my original
solution: it is backward compatible (doesn't require any changes in ReplicationSlot
structure). My original solution can be backward compatible as well if to
allocate flushed_restart_lsn in a separate array in shmem, not in the
ReplicationSlot structure, but I believe the new solution is the better one. If
you still think that my previous solution is the better (I don't think so), I
will prepare a backward compatible patch with my previous solution.

I also proposed one more commit (patch [5]) which removes unnecessary calls of
ReplicationSlotsComputeRequiredLSN function which seems to be redundant. This
function updates the oldest required LSN for slots and it is called every time
when slots' restart_lsn is changed. Once, we use the oldest required LSN in
CreateCheckPoint/CreateRestartPoint to remove old WAL segments, I believe, there
is no need to calculate the oldest value immediately when the slot is advancing
and in other cases when restart_lsn is changed. It may affect on
GetWALAvailability function because the oldest required LSN will be not up to
date, but this function seems to be used in the system view
pg_get_replication_slots and doesn't affect the logic of old WAL segments
removal. I also have some doubts concerning advancing of logical replication
slots: the call of ReplicationSlotsComputeRequiredLSN was removed. Not sure, how
it can affect on ReplicationSlotsComputeRequiredXmin. This commit is not
necessary for the fix but I think it is worth to consider. It may be dropped or
applied only to the master branch.

This patch can be easily backported to the major release branches. I will
quickly prepare the patches for the major releases as we agree on the final
solution.

I apologize for such too late change in patch when it is already on commitfest.
I'm not well experienced yet in the internals of PostgreSQL at the moment,
sometimes the better solution needs some time to grow. In doing we learn :)

[1] 0001-Add-injection-points-to-test-replication-slot-advanc.v2.patch
[2] 0002-Add-TAP-test-to-check-logical-repl-slot-advance-duri.v2.patch
[3] 0003-Add-TAP-test-to-check-physical-repl-slot-advance-dur.v2.patch
[4] 0004-Keep-WAL-segments-by-slot-s-flushed-restart-LSN.v2.patch
[5] 0005-Remove-redundant-ReplicationSlotsComputeRequiredLSN-.v2.patch

With best regards,
Vitaly

Attachment