Thread: Re: Mention idle_replication_slot_timeout in pg_replication_slots docs
On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > Hi, > > The pg_replication_slots documentation mentions only max_slot_wal_keep_size > as a condition under which the wal_status column can show unreserved or lost. > However, since commit ac0e33136ab, idle_replication_slot_timeout can also > cause this behavior when it is set. This has not been documented yet. > https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > +1 to the doc update. > So, how about updating the documentation to also mention > idle_replication_slot_timeout as a factor that can cause wal_status to > become unreserved or lost? Patch attached. > Since idle_replication_slot_timeout can only cause wal_status to become 'lost' and not 'unreserved', perhaps we can reword the sentence slightly for clarity, suggestion - "The last two states are seen when max_slot_wal_keep_size is non-negative and, the 'lost' state may also appear when idle_replication_slot_timeout is greater than zero." Please feel free to rephrase if needed. -- Thanks, Nisha
On 2025/06/26 15:46, Nisha Moond wrote: > On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> >> Hi, >> >> The pg_replication_slots documentation mentions only max_slot_wal_keep_size >> as a condition under which the wal_status column can show unreserved or lost. >> However, since commit ac0e33136ab, idle_replication_slot_timeout can also >> cause this behavior when it is set. This has not been documented yet. >> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html >> > > +1 to the doc update. Thanks for the review! >> So, how about updating the documentation to also mention >> idle_replication_slot_timeout as a factor that can cause wal_status to >> become unreserved or lost? Patch attached. >> > > Since idle_replication_slot_timeout can only cause wal_status to > become 'lost' and not 'unreserved', perhaps we can reword the sentence > slightly for clarity, suggestion - > "The last two states are seen when max_slot_wal_keep_size is > non-negative and, the 'lost' state may also appear when > idle_replication_slot_timeout is greater than zero." I was thinking that when idle_replication_slot_timeout triggers, the following functions are called, and that wal_status can become "unreserved" before ReplicationSlotRelease() runs. It's very short period, though. Am I wrong? ReplicationSlotMarkDirty(); ReplicationSlotSave(); ReplicationSlotRelease(); Regards, -- Fujii Masao NTT DATA Japan Corporation
On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2025/06/26 15:46, Nisha Moond wrote: > > On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> Hi, > >> > >> The pg_replication_slots documentation mentions only max_slot_wal_keep_size > >> as a condition under which the wal_status column can show unreserved or lost. > >> However, since commit ac0e33136ab, idle_replication_slot_timeout can also > >> cause this behavior when it is set. This has not been documented yet. > >> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > >> > > > > +1 to the doc update. > > Thanks for the review! > > > >> So, how about updating the documentation to also mention > >> idle_replication_slot_timeout as a factor that can cause wal_status to > >> become unreserved or lost? Patch attached. > >> > > > > Since idle_replication_slot_timeout can only cause wal_status to > > become 'lost' and not 'unreserved', perhaps we can reword the sentence > > slightly for clarity, suggestion - > > "The last two states are seen when max_slot_wal_keep_size is > > non-negative and, the 'lost' state may also appear when > > idle_replication_slot_timeout is greater than zero." > > I was thinking that when idle_replication_slot_timeout triggers, > the following functions are called, and that wal_status can become > "unreserved" before ReplicationSlotRelease() runs. It's very short > period, though. Am I wrong? > > ReplicationSlotMarkDirty(); > ReplicationSlotSave(); > ReplicationSlotRelease(); > Thank you for pointing it out. You are correct that while the checkpointer is in the process of invalidating a slot, it sets its PID as the slot’s active_pid. During this short window, if a user queries pg_replication_slot, the underlying function pg_get_replication_slots will compute the wal_status as 'unreserved' for the invalidated slot because the slot has a valid active_pid. That said, it's reasonable to mention in the doc that 'unreserved' may appear when idle_replication_slot_timeout is greater than zero, as this can indeed happen. So, let's retain the current description. However, this behavior isn’t specific to idle_replication_slot_timeout. For example, when a slot is being invalidated due to a different cause "wal_level_insufficient", 'unreserved' may also briefly appear in wal_status. The current patch LGTM. -- Thanks, Nisha
On Fri, Jun 27, 2025 at 5:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2025/06/27 15:32, Nisha Moond wrote: > > On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> > >> > >> On 2025/06/26 15:46, Nisha Moond wrote: > >>> On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> The pg_replication_slots documentation mentions only max_slot_wal_keep_size > >>>> as a condition under which the wal_status column can show unreserved or lost. > >>>> However, since commit ac0e33136ab, idle_replication_slot_timeout can also > >>>> cause this behavior when it is set. This has not been documented yet. > >>>> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > >>>> > >>> > >>> +1 to the doc update. > >> > >> Thanks for the review! > >> > >> > >>>> So, how about updating the documentation to also mention > >>>> idle_replication_slot_timeout as a factor that can cause wal_status to > >>>> become unreserved or lost? Patch attached. > >>>> > >>> > >>> Since idle_replication_slot_timeout can only cause wal_status to > >>> become 'lost' and not 'unreserved', perhaps we can reword the sentence > >>> slightly for clarity, suggestion - > >>> "The last two states are seen when max_slot_wal_keep_size is > >>> non-negative and, the 'lost' state may also appear when > >>> idle_replication_slot_timeout is greater than zero." > >> > >> I was thinking that when idle_replication_slot_timeout triggers, > >> the following functions are called, and that wal_status can become > >> "unreserved" before ReplicationSlotRelease() runs. It's very short > >> period, though. Am I wrong? > >> > >> ReplicationSlotMarkDirty(); > >> ReplicationSlotSave(); > >> ReplicationSlotRelease(); > >> > > > > Thank you for pointing it out. > > You are correct that while the checkpointer is in the process of > > invalidating a slot, it sets its PID as the slot’s active_pid. During > > this short window, if a user queries pg_replication_slot, the > > underlying function pg_get_replication_slots will compute the > > wal_status as 'unreserved' for the invalidated slot because the slot > > has a valid active_pid. > > > > That said, it's reasonable to mention in the doc that 'unreserved' may > > appear when idle_replication_slot_timeout is greater than zero, as > > this can indeed happen. So, let's retain the current description. > > > > However, this behavior isn’t specific to > > idle_replication_slot_timeout. For example, when a slot is being > > invalidated due to a different cause "wal_level_insufficient", > > 'unreserved' may also briefly appear in wal_status. > > Yes, and "lost" can appear for various reasons, including wal_level_insufficient, > so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost" > status in the note. It would probably be better to remove the mention of "lost" > from that note. > +1 > As for "unreserved", it can also occur for different reasons, but typically, > it happens when max_slot_wal_keep_size is set to a non-negative value. > So it might make sense to keep the explanation focused just on "unreserved" > and max_slot_wal_keep_size. For example: > > ---------------------- > <listitem> > <para> > <literal>unreserved</literal> means that the slot no longer > retains the required WAL files and some of them are to be removed at > - the next checkpoint. This state can return > + the next checkpoint. This can occur when > + <xref linkend="guc-max-slot-wal-keep-size"/> is set to > + a non-negative value. This state can return > to <literal>reserved</literal> or <literal>extended</literal>. > </para> > </listitem> > <listitem> > ---------------------- > > What do you think? > The change LGTM, only a minor suggestion to add "typically", as “This can typically occur when…” to indicate that max_slot_wal_keep_size is one possible reason, not the only one. > > Also, I noticed the note that says “If <structfield>restart_lsn</structfield> > is NULL, this field is null” seems inaccurate. For example, when "wal_removed" > happens, restart_lsn is NULL but wal_status is "lost". So maybe we should remove > that note as well? You're right, the statement is not accurate. We could rephrase it as: "If <structfield>restart_lsn</structfield> is NULL, this field is either null or lost." But since 'unreserved' can also appear briefly during invalidation, it might be better to remove it altogether. -- Thanks, Nisha
On 2025/06/30 20:32, Nisha Moond wrote: > On Fri, Jun 27, 2025 at 5:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> >> >> >> On 2025/06/27 15:32, Nisha Moond wrote: >>> On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>>> >>>> >>>> >>>> On 2025/06/26 15:46, Nisha Moond wrote: >>>>> On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> The pg_replication_slots documentation mentions only max_slot_wal_keep_size >>>>>> as a condition under which the wal_status column can show unreserved or lost. >>>>>> However, since commit ac0e33136ab, idle_replication_slot_timeout can also >>>>>> cause this behavior when it is set. This has not been documented yet. >>>>>> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html >>>>>> >>>>> >>>>> +1 to the doc update. >>>> >>>> Thanks for the review! >>>> >>>> >>>>>> So, how about updating the documentation to also mention >>>>>> idle_replication_slot_timeout as a factor that can cause wal_status to >>>>>> become unreserved or lost? Patch attached. >>>>>> >>>>> >>>>> Since idle_replication_slot_timeout can only cause wal_status to >>>>> become 'lost' and not 'unreserved', perhaps we can reword the sentence >>>>> slightly for clarity, suggestion - >>>>> "The last two states are seen when max_slot_wal_keep_size is >>>>> non-negative and, the 'lost' state may also appear when >>>>> idle_replication_slot_timeout is greater than zero." >>>> >>>> I was thinking that when idle_replication_slot_timeout triggers, >>>> the following functions are called, and that wal_status can become >>>> "unreserved" before ReplicationSlotRelease() runs. It's very short >>>> period, though. Am I wrong? >>>> >>>> ReplicationSlotMarkDirty(); >>>> ReplicationSlotSave(); >>>> ReplicationSlotRelease(); >>>> >>> >>> Thank you for pointing it out. >>> You are correct that while the checkpointer is in the process of >>> invalidating a slot, it sets its PID as the slot’s active_pid. During >>> this short window, if a user queries pg_replication_slot, the >>> underlying function pg_get_replication_slots will compute the >>> wal_status as 'unreserved' for the invalidated slot because the slot >>> has a valid active_pid. >>> >>> That said, it's reasonable to mention in the doc that 'unreserved' may >>> appear when idle_replication_slot_timeout is greater than zero, as >>> this can indeed happen. So, let's retain the current description. >>> >>> However, this behavior isn’t specific to >>> idle_replication_slot_timeout. For example, when a slot is being >>> invalidated due to a different cause "wal_level_insufficient", >>> 'unreserved' may also briefly appear in wal_status. >> >> Yes, and "lost" can appear for various reasons, including wal_level_insufficient, >> so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost" >> status in the note. It would probably be better to remove the mention of "lost" >> from that note. >> > > +1 Is this true starting from v16, when logical replication from standby was introduced? In other words, in v15 and earlier, only max_slot_wal_keep_size could cause the wal_status to become "unreserved" or "lost"? I'm wondering where to back-patch this fix to. >> As for "unreserved", it can also occur for different reasons, but typically, >> it happens when max_slot_wal_keep_size is set to a non-negative value. >> So it might make sense to keep the explanation focused just on "unreserved" >> and max_slot_wal_keep_size. For example: >> >> ---------------------- >> <listitem> >> <para> >> <literal>unreserved</literal> means that the slot no longer >> retains the required WAL files and some of them are to be removed at >> - the next checkpoint. This state can return >> + the next checkpoint. This can occur when >> + <xref linkend="guc-max-slot-wal-keep-size"/> is set to >> + a non-negative value. This state can return >> to <literal>reserved</literal> or <literal>extended</literal>. >> </para> >> </listitem> >> <listitem> >> ---------------------- >> >> What do you think? >> > > The change LGTM, only a minor suggestion to add "typically", as “This > can typically occur when…” to indicate that max_slot_wal_keep_size is > one possible reason, not the only one. OK. >> Also, I noticed the note that says “If <structfield>restart_lsn</structfield> >> is NULL, this field is null” seems inaccurate. For example, when "wal_removed" >> happens, restart_lsn is NULL but wal_status is "lost". So maybe we should remove >> that note as well? > > You're right, the statement is not accurate. > We could rephrase it as: "If <structfield>restart_lsn</structfield> is > NULL, this field is either null or lost." But since 'unreserved' can > also appear briefly during invalidation, it might be better to remove > it altogether. I agree with removing the description. Unless I'm missing something, it has been incorrect since at least v13, so we should back-patch this fix to all supported versions. Regards, -- Fujii Masao NTT DATA Japan Corporation
On Mon, Jun 30, 2025 at 6:12 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2025/06/30 20:32, Nisha Moond wrote: > > On Fri, Jun 27, 2025 at 5:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> > >> > >> On 2025/06/27 15:32, Nisha Moond wrote: > >>> On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>> > >>>> > >>>> > >>>> On 2025/06/26 15:46, Nisha Moond wrote: > >>>>> On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> The pg_replication_slots documentation mentions only max_slot_wal_keep_size > >>>>>> as a condition under which the wal_status column can show unreserved or lost. > >>>>>> However, since commit ac0e33136ab, idle_replication_slot_timeout can also > >>>>>> cause this behavior when it is set. This has not been documented yet. > >>>>>> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > >>>>>> > >>>>> > >>>>> +1 to the doc update. > >>>> > >>>> Thanks for the review! > >>>> > >>>> > >>>>>> So, how about updating the documentation to also mention > >>>>>> idle_replication_slot_timeout as a factor that can cause wal_status to > >>>>>> become unreserved or lost? Patch attached. > >>>>>> > >>>>> > >>>>> Since idle_replication_slot_timeout can only cause wal_status to > >>>>> become 'lost' and not 'unreserved', perhaps we can reword the sentence > >>>>> slightly for clarity, suggestion - > >>>>> "The last two states are seen when max_slot_wal_keep_size is > >>>>> non-negative and, the 'lost' state may also appear when > >>>>> idle_replication_slot_timeout is greater than zero." > >>>> > >>>> I was thinking that when idle_replication_slot_timeout triggers, > >>>> the following functions are called, and that wal_status can become > >>>> "unreserved" before ReplicationSlotRelease() runs. It's very short > >>>> period, though. Am I wrong? > >>>> > >>>> ReplicationSlotMarkDirty(); > >>>> ReplicationSlotSave(); > >>>> ReplicationSlotRelease(); > >>>> > >>> > >>> Thank you for pointing it out. > >>> You are correct that while the checkpointer is in the process of > >>> invalidating a slot, it sets its PID as the slot’s active_pid. During > >>> this short window, if a user queries pg_replication_slot, the > >>> underlying function pg_get_replication_slots will compute the > >>> wal_status as 'unreserved' for the invalidated slot because the slot > >>> has a valid active_pid. > >>> > >>> That said, it's reasonable to mention in the doc that 'unreserved' may > >>> appear when idle_replication_slot_timeout is greater than zero, as > >>> this can indeed happen. So, let's retain the current description. > >>> > >>> However, this behavior isn’t specific to > >>> idle_replication_slot_timeout. For example, when a slot is being > >>> invalidated due to a different cause "wal_level_insufficient", > >>> 'unreserved' may also briefly appear in wal_status. > >> > >> Yes, and "lost" can appear for various reasons, including wal_level_insufficient, > >> so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost" > >> status in the note. It would probably be better to remove the mention of "lost" > >> from that note. > >> > > > > +1 > > Is this true starting from v16, when logical replication from standby was introduced? > In other words, in v15 and earlier, only max_slot_wal_keep_size could cause > the wal_status to become "unreserved" or "lost"? I'm wondering where to back-patch > this fix to. > I also think we should back-patch this till v16, since that’s when additional slot invalidation causes were also introduced(commit be87200). And since then “max_slot_wal_keep_size” is no longer the sole reason for “unreserved” or “lost” status. -- Thanks, Nisha
On 2025/07/02 16:12, Fujii Masao wrote: > > > On 2025/07/01 13:52, Nisha Moond wrote: >> On Mon, Jun 30, 2025 at 6:12 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>> Is this true starting from v16, when logical replication from standby was introduced? >>> In other words, in v15 and earlier, only max_slot_wal_keep_size could cause >>> the wal_status to become "unreserved" or "lost"? I'm wondering where to back-patch >>> this fix to. >>> >> >> I also think we should back-patch this till v16, since that’s when >> additional slot invalidation causes were also introduced(commit >> be87200). And since then “max_slot_wal_keep_size” is no longer the >> sole reason for “unreserved” or “lost” status. > > Okay, I've prepared two patches: > > - 0001 removes the incorrect line: "If restart_lsn is NULL, this field is null." > This should be back-patched to v13. > - 0002 updates the description of the wal_status to reflect that max_slot_wal_keep_size > is not the only cause of the lost state. This should be back-patched to v16. > > Barrng objections, I will commit these patches. I've pushed the patches. Thanks! Regards, -- Fujii Masao NTT DATA Japan Corporation