Home > mailing lists
Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Date	October 9 21:18:49
Msg-id	2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl@mdyyqpqzxjek Whole thread Raw
In response to	Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) (Melanie Plageman <melanieplageman@gmail.com>)
Responses	Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
List	pgsql-hackers
Tree view
Hi,

On 2025-10-08 18:54:25 -0400, Melanie Plageman wrote:
> +uint8
> +visibilitymap_set_vmbits(BlockNumber heapBlk,
> +                         Buffer vmBuf, uint8 flags,
> +                         const char *heapRelname)
> +{
> +    BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
> +    uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
> +    uint8        mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
> +    Page        page;
> +    uint8       *map;
> +    uint8        status;
> +
> +#ifdef TRACE_VISIBILITYMAP
> +    elog(DEBUG1, "vm_set flags 0x%02X for %s %d",
> +         flags, heapRelname, heapBlk);
> +#endif

I like it doesn't take a Relation anymore, but I'd just pass the smgrrelation
instead, then you don't need to allocate the string in the caller, when it's
approximately never used.

Otherwise this looks pretty close to me.



> @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>      }
>  
>      /*
> -     * If we have a full-page image, restore it and we're done.
> +     * If we have a full-page image of the heap block, restore it and we're
> +     * done with the heap block.
>       */
> -    action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> -                                           (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> -                                           &buffer);
> -    if (action == BLK_NEEDS_REDO)
> +    if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> +                                      (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> +                                      &buffer) == BLK_NEEDS_REDO)
>      {
>          Page        page = BufferGetPage(buffer);
>          OffsetNumber *redirected;

Why move it around this way?


> @@ -138,36 +157,104 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          /* There should be no more data */
>          Assert((char *) frz_offsets == dataptr + datalen);
>  
> +        if ((vmflags & VISIBILITYMAP_VALID_BITS))
> +            PageSetAllVisible(page);
> +
> +        MarkBufferDirty(buffer);
> +
> +        /*
> +         * Always emit a WAL record when setting PD_ALL_VISIBLE but only emit
> +         * an FPI if checksums/wal_log_hints are enabled.

This comment reads as-if we're WAL logging here, but this is a
Wendy's^Wrecovery.

> Advance the page LSN
> +         * only if the record could include an FPI, since recovery skips
> +         * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> +         * needed to repair a torn page.
> +         */

This is confusing, should probably just reference the stuff we did in the
!recovery case.


> +        if (do_prune || nplans > 0 ||
> +            ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> +            PageSetLSN(page, lsn);
> +
>          /*
>           * Note: we don't worry about updating the page's prunability hints.
>           * At worst this will cause an extra prune cycle to occur soon.
>           */

Not your fault, but that seems odd? Why aren't we just doing the right thing?

>      /*
> -     * If we released any space or line pointers, update the free space map.
> +     * If we released any space or line pointers or set PD_ALL_VISIBLE or the
> +     * VM, update the freespace map.

I'd replace the first or with a , ;)


> +     * Even when no actual space is freed (e.g., when only marking the page
> +     * all-visible or frozen), we still update the FSM. Because the FSM is
> +     * unlogged and maintained heuristically, it often becomes stale on
> +     * standbys. If such a standby is later promoted and runs VACUUM, it will
> +     * skip recalculating free space for pages that were marked all-visible
> +     * (or all-frozen, depending on the mode). FreeSpaceMapVacuum can then
> +     * propagate overly optimistic free space values upward, causing future
> +     * insertions to select pages that turn out to be unusable. In bulk, this
> +     * can lead to long stalls.
> +     *
> +     * To prevent this, always refresh the FSM’s view when a page becomes
> +     * all-visible or all-frozen.

I'd s/refresh/update/, because refresh sounds more like rereading the current
state of the FSM, rather than changing the FSM.


> +        /* We don't have relation name during recovery, so use relfilenode */
> +        relname = psprintf("%u", rlocator.relNumber);
> +        old_vmbits = visibilitymap_set_vmbits(blkno, vmbuffer, vmflags, relname);
>  
> -            XLogRecordPageWithFreeSpace(rlocator, blkno, freespace);
> +        /* Only set VM page LSN if we modified the page */
> +        if (old_vmbits != vmflags)
> +        {
> +            Assert(BufferIsDirty(vmbuffer));
> +            PageSetLSN(BufferGetPage(vmbuffer), lsn);
>          }
> -        else
> -            UnlockReleaseBuffer(buffer);
> +        pfree(relname);

Hm. When can we actually enter the old_vmbits == vmflags case?  It might also
be fine to just say that we don't expect it to change but are mirroring the
code in visibilitymap_set().


I wonder if the VM specific redo portion should be in a common helper? Might
not be enough code to worry though...


> @@ -2070,8 +2079,24 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
>      xlhp_prune_items dead_items;
>      xlhp_prune_items unused_items;
>      OffsetNumber frz_offsets[MaxHeapTuplesPerPage];
> +    bool        do_prune = nredirected > 0 || ndead > 0 || nunused > 0;
> +    bool        do_set_vm = vmflags & VISIBILITYMAP_VALID_BITS;
>  
>      xlrec.flags = 0;
> +    regbuf_flags = REGBUF_STANDARD;
> +
> +    Assert((vmflags & VISIBILITYMAP_VALID_BITS) == vmflags);
> +
> +    /*
> +     * We can avoid an FPI if the only modification we are making to the heap
> +     * page is to set PD_ALL_VISIBLE and checksums/wal_log_hints are disabled.

Maybe s/an FPI/an FPI for the heap pae/?


> +     * Note that if we explicitly skip an FPI, we must not set the heap page
> +     * LSN later.
> +     */
> +    if (!do_prune &&
> +        nfrozen == 0 &&
> +        (!do_set_vm || !XLogHintBitIsNeeded()))
> +        regbuf_flags |= REGBUF_NO_IMAGE;

>      /*
>       * Prepare data for the buffer.  The arrays are not actually in the
> @@ -2079,7 +2104,11 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
>       * page image, the arrays can be omitted.
>       */
>      XLogBeginInsert();
> -    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
> +    XLogRegisterBuffer(0, buffer, regbuf_flags);
> +
> +    if (do_set_vm)
> +        XLogRegisterBuffer(1, vmbuffer, 0);

Seems a bit confusing that it's named regbuf_flags but isn't used all the
XLogRegisterBuffer calls. Maybe make the name more specific
(regbuf_flags_heap?)...

>      }
>      recptr = XLogInsert(RM_HEAP2_ID, info);
>  
> -    PageSetLSN(BufferGetPage(buffer), recptr);
> +    if (do_set_vm)
> +    {
> +        Assert(BufferIsDirty(vmbuffer));
> +        PageSetLSN(BufferGetPage(vmbuffer), recptr);
> +    }

> +    /*
> +     * We must bump the page LSN if pruning or freezing. If we are only
> +     * updating PD_ALL_VISIBLE, though, we can skip doing this unless
> +     * wal_log_hints/checksums are enabled. Torn pages are possible if we
> +     * update PD_ALL_VISIBLE without bumping the LSN, but this is deemed okay
> +     * for page hint updates.
> +     */

Arguably it's not a torn page if we only modified something as narrow as a
hint bit, and are redoing that change after recovery. But that's extremely
nitpicky.

I wonder if the comment explaining this should be put into one place and
reference it from all the different places.

> @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
>                               InvalidOffsetNumber);
>  
> +    /*
> +     * Before marking dead items unused, check whether the page will become
> +     * all-visible once that change is applied.

So the function is named _would_ but here you say will :)


> This lets us reap the tuples
> +     * and mark the page all-visible within the same critical section,
> +     * enabling both changes to be emitted in a single WAL record. Since the
> +     * visibility checks may perform I/O and allocate memory, they must be
> +     * done outside the critical section.
> +     */
> +    if (heap_page_would_be_all_visible(vacrel, buffer,
> +                                       deadoffsets, num_offsets,
> +                                       &all_frozen, &visibility_cutoff_xid))
> +    {
> +        vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> +        if (all_frozen)
> +        {
> +            vmflags |= VISIBILITYMAP_ALL_FROZEN;
> +            Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> +        }
> +
> +        /* Take the lock on the vmbuffer before entering a critical section */
> +        LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);

It sure would be nice if we had documented the lock order between the heap
page and the corresponding VM page anywhere.  This is just doing what we did
before, so it's not this patch's fault, but I did get worried about it for a
moment.


> +/*
> + * Check whether the heap page in buf is all-visible except for the dead
> + * tuples referenced in the deadoffsets array.
> + *
> + * The visibility checks may perform IO and allocate memory so they must not
> + * be done in a critical section. This function is used by vacuum to determine
> + * if the page will be all-visible once it reaps known dead tuples. That way
> + * it can do both in the same critical section and emit a single WAL record.
> + *
> + * Returns true if the page is all-visible other than the provided
> + * deadoffsets and false otherwise.
> + *
> + * Output parameters:
> + *
> + *  - *all_frozen: true if every tuple on the page is frozen
> + *  - *visibility_cutoff_xid: newest xmin; valid only if page is all-visible
> + * Callers looking to verify that the page is already all-visible can call
> + * heap_page_is_all_visible().
> + *
> + * This logic is closely related to heap_prune_record_unchanged_lp_normal().
> + * If you modify this function, ensure consistency with that code. An
> + * assertion cross-checks that both remain in agreement. Do not introduce new
> + * side-effects.
> + */
> +static bool
> +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> +                               OffsetNumber *deadoffsets,
> +                               int ndeadoffsets,
> +                               bool *all_frozen,
> +                               TransactionId *visibility_cutoff_xid)
> +{
>      Page        page = BufferGetPage(buf);
>      BlockNumber blockno = BufferGetBlockNumber(buf);
>      OffsetNumber offnum,
>                  maxoff;
>      bool        all_visible = true;
> +    int            matched_dead_count = 0;
>  
>      *visibility_cutoff_xid = InvalidTransactionId;
>      *all_frozen = true;
>  
> +    Assert(ndeadoffsets == 0 || deadoffsets);
> +
> +#ifdef USE_ASSERT_CHECKING
> +    /* Confirm input deadoffsets[] is strictly sorted */
> +    if (ndeadoffsets > 1)
> +    {
> +        for (int i = 1; i < ndeadoffsets; i++)
> +            Assert(deadoffsets[i - 1] < deadoffsets[i]);
> +    }
> +#endif
> +
>      maxoff = PageGetMaxOffsetNumber(page);
>      for (offnum = FirstOffsetNumber;
>           offnum <= maxoff && all_visible;
> @@ -3649,9 +3712,15 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
>           */
>          if (ItemIdIsDead(itemid))
>          {
> -            all_visible = false;
> -            *all_frozen = false;
> -            break;
> +            if (!deadoffsets ||
> +                matched_dead_count >= ndeadoffsets ||
> +                deadoffsets[matched_dead_count] != offnum)
> +            {
> +                *all_frozen = all_visible = false;
> +                break;
> +            }
> +            matched_dead_count++;
> +            continue;
>          }
>  
>          Assert(ItemIdIsNormal(itemid));

Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
the end?


> From 6b5fc27f0d80bab1df86a2e6fb51b64fd20c3cbb Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 15 Sep 2025 12:06:19 -0400
> Subject: [PATCH v17 03/15] Assorted trivial heap_page_prune_and_freeze cleanup

Seems like a good idea, but I'm too lazy to go through this in detail.


> From c69a5219a9b792f3c9f6dc730b8810a88d088ae6 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 16 Sep 2025 14:22:10 -0400
> Subject: [PATCH v17 04/15] Add helper for freeze determination to
>  heap_page_prune_and_freeze
> 
> After scanning through the line pointers on the heap page during
> vacuum's first phase, we use several statuses and information we
> collected to determine whether or not we will use the freeze plans we
> assembled.
> 
> Do this in a helper for better readability.


> @@ -663,85 +775,11 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>       * Decide if we want to go ahead with freezing according to the freeze
>       * plans we prepared, or not.
>       */
> -    do_freeze = false;
> - ...
> +    do_freeze = heap_page_will_freeze(params->relation, buffer,
> +                                      did_tuple_hint_fpi,
> +                                      do_prune,
> +                                      do_hint_prune,
> +                                      &prstate);
>  

Assuming this is just moving the code, I like this quite bit.


> From d4a4be3eed25853fc1ea84ebc2cbe0226afd823a Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 15 Sep 2025 16:25:44 -0400
> Subject: [PATCH v17 05/15] Update PruneState.all_[visible|frozen] earlier in
>  pruning
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> In the prune/freeze path, we currently delay clearing all_visible and
> all_frozen when dead items are present. This allows opportunistic
> freezing if the page would otherwise be fully frozen, since those dead
> items are later removed in vacuum’s third phase.
> 
> However, if no freezing will be attempted, there’s no need to delay.
> Clearing the flags promptly avoids extra bookkeeping in
> heap_prune_unchanged_lp_normal(). At present this has no runtime effect
> because all callers that consider setting the VM also attempt freezing,
> but future callers (e.g. on-access pruning) may want to set the VM
> without preparing freeze plans.

s/heap_prune_unchanged_lp_normal/heap_prune_record_unchanged_lp_normal/

I think this should make it clearer that this is about reducing overhead for
future use of this code in on-access-pruning.


> We also used to defer clearing all_visible and all_frozen until after
> computing the visibility cutoff XID. By determining the cutoff earlier,
> we can update these flags immediately after deciding whether to
> opportunistically freeze. This is necessary if we want to set the VM in
> the same WAL record that prunes and freezes tuples on the page.

I think this last sentence needs to be first. This is the only really
important thing in this patch, afaict.



> From 86193a71d2ff9649b5b1c1e6963bd610285ad369 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 3 Oct 2025 15:57:02 -0400
> Subject: [PATCH v17 06/15] Make heap_page_is_all_visible independent of
>  LVRelState
> 
> Future commits will use this function inside of pruneheap.c where we do
> not have access to the LVRelState. We only need a few parameters from
> the LVRelState, so just pass those in explicitly.
> 
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
> Reviewed-by: Robert Haas <robertmhaas@gmail.com>
> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com

Makes sense. I don't think we need to wait for other stuff to be committed to
commit this.


> From dde0dfc578137f7c93f9a0e34af38dcdb841b080 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 8 Oct 2025 15:39:01 -0400
> Subject: [PATCH v17 07/15] Eliminate XLOG_HEAP2_VISIBLE from vacuum
>  prune/freeze
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit

Seems very mildly odd that 0002 references phase III in the subject, but this
doesn't...

(I'm just very lightly skimming from this point on)


> During vacuum's first and third phases, we examine tuples' visibility
> to determine if we can set the page all-visible in the visibility map.
> 
> Previously, this check compared tuple xmins against a single XID chosen at
> the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> enables future work to set the VM during on-access pruning, since ordinary
> queries have access to GlobalVisState but not OldestXmin.
> 
> This also benefits vacuum directly: GlobalVisState may advance
> during a vacuum, allowing more pages to become considered all-visible.
> In the rare case that it moves backward, VACUUM falls back to OldestXmin
> to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> prunable according to the GlobalVisState.

It could, but it currently won't advance in vacuum, right?


> From e412f9298b0735d1091f4769ace4d2d1a7e62312 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 09:57:13 -0400
> Subject: [PATCH v17 12/15] Inline TransactionIdFollows/Precedes()
> 
> Calling these from on-access pruning code had noticeable overhead in a
> profile. There does not seem to be a reason not to inline them.

Makes sense, just commit this ahead of the more complicated rest.



> From 54fcba140e515eba0eb1f9d48e7d5875b92e7e39 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 14:34:30 -0400
> Subject: [PATCH v17 13/15] Allow on-access pruning to set pages all-visible

Sorry, will have to look at this another time...

Greetings,

Andres Freund
pgsql-hackers by date:
From: Masahiko Sawada
Date: 09 October, 21:02:43
Subject: Re: Invalid pointer access in logical decoding after error
From: Masahiko Sawada
Date: 09 October, 21:22:47
Subject: Re: Executing pg_createsubscriber with a non-compatible control file
Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) - Mailing list pgsql-hackers

Previous

Next