Re: Checkpointer write combining - Mailing list pgsql-hackers

From Chao Li
Subject Re: Checkpointer write combining
Date
Msg-id 0D990493-9DF4-4136-B1A1-CA0112E809ED@gmail.com
Whole thread Raw
In response to Re: Checkpointer write combining  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Checkpointer write combining
List pgsql-hackers

> On Jan 14, 2026, at 06:24, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> Attached v12 fixes a variety of buglets I found throughout the patch set.
>
> I've also done quite a bit of refactoring. The scope of the
> refactoring varies from inlining some helpers to introducing new input
> argument structs.
>
> 0001 is independently valuable as it optimizes StrategyRejectBuffer()
> a bit and makes GetVictimBuffer() cleaner
>
> 0002-0007 were largely present in older versions of the patch set
>
> 0008 is new -- it is an early version of batching for normal backends
> flushing a buffer to obtain a clean one. Right now, it checks if the
> two blocks succeeding the target block are in shared buffers and
> dirty, and, if so, writes them out with the target buffer. I haven't
> started testing or benchmarking it because I need to convert bgwriter
> to use write combining to be able to benchmark it effectively. But I
> thought I would get the code out there sooner rather than later.
>
> It's a lot harder with my current code structure to add the target
> block's predecessor if it is dirty and read to write out. I wonder how
> important this is vs just the two succeeding blocks.
>
> - Melanie
>
<v12-0001-Streamline-buffer-rejection-for-bulkreads-of-unl.patch><v12-0002-Split-FlushBuffer-into-two-parts.patch><v12-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v12-0004-Write-combining-for-BAS_BULKWRITE.patch><v12-0005-Add-database-Oid-to-CkptSortItem.patch><v12-0006-Implement-checkpointer-data-write-combining.patch><v12-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch><v12-0008-Eagerly-flush-buffer-successors.patch>

Hi Melanie,

I went through the patch set again today, and paid special attention to 0001 and 0008 that I seemed to not review
before.Here are comments I got today: 

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H

+#include "access/xlogdefs.h"
```

I tried to build without adding this include, build still passed. I think that’s because there is a include path:
storage/bufmgr.h-> storage/bufpage.h -> access/xlogger.h. 

So, maybe we can remove this include.

2 - 0001
```
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+    XLogRecPtr    lsn;
+
+    if (!strategy)
+        return false;
```

As the new comment stated, the buffer must be pinned, maybe we can add an assert to ensure that:
```
Assert(BufferIsPinned(buffer));
```

Similarly, maybe we can also assert the locks are not held:
```
Assert(BufferDescriptorGetContentLock(buffer));
Assert(!pg_atomic_read_u32(&buf->state) & BM_LOCKED);
```

3 - 0001 - bufmgr.c - BufferNeedsWALFlush()
```
+    buffer = BufferDescriptorGetBuffer(bufdesc);
+    page = BufferGetPage(buffer);
+
+    Assert(BufferIsValid(buffer));
```

I think the Assert should be moved to before "page = BufferGetPage(buffer);”.

4 - 0001 - bufmgr.c
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * *lsn is set to the current page LSN.
+ *
+ * If the result is required to be correct, the caller must hold a buffer
+ * content lock. If they only hold a shared content lock, we'll need to
+ * acquire the buffer header spinlock, so they must not already hold it.
+ *
+ * If the buffer is unlogged, *lsn shouldn't be used by the caller and is set
+ * to InvalidXLogRecPtr.
+ */
+bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive, XLogRecPtr *lsn)
```

I think the “exclusive" parameter is a bit subtle. The parameter is not explicitly explained in the header comment,
thoughthere is a paragraph that explains different behaviors when the caller hold and not hold content lock. Maybe we
canrename to a more direct name: hold_exclusive_content_lock, or a shorter one hold_content_lock. 

5 - 0002 - commit message
```
actual buffer flushing step. This separation procides symmetry with
```

Typo: procides -> provides

6 - 0002
```
+     * We must hold the buffer header lock when examining the page LSN since
+     * don't have buffer exclusively locked in all cases.
```

Looks like it’s better to add “we” between “since” and “don’t” => since we don’t have ...

7 - 0002
```
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * actually needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
     uint32        buf_state;

     /*
@@ -4425,42 +4445,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
      * someone else flushed the buffer before we could, so we need not do
      * anything.
      */
-    if (!StartBufferIO(buf, false, false))
-        return;
-
-    /* Setup error traceback support for ereport() */
-    errcallback.callback = shared_buffer_write_error_callback;
-    errcallback.arg = buf;
-    errcallback.previous = error_context_stack;
-    error_context_stack = &errcallback;
-
-    /* Find smgr relation for buffer */
-    if (reln == NULL)
-        reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
-
-    TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-                                        buf->tag.blockNum,
-                                        reln->smgr_rlocator.locator.spcOid,
-                                        reln->smgr_rlocator.locator.dbOid,
-                                        reln->smgr_rlocator.locator.relNumber);
-
-    buf_state = LockBufHdr(buf);
-
-    /*
-     * Run PageGetLSN while holding header lock, since we don't have the
-     * buffer locked exclusively in all cases.
-     */
-    recptr = BufferGetLSN(buf);
+    if (!StartBufferIO(bufdesc, false, false))
+        return false;

-    /* To check if block content changes while flushing. - vadim 01/17/97 */
-    UnlockBufHdrExt(buf, buf_state,
-                    0, BM_JUST_DIRTIED,
-                    0);
+    *lsn = InvalidXLogRecPtr;
+    buf_state = LockBufHdr(bufdesc);
```

The header comment says “lsn returns the buffer's LSN if the table is logged”, which looks inaccurate, because if
StartBufferIO()is true, the function returns early without setting *lsn. 

8 - 0008 - comment message
```
When flushing a dirty buffer, check if it the two blocks following it
```

Nit: “if it the” -> “if the"

9 - 0008
```
+ * Check if the blocks after my block are in shared buffers and dirty and if
+ * it is, write them out too
```

Nit: “if it is” -> “if they are”

10 - 0008
```
+    BlockNumber max_batch_size = 3; /* we only look for two successors */
```

Using type BlockNumber for batch size seems odd. I would suggest change to uint32.

11 - 0008 - WriteBatchInit()
```
+    LockBufHdr(batch_start);
+    batch->max_lsn = BufferGetLSN(batch_start);
+    UnlockBufHdr(batch_start);
```

Should we check unlogged buffer before assigning max_lsn? In previous commits, we have done that in many places.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Remove redundant assignment in CreateWorkExprContext
Next
From: Bertrand Drouvot
Date:
Subject: Re: Safer hash table initialization macro