Home > mailing lists

Re: Checkpointer write combining - Mailing list pgsql-hackers

From	Chao Li
Subject	Re: Checkpointer write combining
Date	September 10 11:24:13
Msg-id	2FA0BAC7-5413-4ABD-94CA-4398FE77750D@gmail.com Whole thread Raw
In response to	Re: Checkpointer write combining (Melanie Plageman <melanieplageman@gmail.com>)
List	pgsql-hackers

Tree view

On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:

[1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
<v5-0004-Write-combining-for-BAS_BULKWRITE.patch><v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patch><v5-0002-Split-FlushBuffer-into-two-parts.patch><v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v5-0006-Add-database-Oid-to-CkptSortItem.patch><v5-0007-Implement-checkpointer-data-write-combining.patch>

1 - 0001

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+ * The buffer must pinned and content locked and the buffer header spinlock

```

“Must pinned” -> “must be pinned"

2 - 0001

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+ if (XLogNeedsFlush(lsn))

+ {

+ /*

+ * Remove the dirty buffer from the ring; necessary to prevent an

+ * infinite loop if all ring members are dirty.

+ */

+ strategy->buffers[strategy->current] = InvalidBuffer;

+ return true;

+ }

- return true;

+ return false;

}

```

We can do:

```

If (!XLogNeedsFlush(lan))

Return false

/* Remove the dirty buffer ….

Return true;

}

```

This way makes less diff.

3 - 0002

```

+ * Prepare to write and write a dirty victim buffer.

```

Prepare to write a dirty victim buffer.

4 - 0002

```

- /* OK, do the I/O */

- FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);

- LWLockRelease(content_lock);

- ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,

- &buf_hdr->tag);

+ CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);

```

I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().

I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.

5 - 0002

```

* disastrous system-wide consequences. To make sure that can't happen,

* skip the flush if the buffer isn't permanent.

- if (buf_state & BM_PERMANENT)

- XLogFlush(recptr);

+ if (!XLogRecPtrIsInvalid(buffer_lsn))

+ XLogFlush(buffer_lsn);

```

Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.

6 - 0003

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+ * patterns than lazily flushing buffers directly before reusing them.

+ */

```

Here “directly” is kind ambiguous. It could mean “immediately before” or “without going through something else”. My understanding is “immediately”, If that is true, please change “directly” to “immediately” or just remove it.

7 - 0003

```

+void

+StartStrategySweep(BufferAccessStrategy strategy)

+ if (!strategy)

+ return;

```

I doubt if this “strategy” null check is needed. Because it is only called when strategy_supports_eager_flush() is true, and strategy_supports_eager_flush() has asserted “strategy”.

And as a pair function, StrategySweepNextBuffer() doesn’t do null check nor assert strategy.

8 - 0003

```

bool

+strategy_supports_eager_flush(BufferAccessStrategy strategy)

```

This function is only used in bufmgr.c, can we move it there and make it static?

9 - 0004

```

--- a/src/backend/storage/buffer/bufmgr.c

+++ b/src/backend/storage/buffer/bufmgr.c

+ batch->start = batch->bufdescs[0]->tag.blockNum;

+ batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);

+ batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);

+ batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);

+ Assert(BlockNumberIsValid(batch->start));

```

Why don’t assert immediately after batch->start is assigned? So upon error, smgropen() will not be called.

~~10 - 0004~~

~~```~~

~~--- a/src/backend/storage/buffer/bufmgr.c~~

~~+++ b/src/backend/storage/buffer/bufmgr.c~~

~~+ limit = Min(max_batch_size, limit);~~

~~```~~

Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

11 - 0004

```

--- a/src/backend/storage/buffer/bufmgr.c

+++ b/src/backend/storage/buffer/bufmgr.c

+ for (batch->n = 1; batch->n < limit; batch->n++)

+ {

+ Buffer bufnum;

+ if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)

+ break;

```

Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.

12 - 0004

```

@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,

if (from_ring && strategy_supports_eager_flush(strategy))

{

+ uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);

```

I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.

13 - 0004

```

+void

+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,

+ WritebackContext *wb_context)

+ ErrorContextCallback errcallback =

+ {

+ .callback = shared_buffer_write_error_callback,

+ .previous = error_context_stack,

+ };

+ error_context_stack = &errcallback;

+ pgBufferUsage.shared_blks_written += batch->n;

```

Should we only increase shared_blks_written only after the loop of write-back is done?

14 - 0004

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+uint32

+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)

```

I think this function can be moved to bufmgr.c and make it static.

15 - 0004

```

--- a/src/backend/storage/page/bufpage.c

+++ b/src/backend/storage/page/bufpage.c

@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)

((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);

}

+/*

+ * A helper to set multiple block's checksums.

+ */

+void

+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)

```

We should mark blknos as const to indicate it is readonly: const BlockNumber *blknos, which will also prevent from incidentally change on blknos in within the function.

16 - 0005

```

* instead. So "needs flush" is taken to mean whether minRecoveryPoint

* would need to be updated.

- if (RecoveryInProgress())

+ if (RecoveryInProgress() && !XLogInsertAllowed())

```

As a new check is added, the comment should be updated accordingly.

17 - 0006

```

--- a/src/include/storage/buf_internals.h

+++ b/src/include/storage/buf_internals.h

@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)

typedef struct CkptSortItem

{

Oid tsId;

+ Oid db_id;

```

I think “db_id” should be named “dbId” or “dbOid”. Let’s keep the name conversation consistent.

18 - 0007

```

--- a/src/backend/storage/buffer/bufmgr.c

+++ b/src/backend/storage/buffer/bufmgr.c

+ max_batch_size = checkpointer_max_batch_size();

```

Look like we don’t need to calculate max_batch_size in the for loop.

19 - 0007

```

+ * Each batch will have exactly one start and one max lsn and one

+ * length.

```

I don’t get what you want to explain with this comment. It sounds quite unnecessary.

Best regards,

—

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

pgsql-hackers by date:

From: Michael Paquier
Date: 10 September, 10:58:37
Subject: Re: Incorrect logic in XLogNeedsFlush()

From: Amit Kapila
Date: 10 September, 11:39:34
Subject: Re: Conflict detection for update_deleted in logical replication

Re: Checkpointer write combining - Mailing list pgsql-hackers

Previous

Next