Re: [RFC] Lock-free XLog Reservation from WAL - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: [RFC] Lock-free XLog Reservation from WAL
Date
Msg-id 7b31f916-2b7d-49c7-b70a-b0342ba6b423@postgrespro.ru
Whole thread Raw
In response to Re: [RFC] Lock-free XLog Reservation from WAL  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
List pgsql-hackers
10.01.2025 19:53, Matthias van de Meent пишет:
> On Fri, 10 Jan 2025 at 13:42, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>>
>> BTW, your version could make alike trick for guaranteed atomicity:
>> - change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
>> and store offset to prev record's start.
> 
> -1, I don't think that is possible without degrading what our current
> WAL system protects against.
> 
> For intra-record torn write protection we have the checksum, but that
> same protection doesn't cover the multiple WAL records on each page.
> That is what the xl_prev pointer is used for - detecting that this
> part of the page doesn't contain the correct data (e.g. the data of a
> previous version of this recycled segment).
> If we replaced xl_prev with just an offset into the segment, then this
> protection would be much less effective, as the previous version of
> the segment realistically used the same segment offsets at the same
> offsets into the file.

Well, to protect against "torn write" it is enough to have "self-lsn" 
field, not "prev-lsn". So 8 byte "self-lsn" + "offset-to-prev" would work.

But this way header will be increased by 4 bytes compared to current 
one, not decreased.

Just thought:
If XLogRecord alignment were stricter (for example, 32 bytes), then LSN 
could mean not byte-offset, but 32byte-offset. Then low 32bits of LSN 
will cover 128GB of WAL logs. For most installations re-use distance for 
WAL segments doubdfully longer than 128GB. But I believe, there are some 
  with larger one. So it is not reliable.

> To protect against torn writes while still only using record segment
> offsets, you'd have zero and then fsync any segment before reusing it,
> which would severely reduce the benefits we get from recycling
> segments.
> Note that we can't expect the page header to help here, as write tears
> can happen at nearly any offset into the page - not just 8k intervals
> - and so the page header is not always representative of the origins
> of all bytes on the page - only the first 24 (if even that).

-----

regards,
Yura




pgsql-hackers by date:

Previous
From: James Hunter
Date:
Subject: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Next
From: Andres Freund
Date:
Subject: Re: Reorder shutdown sequence, to flush pgstats later