Home > mailing lists

Re: XLog changes for 9.3 - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: XLog changes for 9.3
Date	June 7, 2012 12:35:34
Msg-id	4FD0CA2F.50601@enterprisedb.com Whole thread Raw
In response to	Re: XLog changes for 9.3 (Andres Freund <andres@2ndquadrant.com>)
Responses	Re: XLog changes for 9.3
List	pgsql-hackers

Tree view

On 07.06.2012 17:18, Andres Freund wrote:
> On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote:
>> 3. Move the only field, xl_rem_len, from the continuation record header
>> straight to the xlog page header, eliminating XLogContRecord altogether.
>> This makes it easier to calculate in advance how much space a WAL record
>> requires, as it no longer depends on how many pages it has to be split
>> across. This wastes 4-8 bytes on every xlog page, but that's not much.
> +1. I don't think this will waste a measureable amount in real-world
> scenarios. A very big percentag of pages have continuation records.

Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on 
64-bit architectures) even when there is a continuation record, because 
of alignment:

typedef struct XLogPageHeaderData
{    uint16      xlp_magic;     /* magic value for correctness checks */    uint16      xlp_info;      /* flag bits,
seebelow */    TimeLineID  xlp_tli;       /* TimeLineID of first record on    XLogRecPtr  xlp_pageaddr;  /* XLOG
addressof this page */
 

+   uint32      xlp_rem_len;   /* bytes remaining of continued record */ } XLogPageHeaderData;

The page header is currently 16 bytes in length, so adding a 4-byte 
field to it bumps the aligned size to 24 bytes. Nevertheless, I think we 
can well live with that.

>> 4. Allow WAL record header to be split across page boundaries.
>> Currently, if there are less than SizeOfXLogRecord bytes left on the
>> current WAL page, it is wasted, and the next record is inserted at the
>> beginning of the next page. The problem with that is again that it makes
>> it impossible to know in advance exactly how much space a WAL record
>> requires, because it depends on how many bytes need to be wasted at the
>> end of current page.
> +0.5. Its somewhat convenient to be able to look at a record before you have
> reassembled it over multiple pages. But its probably not worth the
> implementation complexity.

Looking at the code, I think it'll be about the same complexity for 
XLogInsert in its current form (it will help the patch I'm working on), 
and makes ReadRecord() a bit more complicated. But not much.

> If we do that we can remove all the aligment padding as well. Which would be a
> problem for you anyway, wouldn't it?

It's not a problem. You just MAXALIGN the size of the record when you 
calculate how much space it needs, and then all records become naturally 
MAXALIGNed. We could quite easily remove the alignment on-disk if we 
wanted to, ReadRecord() already always copies the record to an aligned 
buffer, but I wasn't planning to do that.

>> These changes will help the XLogInsert scaling patch, by making the
>> space calculations simpler. In essence, to reserve space for a WAL
>> record of size X, you just need to do "bytepos += X".  There's a lot
>> more details with that, like mapping from the contiguous byte position
>> to an XLogRecPtr that takes page headers into account, and noticing
>> RedoRecPtr changes safely, but it's a start.
> Hm. Wouldn't you need to remove short/long page headers for that as well?

No, those are ok because they're predictable. Although it would make the 
mapping simpler. To convert from a contiguous xlog byte position that 
excludes all headers, to XLogRecPtr, you need to do something like this 
(I just made this up, probably has bugs, but it's about this complex):

#define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
#define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) * 
UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD)

uint64 xlogrecptr;
uint64 full_segments = bytepos / UsableBytesInSegment;
int offset_in_segment = bytepos % UsableBytesInSegment;

xlogrecptr = full_segments * XLOG_SEG_SIZE;
/* is it on the first page? */
if (offset_in_segment < XLOG_BLCKSZ - SizeOfXLogLongPHD)   xlogrecptr += SizeOfXLogLongPHD + offset_in_segment;
else
{   /* first page is fully used */   xlogrecptr += XLOG_BLCKSZ;   /* add other full pages */   offset_in_segment -=
XLOG_BLCKSZ- SizeOfXLogLongPHD;   xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ;   /* and finally
offsetwithin the last page */   xlogrecptr += offset_in_segment % UsableBytesInPage;
 
}
/* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */
XLogRecPtr.xlogid = xlogrecptr >> 32;
XLogRecPtr.xrecoff = xlogrecptr & 0xffffffff;

Capsulated in a function, that's not too bad. But if we want to make 
that simpler, one idea would be to allocate the whole 1st page in each 
WAL segment for metadata. That way all the actual xlog pages would hold 
the same amount of xlog data.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

pgsql-hackers by date:

From: Robert Haas
Date: 07 June 2012, 12:24:42
Subject: Re: Could we replace SysV semaphores with latches?

From: Honza Horak
Date: 07 June 2012, 12:47:52
Subject: Re: Ability to listen on two unix sockets

Re: XLog changes for 9.3 - Mailing list pgsql-hackers

Previous

Next