Re: XLog changes for 9.3 - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: XLog changes for 9.3 |
Date | |
Msg-id | 4FD0CA2F.50601@enterprisedb.com Whole thread Raw |
In response to | Re: XLog changes for 9.3 (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: XLog changes for 9.3
|
List | pgsql-hackers |
On 07.06.2012 17:18, Andres Freund wrote: > On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote: >> 3. Move the only field, xl_rem_len, from the continuation record header >> straight to the xlog page header, eliminating XLogContRecord altogether. >> This makes it easier to calculate in advance how much space a WAL record >> requires, as it no longer depends on how many pages it has to be split >> across. This wastes 4-8 bytes on every xlog page, but that's not much. > +1. I don't think this will waste a measureable amount in real-world > scenarios. A very big percentag of pages have continuation records. Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on 64-bit architectures) even when there is a continuation record, because of alignment: typedef struct XLogPageHeaderData { uint16 xlp_magic; /* magic value for correctness checks */ uint16 xlp_info; /* flag bits, seebelow */ TimeLineID xlp_tli; /* TimeLineID of first record on XLogRecPtr xlp_pageaddr; /* XLOG addressof this page */ + uint32 xlp_rem_len; /* bytes remaining of continued record */ } XLogPageHeaderData; The page header is currently 16 bytes in length, so adding a 4-byte field to it bumps the aligned size to 24 bytes. Nevertheless, I think we can well live with that. >> 4. Allow WAL record header to be split across page boundaries. >> Currently, if there are less than SizeOfXLogRecord bytes left on the >> current WAL page, it is wasted, and the next record is inserted at the >> beginning of the next page. The problem with that is again that it makes >> it impossible to know in advance exactly how much space a WAL record >> requires, because it depends on how many bytes need to be wasted at the >> end of current page. > +0.5. Its somewhat convenient to be able to look at a record before you have > reassembled it over multiple pages. But its probably not worth the > implementation complexity. Looking at the code, I think it'll be about the same complexity for XLogInsert in its current form (it will help the patch I'm working on), and makes ReadRecord() a bit more complicated. But not much. > If we do that we can remove all the aligment padding as well. Which would be a > problem for you anyway, wouldn't it? It's not a problem. You just MAXALIGN the size of the record when you calculate how much space it needs, and then all records become naturally MAXALIGNed. We could quite easily remove the alignment on-disk if we wanted to, ReadRecord() already always copies the record to an aligned buffer, but I wasn't planning to do that. >> These changes will help the XLogInsert scaling patch, by making the >> space calculations simpler. In essence, to reserve space for a WAL >> record of size X, you just need to do "bytepos += X". There's a lot >> more details with that, like mapping from the contiguous byte position >> to an XLogRecPtr that takes page headers into account, and noticing >> RedoRecPtr changes safely, but it's a start. > Hm. Wouldn't you need to remove short/long page headers for that as well? No, those are ok because they're predictable. Although it would make the mapping simpler. To convert from a contiguous xlog byte position that excludes all headers, to XLogRecPtr, you need to do something like this (I just made this up, probably has bugs, but it's about this complex): #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD) #define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) * UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD) uint64 xlogrecptr; uint64 full_segments = bytepos / UsableBytesInSegment; int offset_in_segment = bytepos % UsableBytesInSegment; xlogrecptr = full_segments * XLOG_SEG_SIZE; /* is it on the first page? */ if (offset_in_segment < XLOG_BLCKSZ - SizeOfXLogLongPHD) xlogrecptr += SizeOfXLogLongPHD + offset_in_segment; else { /* first page is fully used */ xlogrecptr += XLOG_BLCKSZ; /* add other full pages */ offset_in_segment -= XLOG_BLCKSZ- SizeOfXLogLongPHD; xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ; /* and finally offsetwithin the last page */ xlogrecptr += offset_in_segment % UsableBytesInPage; } /* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */ XLogRecPtr.xlogid = xlogrecptr >> 32; XLogRecPtr.xrecoff = xlogrecptr & 0xffffffff; Capsulated in a function, that's not too bad. But if we want to make that simpler, one idea would be to allocate the whole 1st page in each WAL segment for metadata. That way all the actual xlog pages would hold the same amount of xlog data. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: