Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: WAL format and API changes (9.5) |
Date | |
Msg-id | 20141113150445.GB29112@alap3.anarazel.de Whole thread Raw |
In response to | Re: WAL format and API changes (9.5) (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: WAL format and API changes (9.5)
Re: WAL format and API changes (9.5) |
List | pgsql-hackers |
On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote: > Here's a new version, with big changes again to the record format. Have a > look at xlogrecord.h for the details, but in a nutshell: > > 1. The overall format is now: XLogRecord, per-block headers, header for main > data portion, per-block data, main data. > > 2. I removed xl_len field from XLogRecord and rearranged the fields, to > shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a new 2- > or 5-byte header for the "main data", after the block headers). > > 3. No alignment padding. (the data chunks are copied to aligned buffers at > replay, so redo functions can still assume aligned access) Whoa. That's a large amount of changes... Not bad at all. > In quick testing, this new WAL format is somewhat more compact than the 9.4 > format. Well, that's mixing different kinds of changes together to some degree. Most of this could have been done independently as well. I'm not generally objecting to doing that, but the space/performance comparison isn't entirely fair. One could (I'm not!) argue that we should just do a refactoring like you suggest above, without the additional block information. > That also seems to have more than bought back the performance > regression I saw earlier. I think we really need to do the performance test with a different CRC implementation. So far all the tests in this thread seem to be correlating pretty linearly to the size of the record. > Here are results from my laptop, using the wal-update-testsuite.sh script: Would be nice if that'd also compute the differences in wal_generated and duration between two runs... :) > Aside from the WAL record format changes, this patch adds the "decoded WAL > record" infrastructure that we talked about with Andres. Ah, cool. > XLogReader now has > a new function, DecodeXLogRecord, which parses the block headers etc. from > the WAL record, and copies the data chunks to aligned buffers. The redo > routines are passed a pointer to the XLogReaderState, instead of the plain > XLogRecord, and the redo routines can use macros and functions defined > xlogreader.h to access the already-decoded WAL record. The new WAL record > format is difficult to parse in a piece-meal fashion, so it really needs > this separate decoding pass to be efficient. Hm. I'm not sure if there's really a point in exposing DecodeXLogRecord() outside of xlogreader.c. In which case would a xlogreader user not want to decode the record? I.e. couldn't we just always do that and not bother exposing the function? Sure, you could skip decoding if the record is of a "uninteresting" rmgr, but since all validation but CRC now happens only during decoding I'm not sure that's a good idea. You e.g. have a block + if (!DecodeXLogRecord(xlogreader_state, &errormsg)) + { + fprintf(stderr, "error in WAL record at %X/%X: %s\n", + (uint32) (xlogreader_state->ReadRecPtr >> 32), + (uint32) xlogreader_state->ReadRecPtr, + errormsg); + /* + * Parsing the record failed, but it passed CRC check, so in + * theory we can continue reading. + */ + continue; + } + I think that's quite the bad idea. The CRC is only a part of the error detection, not its entirety. > Thoughts on this new WAL record format? I've attached the xlogrecord.h file > here separately for easy reading, if you want to take a quick look at just > that without applying the whole patch. After a quick skim I like it in general. There's some details I'd rather change, but I think it's quite the step forward. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: